Vocal Modulation Features in the Prediction of

Vocal Modulation Features in the Prediction of Major Depressive Disorder Severity by Rachelle L. Horwitz B.S. Electrical and Computer Engineering B.S. Biomedical Engineering Worcester Polytechnic Institute, 2008 OF TECHNOLOGY 5 20 2RSEP LIBRARIES SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN ELECTR!CAL ENGINEERING AND COMPUTER SCIENCE AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY SEPTEMBER 2014 0 Massachusetts Institute of Technology. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of Author.. Signature redacted Department of Electrical Engineering and Computer Science August 12, 2014 C1 e bd. y. if Signature redacted -------- .................... Thomas F. Quatieri, Ph.D. Senior Member of Technical Staff, MIT Lincoln Laboratory Faculty of the Harvard-MIT SHBT Program Thesis Supervisor Signature redacted Accepted by................................... ........ de i P D Leslie Kolodziejski, Ph.D. Professor of Electrical Engineering and Computer Science Chairman, Committee for Graduate Students 1 Vocal Modulation Features in the Prediction of Major Depressive Disorder Severity by Rachelle L. Horwitz Submitted to the Department of Electrical Engineering and Computer Science August 12 2014 In Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering and Computer Science ABSTRACT This tiesis deyelops a model of vocal modulations up to 50 Hz in sustained vowels as a basis for biomarkers of neurological disease, particularly Major Depressive Disorder (MDD). Two model components contribute to amplitude modulation (AM): AM from respiratory muscles and from interaction between formants and frequency modulation in the fundamental frequency harmonics. Based on the modulation model, we test three methods to extract the envelope of the third formant from which features are extracted using sustained vowels from the 2013 AudioNisual Emotion Challenge. Using a Gaussian-Mixture-Model-based predictor, we evaluate performance of each feature in predicting subjects' Beck MDD severity score by the root mean square error (RMSE), mean absolute error (MAE), and Spearman correlation between the actual Beck score and predicted score. Our lowest MAE and RMSE values are 8.46 and 10.32, respectively (Spearman correlation=0.487, p<0.001), relative to the mean MAE of 10.05 and mean RMSE of 11.86. Thesis Supervisor: Thomas F. Quatieri Title: Senior Member of Technical Staff, MIT Lincoln Laboratory; Faculty of the Harvard-MIT SHBT Program 2 Acknowledgements First and foremost, I would like to thank my thesis supervisor, The Legendary Tom Quatieri, for the time he has spent over the past two years providing guidance, and sharing his enthusiasm and vast expertise in speech signal processing. He encouraged me to take several steps back and see the forest from the trees, and for that, I am a better researcher. I would also like to thank the following colleagues at MIT Lincoln Laboratory: Daryush Mehta, Jim Williamson, Brian Helfer, and Elizabeth Godoy. Daryush wrote the code to extract the formants, aided me with some of the signal processing, and provided me with encouragement and advice throughout the thesis. I appreciate his time and effort in helping me fully explore some of the technical difficulties I had encountered, and provide sound advice. Jim wrote the code for the Gaussian Mixture Model and for the cross-correlation features, which he and Brian took the time to explain to me and help me to incorporate into my thesis. Elizabeth Godoy wrote the code for one of the techniques to extract the envelope, and provided additional insights into processing speech data. I would like to thank my parents, Marshall and Susan Horwitz, for all the encouragement they provided throughout this process. I enjoyed being able to talk to my mom, a speech pathologist, about the nontechnical aspects of my thesis. My classmates, Sonam Dilwali, Koeun Lim, Jordan Whitlock, and Nate Zuk, were also excellent resources of both knowledge and support. I'm particularly grateful to Nate for allowing me to explain the details of the signal processing to him. Finally, I would like to thank my fianc6, Rob Martin. Although he would prefer to spend his free time writing code for his game engine, he now knows all about the source-filter theory of speech, Fourier transforms, demodulation, and Gaussian Mixture Models! Rob, I'm looking forward to spending the rest of my life with you. 3 4 Table of Contents Chapter 1 Introduction ............................................................................................................................... 7 Thesis Outline .................................................................................................................................. 8 12 Contributions ................................................................................................................................... Chapter 2 Background ............................................................................................................................. 8 10 2.1 Acoustics of Vocal Modulations and Vibrato ............................................................................ 2.1.1 FM in Vocal Modulations................................................................................................... 2.1.2 AM in Vocal M odulations ................................................................................................... 2.1.3 Periodicity in Vocal Tremor ................................................................................................ 2.2 Physiology of Vocal Tremor ..................................................................................................... 2.3 Prior Work on Modulation Features of Vocal Tremor .............................................................. 2.4 Conclusions.................................................................................................................................... Chapter 3 Relationship between Perceived Amount of Vocal Modulation and MDD........................ 3.1 Experimental Set-Up ................................................................................................................. 3.2 Results............................................................................................................................................ 3.3 Conclusions....................................................................................................................................22 Chapter 4 M odel ...................................................................................................................................... 10 11 11 14 15 16 19 20 20 21 1.1 4.1 M otivation for the M odel and Processing ................................................................................. 23 23 4.1.1 Motivation for an AM and FM Model During a Sustained Vowel ................. 23 4.1.2 M otivation for Processing the Output of the M odel........................................................... 25 42 Implementation of the M odel ..................................................................................................... 25 4.3 Envelope Component Estimation .............................................................................................. 27 4.3.1 4.3.2 4.3.3 4.3.4 Stockham 's method ............................................................................................................ Hilbert-Stockham Method .................................................................................................. Nonlinear, Iterative Envelope (NLIE) -Stockham M ethod............................................... Comparison of the Three Processing M ethods ................................................................... 4.4 Conclusions.................................................................................................................................... Chapter 5 Feature Extraction ............................................................................................................... 5.1 The AVEC Database......................................................................................................................50 5.2 Pre-Processing ............................................................................................................................... 5.3 Frequency Domain-Based Features ........................................................................................... 5.3.1 Average of the STFT M agnitude of the Log-Envelope...................................................... 5.3.2 Variance of the M agnitude of the STFT of the Log-Envelope.......................................... 5.3.3 Coefficient of Variation (CV) of the Magnitude of the STFT of the Log-Envelope ...... 5.3.4 Unnormalized Energy in the Frequency Band Corresponding to the AM due to the Respiratory M uscles............................................................................................................................55 5.3.5 Ratios of Energy in Various Frequency Bands................................................................... 5.4 Time Domain Features ............................................................................................................... 27 35 42 47 48 50 51 52 52 54 54 56 57 5.5 Conclusions....................................................................................................................................60 Chapter 6 Regression and Prediction Using the AVEC MDD Database ............................................ 61 Bookmark not defined. 6.1.1 Introduction..............................................................................Error! 6.1.2 Training and Testing Procedure......................................................................................... 62 Average STFT Magnitude of the NLIE-Stockham and Hilbert-Stockham Envelopes ............. 6.2.1 NLIE-Stockham Envelope with a Maximum Frequency of 20 Hz ..................................... 6.2.2 Hilbert-Stockham Envelope, with a Maximum Frequency of 20 Hz ................................. 6.2.3 Stockham Envelope, with a Maximum Frequency of 20 Hz............................................... 6.2.4 NLIE-Stockham Envelope, With a Maximum Frequency of 50 Hz ................ 6.2.5 Hilbert-Stockham Envelope with a Maximum Frequency of 50 Hz .................................. 6.3 Variance of the M agnitude of the STFT of the Log-Envelope................................................. 6.3.1 NLIE-Stockham ...................................................................................................................... 5 63 63 64 66 68 70 73 76 76 6.3.2 Hilbert-Stockham .................................................................................................................... 77 6.4 Coefficient of Variation (CV) of the Magnitude of the STFT of the Log-Envelope ................. 78 78 6.4.1 NLIE-Stockham ...................................................................................................................... 6.4.2 Hilbert-Stockham .................................................................................................................... 80 82 65 Unnormalized Energy in the Low Frequency Band .................................................................. 65.1 NLIE-Stockham .................................................................................................. ..... 82 65.2 Hilbert-Stockham ............................................................................................................... 85 6.6 Energy Ratio .................................................................................................................................. 86 6.7 Time-Dom ain Features ............................................................................................................... 88 6.7.1 Eigenvalues and Summary Statistic from NLIE Envelope................................................. 89 6.7.2 Eigenvalues and Summary Statistic from Hilbert Envelope .................................................. 92 6.8 Conclusions.................................................................................................................................... 94 Chapter 7 Conclusions and Future Work.............................................................................................. 95 7.1 Improvement to the Underlying M odel ...................................................................................... 96 7.2 Implementation of the M odel .................................................................................................... 97 7.3 Envelope Extraction....................................................................................................................... 97 74 N e-Processing the Envelopes..................................................................................................... 98 7.5 Features..........................................................................................................................................98 Appendix A : Subjectively Rating Vocal Modulation............................................................................... 100 Appendix B: Beck Depression Inventory ................................................................................................. 103 Appendix C: Derivation of Equations for AM, FM, and FM-and-AM ..................................... 104 ...... .............. Appendix D: Model of the Vocal Tract .. .................Bibliography.............................................................................................................................................. 112 113 6 Chapter 1 Introduction Vocalizations are not constant, even during the phonation of a sustained vowel. Commonly studied modulations are jitter and shimmer, which are cycle-to-cycle variations in the amplitude and frequency of the fundamental frequency (fo), respectively. While the fundamental period ranges between approximately 3 and 20ms, other modulations occur on longer timescales. In this thesis, modulations up to 50 Hz will be investigated. The modulations under investigation in this range can be periodic, quasi-periodic, or irregular. Vocal tremor is a characteristic of many pathologies, some of which are due to anatomical or functional abnormalities (e.g. reflux laryngitis and laryngeal web) while others are neurological in origin (e.g. Parkinson's disease). Perceptually, vocal tremor is similar to vibrato, or perceptually salient undulations in a singing voice, because both contain amplitude and frequency modulation [1][2][3]. Vocal tremor can arise from modulations in the respiratory system, the larynx (vocal folds), or vocal tract cavity. Although there have been some studies that describe the perceptual characteristics of vocal tremor [2] and quantitative features of tremor have been proposed [41, the relationship between severity of tremor and its quantitative characteristics has yet to be adequately explored. Often thought to be periodic in nature, vocal tremor can be quasi-periodic or irregular. [2]. Furthermore, there has been no attempt to separate out the effects of the three possible physiological origins of tremor on vocal amplitude and frequency modulation. Since the focus of this thesis is on vocal modulations that occur in the time range of vocal tremor, but will not be limited to modulations that are periodic or quasi-periodic, for the remainder of this thesis, we use the term "vocal modulation" as a generalization of tremor. This describes modulations in voice that can be periodic, quasi-periodic, or irregular, and may be less perceptually salient than true vocal tremor. Technically, rhythm, jitter, and shimmer are also vocal modulations, but they are beyond the scope of this thesis. The rhythm of running speech is not investigated because this thesis focuses on sustained vowels. Jitter and shimmer are not investigated because they occur on a shorter timescale than the focus of this thesis. Differences in vocal modulation may be present in Major Depressive Disorder (MDD), a common mood disorder characterized by emotional, cognitive, and/or somatic abnormalities such as melancholia, feelings of guilt, and sleep abnormalities [5]. Numerous studies have shown that MDD causes changes in voice [6][7][8][9][10][11][12][13]. This thesis aims to apply vocal modulation features toward predicting subjects' MDD, under the hypothesis that subjects with more severe MDD have more erratic modulation in their voices. 7 1.1 Thesis Outline This thesis contains six chapters. Chapter 2 provides the background for the thesis. It discusses previous work that characterizes the acoustics and physiology of vocal tremor, as well as prior work where modulation features are used to predict the severity of patients' Major Depressive Disorder (MDD). Chapter 3 discusses the perceptual origins of our main hypothesis: there is a relationship between the regularity of vocal modulation in a sustained vowel and the severity of a patient's MDD. Chapter 4 introduces our model of vocal modulation, as well as three demodulation methods. Chapter 5 describes the features we extract from a database of sustained /a/ vowels. We use the features from Chapter 5 to predict the severity of the subjects' MDD, as detailed in Chapter 6. We summarize our findings and outline future work in Chapter 7. 1.2 Contributions This thesis accomplishes the following: 1. We create a model of vocal modulation that incorporates contributions from both the muscles of respiration and the laryngeal muscles. The modulation model is developed in the context of a sustained vowel, assuming that two components contribute to amplitude modulation (AM): 1) AM from the respiratory muscles, and 2) AM from interaction between formants and the FM from the fundamental frequency harmonics, i.e., a mapping of FM to AM. Modulation from both components is incorporated into the envelope of the waveform. This novel approach allows us to explore the contributions from each source of modulation independently. 2. We test the separability of the modulation from the two sources by implementing three envelope extraction techniques. The first technique is Stockham's method, where the logarithm of the magnitude of the signal is extracted [14]. The second technique involves computing the logarithm of the magnitude of the Hilbert envelope, and is called the Hilbert-Stockham method. The third method uses a nonlinear, iterative envelope (NLIE) estimation method, combined with the Stockham approach, referred to as the NLIE-Stockham method [15]. We find that the Hilbert-Stockham envelope and the nonlinear, iterative method enable improved separability compared to the Stockham envelope. 3. We extract frequency- and time-domain features that capture the modulation characteristics of the envelopes of the speech signals, and use those features to predict the severity of a patient's MDD. By estimating the mean Beck depression severity score of all patients from each session, the baseline mean average error (MAE) is 10.05, and the root mean square error (RMSE) is 11.86. In contrast, using our extracted modulation features, we obtain a decrease of 1.59 points in the MAE, from 10.05 8 to 8.46, and a decrease of 1.54 points in the RMSE, from 11.86 to 10.32. The Spearman correlation between the actual and predicted MDD severity scores is 0.487 (p<0.001). 4. Our modulation model provides a framework to study other neurological disorders that affect vocal features. This may aid us in uncovering unseen abnormalities in the nervous systems of patients with various neurological disorders, as well as providing a foundation for early detection and monitoring of such disorders. 9 Chapter 2 Background In the early 1960s, Brown and Simonson [14][15] studied patients with essential vocal tremor and reported qualitative characteristics associated with it. Essential tremor is a neurological disease of unknown etiology that causes involuntary movements in the upper body, most notably the hands and head, but may also cause vocal tremor. Brown and Simonson [15] noted the following: "The speech symptom common to all of these patients was a disorder of the respiratory-phonatory phase of motor speech. The dysphonia, characterized by rhythmic alterations of pitch, and, particularly, of loudness of vowels and continuant consonants in each word, resulted in tremulous, quavering speech. Short phonated sounds could be uttered without impairment, but tremor appeared when the patients made the indicated sounds of longer duration. It was most apparent when patients were asked to hold a sound such as 'ah' as long as possible." This chapter aims to provide background on the acoustics and physiology of vocal modulations, and review modulation features that have been used in Major Depressive Disorder (MDD) severity prediction. 2.1 Acoustics of Vocal Modulations and Vibrato Parameters to characterize vocal vibrato have been outlined by Sundberg [1]. The same parameters have also been used to characterize vocal tremor [16][17][2][4]. He describes vibrato in terms of four parameters relating to frequency or amplitude modulation: the rate, extent, regularity, and similarity to a pure sinusoid. The first two are depicted Figure 1 [1]. RATE= I/PERIOD EE z j//' XTENT 0 U- 0 liME (msl Figure 1: Vibrato period, rate, and extent [1]. In Figure 1, the vertical axis is the deviation from the mean frequency. The rate refers to the number of periods per second, and the extent refers to difference between the peak value and the mean value of the frequency. The regularity refers to the differences in the cycle-to-cycle variations in the frequency, and is also known as jitter [1]. The two types of modulation that have been explored are frequency modulation (FM) and amplitude modulation (AM). The remainder of this sub-section discusses both. 10 2.1.1 FM in Vocal Modulations FM in the fundamental frequency (fo) occurs by changing the rate at which the vocal folds adduct and abduct, but it can also result from AM originating subglottally. An increase in the fundamental frequency of 0.5-6 Hz/cm H 20 of subglottal pressure has been predicted based on stress-strain curves derived from fundamental frequency and length measurements on human subjects [18]. Therefore, a change in the fundamental frequency would be predicted to occur as a result of subglottal changes in AM. Numerous frequencies of FM have been reported. Winholtz and Ramig [16] created the "Vocal Demodulator" by demodulating fo and obtained both an amplitude contour and a frequency contour of fo, measuring both the amplitude and frequency of each of the contours. They tested the Vocal Demodulator on subjects with vocal tremor due to Parkinson's disease, essential tremor, spasmodic dysphonia, and spinal muscular atrophy. They reported variations in fo between 4 Hz and 6.9 Hz. Other studies have reported different frequencies at which FM occurs [16]. Aronson et al. [19] reported that when patients with amyotrophic lateral sclerosis (ALS) utter a sustained vowel, peaks in the spectrum of fo range from 1.1 Hz to 23.9 Hz. Their results from control subjects are similar: spectral peaks were observed between 1.1 Hz and 25 Hz. However, the extents of the spectral peaks of the FM were different between the ALS patients and the controls: compared to a spectral amplitude of 6.9-134.8 mV in the controls' peaks due to FM, the ALS patients' spectral amplitudes for the FM in their vowels ranged between 17.7 and 637 mV [19]. Kreiman et al. claim lower frequencies in the FM of fo: they state that the range is between 2 and 12 Hz for both normal and pathological phonation [2]. 2.1.2 AM in Vocal Modulations Unlike FM in an acoustic signal, the origins of AM are more complicated. AM in an acoustic signal originates from at least one of three different sources: the fo variation itself, oscillations in subglottal pressure, and/or the vocal tract shape [1]. Acoustic theory explains how fo variation over time can cause AM, assuming that the voice source and formant frequencies are constant [20]. In the frequency domain, when voicing is present, the glottis produces multiples of the fundamental frequency, the harmonics, as shown in the top panel of Figure 2. These are represented by the magnitude of the source spectrum, IS(f) |. 11 IS(f)|~hl1~~nnr )s I W 3 1 4 6 S ITV)i IR(flj Frequency (kHz) Figure 2: Spectra of the source IS(f)l, vocal tract transfer function (VTTF) IT(f)l, radiation characteristic IR(f)I, and output Ip/J)I (modified from [211). The shape of the vocal tract has a transfer function, T(f), that varies, depending on the sound produced, as shown in the second panel of Figure 2. The formants are the peaks in IT(f)|. The third panel shows the radiation characteristic. The output, which is the sound pressure P at a given distance r, denoted by P(f), is equal to the product of each of the three quantities: Pr(f) = S(f)T(f)R(f). Consequently, the output is a harmonic series of filtered glottal pulses at various amplitudes. One source of AM is the interaction between the changing harmonics of fo and the formants. The observed amplitude of the formants depends on fo because the spacing of the fundamental frequency determines where the vocal tract transfer function (VTTF) is "sampled" by the glottal pulse. If the VTTF is sampled near its peaks, the formants appear louder; if it is sampled away from its peaks, the formants appear quieter. Thus, if fo changes and a harmonic of fo becomes closer to the first peak of the VTTF, the first formant will be louder; the opposite happens if a frequency changes and a harmonic becomes farther away from the peak of the VTTF. Therefore, AM can be a secondary effect of FM [20]. Both cases are illustrated in Figure 3 (modified from [1]). 12 Mean FO =100 Hz Mean FO = 100 Hz 5 10 Case A: Mean FO = 100Hz 40- PT 110 30- U. 100 0.1 20 0.2 0.3 0.4 Intensity of Formant -35 / 10 0 -101 . Mean FO = 104.1 Hz 30 0 50 CO Case B: Mean FO = 104.1Hz 40 110 30 100 20 0 10 S 40- 3% Mean FO = 107 Hz so 0.4 0.2 0.3 0.3 0.1 0.2 Mean FO = 107 Hz 0.4 0.1 0.4 _ 110 40- cc 30 I 0.1 35 -10 .E 0.4 Intensity of Formant 0 Case C: Mean FO = 107Hz 0.1 0.2 0.3 Mean FO = 104.1 Hz U. 100 0 20 0.2 0.3 Intensity of Formant 10 0 -1',-- 200 300 400 500 Frequency (Hz) 600 3003 / 0.1 0.2 Time (sec) 0.3 1 0.4 Figure 3: Relationship between frequency and amplitude modulation (modified from [11). In each of the figures in the left column, the peak of a hypothetical VTTF is at 415 Hz (ignoring other formants). Each of the stems in the plots represents a harmonic off,, which varies sinusoidally about a mean, shown in the red plots in the right-hand column. The harmonics of the mean fo are the black stems, the minimum of each harmonic is shown in green, and the maximum of each harmonic is magenta. In Case A,f, = 100Hz at the first time instant, moves sinusoidally up to 103 Hz, and then down to 97 Hz, as shown in the top panel of the right-hand column. In the left-hand column of Case A, the mean harmonic closest to the peak of the VTTF is at 400 Hz (black stem). As it increases to 412 Hz (magenta stem), the intensity increases. When it decreases, the opposite occurs. As a result, the intensity of the formant moves in phase with the harmonic when the location of the harmonic is on the increasing side of the peak of the VTTF. In Case C, when the mean harmonic is on the decreasing side of the peak of the VTTF (here, it is 428 Hz), the intensity of the formant moves out of phase with the motion of the harmonic. In Case B, when the maximum and minimum harmonics nearest to the peak of the VTTF are on opposite sides of the peak of the VTTF, the intensity of the formant varies with twice the frequency off,. In Case A of Figure 3, when fo increases by a few Hz, the spacing becomes larger, and the second harmonic becomes closer to the first formant. Therefore, as the frequency increases, the energy also increases. In Case C, an increase of a few Hz of the fundamental frequency will push the third harmonic 13 farther away from the peak of the VTTF (in this case, the second harmonic will also be pushed closer to the peak of the VTTF, but the difference seen in the third harmonic will be greater than the second). In Case B, the harmonic of fo might also oscillate symmetrically about peak of the VTTF, causing the amplitude modulation to be twice the frequency of the frequency modulation [1]. A second source of AM originates from oscillations in subglottal pressure. In certain types of vocal modulation, oscillations in the subglottal pressure cause the voice source to vary in amplitude. Consequently, although the frequencies of the harmonics of fo remain fixed, the amplitudes of the harmonics change together over time, causing AM [1]. A third type of AM occurs when the shape of the vocal tract changes. For example, the tongue and pharynx can move. When this occurs, the formants of the vocal tract change, and AM occurs, for example, when the harmonics are fixed and the formants move through the harmonics. In a sense, this is the "converse" of the cases illustrated in Figure 3. In addition to FM causing AM, AM has been predicted to cause FM. Experimentally, changes info with a 1cm H 2 0 increase in pressure range between 1-10 Hz, and the amount of fo change per unit pressure increase differs based on the vocal register. An increase info of 0.5-6 Hz/cm H 20 of subglottal pressure has been predicted based on stress-strain curves derived from fo and length measurements on human subjects [18]. Therefore, a change info would also be predicted to occur as a result of AM. Like the frequencies of FM in vocal tremor, the frequencies of AM are also a source of debate. Aronson et al. [191 report that patients with ALS and controls had similar ranges of peaks in their spectra from the AM: 1.1 Hz to 25 Hz and 1.1 Hz to 24 Hz, respectively. Winholtz and Ramig [16] report AM rates between 4.5 Hz and 12.6 Hz. 2.1.3 Periodicity in Vocal Tremor Kreiman et al. tested whether vocal tremor is periodic or irregular by creating two models of tremor and asking expert listeners to rate the similarity of the synthesized waveforms to the true voice samples. The first model consisted of an FM source in which fo was modulated by a sine wave, while fo in the second model was modulated by an irregular waveform. Based on the ratings of the expert listeners, they found that in general, both tremor models provided "excellent" matches to the original voices. This suggests that vocal tremor can be periodic or irregular. 14 One of the shortcomings of this paper is that it does not address AM; it only addresses FM. Kreiman et al. argue that they did not address AM because most AM is an artifact of FM. Given that FM info causes AM, it is possible that the listeners were using AM cues from frequency modulation, as described earlier in this section. 2.2 Physiology of Vocal Tremor Rhythmic changes in the movements of the muscles of the larynx, respiratory muscles, and muscles of the supraglottal vocal tract might be responsible for producing modulations in pitch, loudness, or both. Our focus in this thesis is on the muscles of the larynx and the respiratory muscles. The larynx is the organ that is responsible for producing voicing, which occurs when the vocal folds adduct and abduct. Changing the length of the vocal folds changes the pitch. The muscle responsible for vocal fold lengthening is the cricothyroid muscle, while the thyroarytenoid (TA) muscles shorten the vocal folds. The posterior cricoarytenoid (PCA) muscle abducts the vocal folds, while the interarytenoid (IA) and lateral cricoarytenoid (LCA) muscles adduct the vocal folds. When the cricothyroid muscle contracts, the vocal folds lengthen. This increases the tension of the vocal folds, which increases their rate of vibration. This is measured as an increase in fo. The opposite occurs when the TA muscles contract: the vocal folds shorten, decreasing the tension of the vocal folds, resulting in a decrease info [22] [23]. Whenfo changes, its harmonics change, resulting in modulations in amplitude by formant sampling as described in Section 2.1. Respiratory tremor is also thought to cause modulations in fo and intensity. The lungs provide the air pressure to the larynx during expiration. When tremor affects the chest wall, the motion of the chest wall is modulated, causing modulations in the glottal flow and therefore in the pressure. Lester and Story [24] simulated respiratory tremor in healthy adults by mechanically compressing subjects' chest walls, and found that modulations offo and intensity occurred. The muscles of the supraglottal vocal tract can also be affected by tremor. These muscles modulate the length and width of the various parts of the vocal tract. When the shape of the vocal tract changes, the resonant frequencies, or formants, also change. Therefore, tremor in the vocal tract would be expected to change the formant frequencies. As the formant frequencies change, the harmonics of fo will "sample" them at different locations, so the intensity of the voice signal would also be expected to change [22]. 15 2.3 Prior Work on Modulation Features of Vocal Tremor Many studies have measured both the amplitude and frequency modulation of vocal tremor in the voice signal. This section aims to describe the methodology and results of relevant experiments. Vocal Demodulator: The Vocal Demodulator, developed by Winholtz and Ramig, [16] was created to quantify the amplitude and frequency modulation components of vocal tremor. The envelope of the entire signal is extracted, and the frequency and amplitude of that signal are analyzed. A similar procedure is performed, but for the frequency contour of fo. Next, spectral analysis is performed on the frequency and amplitude contours. The amplitude level and frequency levels were measured over 0.5-second intervals based on full-wave rectification of the demodulated signals. The following equation was used to compute the amplitude modulation level: Vmax - Vmiin Amplitude Modulation Level (%) = Vmax + Vmin where V, is the maximum voltage measured in the amplitude envelope of the signal, and V,,1,, is the minimum voltage measured in the amplitude envelope of the signal. The following equation was used to compute the frequency modulation level: Frequency ModulationLevel (%) = fOdeviation - o fo where fod,, is the peak-to-peak variation info when thefo contour is taken, andfw... is the meanfo in the signal. To test the Vocal Demodulator, the authors recorded the /a/ vowel from individuals with vocal tremor, individuals without a history of neurological or phonatory disorders, and singers who sang with vibrato. To investigate the rate of modulation in AM and FM, they used six target frequencies: 3 Hz, 6 Hz, 9 Hz, 12 Hz, 15 Hz, and 18 Hz and correlated those frequencies with the demodulated frequencies from the subjects. To verify their measurements for the level of amplitude and frequency modulation, they again performed correlations. The target levels they used were 5%, 10%, 15%, 20%, and 25% for amplitude modulation, and 1%, 2.5%, 7.5%, and 10% for frequency modulation. When they compared the groups of subjects, they found that the AM and FM rates of modulation were higher within the tremor and control subjects than within the vibrato group, the median levels and ranges of AM were significantly larger within the tremor and vibrato groups than within the control group, the median extents of FM in the tremor and vibrato groups were higher than in the control groups, and the range of FM was largest for the tremor group [16]. 16 Multi-Dimensional Voice Program: Currently, the Multi-Dimensional Voice Program (MDVP) is a clinical tool that can be used to measure vocal tremor. It provides the following measurements, which are similar to the measurements provided by Winholtz and Ramig [16]: 1) Fatr (Frequency of amplitude tremor, measured in): frequency of the strongest low-frequency AM component within a specified range 2) Fftr (Frequency of frequency tremor, measured in Hz): frequency of the strongest low-frequency FO-modulating component within a specified range 3) ATrI (Amplitude tremor intensity index, measured in %): average ratio of the strongest lowfrequency AM component to the amplitude of the signal, within a specified range absATrI -A ATrI=100* A where absATrI represents the absolute tremor intensity in Pascals, and A represents the mean amplitude in Pascals 4) FTrI (Frequency tremor intensity index, measured in %): average ratio of the strongest lowfrequency fo-modulating component to the meanfo of the signal, within a specified range. absFTrl - f - FTrI = 100 * Jo where absFTrI represents the absolute tremor intensity in Hz, and o represents the meanfo, also in Hz [4][25]. Two other measurements have been proposed by Bruckl [4] 1) Frequency tremor power index (FTrP) * FTrP = FTrJ FTrF FTrF + 1 where FTrF is the frequency of the tremor frequency, measured in Hz, and 2) Amplitude tremor power index (ATrP) * ATrP = FTrI ATrF ATrF + 1 where ATrF is the frequency of the tremor amplitude, measured in Hz The power indices are smaller for lower frequencies, and greater for higher frequencies; a lower index results if the frequency of the tremor is lower. 17 Praat Software Tool: Br(IckI [4] provides code that runs in an acoustic analysis program called Praat [26], which computes the autocorrelation of the amplitude and frequency contours to determine the tremor frequency. FTrI and ATrI are subsequently computed based on the mean maximum and mean minimum of the contours, and the other four measurements are computed based on FTrI, ATrI, and the frequencies of the tremor. Bruckl validated the algorithm on sounds with given parameters [4]. Some of the drawbacks associated with this algorithm are the inherent limitations of Praat's pitch estimator, and the irregularities in vocal tremor, noted by Kreiman et al. [2], which are not taken into account by Bruckl's algorithm. AM-FM Decomposition Algorithm: Another method has been proposed to extract vocal tremor. Although not yet tested on a population with vocal tremor, the method proposed by Pantazis et al. [27] uses an AM-FM decomposition algorithm. The advantage of this method is that it adapts to nonstationary signals. Their method can also be used to analyze any frequency component of the speech signal, not only the first harmonic. Their method works by first demodulating the signal using the AM-FM decomposition algorithm, subtracting out modulations less than 2 Hz using a smoothing filter, and then estimating the modulation frequency and modulation level, which are time-varying attributes. Pantazis et al. [28] validated the method on normophonic subjects, but it has yet to be tested on pathological subjects. Modulation Spectrogram: Using various techniques to reduce the dimensionality of the modulation spectrogram [30], Cummins et al. [29] classified various levels of depression in both the two-class case and the five-class case, both with and without log-mean subtraction [31] in attempt to isolate the dynamics of the supraglottal vocal tract. The modulation spectrogram is the Fourier transform of the temporal trajectory of each frequency channel in the spectrogram [29]. It is defined by: N-I M-1 x[n, m]e-j(en+wm) X(6, W) = n=O m=O where x[n, m] is a short term speech segment, n is the frame index, m is the time index, o is the acoustic frequency, and 0 is the modulation frequency [29]. To classify the MDD severity of the subjects, Cummins et al. partitioned the data into two classes, and then into five classes. When partitioning the data into two classes, the first class contained sessions with utterances from patients who were not depressed, mildly depressed, or moderately depressed. The second class contained patients who were moderately-severely depressed, severely depressed, and very severely depressed. The classes in the 5-class case were the following: normal, mild depression, moderate 18 depression, severe depression, and very severe depression. The best classifications Cummins et al. obtained using Gaussian Mixture Models (GMMs) were 66.9% for the two-class case, and 36.4% for the 5-class case, suggesting that there is a link between the modulation characteristics and the depressed voice. The modulation spectrogram in this case was derived with the use of 24-element gammatone filter bank using a 35-subject depression database. The spectral analysis of the temporal envelope of each filter bank output was used as a basis for modulation features at short and long time scales. This modulation characterization reflects the various sources of tremor discussed above but makes no attempt to represent individual components or origins of AM and FM. This system also analyzed only the phrase "pa-ta-ka." 2.4 Conclusions Vocal tremor can be due to at least one of three sources: (1) interactions between the harmonics offo and the formants, (2) oscillations in subglottal pressure, and (3) movement of the formants through the harmonics. The rate and depth of both AM and FM can be used to characterize the modulation [1], although the modulation can be irregular [2]. The modulation spectrogram has aided in classification of the severity of MDD [29]. Therefore, it is possible that other modulation characteristics can further aid in the prediction of a patient's MDD severity. 19 Chapter 3 Relationship between Perceived Amount of Vocal Modulation and MDD Cummins et al. [29] showed that features extracted from the modulation spectrogram aided in classifying MDD. When classifying depressed and non-depressed subjects, the highest percentage of correct classification they achieved was 66.9%. Our goal is to further motivate the hypothesis that the modulation characteristics of more depressed individuals are different from the modulation characteristics of less depressed individuals. 3.1 Experimental Set-Up Seven employees in MIT-Lincoln Laboratory's Bioengineering Systems and Technology group listened to and visually inspected the spectrograms and waveforms of 25 randomly selected /a/ vowels at a comfortable loudness level, selected from the training set of the Audio/Visual Depression and Emotion Challenge (AVEC 2013)[32] that is described in Section 5.1. For each vowel, the raters evaluated the following vocal characteristics: 1) Aurally perceived vocal modulation. The raters listened to the recording without viewing the spectrogram or waveform. They rated the vocal modulation on a scale from 1-5, where 1 indicated very little or no vocal modulation, and 5 indicated severe vocal modulation. Examples of recordings with miniscule/absent modulation and significant modulation were provided for reference. 2) Visually perceived sub-harmonics. The raters viewed the spectrogram to evaluate the presence of sub-harmonics, rating between 1 and 5. (See Appendix A for details regarding the process used to rate the sub-harmonics.) 3) Visually perceived FM. The raters viewed the spectrogram to evaluate the amount of change in the harmonics of the fundamental frequency. (See Appendix A for details regarding the process used to rate the FM). 4) Visually perceived AM. The raters viewed the waveform to evaluate the amount of AM. See Appendix A for details regarding the process used to rate AM. The raters were instructed to perform one task on all of the recordings before proceeding to the next task. This was necessary to reduce bias. If raters had performed all four tasks on one recording before proceeding to the next recording, it is possible that their judgments could have been influenced by prior evaluations of the same recording. For example, if raters had heard a large amount of vocal modulation but did not visually perceive a large amount of FM in the same recording, they could have been unintentionally biased by their previous assessment of the vowel and indicated the presence of a greater 20 amount of FM than they otherwise would have indicated if they had viewed the FM without hearing the utterance. 3.2 Results The results suggest that both AM and FM characteristics may be useful in predicting patient's Beck scores. The Beck Depression Inventory is commonly used to assess the severity of depression, and is described further in Appendix B. The scores from each of the raters were combined and the Spearman correlation between each of the subjective ratings and the Beck score was computed. The results are shown in Table 1. Table 1: Spearman p values (and p-values in parentheses) among Beck score, aurally-perceived modulation severity, visually perceived sub-harmonics, visually perceived FM, and visually perceived AM. Statistically significant correlations, accounting for the Bonferroni correction, are displayed in blue font. Beck Score Sub- Severity Harmonics Modulation 0.018 (p=0.809) 0.262 Severity (p< Beck Score 0.235 (p=0.00178) AM FM Modulation 9.001) SubHarmonics 0.265 (p<0.001) 0.324 0.268 (p<0.001) 0.345 (p<0.001) (p<0.001) 0.140 0.284 (p=0.0 6 4 0) (p<0.001) 0.280 FM (p<0.001) Accounting for the Bonferroni correction, and assuming a desired significance level of 0.05, the Beck score has a statistically significant correlation with the aural perception of modulation severity, and the visual perceptions of AM and FM. The Spearman correlation between the Beck score and aurally perceived vocal modulation is weak but statistically significant: p=0.235 (p=0.00178). There is also a statistically significant correlation between visually perceived FM and Beck score (p =0.265 p<0.001), and between visually perceived AM and Beck score (p=0.268, p<0.001). 21 In addition to the Spearman correlations between each of the perceptual ratings and the Beck score, there are also statistically significant correlations among the perceptual ratings. The greatest of these are the correlation between visually perceived AM and aurally perceived modulation severity (p=0.3455 p<0.00001), and between visually perceived FM and aurally perceived modulation severity (p=0.324 p<0.0001). There are also statistically significant correlations between visually perceived sub-harmonics and visually perceived AM (p =0.284 p=0.0001), visually perceived AM and visually perceived FM (p=0.280 p=0.0002), and visually perceived sub-harmonics and aurally perceived vocal modulation severity (p=0.262, p=0.005). 3.3 Conclusions The presence of visually perceived FM and visually perceived AM is expected because it is consistent with Sundberg [1]. The presence of statistically significant correlations between visually perceived FM and aurally perceived modulation severity, visually perceived AM and aurally perceived modulation severity, aurally perceived modulation severity and Beck score, visually perceived FM and Beck score, and visually perceived AM and Beck score, indicate that attempting to characterize the FM and AM in the sustained vowels may lead to improved automated Beck score prediction. 22 Chapter 4 Model In this chapter, we propose a model for vocal modulation based on the source-filter theory of speech [21]. The premise of the model is that in the frequency domain, AM and FM can be separated in a signal by viewing the logarithm the signal's envelope, whereby FM is mapped to an AM contribution. Subsequently, the AM and FM contributions can be used to obtain features to predict the MDD severity of the subjects. To simplify the model, only a single rate and extent of modulation for both AM and FM are represented, although this can be generalized to multiple rates and extents. For example, the AM in the model occurs at a single frequency of 4 Hz, and the FM in the model occurs at a single rate of modulation of 7 Hz. Several assumptions are also made based on the anatomy and physiology of speech production, as well as prior research on the acoustics and physiology of vocal tremor and vibrato. One of the assumptions is that the frequency of the AM is lower than the frequency of the FM, where AM arises from respiratory muscles and FM arises from laryngeal muscles. To test the model, three envelope types are analyzed: the Stockham envelope, Hilbert-Stockham envelope, and Nonlinear, Iterative Envelope (NLIE). Three test signals are created and the three envelope types are extracted from each. The first signal type consists of a source with both AM and FM as inputs to a synthetic vocal tract. The second signal type consists of a source with AM only, and a third with FM only. 4.1 4.1.1 Motivation for the Model and Processing Motivation for an AM and FM Model During a Sustained Vowel As noted in Chapter 2, the anatomical structures that can contribute to modulation in a sustained vowel are the muscles of respiration, extrinsic and intrinsic laryngeal muscles, and supraglottal vocal tract. All are involved during the utterance of a vowel: the muscles of respiration act as pressure sources, forcing air through the glottis [21]. If the subglottal pressure exceeds a threshold, the vocal folds oscillate, and phonation occurs [33][34]. The supraglottal vocal tract adds "color" to the sound by introducing formants. Oscillations in any of the structures throughout the vocal tract can cause AM and/or FM. To apply constraints to the model, the following were assumed in the AM-FM model: 1. The muscles of respiration are responsible for AM only [24]. As the subglottal pressure increases, in general, the amplitude of the vibration of the vocal folds also increases [18], resulting in an increase sound intensity [35]. An increase in subglottal pressure has been theorized to produce an increase info by 2-6 Hz/cm H 20 [18] and 23 has been experimentally shown to produce both intensity and fundamental frequency increases under the simulated condition of respiratory-induced vocal tremor [24], as discussed in Chapter 2. However, for simplicity, the model assumes that the muscles of respiration produce AM only. The frequency of the AM from the muscles of respiration is assumed to be around 5 Hz, as this is the frequency used by Farinella et al. [36] and Lester and Story [24] when determining the relationship between respiratory oscillation and perception of vocal tremor. 2. The AM component contributed by the respiratory muscles is slower than the FM component contributed by the intrinsic and extrinsic muscles of the larynx. This is assumed because air from the lungs is expelled over time as the vowel is held. There is some auditory feedback that occurs: when the subjects hear themselves sounding quieter, they increasing by increasing the loudness of the phonation. This is assumed to be slower than the AM due to the interaction between the harmonics offo and the formants. 3. The extrinsic and intrinsic muscles of the larynx are responsible for FM in the 2-12 Hz region. Although the muscles involved with vibrato have been identified, the muscles involved in vocal tremor have not, and may be different from those in vibrato [37]. Shipp et al. [38] found that while vibrato can be mediated by either the abdominal muscles or by the larynx, mainly from the cricothyroid muscle, the latter source of vibrato appears to be more common. The sources of vibrato are assumed to be mutually exclusive [38], and it is assumed that these observations carry over to vocal modulation in a held vowel as well. The frequency of the AM and FM from the muscles in this region is hypothesized to occur in the 2-12 Hz region; a range of frequencies of vocal tremor proposed by Kreiman et al. [2]. 4. The formants are constant. In the simulated /a/ vowel, the vocal tract and thus the formants are assumed to be constant. 5. As the pitch changes, AM also occurs from the resonance-harmonics interaction. This is described in Chapter 2. 24 4.1.2 Motivation for Processing the Output of the Model The goal is to be able to separate the envelope e[n], of a speech signal, into two components: the envelope due to AM, eA[n], and the envelope due to the resonance-harmonics interaction, eF[n]. In the case of vocal modulation, both eA[n] and eF [n] are assumed to be slowly-varying relative to the formants, and are assumed to be the only elements in the envelope e [n]: e[n] = eA[n]eF[n]. Linear system analysis can be applied to multiplicative signals where one signal is slowly-varying and the other is quickly-varying. This is accomplished by taking the natural logarithm of the magnitude of the signals, resulting in the sum of the logarithms [39]. According to Stockham [39], an approach to model an acoustic signal, y[n], is to let it be the product of a slowly-varying envelope, e[n], where e[n] > 0, and a quickly-varying signal, v[n]: y[n] = e[n]v[n]. (2) If the natural logarithm of the magnitude of both sides is taken, the logarithm of the composite signal becomes the sum of the logarithm of the magnitudes of e[n] and v[n]: log(|y[n]|) = log(|e[n]|) + log (Iv[n]|). (3) Since the Fourier transform is a linear operation, the Fourier transform of log(|y[n]|) is a linear combination of the Fourier transform of each component in Eq. 3. One of the goals of this thesis is to separate e[n] from a vowel into its AM and FM components, eA[n] and eF[n] (Eq. 1). Stockham's technique provides a basis for this to be accomplished. 4.2 Implementation of the Model For illustration, eA [n] is chosen to be sinusoidal, with a given depth of modulation (aa) and rate of modulation (fa). The AM envelope from the respiratory muscles, eA[n], multiplies a source harmonic signal, the FM of which is produced by the intrinsic and extrinsic muscles of the larynx. The FM is more generally non-sinusoidal, but a sinusoidal FM signal is also used here for illustration. In the model of a vowel with both frequency modulation and amplitude modulation, five inputs are required: 1) Fundamental frequency,fo 2) Rate of FM, ff 25 3) Depth of FM, af 4) Rate of AM, fa 5) Depth of AM, aa This FM source signal is denoted by PF [n]. The result of the multiplication is the FM- and AM-pulse train, PAF[n] PAF [n] = eA[n]pF[n]. (4) The FM-and-AM pulse train, PAF [n], is then convolved with the vocal tract. The impulse response of the vocal tract, h[n], is configured to model the /a/ vowel. This vowel is chosen for two reasons: the third formant is far from the first and second formants, and it is the vowel that is uttered in the AVEC database from which features are derived, as described in Chapters 5 and 6. The output of the vocal tract is XAF [n]. A block diagram of the model is depicted in Figure 4. Harmonic synthesizer Figure 4: Model. The FM parameters are the fundamental frequency (f4), rate of FM (ff), depth of FM (af), and AM envelope (eA[n]). The output of the model is xA[n]. The AM envelope eA[n] requires two inputs: a rate (f,) and an extent (a,) of modulation. The AM from the respiratory muscles, eA [n], is expressed by the following, where aa is the depth of AM and fa is the rate of the AM: eA [n] = + Cos(21r n). (5) Substituting Eq. 5. Into Eq. 4 and using the derivation for the FM described Appendix C, PAF[n] is expressed as: PAF n = + CoS COS (2rkfon/fs + afk si nfs6) 21 n)] The AM-and-FM-modulated signal PAF [n] is then input to the vocal tract, h, [n]. The equations used to model the vocal tract are described in Appendix D. 26 For example, Figure 5 shows the output from the vocal tract, XAF [n], with parameters f0=200 Hz, ff=7 cycles/sec2 , af=10 Hz, aa=0. 2 , and a1 =4 Hz. The formant frequencies in this example are 820, 1220, and 2810 Hz, and the formant bandwidths are 125, 125, and 250 Hz. xAF in time domain 30 .20 10 .80 E 5-10 -20 -30 -0 0.5 1 1.5 2 2.5 3 Time Figure 5: x[n] in the time domain, with f 0=200 Hz, f.=7 cycles/sec', a,=10 Hz, a.=0.2, and a,=4 Hz. The formants are 820 Hz, 1220 Hz, and 2810 Hz, with bandwidths of 125 Hz, 125 Hz, and 250 Hz. There is clearly a slow envelope component at 4 Hz, which results from the respiratory muscles (eA [n]) and a higher frequency envelope component, resulting from the interaction between the harmonics and the formants (eF[n]). However, when a 1-second window Hamming window is applied to XAF[n] to reduce the spectral sidelobes, and the mean of the magnitude of the Short-Time Fourier Transform (STFT) is computed, it is clear that demodulation needs to occur in order to extract the frequency content of the envelope. Three processing methods to obtain the envelope were explored: the Stockham-only method, the HilbertStockham method, and nonlinear iterative envelope (NLIE) method. The next section details each of them. 4.3 Envelope Component Estimation The first processing scheme is a direct application of Stockham's method for estimation of FM and AM signals, with the exception that a bandpass filter is used immediately after the synthesis of the /a/ vowel. 4.3.1 Stockham's method In a speech waveform, the majority of the energy is concentrated around harmonics of fo when the frequencies of the harmonics are near a formant frequency. The purpose of the bandpass filter is to isolate the spectral region where FM is accentuated around the third formant. We chose to isolate the third 27 formant instead of one of the first two formants because the multiple of the modulated fo through the third formant has a greater depth of FM than the multiple of the modulated fo through one of the two lower formants, as described in Chapter 2. Figure 6: Block diagram of Stockham's method and envelope generation. The turquoise box denotes the model. In the mathematical description, the purple boxes, H,(J) and Hb(f), are combined to form H(f). The time-domain signal that is the output from Stockham's method is log(Ibj[n]1). A block diagram for the method is detailed in Figure 6.Here, the vocal tract and bandpass filter can be multiplied together to create a new filter to replace the individual filters. Let this filter be H (f), where H (f) = HV(f)Hb (f). (7) The rapidly-varying component of the speech signal, v[n], is composed of the convolution of two components, )5[n] and h[n]: v[n] = P[n] * h[n]. (8) Since the envelope of the signal is e[nJ, p[n] is a series of impulses that, when convolved with h[n], yield a signal that has a flat amplitude. In other words, P[n] * h[n] ~ PAF[f*l[n](9) e[n] Since PAF [n] is implicitly flat, the spacing of the impulses of P[n] is equivalent to the spacing of the impulses in PAF [n], but the amplitude of the pulses is different: the pulses in PAF [n] have the same amplitude, but in )[n], the amplitudes of the pulses are different. Based on Figure 6, bAF[n] can be expressed by: bAFIn] = PAF In] * Substituting Eq. 10 into Eq. 9, Eq. 11 is obtained: 28 h[n]. (10) e[n](P[n] bAF[n] * (11) h[n]). Figure 7 displays bAF[n] in the time domain whenf0 =200 Hz, aa=0.2,fa=4, and a=10, andf1 =7. 7 bAF[n]fo= 2 00 Hz, aa=0. 2, fa=4 aI=1 0, ff= 0.3 0. E -0.1 -0.3 - 0 0.5 1 2 1.5 Time (sec) 2.5 3 Figure 7: bA,[n] in the time domain when f,=200 Hz, a.=02,fa=4, a=10, andfj=7. The formant parameters are the same as in Figure 5 and Figure 6. Compared to Figure 5, where it was difficult to discern the 7 Hz FM, this faster frequency component in the envelope of bAF [n], is seen in Figure 7 more clearly. Stockham's method provides an approach to separate the fast-moving component from the 4 Hz component. Taking the logarithm of the magnitude of both sides of Eq. 11 and knowing that e[n] > 0, log (IbAF[n]I) = log(Ie[n]I) + log (IP[n] * h[n]). (12) Substituting Eq. 11 into Eq. 12 and taking the Fourier transform of both sides, Fflog (IbAF[n])} = Fflog(eA[n]I)} + F{log(|eF[n]I)} + Fflog (|p[n] * h[n]I)}. The spectrum of log (IbAF[n]|) is shown in Figure 8. (13) This is obtained by removing DC from log (IbAF[n]), applying a Hamming window to the entire 3-second signal, and then taking the Fourier transform. It is necessary to remove the DC component because the DC component is often sufficiently large for its spectral sidelobes to obfuscate the low frequency components. Thus, unless otherwise specified, all Fourier transforms in this chapter are plotted after removing DC from the signal and applying a single Hamming window to the entire length of the 3-second signal. 29 DFT of log(IbAi). 1=200Hz, aa=0.2, f,=4, a =10, f =7, DC removed 16000 14000 120DO 10000 8000 6000 FM from harmonics (7Hz between impulses) 4000 2000 00 Inn Envelope relon, from log(e(n) 9W Frsq. (W) *Fast-movlng region, from log(InJI) Figure 8: Frequency components in the log-envelope. Green arrow: envelope region. Purple arrow: fast moving region. The envelope region primarily consists of components belonging to the log-envelope, log (I e [n] |). This contains components from both log (I eA [n] 1) and log (I eF [n] |). The "fast-moving" region corresponds to the frequency components belonging to log(Iv[n] 1) = log(Ip [n] * h[n] 1). The impulses have the greatest amplitude at around 200 Hz, 400 Hz, and 600 Hz because the pitch of the signal is 200 Hz. The FM of each of the harmonics in the original signal can be represented by a Bessel function representation [40]. Each of the individual impulses is spaced 7 Hz apart because the rate of the FM is 7 cycles/sec2 . This explains some of the variation in the magnitudes of each of the impulses in the "fast-moving" region. The other component that contributes to the magnitude of each of the impulses is the overall envelope in the frequency domain, denoted by the blue dashed lines in Figure 8 and is due to formant shaping. Figure 9 shows the regions to which eA [n], the AM due to the respiratory muscles, and eF [n], the AM due to the interaction between the harmonics and the formants, map within the envelope region. 30 . ...... .... DFT of log(IbAFI). f0=200Hz, a,=0.2, fa=4, a,=10, ff =7, DC removed 5000 4000 3000 2000 1000 LJ I 0 AM due to 10 20 Freq (Hz) 30 40 50 AM due to FM respiratory muaces Figure 9: Components of envelope: AM due to respiratory muscles ("AM region") and AM due to FM ("FM region"). The peaks represent the locations off. andf Since it is assumed thatf,<f the lower frequencies are assumed to be due to the AM, and the higher frequencies are assumed to be due to the AM. (There is some overlap due to a harmonic of the AM, but that becomes negligible after the first harmonic.) The first major peak is at 4 Hz, which corresponds to the frequency of the AM. The next peak is at 7 Hz, which is the rate at which the FM changes (ff). The next large peaks are harmonics of the 7 Hz because the AM due to FM is not a pure sinusoid. Log-FM-and-AM Envelope as a Sum of the Log-FM Envelope and the Log-AM Envelope: Figure 10 depicts the block diagram used to show that the log-FM-and-AM envelope is the sum of the log-FM and log-AM envelopes for the "true" low frequencies, defined as the spectra of log (I eA [n]l) and log (IeF[n] 1). The difference between Figure 10 and Figure 6 is that in Figure 6, the FM and AM are multiplicative, while in Figure 10, the envelopes are processed individually and then added. 31 bn Harmonic stntsiar F)Ie Harmonic r +o[nI)+Iog(14n)j) b.[n] log(+b lol() Figure 10: Block diagram for log-FM envelope plus log-AM envelope by using Stockham's method. Following the same process as described earlier in this chapter, and letting Eq. 7 define H(f), it follows that: Fflog (IbAF[n])} = F~log(|eA[n]j)} + Fflog(JeFn]|)} + 2F[Iog (IP[n] * h[n]|)}. (14) We are ignoring the last term of Eq. 14 because we are primarily interested in the envelopes. When viewed on a frequency scale from 0 to 50 Hz, the Fourier transforms of log (I eA [n] 1) and log (I eF [n] 1) are visible. This is depicted in Figure 11. 32 Fourier transform of log(IbA[n] 0f=200Hz, aa=0. 2 , fa= 4 , af=10, ff=7 6000- 40002000- 0 0 ' 5 ^ 10 15 20 25 30 35 40 Fourier transform of log(IbF[n]I)f 0=200Hz, aa=0. , fa=4, af=10, ff= 1500 I I 50 45 2 I I I I 30 35 40 45 7 1000500- 0* 0 5 10 15 20 25 50 Fourier transform of log(IbA[n])+Iog(IbF[n]), and Fourier transform of log(IbAF[n]1) 000__ F.T. of log(l[n]I) + log(IbF[n]1) ___ F.T. of Iog(I F[n]1) 400020000 5 10 15 20 25 30 35 40 45 50 Figure 11: Comparison of Fourier transform of log-FM-and-AM, and Fourier transform of log-FM plus logFM envelopes. Top plot: Fourier transform of log(IbA[n]I), the log-AM envelope. Middle plot: Fourier transform of log(Ib,[n]1), the log-FM envelope. Bottom plot: Fourier transform of log(lbA[nll) + Fourier transform of log(Ib,[n]I) (red), and the Fourier transform of log(lbA,[n]I) (green). Observe that they are approximately the same. The parameters used in these plots are the same as those used in Figure 7, Figure 8, and Figure 9. Figure 11 depicts the Fourier transform of the log-AM envelope (top plot), derived by following the top branch of Figure 10, the Fourier transform log-FM envelope (middle plot), derived by following the bottom branch of Figure 10, and the Fourier transform of the sum of the log-AM and log-FM plots (the sum of the top and middle plots, shown in red in the bottom plot), and the Fourier transform of the logFM-and-AM envelope, shown in green in the bottom plot. The two signals closely follow each other, supporting the conclusion that the log-envelopes can be summed together to form the combined log-FMand-AM envelope. Results from Stockham's Method: Figure 12 compares bAF[n] and the DFT of log (IbAF[n]I) when the parameters chosen forfo, aa,fa, af, andff are modified. 33 7 2 bAFnl, f 0=200Hz, aa=0. . fa=4, at=10, f,= DFT of log(lbIAF). f0=200 . aa0.2, f=4 af=10, f =7 5 Baseline 0 - 0 .5 0 2 2.5 0.5 1 1.5 3 bAF[nJ, f0 =200Hz, aa=0. 2 , f,=4, a,=10, f,=13 O.5 -J 0 20 40 60 80 100 DFT of log(Ib A), fo=200Hz, aa=0.2, fa=4, at=10, f,= 13 500 ff increased to 5Hz/sec 0 -0.5 0 0.5 1 1.5 2 2.5 3 bAF[n, fo=200Hz, aa=0.2, fa=2.5, a,=10, f,=7 so 100 20 40 60 0 DFT of log(Ib ), f0=200Hz, a,=0.2, f,=2.5, a,=10, f,=7 0.5 fa decreased to 2.5Hz 0 0 0.5 1 1.5 2.5 2 4 bAgJn f0 = 9OHz, aj0.2, fa= , a,=10, f,=7 3 20 DFT 0.5 40 0 80 100 of log(lb AF ), fonI90Hz, a a=0.2, f a =4 a=10, f,=7 0 f. decreased to 190Hz 0 -0.5 0 0.5 1 1.5 2 2.5 bAn), fo=200Hz, aa=0.2, fa=4, a,=20, f,=7 3 % 20 40 60 80 100 DFT of log(IbA,), f,=200Hz, a,=0.2, f =4, a,=20, f =7 05 -0.5 o a, increased o.s 1 1.5 2 2.5 Time (s) 3 0 0 to 20Hz 20 60 Freq (Hz) 40 80 100 Figure 12: Comparison of bAF[n] and DFT of log(Ib[nJi) when the formants are 820 Hz, 1220 Hz, and 2810 Hz, their bandwidths are 125 Hz, 125 Hz, and 250 Hz, respectively, and the bandpass filter is set to formant 3, with corner frequencies at 2810 Hz 250 Hz. Before taking the DFT, DC was removed from the signal and a Hamming window was applied. Left column: bA[n]. Right column: DFT of log(IbF[nJI) on a frequency scale from 0 to 50 Hz. First row: contains the same figures as shown earlier in this section; they are present for reference.f,=200 Hz, aa=0.2,fa=4 Hz, aj=10 Hz,f1 =7 cycles/sec 2 . Second row: ff is increased from 7 cycles/sec2 to 13 cycles/sec2 . Third row: fa is decreased from 4 Hz to 2.5 Hz. Fourth row: fo is decreased from 200 Hz to 190 Hz. Fifth row: af is increased from 10 Hz to 20 Hz. Although not visible in the spectra in Figure 12 due to the limits on the frequency axis, all instances of bAF [n] contain two envelope components: a slow envelope and a fast envelope. In all cases, the frequency of the slow envelope corresponds to the frequency of the AM from the muscles of respiration, which is fa. In the first, second, and fourth columns, fa is 4 Hz, but in the third column, it slows to 3 Hz. When ff is increased from 7 cycles/sec2 to 13 cycles/sec2 , as shown in the change from the first row of Figure 12 to the second, the higher-frequency component of the envelope appears to increase as well. In the third row, when fa is changed from 4 Hz to 2.5 Hz, the frequency of the slow envelope decreases from 4 Hz to 2.5 Hz, as expected. When the modulation parameters are kept constant but fo is decreased from 200 Hz to 190 Hz, the frequency components due to the AM from the respiratory muscles remains the same, but the frequency components due to the interaction between the FM and AM change in amplitude; they are at the same locations (although the peak at 14 Hz is attenuated). This is expected because the harmonics of fo that are closest to the third formant are now at different frequencies. Increasing the depth of the FM, af, keeps the locations of the harmonics of the FM at multiples of 7 Hz, but the harmonics with the greatest energy have a higher frequency. 34 .......... .... .... .. ...... The spectra of log (IbAF[n]1) contain peaks at the AM and FM frequencies and harmonics of those frequencies, but they also contain some high-frequency artifact in between. This artifact is a byproduct of taking the magnitude of bAF[nI. The negative time-domain components are flipped about the time axis, which introduces an additional high-frequency component. Previous studies that used the envelope of a signal to study dysphonic speech used the Hilbert transform to obtain the envelope [41], so the combination of the Hilbert transform and Stockham's method was explored next. 4.3.2 Hilbert-Stockham Method In communications signals and in dysphonic speech, incoherent envelope detection is performed by bandpass filtering around the carrier frequency. The resulting signal is then transformed to a Hilbert envelope by performing the Hilbert transform on the bandpassed signal and then taking the magnitude [41]. In this thesis, the logarithm of the magnitude is taken in an effort to further separate the fast-moving component from e [n]. Description of Hilbert-Stockham Method: Figure 13 shows the block diagram for the processing method using both the Hilbert transform and Stockham's method. GIf I.o(n log PY0fl syog(ze Fltr 1og(ljynlj)+fs(jydnjL) k%() synd~zerG(f) Figure 13: Block diagram representing the model (turquoise box) and processing when the Hilbert transform and Stockham's method were both used. The Hilbert transform of the bandpassed signal, bA[n], is taken, resulting in yA[n]. The log of the magnitude OfYAFl1 is subsequently taken. Note that the only difference between this processing method and the previous method is that the Hilbert transform is taken prior to taking the magnitude of the time-domain signal. 35 From the block diagram in Figure 13 and Eq. 11, YAF[n], the Hilbert transform of bAF[n], is obtained as: YAFIn] e[n](ft[n] * h[n]) * g[n] (15) where g[n] is the Hilbert operator. Since only the magnitude is important in this case, the following approximation can be made [42]: YAF[n %te[n]P[n] * (h[n] * g[n]) (16) Letting h[n] = h[n] * g [n], the Hilbert transform of h[n], then: YAF[fl- (e[n]&[]) * h[n] (17) Using the same approximation that- was used to obtain Eq. 16 from Eq. 15, it follows that: YAF [n] oze[n](Pi[n] * h [n]) (18) Substituting Eq. 8, letting f[n] be the Hilbert transform of v[n], and with e[n] > 0: IYAF[lI e[n]It~nhI (19) Figure 14 compares bAF [n] to its Hilbert transform, YAF[n], over two timescales, when a bandpass filter is applied around the third formant. 36 Comparison of bAF and yAF over entire waveform 0.4 __ 0.2 bAF __yAF 0 -0.2 1 0.5 0 1.5 Time (sec) 3 2.5 2 Comparison of bAF and yAF over segment of waveform 0.4__bA 0.2 _ A -0.2 -0..5 0.55 0.6 Time (sec) 0.65 0.7 . Figure 14: Comparison of b[n] (blue) and yAF[n] (red).f,=200Hz, aa=0.2,f=.4OA Hz, a,=10 Hz,f=7 cycles/sec 2 The first through third formant frequencies are 820 Hz, 1220 Hz, and 2810 Hz, and the bandwidths are 125 Hz, 125 Hz, and 250 Hz, respectively. Top: bA[n] and yA[n], plotted over the entire 3 second duration of the waveforms. Bottom: bA[n] and yA[n], plotted between times 04 and 0.7 seconds. In the bottom plot of Figure 14, YAF [n] appears be the envelope of bAF [n]. It preserves the envelope of the AM from the respiratory muscles and the AM due to the interaction between the formant and the harmonics. In contrast, when the magnitude of bAF [n] is taken, the negative samples become positive, resulting in more high-frequency artifact. To separate the different sources of AM from the Hilbert envelope, Stockham's method is once again utilized. Taking the log of the magnitude of both sides of Eq. 19, with eA[n] 0 and eF[n] 0, and substituting e [n] = eA [n]eF [n], the following is obtained: log(IyAF [n] 1) = log(eA [n]) + log(eF[n]) + log (I91n]I) (20) Taking the Fourier transform of both sides, F{log(IyAF[n] |)) = Fflog (eA [n])} + F{log (eA [n])} + F{log (IV[n]1)) (21) The envelope region of the Fourier transform of log (IYAF [n] ) is shown in Figure 15. As a result of the smoothing from the Hilbert envelope, there is less low-amplitude, high-frequency artifact between the peaks at 4 Hz, 7 Hz, 8 Hz, 14 Hz, 21 Hz, 28 Hz, etc. Therefore, this processing method might yield 37 features that are more strongly correlated with vocal modulation than Stockham's method without the Hilbert transform. DFT of log(lyAFI), f,=200Hz, a,=0.2, fa=4, af=1 0, ff=7 , DC removed rnnn 500o 4000 3000 2000 100 o I 0 10 20 30 40 50 Freq (Hz) AM due to AM due to FM respiratory muscles Figure 15: Spectrum of log(yA,[n]I), showing the regions where there is AM due to the muscles of respiration and AM due to the interaction between the formants and harmonics of the fundamental. However, if the Fourier transform of log (IyAF[n]I) is taken by removing DC and applying a Hamming window, and then viewed over a wider frequency range, it is clear that many high frequency components remain in the Hilbert envelope. This is shown in Figure 16. 38 .... ... ....... ...... ............... I........... ... .............. ............... ... ... ............. . ................ ...... ........................... ......... .... Fourier Transform of log(IbAF x14 1. 51- 0. 0 K 10o 2000 4000 6000 8000 10000 12000 10000 12000 Fourier Transform of Iog(yAF) 4 1.5 1 -l 0.5 00 2000 4000 6000 8000 Freq (Hz) Figure 16: Comparison of the spectra of of log(IbAF[nlI) and of log(lyA,[n]I). log(yAF[n]I) is nearly a lowpassed version of log(IbA[n]I). There is a considerable amount of energy at around 5800 Hz in the plot of the spectrum of log ( bAF[In] 1) in Figure 16. This occurs because bAF n] contains both positive and negative values. When IbAF [n] I is obtained, the negative components become positive, effectively doubling the frequency of the fastest component in the signal, which is formant 3. The third formant is 2810 Hz, so twice the third formant frequency yields 5620 Hz, which is consistent with the top plot in Figure 16. Log-FM-and-AM Envelope as a Sum of the Log-FM Envelope and the Log-AM Envelope: Similar to Stockham's method without the Hilbert transform, the demonstration that the log-FM-and-AM envelope is the sum of the log-FM and log-AM envelopes, is based on Figure 17. 39 Harmonic synthesizer b,[n] G(f) yF~ e andpass Filter Harmonic synthesizer bayA[ logt) by +I[n] ) log(+yF[n] 1e1n]), G(f) 1 og() andpass Filter Figure 17: Block diagram for log-FM envelope plus log-AM envelope by using the Hilbert-Stockham method. Following the same process as described in Section 4.3, and letting Eq. 7 define H(f): Fflog(IyAF [n]|1)} = F~log (eA [n])} + F~log (eA [n])} + 2F[Iog (I [n]|} (22) Since it is only the envelopes that are of interest, the last term of Eq. 22 is ignored. When viewed on a frequency scale from 0 to 50 Hz, the Fourier transforms of log (IeA[n] |) and log (IeF[n] 1) are visible. This is depicted in Figure 18. 40 .. . ............... .... .. --- ...... 1111 .......... _ Fourier transform of Iog(IyA[n]I) f0=200Hz, aa=0. 2 , fa=4, af=10, f = 7 10000 - 5000 0 0 10 5 15 20 25 35 30 2 45 40 4 Fourier transform of Iog(IyFLn]I) f0 =200Hz, aa=0. , fa= , af=10, f = 50 7 2000 - 1000 0 A OL 10 5 15 20 - L --- -A-_IA 35 30 25 40 -_ 45 Al 50 Fourier transform of log(lyA[n])+1-og(IyF[n]), and Fourier transform of Iog(yAF[n]I) 10000[ F.T. of Iog(IA[n]I) + log(IyF ~__ F.T. of log(I F[n]1) ___ - 5000 0 0 A5 -\ 10 A 15 A 20 A 25 Freq (Hz) -N\ I 30 35 40 45 50 Figure 18: Comparison of Fourier transform of log-FM-and-AM, and Fourier transform of log-FM plus logFM envelopes, obtained using the Hilbert-Stockham method. Top plot: Fourier transform of log(lyA[n]I) , the log-AM envelope. Middle plot: Fourier transform of log(Iy,[n]I), the log-FM envelope. Bottom plot: Fourier transform of log(IyA[nJI) + Fourier transform of log(ly,[n]I) (red), and the Fourier transform of log(yAF[nI) (green). Observe that they are approximately the same. Much like the Stockham method without the Hilbert transform, the Fourier transform of the log-FM-andAM envelope is very similar to the sum of the Fourier transform of the log-AM envelope and the log-FM envelope. Results from Hilbert Transform and Stockham's Method: Figure 19 shows the resulting signal when the Hilbert transform is taken before taking the magnitude and then taking the logarithm for a number of different FM parameters. 41 yAF[n] DFT of 1og(YAF) f0=200Hz, a,=0.2, fa=4, a,=10, f,= 7 fo=200Hz, aa=0.2, fa=4, af=10, ff=7 ROOC L I J 1 4000-A 2000- n5v 0.2 yAF[] 19 2 0 3 UiikdIIikh 0.5 1 1.5 2 2.5 25 1 yAF n] fo=200Hz, aa=0.2, fa= . , a=10, ff= 0. A 10 20 A A 30 40 50 DFT of log(IYAF), t0=200Hz, aa=0. 2 , fa=4 , af=10, ff=1 3 fo20Hz, a=0.2, fa=4, a,=10, ff=13 4000 2000 0 o A 3 [J ,, A 10 _0 DFT of 7 log(IyAF)' - 20 A - O.4r 30 40 f0=200Hz, aa=0.2, fa=2.5, af=10, 50 ff=7 - 6000 4000- 0. - 2000A 0 0.4 1 0.5 1 1.5 yAF 1[] f0 =190Hz, 2 2.5 3 7 10 A - A. 20 30 A 40 A 50 DFT of Iog(yAF), f0=19OHz, aa=0.2, fa=4, a=10, ff=7 6000-- a=0.2, f,=4, a,=10, f,=7 4000- 0.1 2000-. 1.5 0 Time (s) A 10 A A 20 Freq (F- 30 40 50 Figure 19: Comparison of yA.[n] and DFT of log(lyAF[nI) when the formants are 820 Hz, 1220 Hz, and 2810 Hz, their bandwidths are 125 Hz, 125 Hz, and 250 Hz, respectively, and the bandpass filter is set to formant 3, with corner frequencies at 2810 Hz * 250 Hz. Before taking the DFT, DC was removed from the signal and a Hamming window was applied. The first column plots YAF[nl, and the second column plots the DFT of log(lyA,[n]I) on a frequency scale from 0 to 50 Hz. The first row of contains the same figures as shown earlier in this section; they are present for reference. In the second row, ff is increased from 7 cycles/sec 2 to 13 cycles/sec 2 . In the third row,fa is decreased from 4 Hz to 2.5 Hz. In the fourth row,f, is decreased from 200 Hz to 190 Hz. The pattern is identical to what was seen in the method without the Hilbert transform, except the Hilbert transform appears to remove some of the high-frequency artifact. Therefore, using the Hilbert method might provide better results when extracting features to detect depression. 4.3.3 Nonlinear, Iterative Envelope (NLIE) -Stockham Method Based on [43], we applied a nonlinear, iterative algorithm to estimate the envelope of a signal*. Description of the NLIE-Stockham Method: Similar to the two other processing methods, the first step is to bandpass the output from the vocal tract, resulting in bAF [n]. Then, the NLIE is obtained by convolving IbAF[n] I with an equally-weighted moving average filter of length 2.5ms. For each point along the length of IbAF [n]l, the maximum between IbAF [n] I and the convolution is kept. Then, the * The code for the NLIE envelope was written by Dr. Elizabeth Godoy, a Technical Staff member at MIT-Lincoln Laboratory, Human Language Systems and Technology Group. 42 process repeats 150 times. The resulting signal is called magnitude of ZAF[n] ZAF [n]. Finally, the natural logarithm of the is taken because it is desirable for the natural logarithm of the AM from the respiratory muscles and the AM from the resonance-harmonics interaction to be additive. This process is illustrated in Figure 20. Harmonic synthesizer bAF[n] ZA [n] I|I xNLUE log() Bandpass Fifter Figure 20: NLIE-Stockham method. The "NLIE" box represents the nonlinear, iterative envelope computation. The output from the envelope is called ZAFl. It is important to note that with the parameters used for the NLIE, which are an equally-weighted moving average filter length of 2.5 ms and a number of iterations set to 150, the envelope does not contain the fast-moving component of the signal, v[n], that was seen in the other two processing methods. 43 bAF[n] and 0.4 1AA A. Ak v, ~Y ZAF[n] over entire waveform h .A rn 0.2 0 -0.2 -0.4 0 0.5 1 I 1.5 Time (sec) 2 2.5 3 bAF[n] and ZAF[n] over a segment of the waveform 0.4 0.2 0 -0.2 0.5 IIP, 0.55 '!II!IfT!IIII' II 0.6 0.65 0.7 0.75 Time (sec) 0.8 0.85 I '1I 0.9 Figure 21: bA[n] and ZAF[n]. The blue line represents bA[n] and the magenta line represents ZAF[f. Figure 21 displays bAF[n] and the envelope output from the NLIE, over two timescales. The NLIE appears to perform very well at estimating the envelope in the time domain. Compared to the Hilbert envelope in Figure 14, the NLIE appears to be an envelope of the Hilbert envelope, removing additional high frequencies that are present in the Hilbert envelope, thus removing artifacts that are not components of the "true envelope" of the speech waveform. The frequency domain representation of the magnitude of the log of the envelope is shown in Figure 22. There is a clear peak at 4 Hz, corresponding to the rate of the AM from the muscles of respiration, and also clear peaks at 7 Hz and its harmonics. 44 Fourier Transform of log(IVI) 6000 4Hz 50001 4000 3000 7Hz 2000 14Hz 28Hz 21Hz 1000 'SHz UO 10 20 30 40 50 Freq (Hz) Figure 22: Fourier transform of log(z 4[nJI) Log- FM-and-AM Envelope as a Sum of the Log-FM Envelope and the Log-AM Envelope: Similar to Stockham's method and the Hilbert-Stockham method, the verification that the log-FM-and-AM envelope is the sum of the log-FM and log-AM envelopes with the NLIE-Stockham method is based on Figure 23. .nUE Harmonic synthesizer Ulo( zlIn FftW Harmonc synthesizer Iog(IznnhI)+Iog(IzrInJ bAn) NUE znl NE-;01 t Figure 23: NLIE-Stockham method for log-FM-and-AM envelope, and log-FM plus log-AM envelope. When viewed on a frequency scale from 0 to 50 Hz, the Fourier transforms of log (leA [n]I) and log (IeF[n] 1), derived from Stockham's method, are visible. This is depicted in Figure 24. 45 Fourier transform of Iog(lzA[n]1) fo=200Hz, aa=0. 2 , fa=4, a,=10, ff=7 10000 - 5000 0 A 5 10 15 20 25 30 35 40 45 50 Fourier transform of log(IzF[n]1) f=200Hz, aa=0. 2 , fa=4, aj=10, f =7 40C 0 20C 00 5 10 15 20 25 30 35 40 45 50 Fourier transform of Iog(IzA[n])+Iog(lzF[n]), and Fourier transform of Iog(IzAF[n]1) 1 00C 0I ____ F.T. of Iog(IzA[n]I) + Iog(IzFnil) F.T. of Iog('zF[n]l) 500 0___ % 5 10 15 20 25 Freq (Hz) 30 35 40 45 50 , Figure 24: Comparison of Fourier transform of log-FM-and-AM, and Fourier transform of log-FM plus logFM envelopes, obtained using the NLIE and Stockham methods. Top plot: Fourier transform of log(IzA[n]I) the log-AM envelope. Middle plot: Fourier transform of log(IzF[nJI), the log-FM envelope. Bottom plot: Fourier transform of log(IzA[n]I) + Fourier Transform of log(lz,[n]I) (red), and the Fourier Transform of log(IzAF[n]I) (green). Note that they are approximately the same. As in the Stockham's method and the Hilbert-Stockham method, when the NLIE-Stockham method is applied, the Fourier transform of the log-FM-and-AM envelope is very similar to the sum of the Fourier transform of the log-AM envelope and the log-FM envelope. Results from the NLIE-Stockham Method: Figure 25 shows the resulting signal when the NLIE is extracted before taking the magnitude and then taking the log of the envelope. Similar to the two previous methods, the main component of the AM from the respiratory muscles, located at 4 Hz, is present, as well as the harmonics of the 7 Hz FM component. 46 7 2 4 zAF[n], f 0=200Hz, aa=0. , fa= ' a,=10, ft= 0.4 0.2 DFT of log(lzAF ) f0 =200Hz, aa=0.2, fa=4, af=10, f,=7 10000 5000 0 0.5 2 2.5 1.5 1 zAF[n], f0=200Hz, aa=0.2, f a=4, a=10, f =13 50 40 20 30 10 0 DFT of log(lzAF) f 0=200Hz, a a=0.2, f a=4, a,=10, ff=13 0.4 10000 0.2 5000 0 0.5 1 1.5 2 2.5 zAF[n], f 0=200Hz, aa=0. 2 , fa=2.5, af=10, f=7 50 40 20 30 10 0 DFT of log(lzAF ), f 0=20OHz, aa=0.2, fa=2.5, a,=10, f,= 7 10000 0.4 0.2 0 A 500:1 AA A 0 0.5 1.5 2 2.5 1 zAF [n] f0=190Hz, aa=0.2, fa=4' a=10, ff =7 0.4 0.2 0 10 DFT of log(lz AF 1) 10000 20 F 30 40 50 (=190Hz, aa=0.2, fa=4, a=10, f,=7 5000- 0 0.5 1 1.5 2 2.5 0( 10 20 30 40 50 Freq (Hz) Time (s) Figure 25: Comparison of y,,[n] and DFT of log(IzA[n]I) when the formants are 820 Hz, 1220 Hz, and 2810 Hz, their bandwidths are 125 Hz, 125 Hz, and 250 Hz, respectively, and the bandpass filter is set to formant 3, with corner frequencies at 2810 Hz 250 Hz. Before taking the DFT, DC was removed from the signal and a Hamming window was applied. The first column plots zA[n], and the second column plots the DFT of log(IzA[n]I) on a frequency scale from 0 to 50 Hz. The first row of contains the same figures as shown earlier in this section; they are present for reference. In the second row, ff is increased from 7 cycles/sec2 to 13 cycles/sec2 . In the third row,f, is decreased from 4 Hz to 2.5 Hz. In the fourth row,fo is decreased from 200 Hz to 190 Hz. 4.3A Comparison of the Three Processing Methods Figure 26 displays the three processing methods when used on four different sets of modulation parameters. The sets of modulation parameters displayed are the same as those used in the previous parts of this section. 47 Comparison of Fourier Transform of log(bAFI), IOg(IyAF'), and log(ZAF) fo= 2 0 0 Hz, a,=0.2, fa=4 , a,=10, ff= 7 400 0 200 0- U Stockham_ -Hilbert and Stockham NUE and Stockham oL~ 5 10 15 20 _&Al 25 log(IbAFI), Comparison of Fourier Transform of 30 A__ _6A 40 35 45 A A 50 aa=0.2,fa=4, af=10, ff=13 ,og(IyAFI) and Iog(IZAFI f 0=20OHz, 40002000- 0 5 10 15 20 Comparison of Fourier Transform of log(IbAFI), 4000 25 30 35 40 45 50 Iog(IyAFI), and log(IZAFI), fo=200Hz, aa=0. 2 , fa=2.5, a=10, f,=7 -- 2000 - :L1k 1 1. - 2A,A0 A 1og(IyAFI), and Iog(IZA Comparison of Fourier Transform of log(IbAFI), I I I A0 A f0=19OHz, aa=0. 2 , fa=4, af=10, f,=7 II I- 4000- A 0 5 10 15 20 25 Freq (Hz) 30 35 40 A - 2000 45 50 Figure 26: Comparison of the three different processing methods for the four different parameter combinations used. Ideally, there would be peaks only at 4 Hz, 7 Hz, and multiples of 7 Hz. Based on these plots, the processing method with the greatest peaks and least amount of high-frequency artifact is the NLIEStockham method. This occurs because the NLIE-Stockham method is finding the envelope of the Hilbert envelope, removing the component of the Hilbert envelope that represents the pitch period. Therefore, it is hypothesized that this method will prove to be most useful when deriving features from the waveforms in the AVEC database. 4.4 Conclusions The Stockham, Hilbert-Stockham, and NLIE-Stockham methods were successful in demodulating synthetic FM-and-AM signals. Due to the least amount of high-frequency artifact in the NLIE-Stockham envelope, it is hypothesized that features extracted from this envelope will be able to best predict 48 subjects' MDD severity. In the next section, we will extract features from the three methods for comparison. 49 Chapter 5 Feature Extraction The purpose of this chapter is to describe the features that are will be used in Chapter 6 to predict depression ratings. Motivated by the perceptual task described in Chapter 3, the model described in Chapter 4, the theory of motor incoordination in MDD [12], and finding that spectrotemporal information at low frequencies (up to 25 Hz) can be used to predict Hamilton Depression (HAM-D) scores [29], various features from each of the speech signals were extracted and used for prediction. The features are used to test the hypothesis that individuals with MDD have more erratic modulation in their voices, where "modulation" is reflected by each of the features. The model of Chapter 4 predicts that many of the features can be found in the low-frequency band of the logarithm (log) of the envelope of the acoustic speech signal. The features can be outlined as follows: 1) Frequency domain-based features a. Mean of the magnitude of the STFT of the log-envelope b. Variance of the magnitude of the STFT of the log-envelope c. Coefficient of variation of the STFT of the log-envelope d. Amount of unnormalized energy in the low-frequency region e. Ratio of energy in low-frequency region to energy in high-frequency region 2) Time domain-based features: eigenvalues of cross-correlations among envelopes containing different frequencies Prior to extracting the features, the waveforms must be pre-processed. The purpose of the pre-processing step is to extract the frequency band of interest. Since the Hilbert-Stockham and NLIE envelopes appeared to contain less high-frequency artifact than the Stockham envelope when they were tested on the model, both the Hilbert-Stockham and NLIE-Stockham envelopes were used. This chapter is outlined as follows: Section 5.1 describes the database used. Section 5.2 describes the preprocessing that is common to both the frequency domain-based features and the time domain-based features. Section 5.3 details each of the frequency domain-based features, while Section 5.4 details the time domain-based features. 5.1 The AVEC Database The 2013 Audio/Visual Emotion Challenge (AVEC) database is used for feature extraction and classification/regression. The AVEC 2013 challenge contains a subset of an audio-visual depression 50 language corpus that includes 340 video recordings of 292 subjects performing a human-computer interaction task while being recorded by a webcam and a microphone and wearing a headset. The 16-bit audio was recorded using a laptop's sound card at a sampling rate of 41 kHz or 32kHz. The video was recorded using a variety of codecs and frame rates, and was re-sampled to a uniform 30 frames-persecond. For the challenge, the recordings were split into three partitions, 50 recordings each: a training, development, and test set. Recording lengths fall between 20-50 minutes with a 25-minute mean value. The mean age is 31.5 years, with a standard deviation of 12.3 years over a range of 18 to 63 years. The recordings took place in different quiet environments [32]. Subjects were required to utter the /a/ vowel at a comfortable sound level for part of the task. That /a/ vowel from the training and development sets was the portion of the recordings used for this thesis in feature extraction and classification/regression. Since 3 of the 100 recordings training and development sets did not contain the /a/ vowel at a comfortable level, 1 contained a recording of the subject laughing during the utterance, 9 were too short, and 1 was too quiet, a total of 87 /a/ vowels was used. The subjects' MDD severity ratings were scored using the self-reported Beck assessment, described in Appendix B. Low Beck scores correspond to little/absent MDD, while high Beck scores correspond to severe MDD. The scores are rated on a scale between 0 and 63. Among the 87 vowels extracted from the AVEC database, the Beck scores varied between 0 and 45. 5.2 Pre-Processing The purpose of the pre-processing is to extract the NLIE-Stockham and Hilbert-Stockham envelopes from the 87 raw waveforms. The recordings sampled at 41.8 kHz are downsampled to 32 kHz to standardize the sampling rates of all of the recordings. Next, the middle 3 seconds of each /a/ vowel are extracted to standardize the number of windows and window lengths used among different sessions when the Fourier transform is taken. Using a Kalman-based autoregressive moving average framework [44], the formants of each of the waveforms are computed. A bandpass filter at the center of the third formant is applied for the same reasons as described in Chapter 4: the frequency of the third formant is significantly away from the first and second formants, and the greatest amount of modulation of the harmonics of fo is expected to occur within the third formant. For each of the waveforms, the bandwidth of the bandpass filter is set to 250 Hz on each side of the formant. The result is two envelopes per waveform: the NLIE-Stockham envelope and the Hilbert-Stockham envelope. Finally, the NLIE-Stockham and Hilbert-Stockham envelopes are computed (denoted by eL[n] = log (Ie[n] 1) and referred to as the general log-envelope). 51 The NLIE and Hilbert envelopes, without taking the log of the magnitude of the envelope, are denoted by e [n]. 5.3 Frequency Domain-Based Features This section details each of the frequency domain-based features. Our model and the findings of Cummins et al. [29] serve as the motivation for this type of feature: these features are derived from the low-frequency (<20 Hz to <50 Hz) content in the log-envelope of the waveform. The definition of "low frequency" needed to be established empirically. To compute the STFT over the middle three seconds, the Hilbert-Stockham and NLIE-Stockham envelope are segmented into 5 sections, each with a length of 1 second, and with an overlap of ; second. The DC value of each segment is removed. A 1-second Hamming window is applied to each segment of the envelope, with an overlap of 0.5 seconds from the previous window. It is necessary to remove the DC value prior to taking the STFT because the in the frequency domain, the energy from the mainlobe at DC often leaks into the frequencies of interest. A 262,144-point FFT is applied when the Fourier transform is computed. The large number of points was necessary because of the desired resolution in the lowfrequency region of the signal. 5.3.1 Average of the STFT Magnitude of the Log-Envelope These features are motivated by Cummins [29], who used spectrotemporal information from utterances of "pa-ta-ka" to predict subjects' clinically assessed HAM-D scores. The data used by Cummins et al. originated from a database different from AVEC. Although it is difficult to extract physiological meaning from the average STFT magnitude of the log-envelope, this set of features can lay the groundwork for features that provide physiological meaning. The STFT magnitude of log(Ie[n]|) at time n, denoted by IEL(n,f)|, is given by Eq. 23: IEL(n,f)| = IXO=- 00 log(le[m]|) w[n - m]e 2 "fm". (23) where w[n - m] is the analysis window. Its length is N, and it is nonzero only over the interval [0, N, - 1]. A single 3-second window is not applied to the signal because the signal is time-variant; "noise" appears in the STFT if the window is too long. If the window is too short, the frequency resolution suffers [30]. The STFT is taken over P windows, and the result is averaged. Assuming each 52 STFT is computed at times that are multiples of N,12, f, is the sampling frequency, and p is an integer representing the multiple of N/2, then the time point at which the STFT is taken can be expressed as: pNw n = 2 ,. Therefore, the mean of the magnitude of the STFT of log(|e[n] 1),| EL(f)I, is computed as: lE!p __ |ELUf)J = S, f). (24) In this case, w[n - m] is 1-second Hamming window, and the STFT is taken at -second intervals of log(Ie[n] 1). There is a total of P=5 STFTs computed. As an example, Figure 27 shows the magnitude of the STFT of the NLIE envelope for each of the windows, as well as the mean STFT magnitude (dark black line). 223_1: Beck score=0 3000 2500 2000 1500M 1500 11000 -Window 1 -Window 2 Window 3 - Window 4 -Wndow ndow 5 322_1: Beck score=0 2282: Beck score=0 2500 2500 2000 2000 1000 1015 1000 1000 20 500 500 Soo 10 1500 50 500 0 228_2: Beck score=1 2000 0 10 0 20 231_1: Beck score=32 218_1: Beck score=31 10 20 0 10 20 237_2 : Beck score=34 234_1 : Beck score=31 4000 2000 4000 5000 3000 1500 3000 4000 2000 1000 2000 1000 500 1000 3000 2000 c0 10 Freq (Hz) 20 0 10 Freq (Hz) 20 1000 0 10 Freq (Hz) 20 0 10 Freq (Hz) 20 Figure 27: Mean STFT magnitude of NLIE-Stockham envelope for subjects with low Beck scores (top) and high Beck scores (bottom), when the frequencies of the envelope range between 0 and 20 Hz. Five 1-second windows over the middle 3 seconds of each vowel are used in the computation of the STFTs. The delay between each window is 0.5 sec. All subjects uttering the waveforms shown in this figure are female. A more periodic, less erratic structure is apparent in the average of the STFT magnitude of the logenvelope in two of the less depressed subjects (228_2 and 322_1) than in the four depressed subjects. However, this observation does not generally hold. As an ensemble, it is difficult to detect patterns from the waveforms that could be used for classifying the subjects with low Beck scores (top row) from subjects with high Beck scores (bottom row). Consequently, each of the 164 frequency samples between 53 0 and 20 Hz are input to a dimensionality reduction scheme prior to Beck score prediction. These processes are described in Chapter 6. 5.3.2 Variance of the Magnitude of the STFT of the Log-Envelope Motivated by the hypothesis of motor incoordination in MDD [12], it is hypothesized that there is greater variance over time in the spectrum of a depressed subject. var(IEL( This variance is computed as FFs,fil) as p varies from 1 to 5, over each of the different segments of the signal. In other words, for all of the windows from one subject from Figure 27, the variance of the magnitude of the STFT of log(Ie[n] |) is computed. The result is shown in Figure 28. 2231 Beck sCore=O x 8. 7 io5 2282 BeCk score=O x 4 6 32.5 4 - 23 2 1 6 3.5 5 3 1431.5 2 0.5 1 0 5 10 15 20 12 218_1: Beck score=31 10 5 _0 . 105 10 15 20 231_1: BeCk score=32 6 . 10, 10 15 20 0 x10e 24-1 Beck score=31 10 5 10 15 20 2372: Beck score=34 2.5 2.5 2- 2. 6 1.5 4 1 2 0.5 1.5 4 2 0 5 0 1x 3 S6 228_2 Beck sCore=1 4 3221 Beck sCOre=O 0.5 5 10 Freq (Hz) 15 20 0 5 10 Freq (Hz) 15 20 0 5 10 Fmq (Hz) 15 20 0r5 101520 Freq (Hz) Figure 28: Variance of the STFT magnitude of NLIE-Stockham envelope for subjects with low Beck scores (top) and high Beck scores (bottom), when the frequencies of the envelope used in classification and regression ranged between 0 and 20 Hz. Five 1-second windows over the middle 3 seconds of each vowel were used in the STFTs. Each window was delayed by 0.5 seconds from the previous. All subjects uttering the waveforms shown in this figure are female. The utterances are the same as those shown in Figure 27. It is difficult to discern a pattern that would classify the depressed subjects from those who are not depressed. Therefore, each of the 164 frequency samples between 0 and 20 Hz is reduced in dimensionality prior to Beck score prediction. 5.3.3 Coefficient of Variation (CV) of the Magnitude of the STFT of the Log-Envelope The coefficient of variation (i.e. the variance normalized by the mean) is also used as a feature. This measurement is similar to variance of IEL(n, f)l, but this is hypothesized to yield improved prediction 54 because it improves normalization to the mean of the FFT up to 20 Hz. The features from 0 to 20 Hz for females with low Beck scores and high Beck scores are shown in Figure 29. 223 1 : Beck score=0 1 -1 228 2: Beck score=0 -1 0.8 322 1 : Beck score=0 -1 228 2: Beck score=1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 8 0.6000 0.4 0.2; 10 20 00 10 231_1: Beck score=32 218_1: Beck score=31 0.8 0 20 1.5 0.6 10 20 _0 10 20 237_2: Beck score=34 234_1: Beck score=31 0.8 1.2 0.6 1 0.8 > 0.4 0.4 0 0.6 0.5. 0.2 0.2 0 10 Freq (Hz) 20 00 10 Freq (Hz) 20 00 0.4 10 Freq (Hz) 20 0.20 10 Freq (Hz) 20 Figure 29: CV of the STFT magnitude of NLIE-Stockham envelope for subjects with low Beck scores (top) and high Beck scores (bottom), when the frequencies of the envelope used in classification and regression ranged between 0 and 20 Hz. Five 1-second windows over the middle 3 seconds of each vowel were used in the STFTs. Each window was delayed by 0.5 seconds from the previous. Similar to the mean of the STFT magnitude of the log-envelope, it is difficult to determine a pattern that would classify the patients into a depressed and non-depressed category. 5.3.4 Unnormalized Energy in the Frequency Band Corresponding to the AM due to the Respiratory Muscles Motivated by the model and the hypothesis that patients with MDD have more erratic modulation in their voices, the unnormalized energy in a low-frequency band of IEL(f)I is computed over frequencies from 0 Hz to an upper limit, denoted by f.. The computation is provided below as: Eamax IEL (f)12. This low-frequency unnormalized energy approximates the energy in the AM region, IEAL(f)|, but without taking the normalization over the frequency range into account. The ideal "low-frequency range" is tested empirically by varying the upper limit from 1 Hz to 12 Hz in steps of 1 Hz. 55 5.3.5 Ratios of Energy in Various Frequency Bands Based on the hypothesis that there might be different ratios of energy in various frequency bands as put forth by the model, ratios between the energy in the IEAL (f) region to the energy in the IEFL(f) region are computed. One of the challenges of this feature is that the frequency regions of the AM due to the muscles of respiration, and the AM due to the interaction between the formants and the harmonics, are unknown. To resolve this, numerous different ranges are tested. In all cases, it is assumed that any frequency up to f. is a frequency region where AM due to the muscles of respiration could occur. However, both the lower and upper frequency bounds on IEFL(f)I are unknown, so both bounds are varied. There are three forms of the energy ratio feature. Unlike the other features, the average STFT magnitude is computed for frequencies up to 50 Hz. This is to ensure that the maximum frequency of the FM is captured. In each case, * |EAL(f) I is the mean of the magnitude of the short-time Fourier transform of log(IeA[n] 1). * IEFL(f) I is the mean of the magnitude of the short-time Fourier transform of log(I eF[n] * f. ). is the highest frequency at which IEALfI occurs. This is varied from 1 Hz to 12 Hz in steps of 1 Hz. * ff is the lowest frequency at which IEFL (f) occurs. This is varied from 4 Hz to 16 Hz in steps of 1 Hz. fj be the highest frequency at which IEFL(f)I occurs. This is varied from 20 Hz to 50 Hz in steps of 5 Hz. Therefore, for each of the three types of ratio features, 12x13x7 = 1092 combinations of frequency region bounds are tested. The three forms of the energy ratio features are the following: 1) The ratio of the energy in the IEFL(f)| region to the energy in the IEAL(f) I region: ' f"1l IEPL(f)1 fff 1Efamax IEALWI 2 famax -fzo 2) The ratio of the energy in the IEAL(f) region to the energy in the IEFL (f) I region: T1 fa., 1 EfamaxIEfI =0o1ALf)1 ff2ff I 3) The difference between the energy in the Z ff2 IEFL W! 56 IEFLCJ)1 2 den i e () g I region and the energy in the IEAL C!) I region: U2 5A 12 Zffr 2tr|F ff f=ffi IEFL(f)12 Eamax | E.g f 2 f =0~oIALf 2 Time Domain Features The time-domain features are the eigenvalues from time-delayed autocorrelation matrices of various segments of the following envelope grades, performed on both the Hilbert and NLIE envelopes: 1) A lowpass filtered version of the Hilbert or NLIE envelope of each waveform. The cutoff frequency of the filter is set to 25 Hz, the upper range of the modulation frequency studied by Cummins [29]. This envelope is called thefine (F) envelope. 2) A lowpass filtered version of the fine envelope. Since Lester and Story compressed subjects' chest walls at a rate of 5 Hz in their study of respiratory tremor [24], we set the cutoff frequency of the lowpass filter at 5 Hz. In this thesis, the lowpass filtered version of the fine envelope is called the coarse (C) envelope. 3) The difference between the fine envelope and the coarse envelope (FMC) 4) The log of the fine envelope (LF) 5) The log of the coarse envelope (LC) 6) The difference between LF and LC (LFMLC) After computing the 6 envelope grades from both the Hilbert and NLIE envelopes, the features are extracted. The procedure is described in detail in the context of epileptic seizure prediction [12]. Williamson et al. [45] also used the procedure on the first three formants of the AVEC database and found that subjects with higher Beck scores exhibited less "coordination" in the formants. The first step in the procedure is to z-score the envelope grade, which sets the mean of the envelope grade to 0 and the variance of the envelope grade to 1. The z-scored envelope grade is then separated into five segments, and the mean from each segment is removed. The final correlation matrix has 25 (5x5) blocks, where each block contains a subset of the correlation coefficients between the two segments correlated against each other, as shown in Table 2. 57 Table 2: Structure of the correlation matrix from which the eigenvalues are calculated. Since there are 5 segments, each of the 25 blocks of the whole correlation matrix contains a subset of the correlation coefficients between two segments. ... ,R ... ... , R . . ... .. R1,1 The cross-correlation of each combination of the 5 segments is computed. To reduce the dimensionality of the cross-correlation matrix, the result of each cross-correlation is downsampled by sampling multiples of 16 points (0.5 ms). This is called the delay. Thirty points, or taps, of each cross-correlation are then sampled. The result is that each block is a cross-correlation matrix of size 30x30. The values of 5 for the number of segments, 16 for the delay, and 30 for the number of taps, were chosen because these are similar to the values chosen by Williamson et al. [12], when they extracted features from the formants on the same database. For the envelope grades in this thesis, we attempted using 3, 4, and 5 segments; 25 and 30 for the number of taps; and 4, 8, and 16 for the number of delays. After obtaining the eigenvalues and summary statistic for each combination, we computed the Spearman correlation between each eigenvalue, and the patients' Beck scores. We looked for the combination of (number of delays, number of segments, number of taps) that produced the eigenvalues most strongly correlated with the patients' Beck scores. Once the final matrix is built for each envelope grade, its eigenvalues and a summary statistic are computed. The eigenvalues are ordered from largest to smallest, and the summary statistic is computed by taking the log of the trace of the covariance matrix. While the eigenvalues capture only frequency and phase-related information, the summary statistic contains information about both the entropy of the 5 segments and their relative amplitudes. 58 Since not all of the eigenvalues and summary statistics are useful, Spearman correlations between each feature (i.e. eigenvalues and summary statistic) and the Beck depression scores are computed, and are used to predict the effectiveness of each feature in predicting Beck scores. As an example, the results of the correlations between each of the features from the 6 grades of the Hilbert envelopes are shown in Figure 30. Correlation Between Beck Score and Features from LC Correlation Between Beck Score and Features from C 0.2 8 0.2 0 8 0 -0.2 1 -0.2 -0.4 -0.4 0 50 100 50 150 Feature Number 100 150 Correlation Between Beck Score and Features from LF Grade Correlation Between Beck Score and Features from F 0.2 0.2.2 8 9 o 8 0- [0.2 -0.2- -0.4 -0.4100 50 50 150 8 150 Correlation Between Beck Score and Features from LFMLC Grade Correlation Between Beck Score and Features from FMC C 100 0.2 0.2 0 8 0 1 -0.2 -0.2F -0.4 -0.4 0 50 Feature Number 100 0 150 50 Feature Number 100 150 Figure 30: Spearman correlations between each feature and Beck score, computed for each envelope grade from the NLIE envelope. The horizontal axis in each plot represents the feature index. The features are ordered from largest eigenvalue to smallest eigenvalue, with the last feature being the summary statistic (i.e. feature number 1 represents the correlation between the largest eigenvalue and the Beck scores, etc.). Top left: Spearman correlation between Beck scores and features from the C grade of the Hilbert envelope. Top right: Spearman correlation between Beck scores and features from the LC grade of the Hilbert envelope. Middle left: Spearman correlation between Beck scores and features from the fine grade of the Hilbert envelope. Middle right: Spearman correlation between Beck scores and features from the log(fine) grade of the Hilbert envelope. Bottom left: Spearman correlation between Beck scores and features from the finecoarse grade. Bottom right: Spearman correlation between Beck scores and log(fme)-log(coarse) grade. All of the features in Figure 30 are extracted using the optimal delays, taps, and segments, discussed in Section 5.4. The correlations for each of the grades in Figure 30 follow the same pattern: the larger absolute-value eigenvalues generally have a negative correlation with the Beck scores. 59 After the correlations between each of the eigenvalues and Beck score are computed, the indices of the strongest-correlated eigenvalues are used in Beck score prediction. 5.5 Conclusions Two principal types of features were introduced in the prediction of a patient's Beck depression score: features extracted from the spectrum of the envelope or log-envelope, and features extracted from the time-domain representation of the envelope. These extracted features are input to the predictor of the Beck depression score, described in Chapter 6. 60 Chapter 6 Regression and Prediction Using the AVEC MDD Database This chapter describes the regression and prediction procedure and then discusses the results we obtain using the features from Chapter 5. 6.1 Regression and Prediction Procedure 6.1.1 Gaussian Mixture Model as a Foundation As a basis for prediction, the type of classifier we use is a Gaussian Mixture Model (GMM). This is a standard classifier and has demonstrated to be effective in predicting a patient's MDD severity [29][45]. As described in [12], instead of training the GMM using Expectation-Maximization with two classes: depressed and not depressed, a different technique, called Gaussian Staircase Regression (GSR), is used. GSR uses multiple data partitions to create a GMM for Class 1 and Class 2. The features from the 87 vowels are partitioned into seven bins based on the Beck score associated with each vowel. Vowels corresponding to a Beck score of 0-4 are in the first (least depressed) bin, 5-11 are in the second, 12-19 in the third, 20-26 in the fourth, 27-34 in the fifth, 35-41 in the sixth, and 241 in the seventh, representing the most depressed subjects. Figure 31 displays the distribution of the partition bins. 0.25 0.2- d 0.15- 0.10. 0.05- 0 - 1 2 3 4 5 Parition Bins 6 7 Figure 31: Distribution of partition bins. Therefore, the GMM is formed from an ensemble of Gaussian classifiers that are trained from the multiple partitions. This is depicted in Figure 32. 61 A Class 2 Class 1 Beck score Figure 32: Illustration of the score segmentation in the Gaussian Staircase Method. In the first partition, Class 1 contains data corresponding to Beck scores between 0 and 4, and Class 2 contains data corresponding to Beck scores above 4. A Gaussian classifier for the first partition is produced, where one Gaussian represents the features corresponding to Beck scores between 0 and 4, and a second Gaussian represents the features corresponding to Beck scores above 4. In the second partition, Class 1 contains data corresponding to Beck scores between 0 and 11, and Class 2 contains data corresponding to Beck scores above 11. A second Gaussian classifier follows that partition, with one Gaussian for Class 1, and one Gaussian for Class 2. Since there are 6 partitions, there are 12 Gaussians that form the GMM. The advantage of Gaussian Staircase method is improved resolution for Class 1 among lower Beck scores, and better resolution for Class 2 among higher Beck scores. This allows for a test statistic that tends to smoothly increase as the Beck score increases. The Gaussian densities used full covariance matrices. A constant called the Gaussian regularization factor (GRF) was added to the diagonal of the covariance matrix to prevent overfitting the data. Since some of the subjects appear more than once, the means in the Gaussian model can be adapted toward the mean for the subject. This process has been called feature adaptation.If a subject utters the /a/ vowels in two sessions, feature adaptation is performed on the second session. The mixing weights are computed as n/O.5+n, where n is the number of sessions in which the features from the subject have been evaluated (i.e. if feature adaptation is being performed, n=2) The factor of 0.5 is chosen because it was used by Williamson et al [12]. The purpose of feature adaptation is to smooth the features extracted from a subject, and is similar to the Universal Background Model [46], a widely used technique in speaker recognition. 62 6.1.2 Training and Testing Procedure Leave-one-out cross-validation is performed on each of the 87 waveforms: one waveform for testing and 86 for training. Some subjects performed the test during two separate sessions. In those scenarios, only the session being tested is left out (i.e. data from the patient under test is in the training set). A common problem is the presence of too many features from a particular feature type. For example, this occurs when the average STFT magnitude and variance of the STFT7 magnitude are being tested. In each case, there are over 100 features, where each feature corresponds to the average STFT magnitude or variance of the STFT magnitude at a particular frequency. There are over 100 features when frequencies up to 20 Hz are under test, and each feature may be weakly correlated with the Beck scores. Principal component analysis (PCA) is required to reduce the dimensionality of the feature matrix. Prior to PCA, the size of the feature matrix is 87xN, where N is the number of features (over 100 in the case of the average STFT magnitude and variance of the STFT magnitude). PCA can be performed to reduce the dimensionality of the matrix to 87xK, where K<N, and the K components account for the largest amount of variance in the data. Without utilizing machine learning, the baseline mean absolute error (MAE) and root mean square error (RMSE) were computed. The baseline MAE is 10.05 and the baseline RMSE is 11.86. In the context of the Beck scores, if s is the actual Beck score for subject i and 9 is the predicted Beck score for subject i, then the MAE is defined as follows: |E - sil. MA1E= The RMSE is defined as follows: RMSE= 6.2 87 Average STFT Magnitude of the NLIE-Stockham and Hilbert-Stockham Envelopes We compute the average STFT magnitude of the NLIE-Stockham envelope, including frequencies up to both 20 Hz and 50 Hz. The upper limit of 20 Hz is chosen because most of the energy in the NLIEStockham envelope is contained below 20 Hz. A second upper limit to test is set to 50 Hz because this is half of the average assumed fundamental frequency for a male. 63 6.2.1 NLIE-Stockham Envelope with a Maximum Frequency of 20 Hz Values for the Gaussian regularization factor (GRF) are varied from 0.1 to 1.5 in steps of 0.1, while the number of PCA components is also varied from 2 to 7 components. This is executed without feature adaptation and subsequently with feature adaptation. The features from the STFT are taken from frequencies between 0 and 20 Hz. RMSEs Without Feature Adapt. MAEs Without Feature Adapt. 12.5 7 7 6 9.6 6 9.5 12 9.4 4 11.5 3 2 4 9.3 3 9.2 21 9.1 GRF GRF Spearman p Without Feature Adapt. Spearman p Without Feature Adapt. 7 0.42 7 6 0.4 6 5 0.38 5 0.36 4 2 2 0.3 U.51 3 3 0.32 2 5 4 0.34 3 X 104 U.5 1.5 GRF 1 15 GRF Figure 33: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude of the NLIEStockham envelope, with a maximum frequency of 20 Hz, without feature adaptation. In all plots, unless otherwise noted, the GRF is varied from 0.1 to 1.5, and K, the number of PCA components, is varied from 2 to 7. For all plots except the bottom left, a cooler color indicates a lower value, which is desirable. Top left: RMSEs when using the average STFT magnitude feature on the NLIE-Stockham envelope, up to 20 Hz, without feature adaptation, while varying the number of PCA components from 2 to 7 and simultaneously varying the GRF from 0.1 to 1.5. Top right: MAEs on the same data. Bottom left: Spearman p values. Bottom right: Spearman p's. The results obtained using this strategy are displayed in Figure 33. The lowest RMSE and lowest MAE occur at different (GRF, K) coordinates, where K is the dimensionality of PCA components used and the GRF is the Gaussian Regularization Factor, described in Section 6.1. The lowest RMSE, 11.07, occurs at (GRF, K) coordinates of (1.5, 7) (p=0.41, p<0.001). However, the lowest MAE, 9.01, occurs at (0.5, 5) (p=0.363, p<0.001). The greatest Spearman correlation, 0.430, occurs at (0.1, 2). At that point, the RMSE is 11.28 and the MAE is 9.15, both of which are slightly lower than baseline. 64 The same procedure is executed again, except feature adaption is performed. Figure 34 shows the RMSEs, MAEs, Spearman p, and Spearman p-values between the predicted Beck score and actual Beck score when the number of PCA components and the GRF were varied. RMSEs With Feature Adapt. MAEs With Feature Adapt. 6 11.6 7 11.4 6 9.2 4 8.8 3 10.8 2 2 0.5 1 1.5 10.6 GRF GRF Spearman p With Feature Adapt. 7 6 4 Spearman p With Feature Adapt. .46 14 .44 6 .042 5 12 10 .4 4 .38 3 x104 7 r 3 2 .36 2 .34 1 0.5 0.5 1.5 GRF 1 15 GRF Figure 34: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude of the NLIEStockham envelope, with a maximum frequency 20 Hz, with feature adaptation. The lowest RMSE, lowest MAE, and highest Spearman correlations are 10.58, 8.52, and 0.477 (p<0.001), respectively. Unlike the case without feature adaptation, all of these extrema occur at (0.1, 2). These coordinates are the same as those in the case without feature adaptation where the highest Spearman correlation was found. Compared to the baseline RMSE of 11.86, the RMSE from the average STFT magnitude of the NLIE-Stockham envelope, using frequencies up to 20 Hz, with 2 PCA components a GRF of 0.1, predicts a patient's Beck score more than one point more accurately than the baseline RMSE. The baseline MAE is 10.05. Similar to the RMSE, the lowest MAE from this feature, 8.52, predicts a subject's Beck depression score more than a point more accurately than baseline. At (0.1, 2), the actual Beck score, predicted Beck score, and line of best fit are shown in Figure 35. 65 Predicted vs. Actual Beck Score, Energy in Frequencies from 0 to 12Hz, No Adaptation -. ........... -.-. .. . 45 40 35 30 Cl, 25 0T 20 - .0 1 1 0 5 VO 5 10 15 20 25 Beck Score 30 35 40 45 50 Figure 35: Predicted score vs. Beck score from the average STFT magnitude with a maximum frequency of 20 Hz feature, using feature adaptation, when GRF=0.1 and K=2 (p=0 A77, p=0.001). Red line shows line of best fit. 6.2.2 Hilbert-Stockham Envelope, with a Maximum Frequency of 20 Hz The procedure described in Section 6.2.1 is used to obtain the RMSEs, MAEs, and Spearman correlations on the Hilbert-Stockham envelopes: the GRF and K values are varied from 0.1 to 1.5 and from 2 to 7, respectively. Cross-validation shows that when feature adaptation is not performed, regardless of the parameters used when computing the average of the STFT magnitude of the Hilbert-Stockham envelope, the RMSE and MAE are higher than those achieved when guessing. The complete results for the HilbertStockham envelope up to 20 Hz, without feature adaptation, are shown in Figure 36. 66 RMSEs Without Feature Adapt. MAEs Without Feature Adapt. 11 13 6 12.8 12.6 4 10.6 3 10.4 2 1 2.4 GRF Spearman p Without Feature Adaot. 7 6 10.8 5 GRF Spearman p Without Feature Adapt. 0.2 7 0.8 6 .1 5 0.6 5 5 .1 4 3 4 0.4 3 .05 0.2 2 2 GRF GRF Figure 36: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude, with a maximum frequency of 20 Hz, of the Hilbert-Stockham envelope, without feature adaptation. Observe that in Figure 36, none of the RMSEs are below the baseline value of 11.86, and none of the MAEs are below the baseline value of 10.05. However, when feature adaptation is performed, the HilbertStockham envelope performs marginally better than guessing under certain combination of (GRF, K). The results are shown in Figure 37. 67 RMSEs With Feature Adapt. MAEs With Feature Adapt 106 7 12.8 6 12.6 5 12.4 5 10.2 4 12.2 4 10 3 9.8 7 104 12 3 18 2 U .D 2 9.6 1 .0 GRF GRF Spearman p With Feature Adapt. Spearman p With Feature Adapt. 7 7 .2 6 0.8 6 5 0.15 5 4 .1 4 3 .05 3 2 0.6 0.4 0.2 2 0.5 1 1.5 0.5 GRF 1 1,5 GRF Figure 37: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude of the HilbertStockham envelope, using a maximum frequency of 20 Hz, with feature adaptation. Top right: MAEs. Bottom left: Spearman p's. Bottom right: Spearman p's. Although the average STFT magnitude of the Hilbert-Stockham envelope, using a maximum frequency of 20 Hz, does not accurately predict subjects' Beck scores without feature adaptation, there is a marginal gain when feature adaptation is performed. 6.2.3 Stockham Envelope, with a Maximum Frequency of 20 Hz We also attempt to predict MDD severity using the average STFT magnitude of the Stockham envelope up to 20 Hz. The results without feature adaptation are shown in Figure 38. 68 RMSEs Without Feature Adapt. MAEs Without Feature Adapt 7 11.2 7 11 10'8 12.8 4 4 12.6 3 10.6 3 2 104 2 GRF Spearman p Without Feature Adapt. 2 Spearman p Without Feature Adapt. 7 7 08 S15 6 0,6 5 4 4 0.05 3 0 2 0.4 0.2 2 GRF GRF Figure 38: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude of the Stockham envelope, with a maximum frequency of 20 Hz, without feature adaptation. Compared to the NLIE-Stockham envelope, the Hilbert-Stockham and Stockham envelopes perform poorly. The NLIE-Stockham envelope shows an improvement in the predicted Beck score relative to baseline, whereas the Hilbert-Stockham and Stockham envelopes are less accurate than baseline unless feature adaptation is performed. Table 3 summarizes the results from all envelopes, without and without feature adaptation, and compares them to baseline. Some cells in the table contain "N/A" because the classifier performed less accurately than baseline. Table 3: Comparison of results from NLIE-Stockham (N-S), Hilbert-Stockham (H-S), and Stockham envelopes, using the average STFT magnitude up to 20 Hz. The values inside the parentheses indicate the (GRF, K) values. The lowest MAE, RMSE, and p-value, and highest Spearman correlation, are in blue font. Lowest RMSE MAE at Lowest RMSE 11.86 11.07 Baseline N-S w/o Feat. Adapt. H-S w/o Feat. Spearman p at lowest RMSE N/A (GRF, K) at lowest RMSE N/A Highest Spearman p Spearman p at highest p 10.05 Spearman p at lowest RMSE N/A N/A 9.08 0.402 <0.001 (1.5,7) 0.430 MAE at highest p N/A RMSE at highest p N/A N/A (GRFK) at highest p N/A <0.001 11.28 9.15 (0.1,2) N/A N/A N/A N/A I 12.30 I _ __ 10.47 _ I _ __ _ N/A N/A I_____ N/A IIIII 69 (1.5,2) Adapt Stockham w/o Feat. Adapt. N-S w/ Feat. Adapt. H-S w/ Feat. Adapt. Stockham w/ Feat. Adapt. 1 12.30 10.46 N/A N/A N/A (0.1,2) N/A N/A N/A N/A 10.58 8.52 0.477 <0.001 (0.1,2) 0.477 <0.001 10.58 8.52 (0.1,2) 11.62 9.63 0.245 0.0223 (0.4,2) 0.247 0.0212 11.62 9.66 (0.5,2) 0.237 0.0274 1_1_1_1 11.65 9.59 (0.3,2) 11.64 9.61 _ 0.234 1 0.0294 1 (0.4,2) 1 1 Table 3 indicates that features extracted from the NLIE-Stockham perform more accurately than those from either the Hilbert-Stockham or Stockham envelopes. Spearman correlations and p-values are not provided when the RMSE and MAE are higher than baseline because the correlation between the predicted Beck score and actual Beck score is no longer meaningful when that occurs. 6.2.4 NLIE-Stockham Envelope, With a Maximum Frequency of 50 Hz Since the exact range of frequencies in e[n] is unknown, the average STFT magnitude up to 50 Hz is also computed, and the same procedure as outlined in Section 6.1.1 is performed. Without performing feature adaptation, the MAEs, RMSEs, and Spearman correlations are shown in Figure 39. 70 RMSEs Without Feature Adopt MAEs Without Feature Adopt 7 12.5 6 7 10.2 0 10 12 5 4 4 3 1.5 9.6 3 94 2 2 GRF GRF Spearman p Without Feature Adapt. 7A7 Spearman p Without Feature Adapt .4 6 5 0.15 4 03 7 0 6 0.015 5 00001 Y4 3 3 1 0.5 151.5 0,005 2.25 GRF GRF Figure 39: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude of NLIE-Stockham envelope, up to 50 Hz, without feature adaptation. Top left: RMSEs. Top right: MAEs. Bottom left: Spearman p's. Bottom right: Spearman p's. The lowest MAE achieved is 9.20, which occurs at (0.2, 2) and has a corresponding Spearman correlation between the actual Beck score and predicted Beck score of 0.421 (p<0.001). The lowest RMSE is 11.17, which occurs at (0.1, 2) and has a Spearman correlation of 0.423 (p<0.001). The (GRF, K) coordinates at which the highest Spearman correlation occurs is (0.1, 2), the same coordinates at which the lowest MAE occurs. At (0.1, 2), the Spearman correlation is 0.425 (p<0.001). Although the MAE and RMSE values are not as low as those found when the upper limit on the frequency range is 20 Hz, when feature adaptation is performed, the MAE and RMSE are lower. Figure 40 displays the results when feature adaptation is performed. 71 RMSEs With Feature Adapt MAEs With Feature Adapt. 711.4 9.2 6 11.26 5 119 4 10.8 4 3 10,6 3 2 10.4 2 8,8 8.6 1 0.50.5 GRF 15 GRF Spearman p With Feature Adapt. Spearman p With Feature Adapt. 7 x104 7 6 0 AS 5 046 0442 4 3 2 0.5 1 63 4 A 3 0 38 2 1.5 1 0.5 GRF 1 1. GRF Figure 40: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude of the log-envelope from NLIE-Stockham envelope, up to 50 Hz, with feature adaptation. In all plots, the GRF is varied from 0.1 to 1.5 and K, the number of PCA components, is varied from 2 to 7. For all plots except the bottom left, a cooler color indicates a lower value, which is desirable. Top left: RMSEs. Top right: MAEs. Bottom left: Spearman p's. Bottom right: Spearman p's. The lowest MAE achieved is 8.46, which occurs at (0.2, 2), and has a corresponding Spearman correlation of 0.487 (p<0.001). This is more than 1.5 points lower than baseline. The lowest RMSE achieved is 10.32, which also occurs at (0.2, 2) and has a corresponding Spearman correlation of 0.487 (p<0.001). The highest Spearman correlation is 0.512, at (0.1, 4) (p<0.001). Figure 41 shows the predicted Beck score versus actual Beck score using the average STFT magnitude of NLIE-Stockham envelope, up to 50 Hz, with feature adaptation, at (0.2, 2). 72 Predicted vs. Actual Beck Score, Average STFT, With Adaptation 45 40 35 0 8 30 a. 0. 0 25 0 20 ' 00O n8 15 n 5 10 15 20 25 Beck Score 30 35 40 45 50 Figure 41: Predicted Beck score vs. actual Beck score from the average STFT magnitude of NLIE-Stockham up to 50 Hz features, using feature adaptation, when GRF=0.2 and K=2 (p=0A87, p<0.001). Red line shows line of best fit. Figure 41 shows the correlation between predicted and actual Beck scores, and the line of best fit. This is the strongest correlation obtained of all features. 6.2.5 Hilbert-Stockham Envelope with a Maximum Frequency of 50 Hz We also investigate the average STFT magnitude of the Hilbert-Stockham envelope, using frequencies up to 50 Hz. Figure 42 shows the results when feature adaptation is not performed. 73 RMSEs Without Feature Adapt. MAEs Without Feature Adaot. 13.4 7 13.2 6 13 7 11 6 10.8 5 12.8 5 4 12.6 4 3 12.4 3 2 12.2 2 0.5 1 10.6 10.4 10.2 15 1 .o U., GRF GRF Spearman p Without Feature Adapt. Spearman p WIthout Feature Adapt. 7 7 .2 0.8 015 5 0.6 01 4 4 3 0 05 3 2 0 2 GRF 04 02 GRF Figure 42: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude, with a maximum frequency of 50 Hz, of the Hilbert-Stockham envelope, without feature adaptation. Even when frequencies up to 50 Hz are included in the Hilbert-Stockham envelope STFT, the classifier performs less accurately than baseline when feature adaptation is not performed. When feature adaptation is performed, there are some (GRF, K) combinations that produce RMSEs and MAEs that are more accurate than baseline. However, the MAEs and RMSEs are not as low as those produced by the average STFT magnitude on the NLIE-Stockham envelope up to 50 Hz. The results when feature adaptation is performed are shown in Figure 43. 74 MAEs With Feature Adapt. RMSEs With Feature Adapt. 13 7 10.2 6 12.5 12 5 10 4 9'8 3 9.6 2 GRF GRF Spearman p With Feature Adapt. Spearman p With Feature Adapt. 0 25 1 7 06 6 05 0.25 5 04 015 4 03 01 3 0,2 01 21 0.05 0.5 i U.S . 11.5 u.5 15 1 GRF GRF Figure 43: RMSEs, MAEs, and Spearman correlations of the average STFT magnitude, with a maximum Frequency of 50 Hz, of the Hilbert-Stockham envelope, with feature adaptation. The features derived from the Hilbert-Stockham envelope never predict the subjects' Beck scores as accurately as the features derived from the NLIE-Stockham envelope. Table 4 summarizes the results from the average STFT magnitude, with a maximum frequency of 50 Hz, extracted from the NLIEStockham and Hilbert-Stockham envelopes. Table 4: Comparison of results from NLIE-Stockham (N-S) and Hilbert-Stockham (H-S) envelopes, using the average STFT magnitude with a maximum frequency of 50 Hz. The values inside the parentheses indicate the (GRF, K) values. The lowest MAE, RMSE, and p-value, and highest Spearman correlation, are in blue font. Baseline N-S w/o Feat. Adapt. H-S w/o Feat. Adapt N-S w/ Feat. Adapt. H-S w/ Feat. Adapt. MAE at highest p (GRFK) at highest N/A N/A N/A <0.001 11.17 9.26 (0.1,2) N/A N/A N/A N/A N/A 0.518 <0.001 10.57 8.58 (0.1,4) Spearman p at lowest RMSE N/A (GRF, K) at lowest RMSE N/A Highest Spearman p Spearman p at highest p RMSE at highest 10.05 Spearman p at lowest RMSE N/A N/A N/A 11.17 9.26 0.425 <0.001 (0.1,2) 0.425 12.02 10.24 N/A N/A (1.5,4) 10.32 8.46 0.487 <0.001 (0.2,2) 11.48 9.45 Lowest RMSE MAE at Lowest RMSE 11.86 I I I 75 I P I (1.5,5) 9.45 11.48 0.007 0.287 (1.5,5) 0.007 0.287 I P I I I The average STFT magnitude, with a maximum frequency of 50 Hz, of the NLIE-Stockham envelope, with feature adaptation, most accurately predicts the subjects' Beck scores. This remains true even when considering the features derived from the average STFT magnitude, with a maximum frequency of 20 Hz, of the NLIE-Stockham envelope. Variance of the Magnitude of the STFT of the Log-Envelope 6.3 6.3.1 NLIE-Stockham Figure 45 shows the RMSEs, MAEs, Spearman p, and Spearman p-values between the predicted Beck score and actual Beck score when the number of PCA components and the GRF are varied, and when feature adaptation is performed. The only feature used is the variance of the STFT magnitude of the NLIE-Stockham envelope, using frequencies between 0 and 20 Hz. RMSEs Without Feature Adapt. MAEs Without Feature Adapt. 7 - 4 3 2 9.8 11.8 9.7 11 7 1 0.5 96 15 GRF GRF Spearman p Without Feature Adapt. S pearman p Without Feature Adapt. 02 7 03 015I 0.1 4, 02 3 0 05 2 0 15 GRF GRF Figure 44: RMSEs, MAEs, and Spearman correlations of the variance of the STFT magnitude of the envelope, with a maximum frequency of 20 Hz, from NLIE-Stockham, without feature adaptation. The lowest RMSE, 11.67, occurs at (0.1, 3). The value of 11.67 is only marginally lower than baseline. The lowest MAE was at GRF=0.1, K=2, and the value of the MAE is 9.60. Again, it is only marginally lower than baseline. The highest Spearman correlation, 0.341 (p=0.001) occurs at (0.1, 3). 76 The experiments with Ks and GRFs are also performed using feature adaptation. The results are shown in Figure 45. RMSEs With Feature Adapt. MAEs With Feature Adapt. 7 7 11.6 6 1 155 4 9.6 9.5 4 11.45 11.5 3 9.4 3 111.45 1.35 26 5 9.7 1 0.5 9.3 2 15 5 GRF 0.5 1 15 GRF Spearman p With Feature Adapt. Spearman p With Feature Adapt. 7 7 0.1 T oe 03 006 4 4 0 25 3 0 04 3 02 2 0.5 1 0,02 2 15 05 GRF 1 1.5 GRF Figure 45: RMSEs, MAEs, and Spearman correlations of the variance of the STFT magnitude of the envelope from NLIE-Stockham (up to 20Hz) with feature adaptation. Both the RMSE and MAE reach a minimum at (0.2, 6), and their values are 11.33 and 9.32, respectively. Similar to the case without feature adaptation, this is only a marginal gain over baseline. 6.3.2 Hilbert-Stockham Similar to the mean STFT feature of the Hilbert-Stockham envelope, the variance of the HilbertStockham envelope does not produce meaningful results. Without feature adaptation, the lowest RMSE achieved is 11.90, which is slightly worse than baseline. The best MAE is 9.98, which is less than a tenth of a point better than baseline. Further, there is no correlation between the actual Beck score and the predicted Beck score. The results are not improved when feature adaptation is performed. The best RMSE is 11.81, which is 0.05 points more accurate than baseline. However, again, there is no correlation between the predicted 77 Beck score and the actual Beck score. The lowest MAE achieved is 9.88, which is less than 0.20 points better than baseline. It can be concluded that with the parameters used, the variance of the STFT magnitude of the HilbertStockham envelope is not helpful in predicting a subject's Beck score. It is possible that different window lengths might be useful, but we darnot find this result when each-window is 1 second long and applied at half-second delays along the 3-second signal. 64 6.4.1 Coefficient of Variation (CV) of the Magnitude of the STFT of the LogEnvelope NLIE-Stockham When the number of PCA components and GRF are varied in the same manner as when testing the mean magnitude of the STFT of the log-envelope and the variance, the classifier often performs worse than guessing. Figure 46 illustrates the results when feature adaptation is not performed. 78 RMSEs Without Feature Adapt. MAEs Without Feature Adapt. 10.5 12.4 12.3 10.4 12.2 10.3 12.1 10.2 12 101 11.9 11.8 10 K K Spearman p Without Feature Adapt. Spearman p WIthout Feature Adapt. 08 0 0.05 06 0.1 0.4 0.15 0.2 0.2 K K Figure 46: RMSEs, MAEs, and Spearman correlations of the coefficient of variation of the STFT magnitude of the envelope from NLIE-Stockham (up to 20 Hz) without feature adaptation - RMSEs, MAEs, Spearman correlations. For most (GRF, K) combinations, the classifier performs less accurately than baseline. The lowest MAE is 9.99, which occurs at (1.5, 3). At those coordinates, the RMSE is 11.89, which is slightly less accurate than baseline. There is no statistically significant correlation between the predicted Beck score and the actual Beck score (p=-0.103, p=0.342). The lowest RMSE is 11.74, which occurs at (0.3, 5). At those (GRF, K) coordinates, the MAE is 10.14, which is worse than guessing. Again, there is no statistically significant correlation between the predicted Beck score and the actual Beck score (p=0.006, p=0.953). When feature adaptation is performed, the MDD severity is predicted even less accurately, as shown in Figure 47. The RMSE never reaches a value below baseline; thus, the coefficient of variation feature performs less accurately than baseline. 79 RMSEs With Feature Adapt. MAEs Wfth Feature Adapt 123 10,6 12.2 10.5 LL 121 0 5 104 12 103 11.9 1 0.5 K 15 K Spearman p With Feature Adapt. Spearman p With Feature Adapt. 7 7 6 6 0.8 0.6 0.1 4 0.4 4 0.2 3 2 3 0,2 2 03 0.5 K 1 15 K Figure 47: RMSEs, MAEs, and Spearman correlations of the coefficient of variation of the STFT magnitude of the NLIE-Stockham envelope, with frequencies up to 20 Hz, with feature adaptation. With the poor Spearman correlations, RMSEs, and MAEs, it can be concluded that with the parameters used, the CV of the NLIE-Stockham envelope is not a satisfactory feature for predicting subjects' Beck scores. It is possible that if a different signal processing method were used, or if different window lengths in the STFT were used, this could be a useful feature. Further experiments need to be performed before this feature is determined to be unhelpful in predicting subjects' Beck scores. 6.4.2 Hilbert-Stockham Similar to the CV feature extracted from the NLIE-Stockham envelope, and the mean STFT magnitude and variance from the Hilbert-Stockham envelope, the predictor performs less accurately than baseline in many cases. There is no correlation between the predicted Beck score and actual Beck score, both with and without feature adaptation. The results are displayed in Figure 48 and Figure 49 for the cases without feature adaptation and with feature adaptation, respectively. 80 RMSEs Without Feature Adant. 7 MAEs Without Feature Adact. 11.6 13.5 6 5 11.4 13 11 0 4 10.8 12.5 3 10 6 2 10 4 K K Spearman p Without Feature Adapt. Spearman p Without Feature Adapt. 7 6 0.1 08 5 02 06 03 04 4 3 0,2 04 2 K K Figure 48: RMSEs, MAEs, and Spearman correlations of the coefficient of variation of the STFT magnitude of the Hilbert-Stockham envelope, with frequencies up to 20 Hz, without feature adaptation. 81 RMSEs With Feature Adapt. MAEs With Feature Adapt. 13 11.2 125 4 41 3 10 2 2 12 05 1 104 1 05 15 K 15 K Spearman p With Feature Adapt. Spearman p With Feature Adapt 7 7 5 0.05 5 4 5 0 4 04 3 0.15 3 2 02 2 0.5 1 0,8 06 0.2 0.5 is K 1 15 K Figure 49: RMSEs, MAEs, and Spearman correlations of the coefficient of variation of the STFT magnitude of the Hilbert-Stockham envelope, with frequencies up to 20 Hz, with feature adaptation. Based on these results, using the CV of the STFT magnitude of the Hilbert-Stockham envelope with the parameters we chose does not aid in predicting subjects' MDD severity. 6.5 Unnormalized Energy in the Low Frequency Band The feature explored in this section is the unnormalized energy hypothesized to lie in the frequency band of the AM due to the muscles of respiration. The GRF is varied from 0.1 to 1.5 in steps of 0.1, and the upper frequency limit, denoted by f., is varied from 1 Hz to 12 Hz. Unlike the previous features, a single value for each session is generated, so the number of PCA components does not need to be varied. 6.5.1 NLIE-Stockham Figure 50 shows the RMSEs, MAEs, and Spearman correlations for the low-frequency feature when there is no feature adaptation, as the GRF is varied from 0.1 to 1 and the upper frequency limit is varied from 1 Hz to 12 Hz. 82 MAEs Without Feature Adapt. RMSEs Without Feature Adapt. 12 12 12 10 10 10 %8 9.6 6~9.4 4 4 2 29 0.2 0-4 0.6 0.8 9.2 0.2 1 04 0.6 0.8 1 GRF GRF Spearman p Without Feature Adapt. Spearman p Without Feature Adapt. 0412 12 08 10 10 02 4 O6 4 0.2 02 04 0.6 GRF 0.8 1 04 0,2 0.2 0.4 0.6 0.8 1 GRF Figure 50: RMSEs, MAEs, and Spearman correlations of the unnormalized energy in low frequency region of the NLIE-Stockham envelope, no feature adaptation. The horizontal axis is the GRF, but the vertical axis is now the upper frequency limit, f... The lowest RMSE achieved is 10.73, which occurs whenf.,,, is 12 Hz and when the GRF is 1. At (1,12), the MAE is 9.11. Both of these values are improvements over baseline. However, the lowest MAE occurs when f,,. is 11 Hz and the GRF is 0.1. At (0.1, 11), the MAE is 8.95, 1.1 points below baseline and the RMSE is 10.91, 0.95 points below baseline. It is interesting that the RMSEs and MAEs improve as f"'. increases. If there were a clear boundary between the AM region due to the respiratory muscles and the AM region due to the interaction between the harmonics and the formants, it would be expected that there would be little energy in that region. As a result, as fan, were increased, there would be a local minimum in the RMSE and MAE, and then the RMSE and MAE would once again increase. However, this pattern is not seen. Regardless, the highest Spearman correlation is achieved at (1, 12), which is the same as the point where the RMSE is the lowest. At those values of the GRF and upper frequency bound on IEFL[n]l, the Spearman correlation is 0.466, and p<0.001. The plot showing the predicted Beck score and actual Beck score is displayed in Figure 51. The correlation between the predicted score and actual score is one of the higher correlations obtained. 83 Predicted vs. Actual Beck Score, Energy in Frequencies from 0 to 12Hz, No Adaptation 50 45 40 35 a) 0 C. C', 'R 30 25 V0 0 0 5 10 15 20 25 Beck Score 30 35 40 45 50 Figure 51: Predicted score vs. Beck score using the energy in frequencies from 0 to 12 Hz feature from the NLIE-Stockham envelope, using feature adaptation. Red line shows line of best fit. Here, the GRF is set to 1. The results from the same features but with feature adaptation are shown in Figure 52. When the GRF and K are varied and when the features are adapted toward the means for the subjects, the lowest RMSE achieved is 10.86. This occurs at the point (1,11). Interestingly, this is higher than the RMSE achieved without feature adaptation. The lowest MAE is 8.94, which is more than a point lower than baseline. This occurs at the point (0.3,11). The highest Spearman correlation is 0.476, and this occurs at the point (0.9, 12). 84 MAEs With Feature Adant. RMSEs With Feature Adapt. 12 12 12 10 11.8 10 10 9.8 8 911.6 0.2 04 0.6 0.8 0.2 1 04 0.6 0.8 1 GRF GRF Spearman p With Feature Adapt. Spearman p With Feature Adapt. 12 04 10 03 6 16 100 4 12 4 02 0.2 04 0.6 0.8 0.2 1 0.4 0.6 08 1 GRF GRF Figure 52: RMSEs, MAEs, and Spearman correlations of the low frequency of the NLIE-Stockham envelope, with feature adaptation. The horizontal axis is the GRF, but the vertical axis is now the upper frequency limit,f.. There are some slight performance improvements achieved with regard to the MAE and RMSE by using the energy in the low frequency region. Compared to a baseline value of 11.86, the lowest RMSE is 10.73. This occurs when f. 12 Hz and when the GRF is 1. Feature adaptation is not used. This is an improvement of 1.1 points on the Beck scale. The classifier was usually guessing a mid-range Beck score. One of the disadvantages of simply computing the energy in the low-frequency range is that it is likely to be correlated with the overall sound intensity level of the signal. To remove dependence on this intensity level, an estimate of the ratio of AM from the respiratory muscles to the AM from the harmonicsformants interaction is computed. 6.5.2 Hilbert-Stockham Similar to the previous features computed with the Hilbert-Stockham envelope, there are no statistically significant correlations seen when the unnormalized energy in frequencies up to 12 Hz are computed. 85 6.6 Energy Ratio Since frequency-domain features extracted from the Hilbert-Stockham envelope do not perform as accurately as the NLIE-Stockham envelope, the energy ratio features are computed only for the NLIEStockham envelope. The energy ratios lead to a slight improvement in the MAEs and RMSEs, even when all 3 types of features and 1,092 frequency ranges are attempted. In each case, the GRF is set to 0.2 to reduce computation time. The lowest MAE for each type of feature and each frequency range is shown in Figure 53. Lowest MAE for Each Energy Ratio Feature, No Adaptation - 10 9.78 9.58.95 9 F/A A/F F-A Lowest MAE for Each Energy Ratio Feature, With Adaptation 109.5- 10.00 9.28 8.95 9F/A A/F F-A Figure 53: Lowest MAEs. Baseline is 10.05. Top: Lowest MAEs for each energy ratio feature, over all 1,092 frequency ranges tested, when feature adaptation was not performed. Bottom: lowest MAEs for the same features and frequency ranges, but when feature adaptation was performed. The bottom axes of both bar graphs indicate which type of feature to which the numbers correspond. F/A means the IEFL(f) region to the energy in the IEL(f) region; A/F is the ratio of the energy in the IEA(f)l region to the energy in the IEFL(J)I region; F-A is the difference between the two regions. The feature with the lowest MAE is the difference between frequency regions (F-A). The MAE for the F-A feature is 8.95 when feature adaptation is not performed, and 8.94 when feature adaptation is performed. These MAEs are approximately 1.1 points better than baseline, which is 10.05. The lowest MAE for the F-A feature achieved, both with and without performing feature adaptation, occurs when the values for (f,,,, ff, fp) were (11 Hz, 16 Hz, 50 Hz). This is interesting because 16 Hz is the highest to the upper limit onf . threshold for ff1 that is tested, and 50Hz is the lowest threshold for fa that was tested, while 11 Hz is close 86 Lowest RMSE for Each Energy Ratio Feature, No Adaptation 12- 1 7 11.75 1111.81 11.510.91 11 10.5 F/A A/F F-A Lowest RMSE for Each Energy Ratio Feature, With Adaptation - 1211.81 11.511 11.21 10.91 11 F/A A/F F-A Figure 54: Lowest RMSEs. Baseline is 11.86. Top: Lowest RMSEs for each energy ratio feature, over all 1,092 frequency ranges tested, when feature adaptation was not performed. Bottom: lowest RMSEs for the same features and frequency ranges, but when feature adaptation was performed. The bottom axes of both bar graphs indicate the type of feature to which the numbers correspond. F/A means the IEFL(f)l region to the energy in the IEA(f) region; A/F is the ratio of the energy in the IEAL(f)1 region to the energy in the IE.(f)l region; F-A is the difference between the two regions. Figure 54 displays the lowest RMSE for each type of feature and each frequency range. The difference between the frequency regions again also yields the lowest RMSEs both with and without feature adaptation. However, the decreases in the RMSE are less than 1 point. In the case without feature adaptation, the lowest RMSE, 10.91, occurs when the values for (fa,,,f,fp~fp)were (12 Hz, 16 Hz, 5 OHz). When feature adaptation is performed, the lowest RMSE achieved, 11.00, occurs when the values for (f,,,., fp, fp) are (11 Hz, 16 Hz, 50 Hz). Those values are identical to the values that produce the lowest MAE. Without performing feature adaptation on the log-energy difference feature group, the highest Spearman correlations between the actual and predicted scores also occur when the values for (f., ff, ff) were (12 Hz, 16 Hz, 50 Hz). At those frequency thresholds, the Spearman correlation is 0.450 (p<0.001). This is consistent with the frequency thresholds that that yielded the lowest RMSE without feature adaptation. 87 The actual Beck scores and predicted Beck scores, along with the line of best fit, are shown for the logenergy difference feature without performing feature adaptation, in Figure 55. Predicted vs. Actual Beck Score, Energy in Frequencies from 0 to 12Hz, No Adaptation 50 0 0 5 10 15 20 25 Beck Score 30 35 40 45 Figure 55: Predicted score vs. Beck score for the energy difference feature, when the values for are (12Hz, 16Hz, 50Hz) and no feature adaptation is performed. Red line shows line of best fit. 50 (fa,,..,fflfp) The correlation obtained between the predicted score and Beck score with the energy difference feature, without feature adaptation, is 0.16 lower than that obtained from the unnormalized energy feature. Thus, it seems likely that the majority of the correlation in the energy difference feature is due to the energy below 12 Hz. Overall, the log-energy difference feature appears to be promising. Most variations reveal that taking the log-energy between 0 and 12 Hz and the log-energy between 16 and 50Hz provide the most accurate Beck score predictions. If the assumption that the AM due to the respiratory muscles is at a lower frequency than the AM due to the interaction between the formants and the harmonics is true, it appears that a reasonable estimate at which the AM due to the respiratory muscles occurs is between 0 and 12 Hz, and the interaction between the formants and harmonics of the fundamental occurs between 16 Hz and 50 Hz. However, it is possible that there is some in the regions, because the maximum fa,,a tested is 12 Hz, and the minimum and maximum ff1 and fa tested respectively, are 16 Hz and 50 Hz. 6.7 Time-Domain Features This sub-section presents the results obtained when predicting subjects' Becks scores based on the time domain features described in Section 5.4. 88 The correlation between each of the features from the envelope grade and the Beck scores allows us to identify the eigenvalues that are the most likely to predict the patients' Beck scores. We perform the regression/prediction step 43 times for each envelope grade, gradually increasing the number of features used from 1 to 151. If there are more than five features used, PCA is performed and the data is flattened to five dimensions. For example, during the first run of the GMM on the coarse grade, we use only the feature that has the strongest correlation with the Beck scores. In the case of the coarse grade from the NLIE envelope, this is the third eigenvalue. During the second run, we use the features with the strongest and second-strongest Spearman correlations. For the coarse grade from the NLIE envelope, this is these are the third and seventh eigenvalues. The GRF is set to 0.2 because this is the value used by Williamson et al. [12]. 6.7.1 Eigenvalues and Summary Statistic from NLIE Envelope Table 5 summarizes the results from the NLIE envelope at each of the six grades when feature adaptation was not performed. The lowest RMSE, 11.00, occurs when the LFMLC grade is used. This is an improvement of approximately one point on the Beck scale. When the RMSE is 11.00, the Spearman correlation is 0.380 (p<0.001), and the 590, 13", 9*, and 6' largest eigenvalues are input to the GMM. It is not surprising that the LFMLC feature yields the lowest RMSE and highest Spearman correlation because we had been hypothesized that the differences in the frequency content would aid in Beck score prediction. Table 5: Summary of results from NLIE envelope at each grade, without feature adaptation. C denotes the coarse envelope, F denotes the fine envelope, FMC denotes the fine minus coarse envelope, LC denotes the log of the coarse envelope, LF denotes the log of the fine envelope, and LFMLC denotes the log of the fine envelope minus the log of the coarse envelope. Envelope Grade Lowest RMSE MAE at Lowest RMSE Spearman p at lowest RMSE Spearman p at lowest RMSE Eigenvalue Indices at lowest RMSE NLIE NLIE NLIE C F FMC 11.33 11.54 11.38 9.12 9.46 9.09 0.367 0.296 0.371 <0.001 0.005 3,7 3 <0.001 1,57,40, 50,... There is a total of 110 eigenvalues used in NLIE NLIE NLIE LC LF LFMLC 11.44 11.60 11.00 9.27 9.39 8.86 0.339 0.274 0.380 0.001 0.010 <0.001 PCA. 89 3,7,17 9,23,3,5 59,13,9,60 Figure 56 illustrates the predicted vs. actual Beck scores using the 5 9 th, 13t, 9t, and 6" eigenvalues from the LFMLC grade of the NLIE envelope. Each grade has its own set of eigenvalues that are the most strongly correlated with the Beck scores. All grades except the difference grades contain the third largest eigenvalue as one of the three eigenvalues that is most strongly correlated with the Beck scores. It is interesting that this eigenvalue index does not appear in the difference grades. Since the lowest RMSEs are still close to baseline, it is not appropriate to draw a conclusion about the significance of the third largest eigenvalue. Predicted vs. Actual Beck Score, NLIE, LFMLC, No Feature Adaptation 50 40 q -.-.--.-- --.-.- -0-- a)830- - - -- - 20 -- 0 1 0 0 5 10 15 20 25 Beck Score 30 35 40 45 50 Figure 56: Predicted vs. actual Beck score using the LFMLC grade from the NLIE feature, without feature adaptation. The same experiments are also performed with feature adaptation. The results from those experiments are shown in Table 6. For this case, the features that lead to the lowest RMSE are the 5 features that result from performing PCA on 110 of the eigenvalues from the correlation matrix. The lowest RMSE is 10.66, more than a full point lower than the baseline of 11.86. The corresponding MAE is 8.49, more than 1.5 points below the baseline value of 10.05. However, the RMSE, MAE, and Spearman correlation achieved by the LFMLC grade are also competitive. Figure 57 shows the predicted versus actual Beck score when PCA on 110 eigenvalues from the FMC grade is performed. 90 Table 6: Summary of results from the NLIE envelope at each grade, with feature adaptation. Envelope NLIE NLIE NLIE Grade C F FMC Lowest RMSE 11.28 11.28 10.66 MAE at Lowest RMSE 9.07 9.31 8.49 Spearman p at lowest RMSE Spearman p at lowest RMSE Eigenvalue Indices at lowest RMSE 0.349 0.298 0.465 0.001 0.005 <0.001 3,7 3, 9,12 1,57,40,... There is a total of 110 eigenvalues used in PCA. NLIE NLIE NLIE LC LF LFMLC 11.44 11.41 10.78 9.34 9.31 8.67 0.377 0.299 0.420 <0.001 0.005 <0.001 3 9 59, 13,9... There is a total of 110 eigenvalues used in PCA. The Spearman correlation between the actual and predicted Beck scores in Figure 57 is 0.420 (p<0.001). This is the strongest correlation observed among the time-domain features, yet it is not as strong as the highest Spearman correlation achieved by the frequency-domain feature with the lowest RMSE. 91 Predicted vs. Actual Beck Score, NLIE, FMC, PCA on 110 Eigenvalues, With Feature Adaptation 45 40 35 (D 8 30 C', 25 0 24 0 00 .00 0 0 0 - Oo 5 0 5 10 15 20 25 Beck Score 30 35 40 45 50 Figure 57: Predicted vs. actual Beck score using the FMC grade from the NLIE feature, with feature adaptation. 6.7.2 Eigenvalues and Summary Statistic from Hilbert Envelope The same procedure is performed on the grades derived from the Hilbert envelope. Table 7 illustrates the results. Similar to the NLIE envelope, the lowest RMSE is achieved when the LFMLC envelope is used. The RMSE from the Hilbert envelope-derived LFMLC is lower than the RMSE from the NLIE envelopederived RMSE. Here, it is 10.89, approximately a tenth of a point lower than the RMSE from the NLIE envelope-derived LFMLC grade, and almost a point lower than the baseline RMSE of 11.86. This is unexpected because the frequency-domain features from the NLIE-Stockham envelope were able to predict subjects' Beck scores than frequency-domain features from the Hilbert-Stockhlam envelope. The eigenvalues used to derive the lowest RMSE from the Hilbert LFMLC grade are the 65t, 66h, and 15t largest eigenvalues. It is surprising that the eigenvalues from the LFMLC grade most strongly correlated with the Beck score are much smaller in absolute value than the eigenvalues. The significance of this has yet to be explored. 92 Table 7: Summary of Results from Hilbert Envelope at each grade, without feature adaptation. Envelope Grade Lowest RMSE MAE at Lowest RMSE Spearman p at lowest RMSE Spearman p at lowest RMSE Eigenvalue Indices at lowest RMSE Hilbert Hilbert Hilbert Hilbert Hilbert Hilbert C F FMC LC LF LFMLC 11.37 11.32 11.28 11.53 11.23 10.89 9.35 9.38 9.29 9.32 9.21 8.72 0.299 0.338 0.361 0.269 0.332 0.388 0.005 0.001 0.001 0.012 0.002 <0.001 3,7,43 3 13 17,2,43 4,5 65,66, 15 The same experiments are also performed with feature adaptation. The results from those experiments are shown in Table 8. Table 8: Summary of results from Hilbert envelope at each grade, with feature adaptation. Envelope Grade Lowest RMSE MAE at Lowest RMSE Spearman p at lowest RMSE Spearman p at lowest RMSE Eigenvalue Indices at lowest RMSE Hilbert Hilbert Hilbert Hilbert Hilbert Hilbert C F FMC LC LF LFMLC 11.49 11.33 11.21 11.50 11.19 10.81 9.46 9.43 9.13 9.40 9.15 8.72 0.363 0.358 0.377 0.300 0.356 0.448 0.001 0.001 0.000 0.005 0.001 0.000 3 3 13,31, 16 17,2 4,5 65,66 The results are very similar to those obtained from the Hilbert envelope without feature adaptation. Again, the grade that best predicts subjects' Beck scores is the LFMLC. Unlike the case without feature adaptation, where three eigenvalues of the cross-correlation matrix are used, only two eigenvalues are used with feature adaptation. The lowest RMSE is 0.08 points lower than without feature adaptation, and the MAE is exactly the same. 93 6.8 Conclusions Using two types of envelope extraction methods, the Hilbert-Stockham and NLIE-Stockham, we tested seven types of features extracted from the envelopes of the AVEC held vowels: the mean STFT of the envelope, variance of the STFT of the envelope, covariance of the STFT of the envelope, energy in the low frequency band, difference in energy between two frequency bands, and eigenvalues from the correlation matrices of various grades extracted from the envelopes. Of all of the features tested, the features that most accurately predicted the subjects' Beck scores are the average STFT magnitude of the NLIE-Stockham envelope reduced by PCA, where frequencies up to 50 Hz are extracted, and the eigenvalues from the correlation matrix of the FMC grade, also computed from the NLIE-Stockham envelope. In the first case, when feature adaptation is performed, the GRF is set to 0.2, and 2 components from PCA are input to the GMM, the RMSE is 10.32, the MAE is 8.46, and the Spearman correlation is 0.487 (p<0.001). This represents decreases in error of approximately 1.5 points for both the RMSE and MAE. When PCA is performed on 110 eigenvalues from the cross-correlation matrix of the FMC grade of the NLIE envelope, the MAE is 8.49, the RMSE is 10.66, and the Spearman correlation is 0.465 (p<0.001). These features are fairly consistent with the hypothesis that subjects with MDD have different modulation patterns in their held vowel than subjects without MDD. 94 Chapter 7 Conclusions and Future Work In this thesis, we proposed a model of vocal modulation as a basis for developing biomarkers of neurological disease and, in particular, Major Depressive Disorder (MDD). The modulation model was developed in the context of a sustained vowel, assuming that two components contribute to amplitude modulation (AM): AM from the respiratory muscles and AM from interaction between formants and the FM from the fundamental frequency harmonics, i.e., from a mapping of FM to AM. This model was motivated by the perceptual task of Chapter 3, the hypothesis of motor incoordination in MDD [12], and the finding that spectrotemporal information at low frequencies (up to 25 Hz) can be used to predict MDD severity [29]. The two AM components were represented in the model as multiplicative contributions to the speech signal's envelope. We explored the separability of the modulation contributions by implementing three envelope extraction techniques: (1) Stockham's method, where the logarithm of the magnitude of the signal is extracted [39], (2) computing the logarithm of the magnitude of the Hilbert envelope, referred to as the Hilbert-Stockham envelope and (3) a nonlinear, iterative envelope (NLIE) estimation method [43], combined with the Stockham approach, referred to as the NLIE-Stockham method. We found that the Hilbert-Stockham and the NLIE-Stockham estimation methods enable improved separability compared to the Stockham envelope. With these envelope estimation approaches as a basis, we derived frequency-domain and time-domain features from bandpass-filtered speech signals, and predicted the subjects' Beck scores using a GMM. Bandpass filters were centered around the 3' formant to accentuate the envelope contribution from the fundamental frequency FM. The frequency-domain features were the following: the average STFT magnitude of the logarithm of the envelope, the variance of the STFT magnitude of the logarithm of the envelope, the coefficient of variation of the STFT of the logarithm of the envelope, the unnormalized energy in a low-frequency band, and the difference in the energy in two frequency bands. The timedomain features were the eigenvalues of the cross-correlation matrix of the envelope over five time segments. For the frequency-domain features, the most accurate Beck score prediction was a decrease of 1.54 points from baseline (from 11.86 to 10.32) in the RMSE, and a decrease of 1.59 (from 10.05 to 8.46) in the MAE. The corresponding Spearman correlation between the predicted Beck score and actual Beck score was 0.487 (p<0.001). We accomplished this by performing PCA on the average STFT magnitude of the NLIE-Stockham envelope, reducing the dimensionality to 2 components, and performing feature adaptation. For the time-domain features, the most accurate Beck score prediction was a decrease of 1.20 points from baseline (from 11.86 to 10.66) in the RMSE, and a decrease of 1.56 (from 10.05 to 8.49) in 95 the MAE. The Spearman correlation between the actual Beck score and predicted Beck score was 0.465 (p<0.001). The time-domain features that produced these results were obtained by pre-processing the acoustic signal, creating a sampled correlation matrix, computing the eigenvalues of the matrix, performing PCA on 110 of the eigenvalues to yield 5 features, and performing feature adaptation, as described in Sections 54 and 6.7. Together, the features are fairly consistent with the hypothesis that the modulation patterns of the sustained vowels uttered by subjects with MDD are different from those uttered by subjects without MDD. The thesis modeling and prediction methodologies provide a foundation for future work. This includes improvement to the underlying model, implementation of the model, the pre-processing methods, and the feature extraction methods. Application to other neurological disorders such as ALS, Parkinson's disease, and early dementia is another rich area, as well as further investigation of other MDD speech types and conditions such as running speech with more emotional content or under fatigue. 7.1 Improvement to the Underlying Model The assumptions underlying the model are overly simplistic. These include: (1) frequency of the AM from the respiratory muscles is less than the AM due to the interaction between the harmonics of the fundamental and the formants, (2) there is no relationship between amplitude and change in fundamental frequency, (3) the frequency and bandwidth of the formant remain constant through the duration of the vowel. This section describes the limitations in each of these assumptions. A literature review did not reveal modulation frequencies associated with the muscles of respiration during a held vowel, but it was assumed that such frequencies are lower than the frequencies at which the harmonics and formants interact. To ascertain our assumption, respiration modulation frequencies during a held vowel would need to be measured, and the bandwidth, shape, and magnitude of the formants would need to be known. Further, the coordination of the various muscles of respiration would need to be measured. We assumed that the incoordination would occur between the muscles of respiration and the interaction between the harmonics of the pitch and the formants, but another possible source of incoordination is within the muscles of respiration themselves. To further improve our model we should add the possibility that the vocal folds can introduce AM as well as FM; currently we assume FM only as revealed in pitch modulation. Moreover, we may want to exploit 96 a possible relationship between the amplitude and fundamental frequency modulation in this production component, as described by Titze [18]. The model also assumes that the formants are held constant throughout the duration of the vowel. When we viewed the spectrograms, this appeared to be mostly true, although there were some instances where the formants appeared to move by approximately 100 Hz. A small movement in the frequency location or bandwidth of the formant can cause different modulation patterns as the harmonics of the fundamental frequency move through the formant. 7.2 Implementation of the Model The implementation of the model could be improved by introducing a time-varying depth of modulation and frequency of modulation of both the AM and FM to represent a more erratic modulation condition, and by modeling each opening/closure of the vocal folds as a glottal pulse instead of an impulse. We have hypothesized in this thesis that the depressed voice is characterized by erratic modulation, yet in our model, we implemented a single, time-invariant depth and frequency of modulation for both AM and FM. In addition, modeling each opening/closure of the vocal folds as a glottal pulse instead of an impulse will introduce low-frequency weighting. This time-varying low- frequency weighting might interfere with the frequencies at which the AM and FM occur, therefore complicating the challenge of identifying the frequencies in the AM and FM. 7.3 Envelope Extraction Three methods of envelope extraction were explored in this thesis: Stockham, Hilbert-Stockham, and NLIE-Stockham. A fourth method of extracting the envelope that could have been performed is bandpassing the Hilbert envelope a second time, passing only the frequencies at which the fundamental frequency is expected to occur. The Hilbert transform would then again be performed a second time, and the magnitude and logarithm of that envelope taken. This might offer an improvement over the HilbertStockham method because the Hilbert-Stockham envelope also has a clear envelope component. Thus band-passing and computing the Hilbert transform envelope as a second application on the Hilbert envelope might the envelope extraction. Alternatively, a fifth method that uses a novel non-linear demodulation algorithm based on complex optimization, and allows different temporal resolutions, should be considered [47]. 97 74 Pre-Processing the Envelopes Due to limited signal duration, the STFT of each envelope was computed over 1-second windows, shifted at half-second delays over the middle 3 seconds of each speech waveform. Other window lengths could be tested when computing the STFT. In addition, we might focus our processing at different formants and more generally different bands. Using multiple bands and later fusing results may lead to more robust estimators. The frequency-domain features in this thesis mainly explored the middle three seconds. In the future, we could also explore the features extracted from the envelopes of the onset and offset of each vowel. It is possible that psychomotor retardation and/or psychomotor agitation may affect the rise time of the vowel's envelope, and the time constant of the offset, assuming the offset of the envelope is roughly exponential. 7.5 Features Additional features that could be tested involve using the features extracted from the Multi-Dimensional Voice Program (MDVP), applying the cross-correlation and covariance features to the time-domain envelope after application of a gammatone filter bank to the envelope, performing the cross-correlation and covariance features to the envelope spectra in the frequency domain, and developing features that relate the modulation in the envelope to the DC component of the envelope. As described in Chapter 2, MDVP outputs features that quantify the depths and rates of modulation. These features could be directly input to the GMM. Alternatively, the features from MDVP could be extracted at different times throughout the waveform, and the relationships among those features at different times could be explored. Applying a gammatone filter bank to the original signal or to its envelope (thus further generalizing the filtering described in Section 7.3), and computing the cross-correlation and covariance features would be similar to the process used by Williamson et al. [12] on the formants. However, the features extracted from the gammatone filter bank would reveal information about the envelopes instead of the formants. Computing the eigenvalues of the correlation and covariance matrices of the spectrum of the envelopes may also be features that differentiate depressed from control subjects. The difference between these features and the features described in Sections 5.4 and 6.7 are that the cross-correlations and covariances would be performed on signals in the frequency domain. 98 Another class of features that could be computed relates the modulation to the DC component. In this thesis, we removed the DC component when computing the STFT because the spectral sidelobes of the component at DC were too large relative to the low frequency components of the envelope. By measuring the ratio of the energy of the low frequency components and comparing it to the energy at DC, we would create another class of features. 99 Appendix A: Subjectively Rating Vocal Modulation Members of MIT Lincoln Laboratory were asked to rate the amount of vocal modulation in 25 held /a/ vowels. For the listening section, raters listened to each of the waveforms. Before they began the test, they were provided aurally with examples of significant vocal modulation and little/no modulation. They were told to rate each waveform using the following rating scheme: 1 2 3 4 5 - very little/no vocal modulation mild/moderate vocal modulation moderate vocal modulation moderate/severe vocal modulation severe vocal modulation The objective of the second task was to visually rate the presence of sub-harmonics from spectrograms. As an example, Figure 58 contains two waveforms: one with sub-harmonics, and one without subharmonics. 1400 Hz Distinct sub-harmonics 5.90 Time (sec) 1400 0 Hz No visible sub-harmonics 5.92 Time (sec) Figure 58: Sub-harmonics in a held vowel. Top spectrogram: /a/ vowel with a region with faint subharmonics, and two regions where there are clearly sub-harmonics. Bottom spectrogram: /a/ vowel with no sub-harmonics present. The raters were instructed to avoid listening to the waveforms from the AVEC database. The rating scale used was: 1 - no sub-harmonics present 100 2 3 4 5 - sub-harmonics appear once and last < 0.15 sec/unclear sub-harmonics sub-harmonics appear once and last >0.15 seconds clear sub-harmonics appear 2 or 3 times and last >0.15 seconds clear sub-harmonics appear more than 3 times and occur throughout the spectrogram The third task consisted of rating the amount of FM in each of the waveforms while viewing 7-10 harmonics on the spectrogram. Examples are illustrated in Figure 59. Significant amount off0 modulation 0W4w Time (sec) 6.93 P-ASW- n* Time (sec) Relatively flat fo modulation 4.31 Figure 59: FM in a held vowel. Top spectrogram: /a/ vowel with region containing significant amounts of FM. Bottom spectrogram: /a/ vowel with little FM. The rating scale was: 1 - nearly constant frequency 2 - mild FM 3 - moderate FM 4 - moderate/severe FM 5 - severe FM Finally, the raters evaluated the AM in each of the waveforms by viewing the waveforms in the time domain. 101 --+ 4wiw - 7* r i r i Relatively little AM 71,1 Time (sec) 4.31 1 Time (sec) Significant amount of AM 6.94 Figure 60: AM in a held vowel. Top spectrogram: /a/ vowel with relatively little AM. Bottom spectrogram: /a/ vowel with a significant amount of AM. Figure 60 illustrates the range of AM seen in the waveforms. The rating scale used was: 1 - nearly constant amplitude 2 - mild AM 3 - moderate AM 4 - moderate/severe AM 5 - severe AM 102 Appendix B: Beck Depression Inventory The self-reported Beck Depression Inventory rates the following symptoms of depression: 1) Sadness 2) Pessimism 3) Past failure 4) Loss of pleasure 5) Guilty feelings 6) Punishment feelings 7) Self-dislike 8) Self-criticalness 9) Suicidal thoughts 10) Crying 11) Agitation 12) Loss of interest 13) Indecisiveness 14) Worthlessness 15) Loss of energy 16) Change in sleeping 17) Irritability 18) Change in appetite 19) Concentration difficulty 20) Tiredness or fatigue 21) Loss of interest in sex Each symptom is rated on a scale between 0 and 3 and then all 21 scores are summed. Thus, each subject's depression is rated on a scale between 0 and 63[48][49]. 103 Appendix C: Derivation of Equations for AM, FM, and FM-and-AM In Figure 4, the AM- and FM-modulated pulse train from the glottis, PAF [n], is the product of the AM envelope and cosine of a function of the FM, summed over all harmonics of fo. In other words, PAF [n] can be expressed as: PAF [ = eA[n]X.1COS(Pk[n]). A] (25) where K is the number of harmonics k is the index of the harmonic eA [n] is the envelope of the AM, described in Chapter 4, and Pk [n] is a phase function of the FM signal, described later in this appendix. eA[n] + !- cos 21rn . As discussed in Chapter 4, the AM envelope, eA [n], is assumed to originate from the muscles of respiration. It shapes the harmonics of fo, which originate from the opening and closing of the vocal folds. The equation for eA [n] is assumed to be sinusoidal: It assumes the muscles of respiration control both the AM extent and AM rate. If there is AM but no FM in the source signal, the model appears as shown in Figure 61. Harmonic syntheszer Figure 61: Model with AM-only input signal. eA[n] is the AM envelope that shapes the harmonics from the harmonic synthesizer. The output from the harmonic synthesizer is an AM pulse train, denoted pA[n], which is sent through the vocal tract, H,(). The output from the vocal tract is xA[n], an AM signal. If the depth of AM, aa, is set to 0.2 and the frequency of the AM, fa, is 4 Hz, eA [n] appears as shown in Figure 62. The depth of modulation is constrained to 0 < aa ! 1. 104 AM envelope eA[n], fo=200Hz, aa=0. 2 , fa=4 0.6-0.5 -0.4 -M V 0.30.20.1- 00 0.5 1 1.5 Time (sec) 2 2.5 3 Figure 62: AM envelope, eA[n], when a =0.2 andf.=4 Hz. The AM envelope, eA [n], is passed through a harmonic synthesizer, which is modeled as a sum of sinusoids [30] . The output of the harmonic synthesizer when only AM is present, pA [n], represents the opening and closing of the vocal folds shaped by the AM envelope, and approximates a series of impulses. The equation for pA [n] is given in Eq. 26: PA[fl] = eA[n] {..cos(27rk(fO/fs)n) where fs is the sampling rate. Figure 63 displays plots of pA [n] are over two timescales. In those plots, the AM depth and rate are 0.2 and 4 Hz, respectively, as example values. 105 (26) pA [n], f=200Hz, aa=0.2, fa=4, Displayed over 3 sec 10 VVVVVVV-'i 0. E 5 0 0 0.5 PA 1 1.5 Time (sec) 2 2.5 3 4 f0=20OHz, aa=0. 2 , fa= , Displayed over 0.2 sec 15 10a& 5- E d 0i44 0.15 0.2 Time (sec) 0.25 0.3 Figure 63: AM from respiratory muscles shaping glottal impulses. Upper plot: AM signal over a 3-second signal. Lower plot: 0.2 seconds of the signal, showing each of the impulses. The upper plot of Figure 63 displays pA[n] over a 3-second signal. It is difficult to discern the individual impulses because they are spaced 50 ms apart. However, the shaping of the impulses at a rate of 4 Hz is visible. The lower plot of Figure 63 is a zoomed-in view of pA [n], from 0.1 seconds to 0.3 seconds. The AM shaping of the impulses is clearly visible. When pA [n] is passed through the vocal tract transfer function, bandpass filtered to isolate formant 3 (F3), and Hilbert transformed, as depicted in the bottom branch of Figure 17, the resulting signals appear as shown in Figure 64. 106 Entire pA[n] waveform, f0=200Hz, aa=0. 2 , fa= 4 PA[n] over 0.2 sec 20 20 10 10 - bA[n bA[n] 10 0 -10 o 0.5 1 1.5 2 2.5 .1 3 0.15 20 0 0 -20 -40 0.5 1 1.5 2 4 2.5 .1 0.15 0.5 0.5 0 0 0.5 1 1.5 2 2.5 3 B.1 0.5 0 0 0 0.5 1 1.5 Time (sec) 2 0.2 0.25 0.3 0.15 0'2 0.25 0. 3 bA[n] and YA[n] over 0.2 sec Entire bA [n] and yA[n] waveforms 0.5 -0.5 0.3 bA[n] over 0.2 sec Entire bA[n] waveform 0 0.25 20 -20 -0.5 0.2 xA[n] over 0.2 sec Entire xA[n] waveform, 2.5 3 -. 5.1 0.15 0.2 Time (sec) 0.25 0. 3 Figure 64: First row: pA[n], the AM source signal. Second row: xA[n], the output from the vocal tract when pA[n] is the input, Third row: bA[n], the F3-bandpass-filtered waveform when xA[n] is the input. Fourth row: yA[n], the Hilbert transform of bA[n]. The left column shows each of the entire 3-second waveforms; the right column shows 0.2 seconds of the waveform to depict activity in higher frequencies. The first three formant frequencies are 820 Hz, 1220 Hz, and 2810 Hz, with bandwidths of 125 Hz, 125 Hz, and 250 Hz, respectively. The center frequency of the bandpass filter is 2810Hz and bandwidth is 250Hz. The first row illustrates PA [n], and is identical to the graphs in Figure 63. The output of the model, xA [n], occurs when pA [n] is input to the vocal tract. In this case, fo is 200 Hz, and the three formants are 820 Hz, 1220 Hz, and 2810 Hz, forming the /a/ vowel. The bandwidths are 125, 125, and 250 Hz for formants 1-3, respectively. The second row in Figure 64 displays the output from the vocal tract, xA [n] on the same two timescales. Over the three-second signal, only the slow AM from the respiratory muscles is visible. When zoomed into the first 0.2 seconds of the synthetic vowel, higher frequency components are visible. xA [n] is then input to a bandpass filter that passes only F3. The result, bA [n], is shown over the two timescales in the third row of Figure 64. It is much smaller in amplitude than xA[n] because there is less energy in the third formant than in the first two formants Finally, the Hilbert transform of bA[n] is taken, and the result, yA [n], is displayed in the bottom row of Figure 64. The period of the Hilbert transform approximately traces each pitch period. Figure 65 illustrates the block diagram for the FM-only model. The extent, rate, and center frequency are specified to generate the FM signal, which represents the instantaneous frequency. Harmonics of that signal are generated and added together. 107 Figure 65: FM-only model. #k[n] represents the phase of the FM and described later in this chapter. The following equation describes vibrato, which is the frequency modulation in the signal. The source of the FM described in this chapter is assumed to be from the laryngeal muscles. Letting r[n] be the vibrato as a function of time, af be the extent of the frequency modulation (or FM index), and ff be the rate of the vibrato, r[n] can be expressed as: r[n] = af cos (2rffn/fs). The instantaneous frequency for the k* harmonic, (27) k [n], can be expressed as: Dk[n] = k2 0 [n] (28) where J2 [n] is the instantaneous frequency. The instantaneous frequency is a function of r[n]. The instantaneous frequency is the fundamental frequency offset by r[n]: 120 [n] = fo + r[n]. (29) Substituting Eq. 27 into Eq. 29, the following expression for 12 k[n] is obtained: = k(f0 + afcos (21rfn/fs)). (30) The phase qrj [n] is the integral of the frequency of the kth harmonic (modified from [30]): *fjnk [] da. (31) Pk [n] f Substituting Eq. 30 into Eq. 31, Eq. 32 is obtained: k = nf 0 + afkcos (27rffalfs))do. (32) Therefore, (33) k[n]= + afk sin(27fl/fs) fs 27rff When the sum of the cosine of muscles, PF[n], is obtained: kk PF~n [n] is taken over all harmonics, the pure-FM output from the laryngeal = PFK=1 COS (27kfon/fs 108 +=af k sin( f I (34) The 3-second harmonic FM signal generated using fo=200 Hz, af=l0 Hz, and ff=7 cycles/sec 2 is shown in Figure 66. The bottom panel of Figure 66 shows the same signal, but zoomed into a range of times from 0.1 second to 0.3 seconds. 7 pF [' f 0=200Hz, af=1 0, ff= , Displayed over 3 sec E AU0 0.5 1 1.5 2 2.5 Time (sec) 7 PF ], f0=200Hz, af=10, f,= , Displayed over 0.2 sec 3 - 10 E) in '6.1 I I 0.12 0.14 I I 0.16 0.18 I 0.2 Time (sec) I I 0.22 0.24 I 0.26 I 0.28 0.3 Figure 66: pF[n] over two timescales. Top: pF[n] over 3 seconds. Bottom: p,[n] between times 0.1 and 0.3 seconds In the bottom panel of Figure 66, the changes in frequency are visually imperceptible; the purpose is to show that a series of impulses is obtained. Figure 67 illustrates the instantaneous frequency of the same 2second FM signal, as well as the spectrogram of the signal. Instantaneous Frequency of p[n1, f0=200Hz, af=1 0, ff= 7 210 205 195 10 '4I 0 0.5 -M 1 1.5 2 2.5 3 2 2.5 3 Spectrogram of pF[n 15 g10 0 0.5 1 1.5 Time (sec) Figure 67: Top plot: instantaneous frequency of pF[n]. Bottom plot: spectrogram of pF[n] from 0 to 2000 Hz. The upper plot of Figure 67 shows the instantaneous frequency of the fundamental frequency of the signal created when the center fundamental frequency is 200 Hz, the frequency index (af) is 10 Hz, and the rate of frequency change (ff) is 7 cycles/sec 2 , and the signal is 2 seconds long. The lower plot is the 109 spectrogram of the FM-only signal with the same parameters, but after the first 10 harmonics are included. As k increases, the extent of the frequency also increases, but the rate at which the frequency changes remains constant at 7 cycles/sec2 . The harmonic centered at 2 kHz reaches a maximum frequency of 2100 Hz and a minimum frequency of 1900 Hz, with an extent of 100 Hz. When pF[n] is passed through the vocal tract, bandpass filtered to formant 3, and Hilbert transformed, as depicted in the top branch of Figure 17, the resulting signals appear as shown in Figure 68. Entire pFJn] waveform, fo=200Hz, a1=1 0, ff=7 pF[n] over 0.2 sec 10. 0 0.5 1 1.5 2 2.5 _.1___. __ 0.15 3 Entire xF[n] waveform 0.25 0 3 0.25 0.3 0.25 0.3 xFJn] over 0.2 sec 50 50 o 0.2 .____ 0.5 1 1.5 2 2.5 -5.1 3 0.15 Entire bF[n] waveform 0.2 bF[n] over 0.2 sec 1 0 0 0.5 1 1.5 2 2.5 3 t.1 Entire bF[n] and YF[n] waveforms 0.5 1 1.5 Time (sec) 2 2.5 0.2 bF[n] and YF[n] over 0.2 sec 11111flii lII~ 0 0.15 1. 0 6.1 3 0.15 0.2 Time (sec) 0.25 0.3 Figure 68: First row: PF[n], the AM source signal. Second row: xF[n], the output from the vocal tract when PF[n] is the input, Third row: b,[n], the formant 3-bandpass-filtered waveform when x,[n] is the input Fourth row: yF[n], the Hilbert transform of bF[n], plotted on top of b,[n] The left column shows each of the entire 3second waveforms; the right column shows 0.2 seconds of the waveform to depict activity in higher frequencies. The formants are 820 Hz, 1220 Hz, and 2810 Hzwith bandwidths of 125 Hz, 125 Hz, and 250 Hz. The bandpass filter is set to formant 3, with a center frequency of 2810 Hz and bandwidth of 250 Hz. The top plots illustrate PF[n],which is identical to the graphs in Figure 66. The output of the model, XF [n], occurs when pF [n] is put through the vocal tract. In this case, the fundamental frequency is 200 Hz for a female, and the three formants are 820 Hz, 1220 Hz, and 2810 Hz, forming the /a/ vowel. The bandwidths are 125, 125, and 250 Hz for formants 1-3, respectively. The second row in Figure 68 shows XF [n] on the same two timescales. Although there was no AM as an input to the model, it appears as though there is AM! This occurs because the harmonics of the fundamental interact with the formants, as discussed in Chapter 4. When zoomed into the first 0.2 seconds of the synthetic vowel, higher frequency components are visible, but the AM from the interaction between the harmonics and the formants is still present. The AM-only XF [n] is then passed through a bandpass filter that only filters out formant 3. The result, bF[n], is shown over the two timescales in the third row of Figure 68. There is still AM due to FM present, although the shape of the envelope appears to be different. This is because there is only one formant with which the harmonics can interact, instead of three. Finally, the Hilbert transform of bF [n] is 110 taken, and the result, yF[n], is shown in the bottom row of Figure 68. The period of the Hilbert transform approximately traces out the fundamental frequency, and the envelope is still present. 111 Appendix D: Model of the Vocal Tract The vocal tract transfer function, H,(f), shapes the pulse train PAF[n] by utilizing three all-pole filters in series, as described in [50], one for each of the three formants. Letting i be the index of the formant, the equation for H,(f) is given in Eq. 35 (modified from [50]): H,(Z) = 1=(i)-2 i=I1-aC)Z- y(j)z- 2 In Eq. 35, f(i), a(i), and y(i) are given by the following, where B(i) is the 3-dB bandwidth of the formant i and F(i) is the formant frequency of forinant i (also modified from [50]). y(i) = -exp (-2nB(i)/f5) a(i) = 2 cos(27rF(i)/fs) exp (7rB(i)/fs) f(i) = 1 - a(i) - y(i). 112 (35) Bibliography [1] J. Sundberg, "Acoustic and psychoacoustic aspects of vocal vibrato," Vibrato, pp. 35-62, 1995. [2] J. Kreiman, B. Gabelman, and B. R. Gerratt, "Perception of vocal tremor," J. Speech Lang. Hear. Res., vol. 46, pp. 203-214, Feb. 2003. [3] L. A. Ramig and T. Shipp, "Comparative measures of vocal tremor and vocal vibrato," J. Voice, vol. 1,no. 2, pp. 162-167,1987. [4] M. Bruckl, "Vocal Tremor Measurement Based on Autocorrelation of Contours," in 13th Annual Conference of the InternationalSpeech CommunicationAssociation, Portland, OR, 2012, pp. 715- 718. [5] D. J. Kupfer, E. Frank, and M. L. Phillips, "Major Depressive Disorder: New Clinical, Neurobiological, and Treatment Perspectives," The Lancet, vol. 379, no. 9820, pp. 1045-1055, Mar. 2012. [6] S. S. Newman and V. G. Mather, "Analysis of Spoken Language of Patients with Affective Disorder," Am. J. Psychiatry, vol. 94, pp. 913-942, 1938. [7] A. Nilsonne, J. Sundberg, S. Ternstrom, and A. Askenfelt, "Measuring the Rate of Change of Voice Fundamental Frequency in Fluent Speech During Mental Depression," J. Acoust. Soc. Am., vol. 83, no.2, pp.716-728, 1988. [8] J. C. Mundt, "Voice Acoustic Measures of Depression Severity and Treatment Response Collected Via Interactive Voice Response (IVR) Technology," J. Neurolinguistics,vol. 20, no. 1, pp. 50-64, Jan.2007. [9] A. C. Trevino, T. F. Quatieri, and N. Malyska, "Phonologically-based biomarkers for major depressive disorder," EURASIP J. Adv. Signal Process., vol. 2011, no. 1, pp. 1-18, 2011. [101E. Moore, M. Clements, J. Peifer, and L. Weisser, "Analysis of Prosodic Variation in Speech for Clinical Depression," in Engineeringin Medicine and Biology Society, 2003. Proceedingsof the 25th Annual InternationalConference of the IEEE, Cancun, Mexico, 2003, vol. 3, pp. 2925-2928. [11]R. Horwitz, T. F. Quatieri, B. S. Helfer, B. Yu, J. R. Williamson, and J. C. Mundt, "On the Relative Importance of Vocal Source, System, and Prosody in Human Depression," presented at the 2013 IEEE International Conference on Body Sensor Networks, Cambridge, MA, 2013, pp. 1-6. [12]J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. Horwitz, B. Yu, and D. D. Mehta, "Vocal biomarkers of depression based on motor incoordination," in Proceedingsof the 3rd ACM internationalworkshop on Audio/visual emotion challenge, 2013, pp. 41-48. [13]N. Cummins, J. Epps, V. Sethu, M. Breakspear, and R. Goecke, "Modeling spectral variability for the classification of depressed speech.," in Interspeech, Lyon, France, 2013, pp.857-861. [14]J. R. Brown and J. Simonson, "Organic voice tremor," Trans. Am. Neurol. Assoc., no. 87, pp. 179180, 1962. [15]J. R. Brown and J. Simonson, "Organic voice tremor: A tremor of phonation," Neurology, vol. 13, pp. 520-525, 1963. [161W. S. Winholtz and L. 0. Ramig, "Vocal tremor analysis with the vocal demodulator," J. Speech Lang. Hear. Res., vol. 35, no. 3, pp. 562-573, 1992. [17]L. A. Ramig, I. R. Titze, R. C. Scherer, and Ringel, Steven P., "Acoustic analysis of voices of patients with neurologic disease: rationale and preliminary data," Ann. Otol. Rhinol. Laryngol., vol. 97, no. 2 (pt 1), pp. 164-172, Apr. 1988. [18]I. R. Titze, "On the relation between subglottal pressure and fundamental frequency in phonation," J. Acoust. Soc. Am., vol. 85, no. 2, Feb. 1989. [19]A. Aronson, W. S. Winholtz, L. 0. Ramig, and S. R. Sibler, "Rapid voice tremor, or 'flutter,' in amyotrphic lateral sclerosis," Ann. Otol. Rhinol. Laryngol., vol. 101, pp. 511-518, 1992. [20]Y. Horii, "Acoustic analysis of vocal vibrato: a theoretical interpretation of data," J. Voice, vol. 3, no. 1, pp. 36-43, 1989. [21]K. N. Stevens, Acoustic Phonetics.Cambridge, MA: MIT Press, 1998. 113 [22]R. A. Lester, J. Barkmeier-Kraemer, and B. H. Story, "Physiologic and Acoustic Patterns of Essential Vocal Tremor," J. Voice, vol. 27, no. 4, pp. 422-432, Jul. 2013. [23] P. MacKinnon and1J. Morris, Oxford Textbook of FunctionalAnatomy: Head and Neck, vol. 3. . Oxford University Press, 1990. [24]R. A. Lester and B. H. Story, "Acoustic characteristics of simulated respiratory-induced vocal tremor," Am. J. Speech Lang. Pathol.,vol. 22, pp. 205-211, May 2013. [25]Multi-DimensionalVoice Program.KayPENTAX. [26]P. Boersma and D. Weenink, Praat. [27]Y. Pantazis, M. Koutsogiannaki, and Y. Stylianou, "A novel method for the extraction of vocal tremor," in Models and analysis of vocal emissionsfor biomedicalapplications:6th international workshop, Firenze, Italy, 2009. [28]Y. Pantazis, 0. Rosec, and Y. Stylianou, "Adaptive AM-FM Signal Decomposition With Application to Speech Analysis," IEEE Trans. Audio Speech Lang. Process.,vol. 19, no. 2, pp. 290-300, Feb. 2011. [29]N. Cummins, J. Epps, and E. Ambikairajah, "Spectro-temporal analysis of speech affected by depression and psychomotor retardation," in Acoustics, Speech and Signal Processing(ICASSP), 2013 IEEE InternationalConference on, Vancouver, Canada, 2013, pp. 7542-7546. [30]T. F. Quatieri, Discrete-Time Speech Signal Processing:Principlesand Practice.Upper Saddle River, NJ: Prentice Hall, 2002. [31]A. Ivanov and X. Chen, "Modulation Spectrum Analysis for Speaker Personality Trait Recognition.," presented at the InterSpeech 2012, Portland, OR, 2012. [32]M. Valstar, F. Eyben, S. Schnieder, B. Schuller, B. Jiang, R. Cowie, K. Smith, S. Bilakhia, and M. Pantic, "AVEC 2013 - The Continuous AudioNisual Emotion and Depression Recognition Challenge," in Proceedingsof the 3rd ACM InternationalWorkshop on Audio/Visual Emotion Challenge, Barcelona, Spain, 2013, pp. 3-10. [3311. R. Titze, "The physics of small-amplitude oscillation of the vocal folds," J. Acoust. Soc. Am., vol. 83, no. 4, pp. 1536-1552, Apr. 1988. [3411. R. Titze, "Phonation threshold pressure: A missing link in glottal aerodynamics," J. Acoust. Soc. Am., vol. 91, no.5, pp. 2926-2935, May 1992. [35]R. L. Plant and R. M. Younger, "The interrelationship of subglottic air pressure, fundamental frequency, and vocal intensity during speech," J. Voice, vol. 14, no. 2, pp. 170-177, 2000. [361K. A. Farinella, T. J. Hixon, B. H. Story, and P. J. Jones, "Listener Perception of Respiratory-Induced Vocal Tremor," Am. J. Speech Lang. Pathol.,vol. 15, no. 1, pp. 72-84, Feb. 2006. [37]C. Dromey and M. E. Smith,, "Vocal Tremor and Vibrato in the Same Person: Acoustic and Electromyographic Differences," J. Voice, vol. 22, no. 5, pp. 541-545, Sep. 2008. [38]T. Shipp, E. T. Doherty, and S. Haglund, "Physiologic factors in vocal vibrato production," J. Voice, vol. 4, no. 4, pp. 300-304, 1990. [39]T. G. Stockham, "The Applicaiton of Generalized Linearity to Automatic Gain Control," IEEE Trans. Audio Electroacoustics,vol. AU-16, no.2, pp. 267-270, Jun. 1968. [40]T. F. Quatieri and R. J. McAulay, "Audio Signal Processing Based on Sinusoidal Analysis/Synthesis," in Applications of DigitalSignal Processingot Audio and Acoustics, Boston: Kluwer Academic Publishers, 1998, pp. 343-413. [41]N. Malyska, T. F. Quatieri, and D. Sturim, "Automatic Dysphonia Recognition Using Biologically Inspired Amplitude-Modulation Features.," presented at the Proceedings of the ICASSP, Prague, 2005, pp.873-876. [42]T. F. Quatieri, "Phase Estimation with Applicaiton to Speech Analysis-Synthesis," Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, 1979. [43]A. Robel and X. Rodet, "Efficient Spectral Envelope Estimation and Its Application to Pitch Shifting and Envelope Preservation," in Proceedingsof the 8th InternationalConference on DigitalAudio Effects, Madrid, Spain, 2005, pp. DAFX1-DAFX6. 114 [44]D. D. Mehta, D. Rudoy, and P. J. Wolfe, "Kalman-based autoregressive moving average modeling and inferance for formant and antiformatn tracking," J. Acoust. Soc. Am., vol. 132, no. 3, pp. 1732- 1746, Sep. 2012. [45]J. R. Williamson, D. W. Bliss, and D. W. Browne, "Epileptic seizure prediction using the spatiotemporal correlation structure of intracranial EEG," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE InternationalConference on, 2011, pp. 665-668. [46]D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker Verification Using Adapted Gaussian Mixture Models," Digit. Signal Process., vol. 10, no. 1-3, pp. 19-41, Jan. 2000. [471G. Sell and M. Slaney, "Solving demodulation as an optimization problem," Audio Speech Lang. Process.IEEE Trans. On, vol. 18, no. 8, pp. 2051-2066, 2010. [48]A. T. Beck, R. A. Steer, R. Ball, and W. F. Ranieri, "Comparison of Beck Depression Inventories -IA and -II in Psychiatric Outpatients," J. Pers. Assess., vol. 67, no. 3, pp. 588-597, 1996. [49]K. L. Smarr and A. L. Keefer, "Measures of depression and depressive symptoms: Beck Depression Inventory-II (BDI-II), Center for Epidemiologic Studies Depression Scale (CES-D), Geriatric Depression Scale (GDS), Hospital Anxiety and Depression Scale (HADS), and Patient Health Questionna," Arthritis Care Res., vol. 63, no. S11, pp. S454-S466, Nov. 2011. [50]D. Klatt, "Software for a cascade/parallel formant synthesizer," J. Acoust. Soc. Am., vol. 67, no. 3, pp. 971-995, Mar. 1980. 115

Vocal Modulation Features in the Prediction of

Related documents

Products

Support

Vocal Modulation Features in the Prediction of

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib