Acoustic Analysis of Voice: A Tutorial

2 Acoustic Analysis of Voice: A Tutorial James M. Hillenbrand Department of Speech Pathology and Audiology Western Michigan University Kalamazoo, MI 49008 3 Abstract This tutorial reviews acoustic methods that have been used to characterize vocal function. The most persuasive argument for the use of acoustic measures is that all of the information used by listeners to make judgments about speech is to be found in the acoustic signal. Acoustic methods have been used clinically to differentiate normal from abnormal voices, to aid in differential diagnosis, to evaluate the relative effectiveness of different treatment approaches, and to track progress in voice therapy. The measures discussed here focus on quantifying the degree of periodicity, the shape of the spectrum, and the range of vocal intensity. Introduction Fundamental to the logic underlying many acoustic measures of voice is the simple observation that the voice is produced to please the human ear. Listeners make both linguistic and aesthetic judgments about voice signals, and essentially all of the information that controls these judgments is embedded within the acoustic signal. It is also the case that the disturbances in glottal vibratory patterns that are associated with laryngeal pathologies will be reflected in changes in the acoustic signal that is radiated from the lips. Also, as has been noted by many investigators, the voice signal is perhaps the most noninvasive and easily obtained measure of vocal performance. As straightforward as this logic may seem, uncovering the fundamental relationships between acoustic properties and perceived vocal qualities, or the relationships between the characteristics of the acoustic signal and the underlying laryngeal vibratory patterns that shape the sound wave, has not proven to be an especially easy problem. Goals of Acoustic Analysis The major clinical applications of acoustic analysis tend to fall into three broad categories: (1) screening, (2) diagnostic support, (3) evaluation of the relative effectiveness of different treatment approaches, and (4) assessment of progress throughout the course of treatment (see Laver Hiller & Beck (1992) for discussion and a slightly expanded list of applications). Screening applications are often directed at rapid and inexpensive early detection of serious conditions such 4 as laryngeal carcinoma, although there have also many attempts to develop screening methods to detect the presence versus absence of laryngeal pathologies as a general category. Many techniques have been described that are capable of discriminating normal from abnormal voices with reasonably good accuracy (e.g. Hadjitodorov & Mitev, 2002), although the usefulness of automating this kind of decision making has been questioned by some investigators (e.g., Hirano et al., 1988). The closely related goal of diagnostic support involves the use acoustic measurements of voice to differentiate among different pathological conditions. The clinical utility of discriminating among underlying pathological conditions is quite clear; however, this is a considerably more difficult problem than separating normal from abnormal voices, and success in this area has been limited. The use of acoustic measurements to provide objective indices of progress throughout the course of a treatment program, or to compare the relative effectiveness of alternative treatment strategies, has received a good deal of attention in the literature. This is perhaps the single most promising application of acoustic measurements of vocal function. Signals for Analysis Airborne Microphone Signal The simplest and most commonly used input signal is an unprocessed sound wave transduced by a conventional airborne microphone. The chief advantage of this approach is that this signal is, for all practical purposes, equivalent to the one that reaches the ear. The principal drawback of this simple signal is that the acoustic properties of the waveform reflect contributions not only of the laryngeal source function that is of greatest interest but also the resonance characteristics of the vocal tract. Contact Microphone Signal A commonly used alternative is to record the voice signal using a contact microphone placed on the throat. The resulting signal is generally claimed to be less contaminated by vocal tract events. A primary effect of this recording technique is to strongly emphasize the lower frequency components of the voice signal at the expense of upper harmonics in formant regions. This often 5 simplifies technical problems associated with measuring features such as instantaneous fundamental frequency (Schoentgen, 1989; Hirano, 1981). Glottal Inverse Filtering A more direct approach to removing the effects of vocal tract filtering involves the use of a pneumotachograph mask and inverse filter to recover an approximation to the glottal volume velocity function (Rothenberg, 1973). The basic idea behind this technique is that if the formant frequencies and bandwidths are known, an inverse filter can be designed to significantly reduce the effects of vocal tract filtering. An approximation to the glottal flow signal is derived by passing the oral flow signal through the inverse filter. There is, however, a certain amount of error involved in estimating formant parameters, and it is not always clear how precisely the glottal flow signal is approximated. The middle panel of Figure 1 shows an example of a glottal flow signal. (See Sondhi (1975) and Javkin, Antonanzas-Barroso & Maddieson (1987) for other approaches to the problem of recovering the glottal flow signal.) LPC Residue Signal A very different approach to inverse filtering is based on a signal analysis method called linear predictive coding (LPC). If certain simplifying assumptions are made, LPC can be used to separate the source and filter components of a speech signal. When applied to voice signals, LPC is based on the assumption that voiced speech can be modeled as the output of a time-varying filter that is excited by quasi-periodic pulses. The filter represents the combined effects of vocal-tract filtering, lip radiation, and shape of the glottal pulse. LPC uses time-series analysis to produce a sequence of estimates of the frequency response curve of this time-varying filter. The filter shapes that are recovered can then be inverted to produce an estimate of the pulse train that is implicitly assumed to excite the filter. The top and bottom panels of Figure 1 show a speech signal along with the "LPC residue signal," the source waveform derived through the application of an LPC inverse filter. Because of the simplifying assumptions that are made when LPC is applied to speech modeling, the residue signal does not resemble a naturally occurring glottal source waveform. Whether this is an advantage or a drawback depends on the goals of the analysis technique. For 6 example, since the recovered source signal tends to show relatively sharp pulses at the onsets of individual pitch periods, the LPC residue can simplify the measurement of acoustic features that depend on detecting the onsets of individual pitch pulses (see Davis, 1976; 1981). Speech Material The most commonly used utterance for acoustic voice analysis is a monotone sustained vowel. The use of sustained vowels makes it much easier to obtain acoustically based estimates of vocal performance that are comparable across a population of talkers who may display a broad range of speaking styles in material such as read text or conversational speech. It is also a much simpler matter to separate glottal source parameters from those associated with the vocal tract when a relatively fixed vocal tract posture is maintained. The most important limitation of sustained vowels is that these simple utterances are sometimes not sufficiently revealing of the underlying pathological condition. As Schoentgen (1989) noted, speakers are often able to compensate for pathological conditions during constant-pitch phonation but may be unable to do so during more natural utterances involving dynamic adjustments of both laryngeal and vocal tract structures. Although the issue has received little systematic study, several investigators have noted that there are many pathological conditions which tend to produce vocal quality disturbances that are conditioned by the type of speech task (e.g., Hirano, 1981; Sapienza, Walton & Murry, 2000). For example, in a comparison of isolated vowels and connected speech, Klingholtz (1990) reported two findings that are of considerable interest. First, measurements of spectral noise from dysphonic speakers' isolated vowels accounted for less than half of the variance in spectral noise measures from connected speech material spoken by the same talkers. This finding suggests that vocal performance measured from isolated vowels is not an accurate gauge of a speaker's performance in connected speech. Second, the separation of normal from pathological speakers based on the noise measures was four times more accurate when the measures were made from connected speech than isolated vowels (see also Hillenbrand, Cleveland & Erickson, 1994; Hillenbrand & Houde, 1996). 7 Acoustic Measures of Vocal Function An enormous variety of acoustic methods for assessing vocal function have been proposed. In this brief review it will be possible to provide a limited discussion of some of the more commonly used procedures. For the purposes of organizing the discussion, the techniques will be grouped somewhat arbitrarily into: (1) periodicity measures, (2) measures of spectral shape, and (3) intensity measures. Periodicity Measures As von Leden, Moore and Timke (1960) noted, "... the commonest observation in pathologic conditions is a strong tendency for frequent and rapid changes in the regularity of the vibratory pattern" (p. 34). It is therefore not surprising that the bulk of work on acoustic analysis has focused on the measurement of signal periodicity. While these departures from perfect periodicity may be observed directly with high-speed film or videostroboscopy, a wide range of instrumentally simple, noninvasive methods are available to measure the acoustic consequences of these disturbances in periodicity. Perturbation Analysis Pitch Perturbation There is no single area of acoustic voice analysis that has received more attention or generated more debate than the measurement of vocal perturbation. The point of departure for this work is Lieberman's observation – first with high-speed film (Lieberman, 1961) and later with acoustic measurements (Lieberman, 1963) – that pathological voices tend to show unusually large cycle-to-cycle fluctuations in the fundamental period. The phenomenon of cycle-to-cycle fluctuations in the fundamental period is referred to variously as pitch perturbation, fundamental frequency perturbation, or vocal jitter. Despite the simplicity of the underlying concept, a dizzying array of methods have been used to quantify pitch perturbation. The simplest of these is mean jitter, defined as the average absolute difference in fundamental period between adjacent pitch pulses. Since there is a strong correlation between mean jitter and the average fundamental period (e.g., Lieberman, 1961; 8 Koike, 1973), it is common to express jitter as a percentage of the average fundamental period. Percent jitter is defined as mean jitter divided by the mean period, multiplied by 100. However, there are some rather clear indications that this simple method does not, in fact, adequately normalize for differences in average fundamental frequency (see Horii (1979) and Orlikoff and Baken (1990) for a discussion). Another approach that is in common use involves calculating a sequence of differences between the instantaneous fundamental period and a running average of adjacent periods (e.g., Koike, 1973). The purpose of the running average is to minimize the effects of slowly varying changes in fundamental frequency and emphasize the quasi-random, rapid fluctuations initially observed by Lieberman. Numerous other methods have been used to calculate pitch perturbation (see Baken, 1987; Laver et al., 1992; Pinto and Titze, 1990, for reviews), and this wide variation in computational methods is at least partly responsible for the limited agreement across laboratories on a host of issues related to perturbation measures. Amplitude Perturbation Amplitude perturbation, or vocal shimmer, is defined as cycle-to-cycle fluctuation in the amplitudes of adjacent pitch pulses. As with jitter, a wide variety of calculation methods have been used. The simplest is mean shimmer, which is simply the average absolute difference in amplitude between adjacent pitch pulses. Methods based running averages are also in widespread use (e.g., Takahashi & Koike, 1975). Clinical Relevance of Perturbation Measures There is no clear consensus on the clinical relevance of perturbation measures. Many studies have confirmed Lieberman's report that perturbation measures distinguish normal from pathological voices with reasonable accuracy, although even on this relatively simple point there is disagreement (e.g., Hecker &Kruel, 1971; Ludlow, Basshich, Connor, Coulter & Lee, 1987; Davis, 1981; Zyski, Bull, McDonald & Johns, 1984). Although some investigators have reported positive findings, there is no consistent evidence across studies that perturbation measures – alone or in combination with other acoustic measures – can sort patients by diagnostic category with the kind of precision that would be needed in clinical settings. In terms of perceptual correlates, there 9 is good evidence from several synthesis studies showing that increased jitter is associated with a sensation of roughness; the evidence regarding the perceptual correlates of shimmer is mixed (Wendahl, 1966; Heiberger & Horii, 1982; Hillenbrand, 1988). Due, in part, to a variety of technical problems in measuring perturbation from natural speech signals (discussed below), there is no consistent evidence for a straightforward relationship between perturbation values measured from natural speech and perceptual dimensions such as roughness, hoarseness, or overall severity of dysphonia prompting some investigators to question the utility of perturbation measures for characterizing voice quality (Kreiman & Gerratt (2005). Methodological Concerns Issues related to the measurement of jitter and shimmer have been examined in minute detail in a large number of methodological studies. The majority of the technical limitations that have been reported in these studies have their origin in two fundamental facts about perturbation measures: (1) the phenomena being measured are quite small (e.g., percent jitter is generally below about 1% in normal voices and seldom exceeds more than a few percent in disordered voices), and (2) the technique requires very precise determination of the onsets and offsets of individual pitch pulses. The net result is that even relatively small errors in fundamental frequency measurement can have a large effect on perturbation measures, and seemingly minor differences in technique from one measurement system to another can result in poor reliability. Further, uncertainty regarding the precise locations of pitch pulses is especially great for the marginally periodic voices that are of greatest interest in most clinical settings (Rabinov, Kreiman, Gerratt & Bielamowicz, 1995). These technical problems, in combination with a considerable diversity in methods across systems, have resulted in the disappointing levels of reliability and validity that have been reported in several studies (e.g., Rabinov et al., 1995; Bielamowicz, Kreiman, Gerratt, Dauer & Berke, 1996; Karnell, Scherer & Fisher, 1991; Karnell, Hall & Landahl, 1995; Hillenbrand, 1987). For example, Bielamowicz et al. made perturbation measures from voice samples recorded from both normal and dysphonic speakers. The measures were obtained with three different commercial systems and a semiautomatic system in which errors in locating pitch-pulse boundaries could be 10 corrected by hand. For jitter measures, some pairs of measurement systems agreed reasonably well, with correlations of approximately 0.80. However, other pairs of systems agreed very poorly with correlations as low as 0.23. Reliability for shimmer measures was better, with most correlations 0.80-0.90 range. Other Periodicity Measures Many other methods have been used to measure the degree of periodicity in a voice waveform. Described below is a sampling of the some of the techniques that have developed. Harmonics-to-Noise Ratio (HNR) A measure called harmonics-to-noise ratio (HNR; Yumoto, Gould, & Baer, 1982) begins with a sustained vowel that has been delineated into individual pitch pulses. The numerator in the HNR calculation is simply the RMS energy in the average of all individual pitch pulses. (For a perfectly periodic signal, the RMS energy in the average pitch pulse will be equal to the total RMS energy in the original waveform since the waveform would consist of the exact repetition of the average pitch pulse.) The noise component is estimated by subtracting this average pitch pulse from each individual pitch pulse in the original signal. The RMS energy in this difference waveform serves as the numerator in the HNR calculation. Studies using this technique have reported large differences in HNR between normal and disordered talkers, as well as strong correlations with subjective ratings of hoarseness severity (Yumoto et al., 1982; Yumoto, Sasaki & Okamura 1984). Since this technique, like the perturbation measures discussed above, depends on identifying the precise locations of the boundaries of individual pitch pulses, HNR suffers from many of the reliability and validity problems described in the previous section (e.g., Hillenbrand, 1987). Spectral Methods Many techniques have been developed that measure signal periodicity in the frequency domain. In a perfectly periodic signal whose spectrum is measured with an ideal analyzer, all of the spectral energy will be found at harmonically related frequencies. With increasing departures 11 from perfect periodicity, the ratio of harmonic energy to energy at non-harmonic frequencies will decrease. Panel a of Figure 2 shows the spectrum of a normally phonated vowel with a well defined harmonic structure. The spectrum of the moderately breathy voice in panel b shows less prominent harmonics and greater energy in the regions between harmonics. Several periodicity metrics have been developed around the straightforward principle of measuring the extent to which signal energy is concentrated in the harmonic regions of the spectrum. For example, Kasuya, Ogawa, Mashima, and Ebihara (1986) developed a spectral periodicity measure called Normalized Noise Energy (NNE). NNE measures correctly identified all T2-T4 glottic cancers as abnormal. However, the technique missed roughly a fourth of T1 cancers, and there was no evidence that NNE measures could separate the glottic cancer group from other laryngeal pathologies (see also Hirano et al., 1988; Kojima, Gould, Lambaise, & Isshiki, 1980; Emanuel & Sansone, 1969; Deal & Emanuel, 1978; Lively & Emanuel, 1970; Cox, Ito & Morrison, 1989; Klingholtz, 1990). Cepstral Methods Another way to capture the degree of harmonic organization in a spectrum is to compute a spectrum of the log power spectrum, known as the cepstrum. For a periodic signal with a well defined harmonic structure, the second spectrum will show a strong component corresponding to the regularity of the harmonic peaks. With increasing departures from perfect periodicity, the prominence this cepstral peak will be reduced. Hillenbrand et al. (1994) developed a metric called Cepstral Peak Prominence (CPP), which was simply an amplitude-normalized measure of the amplitude of the cepstral peak corresponding to the harmonic regularity (see panels c and d of Figure 2). CPP was shown to correlate strongly with listener ratings of breathiness for both sustained vowels and continuous speech (see also Hillenbrand et al., 1996; de Krom, 1993). In addition, multi-factor acoustic models in which CPP measures figure prominently have been found to accurately predict listener ratings of severity of dysphonia (Awan, Roy & Dromey, 2009), differentiate perceptually distinct pathologic voice types (Awan & Roy, 2005), and classify voices by diagnostic category (Callan, Kent, Roy & Tasko, 1999). 12 Autocorrelation Autocorrelation analysis computes the correlation between a signal and a delayed copy of the same signal at delays corresponding to the minimum and maximum expected fundamental periods. The basic idea behind the technique is that the correlation between the signal and the delayed copy will reach a maximum when the copy is offset by exactly one fundamental period. The delay at which the peak occurs serves as an estimate of the fundamental period, and the prominence of the peak in the autocorrelation function can serve as a measure of the degree of signal periodicity; i.e., highly periodic signals should show more prominent peaks than marginally periodic signals (see Figure 3). Davis (1976) developed a measure called Pitch Amplitude (PA) that was based on autocorrelation analysis of the LPC residue signal. Significant differences in PA measures were found between normal and pathological speakers, although the overlap between the groups was sufficiently great that a screening tool based on this measure does not seem feasible. A closely related measure was applied to ordinary voice waveforms by Hillenbrand et al. (1994; 1996) and was found to account for a large percentage of the variance in breathiness ratings from sustained vowels and connected speech samples produced by normal and pathological speakers. Measures of Spectral Shape Spectral shape measures have been directed primarily at quantifying the level of aspiration noise that is associated with insufficient glottal closure. Since aspiration noise is relatively strong in the higher frequencies, a voice signal produced with incomplete glottal closure might be expected to show a greater ratio of high- to low-frequency energy than normally phonated voice signals. Figure 4 shows long-term average spectra (Fourier spectra calculated every 5 ms and averaged over the full utterance) computed from short sentences read by five normal speakers and five speakers with voice disorders that are characterized by moderate to severe breathiness. Large differences can be seen in the levels of high- versus low-frequency energy. Frokjaer-Jensen and Prytz (1976) developed a straightforward measure of spectral shape that consisted of ratio of energy above 1 kHz to energy below 1 kHz from long-term average spectra of 45-second samples 13 of read text. They found significant reductions in high frequency energy following treatment. Conceptually similar measures described by Klich (1982), Fukazawa El-Assuooty and Honjo (1988), Shrivastav and colleagues (Shrivastav & Sapienza, 2003; Shrivastav & Camacho, 2010) and Hillenbrand et al. (1996) were found to correlate well with listener ratings of breathiness (but see Klatt & Klatt, 1990, and Hillenbrand, 1988, for negative findings). Two significant features of this family of spectral shape measures are worth noting. First, the methods are technically very straightforward and, although it has not been studied, it is difficult to envision reliability problems of the kind that have been reported for measures such as pitch and amplitude perturbation. Second, the measures are quite suitable for read texts as well as sustained vowels. In fact, recent results indicate that spectral tilt measures are more closely associated with perceived voice quality for connected speech than sustained vowels (Hillenbrand et al., 1996). Intensity Measures Measures of vocal intensity have received rather limited attention in comparison with measures of periodicity and spectral shape. Although reduced intensity is associated with many laryngeal pathologies, the lack of standardized procedures for measuring intensity and the difficulty of controlling vocal effort complicate the interpretation of these measurements. A reduction in the dynamic range of phonation intensity is frequently associated with laryngeal pathology. Coleman, Mabis and Hinson (1977) reported an intensity range of approximately 50 dB in the middle of the pitch range for a group of young normal subjects, although much smaller normative values for intensity range were reported by Gramming (1988) and Colton, Casper and Leonard (2011). Vocal intensity range is known to depend on fundamental frequency (e.g., Wolfe, Stanley & Sette, 1935). Consequently, a complete picture of a speaker's dynamic range for intensity requires sampling the minimum and maximum vocal intensity for a broad range of fundamental frequencies. A graphic presentation of the minimum and maximum phonation intensity for different fundamental frequencies is known variously as a fundamental frequency-intensity 14 profile, a voice range profile (VRP), a phonetogram, or a phonogram. Several techniques have been used to elicit VRPs (see Coleman, 1993). In the method described by Coleman et al. (1977), the minimum and maximum phonation intensities are measured every 10% of the total fundamental frequency range. An averaged VRP for a group of ten normal men from Coleman et al. (1977) is shown in Figure 5. The compression of the dynamic range at the low and high ends of the fundamental frequency range, and the upward trend in minimum phonation intensities at the high end of the fundamental frequency range, have been observed in other studies (e.g., Gramming, 1988). Conclusions A great deal of effort has gone into the development of acoustic measures that can be used to quantify various aspects of vocal function. These efforts have been more successful in some areas than others. In particular, efforts to infer underlying pathological conditions based on the acoustic signal have been limited in part by our incomplete understanding of some fundamental relationships between physiology and acoustics, and in part by the inherently complex, one-to-many relationships that exist between acoustic features and underlying pathological conditions. For example, there is no single pathological condition that produces the turbulence noise that is captured by many of the measures that have developed, nor is there a single laryngeal pathology that can produce aperiodicities in the laryngeal vibratory pattern that other acoustic measures have attempted to quantify. This should not lead us to conclude that measures of the type described here are necessarily of limited value in the assessment of vocal function. Medicine is replete with examples of diagnostic measures whose outcomes are not uniquely associated with any individual pathological condition, but which can be of considerable significance in combination with other diagnostic information. Commonplace examples include leukocyte counts, blood pressure, and body temperature. It is doubtful that even the most enthusiastic proponent of acoustic measures of voice would place any of the acoustic methods alongside these everyday diagnostic workhorses, but the best of the acoustic measures can contribute significantly 15 to the assessment of vocal function. 16 References Awan, S., and Roy, N. (2005). Acoustic prediction of voice type in women with functional dysphonia. Journal of Voice, 19, 268-282. Awan, S., Roy, N., & Dromey, C. (2009). Estimating dysphonia severity in continuous speech: Application of a multi-parameter spectral/cepstral model. Clinical Linguistics and Phonetics, 23, 825-841. Baken, R.J. (1987). Clinical Measurement of Speech and Voice. Boston: College Hill. Bielamowicz, S., Kreiman, J., Gerratt, B.R., Dauer, M.S., and Berke, G.S. (1996). Comparison of voice analysis systems for perturbation measurement. Journal of Speech and Hearing Research, 39, 126-134. Callan, D.E., Kent, R.D., Roy, N., & Tasko, S.M. (1999). Self-organizing map for the classification of normal and disordered voices. Journal of Speech, Language and Hearing Research, 42, 355-366. Coleman, R.F. (1993). Sources of variation in phonetograms. Journal of Voice. 7, 1-14. Coleman, R.F., Mabis, J.H., and Hinson, J.K. (1977). Fundamental frequency-sound pressure level profiles of adult male and female voices. Journal of Speech and Hearing Research, 20, 197-204. Colton, R., Casper, J.K., & Leonard, R. (2011). Understanding Voice Problems: A Physiological Perspective for Diagnosis and Management. 4th Ed. Baltimore: Lippincott Williams & Wilkins. Cox, N.B., Ito, M., and Morrison, M. (1989). Technical considerations in computation of spectral harmonics-to-noise ratios for sustained vowels. Journal of Speech and Hearing Research, 32, 203-218. Davis, S.B. (1976). Computer evaluation of laryngeal pathology based on inverse filtering of speech. SCRL Monograph 13, Speech Communication Research Laboratory, Santa Barbara, CA. Davis, S.B. (1981). Acoustic characteristics of normal and pathological voices. ASHA Reports, 11, 97-115. de Krom, G. (1993). A cepstrum-based technique for determining a harmonic-to-noise ratio in speech signals. Journal of Speech and Hearing Research, 36, 254-266. Deal, R., and Emanuel, F.W. (1978). Some waveform and spectral features of vowel roughness. Journal of Speech and Hearing Research, 21, 250-264. Emanuel, F.W., and Sansone, F. (1969). Some spectral features of "normal" and simulated "rough" vowels. Folia Phoniatrica, 21, 401-415. 17 Frokjaer-Jensen, B. and Prytz, S. (1976). Registration of voice quality. Bruel and Kjaer Technical Review, 3, 3-17. Fukazawa, T., El-Assuooty, A., and Honjo, I. (1988). A new index for evaluation of the turbulent noise in pathological voice. Journal of the Acoustical Society of America, 83, 1189-1193. Gramming, P. (1988). The phonetogram: An experimental and clinical study. Doctoral dissertation, University of Lund, Malmo, Sweden. Hadjitodorov, S. & Mitev, P. (2002). A computer system for acoustic analysis of pathologic voices and laryngeal diseases screening. Medical Engineering and Physics, 24, 419-429. Hecker, M.H.L., and Kreul, E.J. (1971). Descriptions of the speech of patients with cancer of the vocal folds. Part I: Measures of fundamental frequency. Journal of the Acoustical Society of America, 49, 1275-1282. Heiberger, V.L., and Horii, Y. (1982). Jitter and shimmer in sustained phonation. In N.J. Lass (Ed.), Speech and Language: Advances in Basic Research and Practice, Vol. 7. (pp. 299-332). New York: Academic Press. Hillenbrand, J.M. (1987). A methodological study of perturbation and additive noise in synthetically generated voice signals. Journal of Speech and Hearing Research, 30, 448-461. Hillenbrand, J.M. (1988). Perception of aperiodicities in synthetically generated voices. Journal of the Acoustical Society of America, 83, 2361-2371. Hillenbrand, J.M., Cleveland, R.A., and Erickson, R.L. (1994). Acoustic correlates of breathy vocal quality. Journal of Speech and Hearing Research, 37, 769-778. Hillenbrand, J.M., and Houde, R.A. (1996). Acoustic correlates of breathy vocal quality: Dysphonic voices and continuous speech. Journal of Speech and Hearing Research, 39, 311-321. Hirano, M. (1981). Clinical Examination of Voice. New York: Springer-Verlag. Hirano, M., Hibi, S., Yoshida, T., Hirade, Y., Kasuya, H., and Kikuchi, Y. (1988). Acoustic analysis of pathological voice. Acta Otolaryngologica, 105, 432-438. Horii, Y. (1979). Fundamental frequency perturbation observed in sustained phonation. Journal of Speech and Hearing Research, 22, 5-19. Javkin, H., Antonanzas-Barroso, and Maddieson, I. (1987). Digital inverse filtering for linguistic research. Journal of Speech and Hearing Research, 30, 122-129. Karnell, M.P., Hall, K.D., and Landahl, K.L. (1995). Comparison of fundamental frequency and perturbation measurements among three analysis systems. Journal of Voice, 9, 383-393. 18 Karnell, M.P., Scherer, R.C., and Fischer, L.B. (1991). Comparison of acoustic voice perturbation measures among three independent voice laboratories. Journal of Speech and Hearing Research, 34, 781-795. Kasuya, H., Ogawa, S., Mashima, K., & Ebihara, S.(1986). Normalized noise energy as an acoustic measure to evaluate pathologic voice. Journal of the Acoustical Society of America, 80, 1329-1334. Klatt, D.H., and Klatt L.C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, 87, 820-857. Klich, R.J. (1982). Relationships of vowel characteristics to listener ratings of breathiness. Journal of Speech and Hearing Research, 25, 574-580. Klingholtz, F. (1990). Acoustic recognition of voice disorders: A comparative study of running speech versus sustained vowels. Journal of the Acoustical Society of America, 87, 2218-2224. Koike, Y. (1973). Application of some acoustic measures for the evaluation of laryngeal dysfunction. Studia Phonologica, 7, 17-23. Kojima, H., Gould, W.J., Lambaise, A., and Isshiki, N. (1980). Computer analysis of hoarseness. Acta Otolaryngologica, 89, 547-554. Kreiman, J., and Gerratt, B. (2005). Perception of aperiodicity in pathological voice. Journal of the Acoustical Society of America, 117, 2201–2211. Laver, J., Hiller, S., and Beck, J.M. (1992). Acoustic waveform perturbations and voice disorders. Journal of Voice, 6, 115-126. Lieberman, P. (1961). Perturbations in vocal pitch. Journal of the Acoustical Society of America, 35, 344-353. Lieberman, P. (1963). Some acoustic measures of the fundamental periodicity of normal and pathologic larynges. Journal of the Acoustical Society of America, 35, 344-353. Lively, M.A., and Emanuel, F.W. (1970). Spectral noise levels and roughness severity ratings for normal and simulated rough vowels produced by adult females, Journal of Speech and Hearing Research, 13, 503-517. Ludlow, C.L., Bassich, C., Connor, N.P., Coulter, D.C., and Lee, Y.J. (1987). The validity of using phonatory jitter and shimmer to detect laryngeal pathology. In T. Baer, C. Sasaki, and K. Harris (Eds.), Laryngeal Function in Phonation and Respiration. (pp. 492-508). Boston: College Hill. Orlikoff, R.F., and Baken, R.J. (1990). Consideration of the relationship between fundamental frequency of phonation and vocal jitter. Folia Phoniatrica, 42, 31-40. 19 Pinto, N.B., and Titze, I.R. (1990). Unification of perturbation measures in speech signals. Journal of the Acoustical Society of America, 87, 1278-1289. Rabinov, C.R., Kreiman, J., Gerratt, B.R., and Bielamowicz, S. (1995). Comparing reliability of perceptual ratings of roughness and acoustic measures of jitter. Journal of Speech and Hearing Research, 38, 26-32. Rothenberg, M. (1973). A new inverse-filtering technique for deriving the glottal airflow waverorm during voicing. Journal of the Acoustical Society of America, 53, 1632-1645. Sapienza, C.M., Walton, S., & Murry, T. (2000). Adductor spasmodic dysphonia and muscular tension dysphonia: Acoustic analysis of sustained phonation and reading. Journal of Voice, 14, 502-520. Shrivastav, R., & Camacho, A. (2010). A computational model to predict changes in breathiness resulting from variations in aspiration noise level. Journal of Voice, 24, 395-405. Shrivastav, R., & Sapienza, C. (2003). Objective measures of breathy voice quality obtained using an auditory model. Journal of the Acoustical Society of America, 114, 2217–2224. Schoentgen, J. (1989). Jitter in sustained vowels and isolated sentences produced by dysphonic speakers. Speech Communication, 8, 61-79. Smith, B., Weinberg, B., Feth, L., and Horii, Y. (1978). Vocal roughness and jitter characteristics of vowels produced by esophageal speakers. Journal of Speech and Hearing Research, 21, 240-249. Sondhi, M.M. (1975). Measurement of the glottal waveform. Journal of the Acoustical Society of America, 57, 228-232. Takahashi, H., and Koike, Y. (1975). Some perceptual dimensions and acoustical correlates of pathologic voices. Acta Otolaryngologica, Suppl. 338, 1-24. Titze, I.R. (1994). Toward standards in acoustic analysis of voice. Journal of Voice, 8, 1-7. von Leden, H., Moore, P., and Timke, R. (1960). Laryngeal vibrations: Measurements of the glottal wave. Part III, The pathologic larynx. Archives of Otolaryngology, 71, 16-35. Wendahl, R.W. (1966). Laryngeal analog synthesis of jitter and shimmer auditory parameters of harshness. Folia Phoniatrica, 18, 98-108. Wolfe, S., Stanley, W., and Sette, W. (1935). Quantitative studies on the singing voice. Journal of the Acoustical Society of America, 6, 255-266. Yumoto, E., Gould, W.J., and Baer, T. (1982). Harmonics-to-noise ratio as an index of the degree 20 of hoarseness. Journal of the Acoustical Society of America, 71, 1544-1550. Yumoto, E., Sasaki, Y., and Okamura, H. (1984). Harmonics-to- noise ratio and psychophysical measurement of the degree of hoarseness. Journal of Speech and Hearing Research, 27, 2-6. Zyski, B., Bull, G., McDonald, W., and Johns, M. (1984). Perturbation analysis of normal and pathologic larynges. Folia Phoniatrica, 36, 190-198. 21 Figure Captions Figure 1. An acoustic signal, a glottal airflow signal estimated by glottal inverse filtering, and an LPC residue signal derived by applying an LPC inverse filter. Figure 2. Top panels: Fourier spectra for a nonbreathy (panel a) and a moderately breathy (panel b) voice signal. Bottom panels: cepstra of a nonbreathy (panel c) and a moderately breathy (panel d) voice signal. Figure 3. Autocorrelation functions for normally phonated and breathy voice signals. Figure 4. Long-term average spectra for five utterances spoken by nonpathological talkers and by five talkers with voice disorders characterized by breathiness. Signals from Hillenbrand et al. (1996). Figure 5. Voice range profile based on an average of ten normal men (from Coleman et al., 1977). 22 Acknowledgments I am grateful to Michael Clark and Robert Erickson for comments on a previous draft. Preparation of this chapter was supported by NIH grant R01-DC01661. 23 Figure 1 24 Figure 2 25 Figure 3 26 Figure 4 27

Acoustic Analysis of Voice: A Tutorial

Related documents

Products

Support

Acoustic Analysis of Voice: A Tutorial

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib