Analysis and Synthesis of the ... Lateral Consonant Adrienne Prahler

Analysis and Synthesis of the American English Lateral Consonant by Adrienne Prahler Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 1998 @ Adrienne Prahler, MCMXCVIII. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part, and to grant others the right to do so. A uthor .......................... .................................. Department of Electrical Engineering and Computer Science May 22, 1998 C ertified by.... AJcUL 4•'pted JUL 141998 U ,I ( "~: "~. - kRg14WES ...... ..................................... Kenneth Stevens Clarence LeBel Professor bhesis,Supervisor by .................-f.6-"• .......... Arthur C. Smith Chairman, Department Committee o, Graduate Students Analysis and Synthesis of the American English Lateral Consonant by Adrienne Prahler Submitted to the Department of Electrical Engineering and Computer Science on May 22, 1998, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract The lateral consonant in English is generally produced with a backed tongue body, a midline closure of the tongue blade at the alveolar ridge, and a path around one or both of the lateral edges of the tongue blade. In pre-vocalic lateral consonants, the release of the closure causes a discontinuity in the spectral characteristics of the sound. Past attempts to synthesize syllable-initial lateral consonants using formant changes alone have not been entirely satisfactory. Data from prior research has shown rapid changes not only in the formant frequencies but also in the glottal source amplitude and spectrum as well as in the amplitudes of the formant peaks at the consonant release. Further measurements have been made on additional utterances, guided by models of lateral production. Synthesis of lateral-vowel syllables that include additional changes in bandwidths, pole-zero pairs, spectral tilt, and the amplitude of voicing are judged to be more natural than lateral-vowel syllables with only formant transitions. Thesis Supervisor: Kenneth Stevens Title: Clarence LeBel Professor Acknowledgments I would like to thank Ken Stevens and all of the Speech Group for everything. Ken has been an inspiration to me both professionally and personally, thank you. Thank you to my parents and brother for all their love and support over the years - I would definitely not be here without all of you. Thank you to Peter for all the laughs and nights out - you kept me sane. Basak, what can I say except we are DONE! Work supported in part by a LeBel Fellowship and by NIH Grant DC00075. Contents 1 1.1 M otivation.. . . . ... . . . . . . . . . . . . . . . 12 1.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Background ... 1.4 2 12 Introduction . . . ... . . . .. 13 . ........ .... ................ 1.3.1 Poles and Zeros of Vocal Tract Transfer Function . . . . . . . 13 1.3.2 Glottal Source Reduction . . . . . . . . . . . . . . . . . . . . . 16 1.3.3 Acoustic Losses .......................... 16 Purpose of Research 17 ........................... 20 Modeling 2.1 2.2 2.3 2.4 M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.1 Simple Lossless Tube . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.2 Expected Numbers of Poles and Zeros . . . . . . . . . . . . . 21 Evaluation of Lateral Side Channels . . . . . . . . . . . . . . . . . . . 22 .. ... ..... . . .. .. 2.2.1 Equations .... 2.2.2 Boundary Conditions........................ 2.2.3 Solve for Specific Cases of Lengths and Areas Evaluation of the Entire Model ... . .. .. .. .. 22 24 . . . . . . . . . 24 . . . . . . . . . . . . . . . . . . . . . 29 . . ... . . .. .. . . .. . . ... 29 2.3.1 Equations ...... 2.3.2 Boundary Conditions........................ 30 2.3.3 Solution for specific cases of lengths and Areas . . . . . . . . . 31 Justification of the Model ........................ 35 2.4.1 Calculation of the Number of Poles and Zeros and Approximate Locations 2.5 2.6 3 35 2.5.1 35 Assumptions............................. Conclusions . . .. ... . . .... ... .... . . . . . . . .. . . Measurements 36 37 3.1 Purpose of Measurements - What Makes a /1/? 3.2 M ethod 3.3 M easurements................................ 39 3.3.1 Singleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Cluster.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Singleton Analysis 3.5 Analysis of Utterances with Consonant Clusters . . . . . . . . . . . . 45 3.6 Comparisions of Environments . . . . . . . . . . . . . . . . . . . . . . 47 3.7 Conclusions .. 50 .. Abruptness ............................ . .. . ...... 41 ... .... ....................... . . . . . . . . . . . ..... 50 Synthesis 51 4.1 Abruptness/Consonantal Quality . . . . . . . . . . . . . . . . . . . . 51 4.1.1 Changes in Frequency ...................... 51 4.1.2 Changes in Amplitude ...................... 52 4.2 M ethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Results of Perceptual Testing 6 35 Limitations of Model............................ 3.7.1 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 62 5.1 First Perceptual Experiments . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Final testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Conclusions 67 6.1 69 Further Research ............................. A Matlab code for Modeling 71 A.1 Solve Lateral Channel Equations . . . . . . . . . . . . . . . . . . . . A.2 Solve Equations for Lateral Branches when One Tube Disappears . . 71 72 A.3 Solve for Entire Model .......................... 72 A.4 Whole Model when One Side Branch Disappears . . . . . . . . . . . . 73 List of Figures 2-1 Model of side channels formed during lateral consonant production 23 2-2 Transfer function for model of side channels, 20logIU,.t/UinI, plotted for A 21 = A22 = .1 cm2 and 12 1 = 12 2 = 8 cm . . . . . . . . . . . . . . 25 2-3 Transfer function for model of side channels, 20loglUout/Uin , plotted for A21 = .3 cm2 , A 2 2 2-4 = .2 cm2 and 121 = 8.5 cm, 122 = 7.5 cm . . . . . Transfer function for model of side channels, 20log|Umt/Uinl, plotted for A21 = A22 = .3 cm 2 and 12 1 = 4 cm, 12 2 = 12 cm . . . . . . . . . . 2-5 26 27 Transfer function for model of side channels, 20logjU0ut/U.n|, plotted for A21 = .3 cm2 ,A22 = .1 cm2 and 121 = 4 cm, 122 = 12 cm . . . . . . 27 2-6 Transfer function for model of side channels, 20loglUout/Ui•, plotted 16 cm . . . . . . . . . 28 2-7 Model of the vocal tract during lateral consonant production . . . . . 29 for A 2 1 /A 22 2-8 = .1 cm2 to .5 cm 2 and 121 + 122 Transfer function for model of vocal tract, 201ogjU,0 t/UJn, plotted for A21 = A22 = .2 cm 2 and 12 1 =1 2-9 = 22 = 8 cm . . . . . . . . . . . . . . . . 32 Transfer function for model of vocal tract, 201oglUout/UVn1, plotted for A 21 = A 22 = .2 cm 2 , 21 = 10 cm, and 12 2 = 8 cm . . . . . . . . . . . . 33 2-10 Transfer function for model of vocal tract, 20logjUo 0 ~ t/Uinj, plotted for A21 = .2 cm2, A 22 = .5 cm2 and 121 = 11.5 cm and 122 = 4.5 cm . . .. 33 2-11 Transfer function for model of vocal tract, 20logjUout/Uinj, plotted for various values of A 2 1 and A22 and 121 + l2 2 = 16 cm . . . . . . . . . . 34 3-1 The effect of the pre-emphasis on the spectra . . . . . . . . . . . . . . 40 3-2 Spectrogram of luck for male speaker . . . . . . . . . . . . . . . . . . 40 3-3 Spectrogram of voiced cluster utterance, bleed, for male speaker . . . 42 3-4 Spectrogram of voiceless cluster utterance, plead, for male speaker . . 42 3-5 Changes in amplitudes of formants from lateral to vowel for singleton utterances: error bars represent standard deviation of data . . . . . . 3-6 Changes in formant frequencies from lateral to vowel for singleton utterances: error bars represent standard deviation of data . . . . . . . 3-7 43 44 Changes in amplitudes of formants from liquid to vowel for voiced and some voiceless stop cluster utterances: error bars represent standard deviation of data 3-8 ............................. 45 Changes in formant frequencies from liquid to vowel for voiced and some voiceless stop cluster utterances: error bars represent standard deviation of data ............... 3-9 .............. 46 Comparison of A's of amplitudes of formants from lateral to vowel for singleton and cluster utterances: error bars represent standard deviation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3-10 Comparison of A's of formant frequencies from lateral to vowel for singleton and cluster utterances: error bars represent standard deviation of data.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3-11 Measured values of formant frequencies during lateral consonant: error bars represent standard deviation of data . . . . . . . . . . . . . . . . 49 4-1 Spectrogram of natural utterance loot for male speaker . . . . . . . . 53 4-2 Spectrogram of first synthesized utterance of loot for male speaker . . 53 4-3 Spectrogram of second synthesized utterance of loot for male speaker 53 4-4 Spectra of natural utterance during /1/ using a 6.4 ms Hamming window 54 4-5 Spectra of first synthesized utterance during /1/ using a 6.4 ms Hamming window 4-6 ......... .......... 54 Spectra of second synthesized utterance during /1/ using a 6.4 ms Hamming window 4-7 ............ ............................... Formant trajectories for the synthesized word loot . . . . . . . . . . . 55 56 4-8 Time varying voicing changes in synthesized utterance 4-9 . . . . . . . . Time varying additional pole and zero in synthesized utterance. 4-10 Time varying bandwidth changes in synthesized utterance . .. . . . . . . 58 58 59 4-11 Spectra of natural utterance during /1/ using 25.6 mns Hamming window 59 4-12 Spectra of first synthesized utterance during /1/ using 25.6 ms Hamm ing window ..................... .......... 60 4-13 Spectra of second synthesized utterance during /1/ using 25.6 ms Hamming window 5-1 ............. 60 Results of perceptual experiments: % first utterance rated more natural by listeners 5-2 .................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Total change in Fl and F2 between lateral and vowel vs. % times second synthesized utterances rated more natural than first synthesized utterance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 List of Tables 3.1 Singleton utterances 3.2 Cluster utterances.............................. 3.3 A's of formant frequencies and amplitudes for singleton utterances; ........................... 38 39 Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across speakers and repetitions are given in the right half of the table........................ 3.4 ....... 43 A's of formant frequencies and amplitudes for voiced and some voiceless stop cluster utterances; Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across speakers and repetitions are given in the right half of the table . . . . . . . . . . . . . . . . . . . . 3.5 46 Comparision of singleton and cluster utterances for same speakers; Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across speakers and repetitions are given in the lower half of the table.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Measured formant frequency values in Hz of the lateral for singleton and cluster utterances 4.1 48 . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Constant synthesis parameters of utterance incorporating formant frequency transitions.............................. 56 4.2 Constant synthesis parameters of utterance incorporating voicing changes 59 5.1 Results of perceptual experiments: % first utterance rated more natural by listeners . . . . . . . . . .. .... . . . . . . . . . . . . . . . 64 5.2 Total change in frequency between lateral and vowel for synthesized utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 1 Introduction 1.1 Motivation Consonants in most languages are different from vowels in many ways including abruptness at the consonant release and a reduction in overall power during the sound[9]. The American English lateral consonant is very different from other sounds in that a similar vocal tract configuration produces an extreme variety of frequency spectra. The lateral consonant is sonorant and not continuant, so it is similar to nasal consonants, but differs from glides. The lateral must be examined to develop a better understanding of speech sounds in language. The lateral is a liquid consonant which is more elusive than other consonants. Most sounds in languages have distinctive contrasts and minimal pairs, but the lateral lacks a clear contrasting sound, allowing more variations in the production of the sound. The lateral, however, can also be a syllable nucleus which lacks abruptness, and the perceptual cues must be contained in the spectrum only. Understanding the perceptual cues within the frequency spectra and in the transitions to vowels is a practical motivation for studying the lateral. This information can be used for a variety of purposes including synthesis, speech recognition, and teaching second languages (to speakers of languages without a comparable lateral sound in their language, i.e., Japanese). 1.2 Properties The American English lateral /1/ is one of the semivowel consonants; it is produced with a complete closure on the midline of the vocal tract like many other consonants, but the narrowing of the vocal tract differentiates it from vowels. The lateral is produced with a backed and lowered tongue body and occlusion at the alveolar ridge. A complete closure is not made with the tongue, and airflow continues around the tongue. The first formant(F1) is low, although higher than that typically found for a high vowel, and the second formant(F2) is barely separated from the first formant. The third formant(F3) generally has a relatively strong amplitude and is higher in frequency than the third formant frequency for most vowels[16, 4, 3]. Espy-Wilson reports the average formant frequencies of prevocalic /1/: Fl averaged across all speakers is 399 Hz, F2 is 1074 Hz, and F3 is 2553 Hz[6, 5]. The lateral is prone to considerable variation depending on the individual and the phonetic context, and this variability makes it more difficult to characterize than other consonants [5, 15]. The acoustic analysis of /1/ is not complete, and more work is needed to determine how to characterize the lateral acoustically for applications in speech synthesis directly and any number of other applications including speech recognition and speech pathology. This thesis will develop a theoretical model of the lateral, and will examine the validity of the model by acoustically analyzing the prevocalic lateral for various speakers and contexts, synthesizing the laterals using the theoretical model, and conducting perceptual experiments to determine the primary acoustic cues for a lateral. 1.3 1.3.1 Background Poles and Zeros of Vocal Tract Transfer Function To better understand the acoustic characteristics of the American English laterals, some work has focused on modeling the vocal tract for this class of sounds. The acoustic theory of the laterals was first described by Fant [7]. Assuming that the constriction sizes in the vocal tract are large enough (.17 cm 2 or greater [15]) to avoid the production of turbulence noise, the acoustic theory is simple for the laterals. The vocal tract is modeled as a tube with constrictions and side branches; the first formant is approximately a Helmholtz resonance, with the acoustic mass due to the lateral constriction. The low second formant resonance is because of the pharyngeal constriction caused by the tongue body backing in the cavity behind the constriction. The third formant is roughly a resonance of the mouth cavity anterior to the constriction, and the fourth formant is determined from the length of the entire cavity system. The production of the lateral /1/ with an alveolar point Gf articulation creates an interior cavity formed by the tongue blade. An additional cavity is created under the tongue which couples with the back cavity. The result is that there are zeros as well as poles in the transfer function for a lateral configuration. The lowest zero is possibly caused by the cavity formed by the tongue blade while an additional pole is due to the entire system. Fant suggests that the effect of the pole-zero pair is simply that the fourth formant takes on the role of the third and so on, and a lateral can be synthesized without the additional zero if the formants are simply shifted appropriately[7] . A modification of Fant's theory is proposed by Stevens. The side branches around the tongue will affect the high frequency behavior of the lateral. The transfer function of the vocal tract will have additional zeros when the branches in the airway are asymmetrical. The exact locations of the zeros will vary since the length of the side branches are variable among speakers. Stevens suggests great variability in the spectrum of the lateral in the frequency range of 2500-4000 Hz depending on the surrounding environment and the individual speaker. The rate of release of the lateral is slower than that of other alveolar consonants and the possible interaction of the additional poles and zeros on the surrounding speech segments is unknown[15]. Unlike Fant, Stevens suggests that the effect of the poles and zeros, regardless of their actual location, is not simply shifting of the formants up in frequency, but instead has an overall effect of modifying the spectrum in the high frequency range. Previously, a limiting factor in creating an accurate model of the lateral was the lack of data on the length of the lateral channels. Recent work by Narayanan et al. [12, 13] at UCLA using Magnetic Resonance Imaging (MRI) and electropalatographic techniques (EPG) have provided valuable data that can lead to a more accurate understanding of the geometry, acoustics, and aerodynamics of the lateral. Using the MRI and EPG studies, detailed 3-D data of the vocal tract during the production of the laterals were obtained, and the appearance of side channels during the lateral consonant was confirmed. However, MRI and EPG data show that the left and right channels are not equal in either area or length, and there is great variation across subjects and phonetic contexts. All sounds in the study, however, are produced with a lingual occlusion approximately 1 - 1.5 cm away from the lip opening and a closure length of 0.6-1.5 cm. The cross-sectional area of the channels formed varies from 0.1-0.5 cm 2 and the relative areas of the right and left channels show variation even for the same speaker. The acoustic implications of these data are the ability to finally know the physical dimensions and configuration of the vocal tract and oral cavity during the production of the lateral, and this knowledge permits more accurate modeling of the acoustics. The lateral channel areas and low flow rates suggest no significant pressure drop in the supraglottal constriction region and negligible chances for frication. These observations support the assumption of Stevens [15] that turbulence noise can be ignored in modeling the vocal tract for the lateral, at least when the glottis is in the normal configuration for phonation. This finding also suggests sustained, uniform flow throughout the duration of the sound. The first formant frequency can be attributed to the Helmholtz resonance between the back cavity volume and oral constriction, with the low frequency behavior being dominated by the back cavity configuration. The change of Fl at the release is expected to be abrupt due to the abrupt changes in area functions arising from the anterior tongue blade movement. The second formant is associated with the back cavity resonance and can be greatly affected by retracting or raising the posterior tongue body. Three-dimensional modeling from the collected data suggests that the tongue blade has a tendency towards an inward lateral compression which creates the lateral channels observed [12], supporting the hypothesis of Sproat and Fujimura [14] that the tongue blade narrowing is a feature of laterals. 1.3.2 Glottal Source Reduction In addition to the extra poles and zeros in the frequency spectrum during laterals, a study by Bickley and Stevens [1] found a change in the glottal waveform when a constriction is formed in the vocal tract during the semivowels. The glottal source is affected by the narrowing of the vocal tract during the production of the lateral /1/, which creates an increased acoustic mass and changes the volume velocity waveform at the glottis. As the acoustic impedance increases due to the decreasing constriction size, the pressure drop across the constriction increases causing a decrease in the pressure across the glottis. The pressure drop across the constriction causes an increase in the intraoral pressure during the open phase of the glottal waveform. The decreased transglottal pressure affects the forces on the vocal folds and leads to a decreased amplitude of the volume velocity waveform. Although there are individual differences across speakers, the study suggested that there is always a decrease in intensity during the lateral relative to the adjacent vowel. There is also a less abrupt termination for the glottal pulse, leading to a greater spectrum tilt. 1.3.3 Acoustic Losses Research shows that in addition to the presence of poles and zeros and glottal source reduction during the lateral, acoustic losses in the vocal tract increase. As the acoustic resistance at the constriction increases, there are significant effects on the bandwidth of the first formant. The bandwidth of the second formant also increases significantly during the lateral[16]. The study by Bickley and Stevens [1] found a change in amplitude of the first harmonic to be on average 2.9 dB for /1/, while the average change in amplitude of the first formant was 7.5 dB. This greater change in the first formant amplitude relative to the amplitude of the fundamental suggests an increase in the bandwidth of the first formant. The large bandwidth of F1 can be attributed to the acoustic losses in the oral constriction causing the overall reduction of the amplitude of the spectrum during the /1/ [2]. 1.4 Purpose of Research Previous research still leaves many unanswered questions concerning the acoustic properties of the American English lateral /1/. Research has suggested several models of the vocal tract for the production of prevocalic /1/. However, the variability among individuals and phonetic context makes it difficult to determine the main features of an accurate model without further research. Additional poles and zeros exist in the frequency spectrum during the production of the /1/, but the exact placement and perceptual importance is not known. Glottal source reduction and acoustic losses also occur, but the amount and best method to simulate them in synthesis is not known. This research acoustically analyzes the lateral to determine the essential components and perceptual cues for the /1/ for theoretical modeling and synthesis. A database of utterances containing prevocalic /1/ is created for a variety of male and female speakers. These 1aterals are acoustically analyzed looking specifically for polezero pairs caused by the unequal lateral channels which have been confirmed by the MRI and EPG studies [13], back reactions on the glottal source during the lateral due to the constriction of the vocal tract [2], and increasing bandwidths caused by suggested acoustic losses [15]. Using the previous research and the gathered acoustic evidence, a theoretical model of the vocal tract for the laterals is developed. A selected set of database words are synthesized using the Klatt synthesizer, and perceptual experiments in which the listeners rate the naturalness of the various synthesized sounds is used to determine the validity of the theoretical model and pinpoint the acoustic cues for the prevocalic /1/. One of the primary acoustic features examined is the additional poles and zeros in the spectrum during the lateral. It is hypothesized that the observed poles and zeros found around 1.5-3 kHz are due to some combination of the lateral channels and back cavity, but the great variability of the right and left channel area functions for the same speaker and across speakers makes it difficult to determine the exact effects and their perceptual importance [12]. This research attempts to determine whether the exact placement of the pole and zero is important perceptually or if the overall high frequency effects of a fast drop off of the spectral amplitude above the frequency of the second formant, combined with the higher third formant frequency, are the primary acoustic cues used by listeners for the prevocalic lateral. Another acoustic feature to be examined in the database of sounds is the glottal source during the lateral. Previous research suggests that with the narrowing of the vocal tract during the lateral, there are some back effects that alter the actual glottal source. Preliminary research also shows some sort of glottal back effect for wordinitial prevocalic laterals, but the reduction is not as apparent for prevocalic laterals found in consonant clusters. The variability of the glottal source reduction between speakers and phonetic contexts makes it difficult to determine what is perceptually important. For accurate, natural sounding synthesized laterals and a valid theoretical model of the lateral, the manifestation of glottal source reduction must be determined through acoustic analysis of various speakers and contexts. Acoustic losses are also examined in prevocalic laterals. Preliminary research demonstrates increased acoustic losses during the lateral, but the exact magnitude is still not determined. The most reliable measure of acoustic loss is the increased bandwidth of the first formant. Acoustic analysis of the database is used to estimate the amount of acoustic loss for modeling and synthesis. The acoustic analysis of the prevocalic laterals is used to determine the validity of a theoretical model of the vocal tract. purposes. This model is then used for synthesis Previous synthesis work has contrasted the two liquid consonants, /1/ and /r/ and used the differences between the two sounds to define the lateral for synthesis. The work incorporated a prominent F3 and a decreasing F3 value at the release of the lateral to discriminate the sound from /r/[11]. A change in the voicing is created by lowering the amplitude of voicing during the liquid, but the naturalness of these synthesized laterals is not very good. Synthesis performed in the present study attempts to determine the parameters inherently important for the natural synthesis of the lateral. The Klatt synthesizer parameters that are specifically examined include TL (tilt), BW (bandwidths of formants), the possibility of poles and zeros, and formant changes[8]. Initial synthesis of sample data suggests that the TL parameter increases, the bandwidth of the first formant increases, a pole and zero are present near the third formant, and abrupt transitions of the formants occur at the release of the lateral. The importance of the additional details in the model of the prevocalic lateral are determined through perceptual testing. A set of words in the database are synthesized for a speaker based on the theoretical model, and actual utterances are also used to determine the synthesis parameters necessary for a good match of the synthesized lateral to the natural lateral. Two versions of the word are synthesized. One version time varies only formant transitions while the other time varies bandwidths, spectral tilt, a pole-zero pair, in addition to the formant frequencies. Listeners rate the naturalness of the various sounds, spoken and synthesized, in perceptual experiments. These data are then analyzed to determine what parameters are primary and whether a more complicated model of the lateral is necessary for good synthesis. Chapter 2 Modeling 2.1 Motivation In this chapter, a model of the acoustic behavior of the vocal tract for lateral consonants is developed. The aim is to interpret the changes in the frequency spectra that occur during the production of the lateral consonant. We are interested in the natural resonance frequencies of the vocal tract configuration during the lateral production, and also the frequencies of possible zeros in the transfer function. Acoustic losses, including bandwidth changes and glottal source changes, are not considered in the model. The proposed model offers some explanation of the acoustic attributes of the lateral. It also attempts to explain how the same sound can be produced with such large variability in spectral peaks and valleys. This variability suggests that the acoustic cues are not limited to the locations of the resonances, but there may be more complicated perceptual effects. 2.1.1 Simple Lossless Tube To understand and model the vocal tract during lateral consonant production, a simple lossless tube is first examined. A combination of such tubes is then used as the basis for the model of the lateral consonant. The two variables of interest in the tube are the sound pressure and the volume velocity. The sound pressure p(x, t) and volume velocity U(x, t) for one dimensional propagation are 8p-- pOUUau Ox A&t 7Po& A (2.1) (2.2) where A is the cross-sectional area, Po is the ambient air pressure, p is the ambient pressure of the air (0.00114 gm/cm 3 ), and 7-y is the ratio of specific heats at a constant volume (7 = 1.4 for air). Assuming an exponential time dependence and a constant area function, these equations reduce to d 2p + k 2p- 0 d d2+ k'p = 0 (2.3) (2.3) and A dp= U 327rfp dx where k = 2, and c = 2.1.2 2[ PK(the (2.4) velocity of sound, 35,400 cm/sec in the body)[15]. Expected Numbers of Poles and Zeros Previous research verifies that adding channels to a model increases the number of poles and in some cases causes zeros to appear in the transfer function. In the simple case of a single tube, the expected number of poles and zeros in the transfer function for a uniform tube is determined by the length of the tube and the position and the type of source. The average spacing of poles for a uniform tube of length 1 is (. By adding side branches of total length 1, the average spacing of poles for the system decreases to is np, = 2(1+,)f fz = The number of poles up to a certain frequency, f, Zeros also appear in the transfer function and the number up to frequency, f, is nz,.= at . C The first minimum, zero, in the transfer function occurs for a uniform side branch that is closed at the end[15, 7]. The lateral however consists of two parallel paths not just an additional side branch, and this affects the expected approximate number of poles and zeros. Stevens hypothesizes that poles of the transfer function are replaced by pole-zero-pole clusters and that the first appearance of such a cluster will occur at the half wavelength of the lateral channels, f = (2(Lc12)) [15]. A difficulty with two parallel paths is determining how they will interact; one path could be considered as the main channel, while the other acts as a side branch. 2.2 Evaluation of Lateral Side Channels Fant suggests the possibility of the formation of side channels during lateral consonant production and this is confirmed by the data gathered at UCLA[7, 13]. The addition of side channels during the lateral consonant changes the transfer function of the vocal tract configuration. Previous research shows that speakers produce the lateral consonant with a variety of vocal tract configurations depending on the individual and the phonetic context[12]. Speakers can have two side channels of different lengths and different areas or only one channel. The side channels formed during the lateral consonant can be modeled approximately as two uniform tubes, as shown in Figure 21. We consider the acoustic behavior of such a configuration. 2.2.1 Equations Solving Equations 2.3 and 2.4 for the input and output volume velocities and pressures, the equations reduce to ( Urn,2 ) ()cos(-k1 cosin(-k1 ) 21 - Pin2 and ( ( Pin where 121 and 122 21 cos(-kl21) sin(-kl21) cos(-kl 22) I - -2Pcsin(-k1 22 ) 4 =sin(-kl 22 ) cos(-k122) pout2 (2.5) out,, ( / t22 0oUo C(2.6) Pout22 / are the lengths of each of the component paths, and A2 1 and A 22 are the constant cross-sectional areas. 121 'I Uin Uout A22 122 Figure 2-1: Model of side channels formed during lateral consonant production 2.2.2 Boundary Conditions The boundary conditions of the system are restrictions on the volume velocities and sound pressures of the two tubes at the input and output. The input volume velocity, Uin, is the sum of Uin2 , and Uin,22 and the output volume velocity, Uo, is the sum of the two output volume velocities. The volume velocities are assumed to sum without any interference at the two locations. The input and output pressures of the two tubes are also assumed to be equal. The transfer function, Uout/Uin, of the two side channels is obtained by solving Equations 2.5 and 2.6 for the boundary conditions of the system, assuming zero pressure at the output of the system, Pot = 0. A 21sin(kl 22) + A22sin(kl 21)(2.7) "A 2 1cos(k121)sin(k122) + A22sin(k121)cos(k122 2.2.3 Solve for Specific Cases of Lengths and Areas The transfer function of the system depends on the lengths and areas of the two channels. The model is a lossless model and is useful for locating the resonance peaks and valleys of the system, but not the amplitudes. In this simple system, a zero will occur when the numerator of the transfer function is zero and a pole will occur when the denominator is zero. A zero occurs at the frequencies for which sin(k122 ) = -A 22/A 21sin(kl21) (2.8) cos(kl21)sin(k122 ) = -A 22 /A 2 1sin(kl21)cos(kl 22) (2.9) and a pole occurs when The transfer function has been calculated for several different cases of lengths and area configurations, reflecting the possible vocal tract configurations used by individuals. 0 1000 2000 3000 Frequency (Hz) 4000 5000 __ 6000 Figure 2-2: Transfer function for model of side channels, 20log|IUout/Ui,|, plotted for A21 = A 2 2 = .1 cm2 and 121 = 122 = 8 cm Side Channels the Same Lengths The transfer function, 201og|UoJt/Uid ,is plotted for A 2 1 = A 22 = .1 cm 2 and 121 = 122 = 8 cm in Figure 2-2. When the two cross-sectional areas, A21 and A2 2 , and lengths, 121 and 122, are equal, the transfer function has no interference from the side channels and the transfer function appears to be that of a simple, lossless tube of ltot = 121 = 122 = 8 cm. Varying the cross-sectional areas of the tubes will not affect the locations of the resonance peaks or valleys when the lengths of the side channels are equal as Equations 2.8 and 2.9 suggests. Some zero points are canceled by poles at the same location when the ratio of the lengths is an integer. In this case, where the lengths are equal, the zeros created are cancelled by the additional poles and the system appears as a single tube of length 8 cm. As expected, the number of poles of the system up to 6 kHz is 3, using the rough approximation of the number of poles described in Section 2.1.2 with a single tube of length I = 8 cm. Side Channels of Different Lengths The transfer function, 201ogiUGot/Uin, is plotted for A 21 = .3 cm 2 , A 22 = .2 cm 2 and 121 = 8.5 cm, 122 = 7.5 cm in Figure 2-3. The 1 cm difference between side channel lengths produces two pole-zero pairs - the 0 1000 2000 3000 Frequency (Hz) 4000 5000 6000 Figure 2-3: Transfer function for model of side channels, 20log UoUt/UiI, plotted for A21 = .3 cm2 , A 22 = .2 cm 2 and 121 = 8.5 cm, 122 = 7.5 cm first at approximately 2 kHz and the second at 4 kHz. The first zero location seems to be at a frequency corresponding to a single wavelength of the total length of ýhe side channels (in this case 16 cm). 2 The transfer function, 20logIUo 0 t/U,,j, is plotted for A 21 = A 22 = .3 cm and 121 = 4 cm, 122 = 12 cm in Figure 2-4. As the difference between the two side channels increases, the variation from a simple all-pole system is greater. Although the total number of poles present up to 6 kHz is not as expected, bunching may be occurring at higher frequencies and some cancellation of poles and zeros occurs. Since the ratio of the lengths is an integer and the cross-sectional areas are equal, some of the zeros are cancelled by additional poles. A pole-zero pair does appear with the zero at approximately 2100-2300 Hz. Varying the lengths of the side channels, greatly alters the shape of the transfer function. 2 In Figure 2-5, the transfer function, 201ogUo,,t/Ujl, is plotted for A21 = .3 cm , A 2 2 =- .1 cm2 and 121 = 4 cm, 122 = 12 cm. The total length and length ratios are the same as in Figure 2-4, but the ratio of the cross-sectional areas is different. The change in area ratio produces a great change in the transfer function when the lengths of the side channels are also different, and an additional pole-zero pair appears in the 00 Frequency (Hz) Figure 2-4: Transfer function for model of side channels, 201oglUot/Ui I, plotted for A21 = A22 = .3 cm2 and 121 = 4 cm, 122 =: 12 cm 30 Frequency (Hz) Figure 2-5: Transfer function for model of side channels, 20logljUt/UjnI, plotted for A 21 = .3 cm2,A 22 = .1 cm 2 and 121 = 4 cm, 122 = 12 cm 1000-2000 Hz range. The second lower pole-zero pair could possibly appear due to the change in the ratio of the areas. Even with these four limited cases, the variability in the spectra with this vocal tract is extreme, from a simple all-pole system to a system with two pole-zero pairs. A22/A21=.33, 122/121=.88 - - A22/A21=.2, 122/121=3 - - A22/A21=.5, 122/121=1.46 A22/A21=.25,122/121=.52 0 2000 4000 6000 Frequency (Hz) Figure 2-6: Transfer f'unction for model of side channels, 201og|Uo"tUi,|,plotted for A2 1 /A22 = .1 cm 2 to .5 cm 2 and 121 + 122 = 16 cm To further demonstrate the variability possible with slight modifications of the vocal tract configuration, Figure 2-6 shows the transfer function for pairings of crosssectional area ratios and channel lengths, with the sum of the side channels held constant, 121 + 122 = 16 cm. These plots illustrate the extreme variability that can occur. An interesting observation is that all configurations produce a pole-zero pair in the 1.5-3 kHz range, corresponding approximately to one wavelength of a tube of length 16 cm. Evaluation of the Entire Model 2.3 4 Uin Uout Figure 2-7: Model of the vocal tract during lateral consonant production During lateral production, the vocal tract can be modeled in its entirety as two side channels coupled with a simple uniform tube on each end, as shown in Figure 2-7. 2.3.1 Equations Solving Equations 2.3 and 2.4 for the input and output volume velocities and pressures, the equations for the additional two sections reduce to J Pini -i cos(-kli) 2-A sin (- kll) k cos(-klj) c Pi (2.10 sin(-kl) (2.10) Pout, (Use and cos(-l) / -2csin(-k1 3) / Pin3 " 'PC sn(-kl3)) Uot3 cos(-kla) (2(1 PoUt 3 where 11 and 13 are the lengths of each of the systems, A 1 and A3 are the constant cross-sectional areas of the front and back tubos. 2.3.2 Boundary Conditions The transfer function, Uout/Uin, of the system is obtained by combining Equations 2.5, 2.6, 2.10 and 2.11. The boundary conditions for the system are determined for each of the sections. The input volume velocity, Uin,, is the source volume velocity for the system. The output volume velocity of the left tube, U,,t, is the sum of the two input volume velocities of the side channels, Uin, and Uin22. The output pressure, Pot,, is the input pressure of the side channels. The input volume velocity of the third section, Uin3, is the sum of the output volume velocities of the side channels, Uout,, and Uout2,. The output pressure of the two side channels is the same as the input pressure to the right section, Pin3 . The output pressure, PoUt3 , is assumed to be zero because the end of the tube is open, and the output volume velocity, UOt 3, is the output volume velocity of the system, Uout. The transfer function is Uot/Uin = [-A 3 (A 21 sin(kl22 )+ A 22 sin(kl21 ))]/ (-2A2lA 22 cos(kli)cos(k12 1)cos(kl 22)sin(k13 ) +A A 21sin(kll )sin(kl22)cos(kl2 1)sin(k13 ) +AA -A 3 sin(kli )sin(kl22 ) sin((k121 )cos(k13) 22 A3 cos(kll)cos(k122)sin(kl 21)cos(kl3 ) +(A21+ +2A A22 )(cos(klI)sin(kl21)sin(kl3 )sin(kl22)) 21 A22 cos(kl )sin(kl3) -A 2 1A3 cos (kl)cos(kl 21)cos(k13 )sin(kl22) +AA 22 sin(kl)cos(kl22 )sin(k13) sin(k121) (2.12) 2.3.3 Solution for specific cases of lengths and Areas The transfer function, 201og|IoUt/Ui•n , is solved for various lengths and cross-sectional areas to determine the effects on the frequency spectra. The lengths and areas of the first and third section are not varied. In all of the following figures, A 3 = 5 cm2, 13 = 10 cm, A, = 2 cm2 , 11 = 1 cm are kept constant and correspond roughly to measured values gathered by Narayanan et al. [13, 12]. The second formant in the model will not be as low as actually measured because all of the sections are assumed to have constant areas, and, according to perturbation theory squeezing the cross sectional area of section 1 will cause the second resonance peak to shift lower[15]. Also, squeezing section 3 near the middle (to simulate the backed tongue body) will lower the second resonance. As in the situation with only the side channels, this system will have zeros when the numerator of the transfer function is zero and a pole will occur when the denominator is zero. The zeros of the system are determined by the same equation as for the simple model. Zeros will occur when, sin(kl22 ) = -A 22 /A 21 sin(kl21). (2.13) The locations of the poles in this more complicated model are different from the pole locations for the simple model since the denominator is more complicated. In some cases when poles cancel zeros in the simple model, cancellation may not occur in the more complicated model. For a general idea of the number and location of the additional zeros, the zeros for the simple model can be solved for using Equation 2.8 of the simple model. Same Lengths The transfer function, 201ogjUot/Uinj, is plotted in Figure 2-8 for A 21 = A 22 = .2 cm2 and 121 = 122 = 8 cm. The system appears as an all-pole system as expected. When the side channels are the same length, the effect on the transfer function is to appear as a single tube of the length of the side channels (not the sum). If the model accounted for losses, the area effects would be visible but this model I I-Frequency (Hz) Figure 2-8: Transfer function for model of vocal tract, 201ogJUot/Ui I, plotted for A 2 1 = A 22 = .2 cm 2 and 121 = 122 = 8 cm only locates peaks and valleys of : : system. As expected, the number of poles of the system up to 6 kHz is 6, using the approximation described in Section 2.1.2. Same Areas, Different Lengths The transfer function, 201og|Uout/Uin, is plotted in Figure 2-9 for A21 = A22 = .2 cm 2, 121 = 10 cm, and 122 = 8 cm. The different lengths of the side channels produce two pole-zero pairs with one in the 1800-2200 Hz range. The number of poles is related to the total length of the system, and using the approximation described in Section 2.1.2 the number of poles is expected to be 9 up to 6 kHz. Assuming the additional length to the system is 8 cm (the average length of the two side branches), the expected number of zeros with the addition of the side branches is between two and three. Obviously, the number of zeros with two side channels, is not just dependent on the total length of the side branches but on the interaction of the side channels described by Equation 2.8 a rough approximation of the number of zeros can be determined by the additional length of the side channels to the system. Different Areas, Different Lengths The transfer function, 201ogJUot/Unin, is plotted in Figure 2-10 for A21 = .2 cm 2 , A22 = .5 cm 2 and 121 = 11.5 cm and 122 = 4.5 )0 Frequency (Hz) Figure 2-9: Transfer function for model of vocal tract, 201ogUo,,t/UinJ, plotted for A21 = A22= .2 cm2 , 121 = 10 cm, and 122 = 8 cm Frequency (Hz) 00 Figure 2-10: Transfer function for model of vocal tract, 20loglUot/U,,I, plotted for A21 = .2 cm 2 , A2 2 = .5 cm 2 and 121 = 11.5 cm and 122 = 4.5 cm cm. The difference in areas and lengths of the two side channels causes the transfer function to include three pole-zero pairs. - A22/A21=2.5, 122/121=3 SA22/A21=3, 122/121=.88 - - A22/A21=2, 122/121=.52 ~0 0 5B; 0M m 0 2000 4000 Frequency (Hz) 6000 Figure 2-11: Transfer function for model of vocal tract, 20logIUot/Ui.,j, plotted for various values of A 21 and A 22 and 121 + 22 = 16 cm The transfer function, 201ogU1,/t/Uinj, is plotted in Figure 2-11 for various values of A21 and A22 and 121 + 122 = 16 cm. Even though the total length of the side channels, 121 + 122, remains constant at 16 cm, the variation in the frequency spectra of the transfer function is huge. The length of the side channels is not necessarily 16 cm for all individuals, but the variation does give a general idea of what occurs with different side channel lengths and areas. Pole-zero pairs appear at approximately 1500-2000 Hz and continue to the high frequencies. 2.4 Justification of the Model The model suggested here does not account for any acoustic losses including changes in bandwidth or glottal source. The model does give a general idea of the location of resonance peaks due to the addition of the side channels. Calculation of the Number of Poles and Zeros and 2.4.1 Approximate Locations The complete model will give rise to zeros at the locations specified by Equation 2.8. Since the total number of poles and zeros can not increase for the system, the transfer function, as a general approximation, will also show the formation of the same number of poles as zeros. The number of zeros expected up to a certain frequency, f, is nz = (2) C as described in Section 2.1.2. The exact locations are determined by the denominator of the transfer function. When the lengths are integer multiples of each other, pole and zero cancellations can be expected. The simpler model of the side channels in Figure 2-1 can be used to determine the possible locations of the zeros and the expected number of zeros. 2.5 2.5.1 Lengths Limitations of Model Assumptions The total sum of the side branches is held constant at 16 cm which is greater than data gather by Narayanan et al. suggests [12, 13]. However, the examination of natural utterances, suggests the formation of a pole-zero pair around the third formant. Shortening the side channel lengths, increases the frequency of the first zero above the third formant. Constant Areas The assumption of constant areas for the model is not realistic, but it simplifies the calculations and allows some generalizations to be made of the expected effects. Formant Locations The second formant calculated by this model does not account for the backed tongue body position and the F2 is not as low as observed. Perturbation theory of the back cavity accounts for the lower F2 observed and indicates why the simple model does not exhibit this behavior. 2.6 Conclusions Although the total lengths of the side channels are kept constant, the locations of the resonance peaks and valleys are not constant. Small changes in the lengths and cross-sectional areas causes huge variations in the spectra. If the key perceptual cue for the lateral consonant is the location of the peaks, then speakers would have to maintain a certain configuration of the vocal tract that is not very stable. Since similar configurations create extremely different effects, the key perceptual cue can not be the exact location of the resonance peaks and valleys of the spectrum. Instead, the cues must be more subtle, and possibly the addition of the poles and zeros assist in creating the cues. The constant attributes across the various lengths and crosssectional areas is low first and second formant resonances and the addition of multiple pole-zero pairs beginning at approximately 1500 Hz. Chapter 3 Measurements 3.1 Purpose of Measurements - What Makes a /1/? The lateral consonant is extremely variable depending on the phonetic context and speakers. Modeling of the vocal tract as a configuration with side channels suggests that the resonance frequencies of the system are susceptible to subtle changes in crosssectional areas and lengths of the side channels. If the key perceptual cue can not be the exact location of the resonance peaks and valleys, then what is the cue used by listeners to identify the lateral? Previous research suggests that other changes in the system occur during the lateral production including changes in the glottal source [1, 2], bandwidth changes, and the addition of pole zero pairs. Measurements are made on lateral consonants produced by various speakers in order to better understand the changes occurring during the lateral production and to determine what might be potential perceptual cues used to discriminate the lateral consonant from other sounds. 3.2 Method All recordings were made for normal English speakers with normal hearing. The speakers were recorded in the sound room at the MIT RLE Speech Group lab. The recordings were made onto audio tape or onto DAT tapes. Utterances for male speakers were digitized at a sampling rate of 10 kHz and low pass filtered at 4.8 kHz while female speakers were digitized at a sampling rate of 13 kHz and low pass filtered at 6.2 kHz. One set of utterances for a male speaker and a female speaker were directly digitized from DAT tapes and downsampled to 10 kHz and 13 kHz respectively using the Sound Design program. Singleton The singleton lateral utterances were recorded for six speakers, three female and three male. Three repetitions of six pre-vocalic /1/ utterances were recorded in isolation with an extra word at the end of the list to prevent intonation variations. The speakers were instructed to say the word in a normal tone and to maintain a constant intonation for all words. The word list is given in Table 3.1. leap loot let lap law luck Table 3.1: Singleton utterances The lateral consonant is released into vowels at the four extremes of the possible high and low tongue body configurations and two lax vowels at intermediate tonguebody heights. All speakers of the singleton utterances were also recorded for the cluster utterances. Cluster The cluster utterances were recorded for eight speakers, four female and four male. Two repetitions of the cluster words spoken in the phrase, "Say word again" were recorded. The speakers were instructed to say the list at a comfortable level and pace. The word list is given in Table 3.2. The stop consonants were combined with /r/ and /1/ in combinations that occur in English and released into the vowels, /i/ and /a/. Additionally, singleton stop consonants and /r/ were used for utterances releasing into the two vowels. beat breed clean deep geese green plead team bought broad clod dot got grog plod top reed bleed keep crete dream glean peat preach treat rod block cop craw drop glop pot prod trod Table 3.2: Cluster utterances 3.3 Measurements Windows, Placements All measurements of the waveforms were made using a 6.4 ms Hamming window. The release of the liquid consonants was determined by examining the speech waveform and the spectra for the change in formants (particularly the second formant resonance peak) and voicing. Pre-Emphasis Pre-emphasis on the spectra was used to ensure a more accurate first frequency formant peak, since the first formant frequency is very low for the lateral consonant. The first harmonic interferes with the measurements and Fl sometimes appears to be lower than 300 Hz even with pre-emphasis. Figure 3-1 plots the attenuation in dB of the pre-emphasis filter for a 13 kHz sampling rate. The effect can be roughly thought of as a 6 dB/octave slope up to about 3 kHz. The pre-emphasis effect is removed from measurements of the spectrum amplitudes. 3.3.1 Singleton Measurements of singleton utterances were made at two points in time using a method similar to that described by Stevens and Blumstein[16]. Measurements of the first 00 Figure 3-1: The effect of the pre-emphasis on the spectra three formants and the amplitudes were taken 20 ms prior to the release of the lateral and 20 ms after the release into the vowel. A series of spectra obtained with the 6.4 ms Hamming window were averaged over a 12 ms interval, and therefore included at least one full glottal period. Use of this averaging technique was convenient since it did not require careful placement of the window at the beginning of each glottal period. An example spectrogram of a singleton utterance is shown in Figure 3-2. LSPECTO: 256-pt DFT, smart AGC 6.4-ms Hamming window every 1ms PJLUCK1 APR 20 1998 [APRAHLER.SYN] z 0 w a: Ll~ I -! TIME (ms) (s . . R9n10T.. II F -. - %Y^Yfft. '7ýý -ý .... 40 <0~ L 0 100 200 t--t 300 400 TIME500(ms)600 700 800 900 o1000 Figure 3-2: Spectrogram of luck for male speaker 3.3.2 Cluster Several measurements on utterances with clusters were made at different points in the waveform, although not all of them are reported here. A 6.4 ms Hamming window was centered on the initial burst produced by the stop consonant, and the frequencies and amplitudes of the low frequency peak and high frequency peak (greater than 2.5 kHz for men and 3 kHz for women) were measured. If the stop consonant was voiced, additional measurements of the frequencies and amplitudes of the low and high frequency peaks were made 20 ms after the burst and 20 ms prior to voicing onset with the Hamming window, and averaging for 12 ms. Measurements of the first three formant peaks and amplitudes were taken 20 ms after the release of the liquid consonant or 20 ms after voicing onset (20 ms into the vowel if there was no liquid consonant in the utterance) for voiced and voiceless consonants using the 6.4 ms Hamming window and averaging for 12 ms. If the liquid consonant was sustained for more than 20 ms by the speaker, an additional measurement was made 20 ms prior to the release of the liquid using the 6.4 ms Hamming window averaging for 12 ms. The spectrogram of a voiced labial cluster utterance, bleed, for a male speaker is shown in Figure 3-3 and the spectrogram of a voiceless labial cluster utterance, plead, is shown in Figure 3-4. The lateral is not sustained in the voiceless consonant cluster: it appears that the lateral is being released as voicing onset occurs. 3.4 Singleton Analysis Figures 3-5 and 3-6 show the changes in frequencies and amplitudes of the first three formants from the time 20 ms before the release to 20 ms after the release of the consonant. Clearly, the vowel has higher frequencies and amplitudes than the lateral consonant. The changes in frequencies depend on the vowel that follows. Although the change in Fl is small for both leap and loot which are high vowels, these two words have the greatest average change in amplitude, Al. This change in amplitude is much greater than what would be expected based on a small change in Fl. With greater LSPECTO: 256-pt DFT, smart AGC . 80 (L . 6.4-ms Hamming window every I me . 40 0 100 200 300 400 500 600 TIME (ms) 700 800 900 1000 Figure 3-3: Spectrogram of voiced cluster utterance, bleed, for male speaker LSPECTO: 256-pt DFT, smart AGC 6.4-ms Hamming window every 1 ms 80 A- a.40 <0 0 100 200 300 400 500 600 TIME (ms) 700 800 900 1000 Figure 3-4: Spectrogram of voiceless cluster utterance, plead, for male speaker leap let lap law luck loot Figure 3-5: Changes in amplitudes of formants from lateral to vowel for singleton utterances: error bars represent standard deviation of data word leap let lap law luck loot A1 Al 9.4 6.4 3.9 3.7 4.8 12.7 AA2 9.0 13.8 15.7 10.3 13.5 11.9 A A3 19.5 17.0 20.3 8.2 14.2 15.4 AF1i 15.9 237.1 213.9 318.7 248.8 36.1 AF2 875.9 461.2 321.1 358.9 119.6 272.8 sdal 2.0 3.8 2.0 4.2 3.6 3.2 sda2 3.2 3.8 3.0 4.4 4.2 5.0 sda3 2.0 7 2.2 3.8 3.2 4.6 sdfl 14.4 13.8 29.4 49.6 40.6 30.2 sdf2 108 105 47.8 51 50.8 60.6 Table 3.3: A's of formant frequencies and amplitudes for singleton utterances; Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across speakers and repetitions are given in the right half of the table 800 600 11 400 200 0 leap let lap law luck loot Figure 3-6: Changes in formant frequencies from lateral to vowel for singleton utterances: error bars represent standard deviation of data changes in amplitudes occurring than what is expected based on the shifts in formant frequencies, other changes in the source must occur. Possibly the formant bandwidths and the glottal source change, cause a more abrupt transition and enhance the effects resulting from changes in frequency. This enhancement provides strong perceptual cues that distinguish the lateral consonant from, for example, the glide /w/. 3.5 Analysis of Utterances with Consonant Clusters O delta Al Sdelta A2 E delta A3 la ra i Figure 3-7: Changes in amplitudes of formants from liquid to vowel for voiced and some voiceless stop cluster utterances: error bars represent standard deviation of data Figure 3-7 and 3-8 show the changes in frequencies and amplitudes of the first three formants from the point 20 ms before the release to 20 ms after the release of the consonant. The data includes voiced stop consonants and voiceless stop consonants with a duration of at least 20 ms of the liquid. The lateral consonant has greater 1000 ~ii~~i la ra d Figure 3-8: Changes in formant frequencies from liquid to vowel for voiced and some voiceless stop cluster utterances: error bars represent standard deviation of data word la li ra A1l 0.8 3.9 0.6 ALA2 7.5 2.6 3.9 ri 0.0 1.5 A A3 ALF1 8.8 242.4 12.0 -0.4 4.8 114.4 7.2 -3.3 AF2 sdal 149.7 3.4 684.7 3.8 41.8 5.8 490.4 5.2 sda2 sda3 sdfl 9.4 8.2 167 7.0 11.6 37.0 6.8 14.8 182 9.8 14.0 53.2 sdf2 255 558 148 274 Table 3.4: A's of formant frequencies and amplitudes for voiced and some voiceless stop cluster utterances; Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across speakers and repetitions are given in the right half of the table changes in both amplitudes and frequencies than the retroflex consonant for both vowels, especially changes in the amplitude of the third formant. The change in F1 for /li/ is greater than for /la/ but the change in F2 is is greater for /li/ than for /la/. 3.6 Comparisions of Environments 25 20 15 10 5 0 delta Al delta A2 delta A3 Figure 3-9: Comparison of A's of amplitudes of formants from lateral to vowel for singleton and cluster utterances: error bars represent standard deviation of data Figures 3-9 and 3-10 compare the data for the laterals as singletons and in clusters. The figures show the changes in frequencies and amplitudes of the first three formants from 20 ms into the vowel to 20 ms prior to the release of the lateral consonant. The data used for the comparisons of cluster and singleton lateral consonant changes are from the same six speakers, three female and three male. The changes in amplitudes and frequencies for the singleton lateral consonant are greater than changes in amplitudes and frequencies for the cluster lateral consonants. 1000 Scluster /a/ E single/a/ lI Icluster D'single/N delta F1 delta F2 Figure 3-10: Comparison of A's of formant frequencies from lateral to vowel for singleton and cluster utterances: error bars represent standard deviation of data measurement A Al A A2 A A3 A F1 A F2 sd al sd a2 sd a3 sd fl sd f2 cluster /a/ 0.8 7.5 8.8 242 150 1.6 4.8 4.0 88.2 128 single /a/ 3.7 10.3 8.2 319 359 2.0 3.0 2.2 29.2 47.8 cluster /i/ 3.9 2.6 12.0 -0.4 685 1.8 3.4 5.8 18.6 229 single /i/ 9.4 9.0 19.5 15.9 876 4.2 4.4 3.8 49.4 51 Table 3.5: Comparision of singleton and cluster utterances for same speakers; Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across speakers and repetitions are given in the lower half of the table 20 15 0 5 -50m frequency (Hz) Figure 3-11: Measured values of formant frequencies during lateral consonant: error bars represent standard deviation of data Fl F2 F3 cluster /a/ 482 1022 2694 single /a/ 397 1019 2885 cluster /i/ 398 1244 2450 single /i/ 398 1310 2722 Table 3.6: Measured formant frequency values in Hz of the lateral for singleton and cluster utterances The averaged measured values of the formant frequencies during the lateral are plotted against the changes in amplitudes in Figure 3-11 for singleton and cluster utterances. The measurements used to determine the formant frequencies were made 20 ms prior to the release of the lateral. The values are also listed in Table 3.6. 3.7 Conclusions 3.7.1 Abruptness The lateral is produced with a relatively open vocal tract configuration(compared to other consonants). Possibly, the lateral must increase the abruptness of the release into the vowel when it is a singleton to enhance the consonantal quality. The change in amplitudes of the formants are greater for singleton laterals than for cluster laterals for all formants. The change in formant frequencies are greater for singleton laterals than for cluster laterals. The total changes in formant frequencies depend on the vowel the lateral is being release into. It appears that when the change in formant frequency is small, the change in amplitude of that formant increases. The lateral consonant creates abruptness. The change in frequency is limited by the following vowel. When the formant frequency transition is minimal, the amplitude change is increased by some alteration of the glottal source and and increase of acoustic losses to create the consonantal quality. Cluster laterals do not necessarily need to enhance the consonantal quality since the stop consonant already creates abruptness in the spectra. Chapter 4 Synthesis This chapter describes the synthesis of the lateral using the model and the measured values as a guide. Previous attempts at synthesis have not incorporated the observed phenomena of bandwidth changes, additional spectral tilt, or pole-zero pairs, using instead only formant transitions and some alteration of the amplitude of voicing[11]. The method of the synthesis is described for a lateral using these additional parameters in conjunction with formant transitions. These synthesized words are then used for perceptual experiments .to determine the importance of the various synthesis parameters. Incorporating these changes makes natural synthesis of laterals possible despite the extreme spectral variability observed and expected. 4.1 4.1.1 Abruptness/Consonantal Quality Changes in Frequency The lateral has a low F1 and F2 which transition rapidly at the release into the vowel. As previous attempts at synthesis have shown, these transitions are essential for the accurate synthesis of a natural lateral. In singleton laterals, the transitions alone do not create enough abruptness, and changes in the voicing amplitude must be added. 4.1.2 Changes in Amplitude The model and gathered data for the lateral exhibit variability of amplitudes and in the positioning of poles and zeros, making it nearly impossible to make any generalizations. By switching the focus from particular locations of spectral minimas or maximas to the general effects occurring during the lateral, a different approach to synthesis can be developed. Since the American English lateral has no minimal contrasting pair, more latitude in frequency characteristics is possible. Past research has shown that changes in the glottal source occurs during the lateral consonant[2, 1]. The simple lossless model proposed in Chapter 2 suggests the formation of a pole-zero pair, but the location is variable depending on the lengths and cross-sectional areas of the side channels. Modeling and measured values suggest that glottal source changes are manifested in changes in amplitude of voicing(AV), spectral tilt(TL), bandwidth changes(BW), and a pole and zero. Changes are expected in the glottal source, but the exact mechanisms of the additional loss are not necessarily important for each individual utterance. 4.2 Method The synthesis of the lateral utterances was performed using the Klatt synthesizer. Six utterances, luck, let, lap, law, leap, and loot, spoken by a male speaker in isolation were used as the basis of the synthesis. The final stop consonant is spliced off (except for law which is a consonant vowel(CV)) and only the /1/ and vowel are synthesized. The final version of the utterance is the combination of the synthesized utterance concatenated with the stop consonant. For each utterance, two synthesized versions were created for evaluation. The first utterance consisted of only the appropriate formant locations and transitions and constant parameters for voicing quality. Changes in amplitude, changes in spectral tilt, bandwidth changes, and an additional pole and zero are added to the formant transitions to produce the second synthesized version. The synthesis method including the time varying parameters used for the word loot is detailed below. The spectrograms of the three versions of the word - the natural 26-pt DFT,smamAGC LSPECTO: every1m wndow 6.4-ms Hamming Figure 4-1: Spectrogram of natural utterance loot for male speaker 256-ptDFT,smartAGC LSPECTO: 0 100 200 300 400 windowevery1ms 6.4-msHamming 500 600 TIME(ms) 700 800 1000I 900 Figure 4-2: Spectrogram of first synthesized utterance of loot for male speaker LSPECTO256-ptDFT,wart AGC 0 100 6.4-mmHamming window every1ms XLOOT2 MAY7 1998 [APRAHLER.SYN] 200 300 400 s00 600 TIME(ms) 700 0co soo 1000 Figure 4-3: Spectrogram of second synthesized utterance of loot for male speaker KLSPEC93: MAY 11 1998 I47ms [aprahier~syn] loot1 Figure 4-4: Spectra of natural utterance during /1/ using a 6.4 ms Hamming window KLSPEC93: [aprahler.syn] MAY 11 1998 87mns dootl Figure 4-5: Spectra of first synthesized utterance during /1/ using a 6.4 ms Hamming window KLSPEC93: MAY 11 1998 iN.61A i Jc In4ý m iNMcA-^i oOw Oti Nt21MA V "- VV VwV - V TV vv i-vvV.V [aprahler~syn] f987ms V a V Xloot2 Figure 4-6: Spectra of second synthesized utterance during /1/ using a 6.4 ms Hamming window utterance, the first synthesized version including formant transitions only, and the second synthesized version including additional acoustic losses and formant transitions are shown in Figures 4-1, 4-2, and 4-3. The original utterance was evaluated using a program (lspecto) that estimates the formant locations and amplitude of voicing. Using these values as a template, the first synthesized version was created. The correct formant frequencies were determined by comparing the natural utterance with the synthesized version using a 6.4 ms Hamming window averaging over 14 ms and placing the window for measurements every 15 ms. Example spectra from this measurement technique are shown in Figures 4-4, 4-5, and 4-6 for the three versions of the utterance. Additionally, a 25.6 ms Hamming window placed every 20 ms was used to match the formant peaks with the correct harmonics during the lateral and vowel. This process allowed the formant frequency peaks to be matched after several iterations. The other parameters of the synthesizer, such as the open quotient(OQ) and spectral tilt(TL), were set as constants to maximize the naturalness of the utterance depending on the speaker for the first synthesized version. The final formant transitions used for both versions of the synthesized utterances are plotted in Figure 4-7. The amplitude of voicing parameter and amplitude of aspiration are held constant throughout the utterance, with the aspiration amplitude 15 dB below the amplitude of voicing. The constant parameters used in the first 4500 F5 40003500 .. ..-- -F4 3000- 2500F3 2000 5001000 F1 5000- 0 50 150 100 200 250 Tim (ms) Figure 4-7: Formant trajectories for the synthesized word loot Parameter GV GH OQ TL B1 B2 B3 B4 B5 Value 64 64 55 0 45 200 350 125 150 Table 4.1: Constant synthesis parameters of utterance incorporating formant frequency transitions synthesized version to obtain appropriate voicing quality are listed in Table 4.1. The second synthesized version of the utterance is created by adding a pole-zero pair, changing bandwidths, and altering the spectral tilt. Gathered data suggest the AV parameter increases as the /1/ is released. Voicing changes observed are due to changes in the AV parameter in addition to other mechanisms of acoustic losses, including changes in bandwidths and the formation of pole-zero pairs. Changes in the AV parameter are added first by using the extrapolated voicing values from the original utterance. The amplitude of aspiration is also varied to match the voicing changes and maintain the 15 dB difference for the appropriate voicing quality. The values of the voicing parameters, including the spectral tilt, which is discussed below, are shown in Figure 4-8. A pole and zero are then added during the lateral, and these combine rapidly at the release into the vowel. Modeling suggests that the pole and zero, while occurring in a pair, do not always occur in the same order. By examining the spectra of the natural utterance, the order of the pole and zero is determined as well as the bandwidths. The locations of the pole and zero over time are plotted in Figure 4-9. Following the pole and zero placement, the bandwidths of the formants and the spectral tilt are adjusted. If a zero-pole pair occurs, the spectral effect at high frequencies is an overall increase in amplitude, whereas a pole-zero pair causes an attenuation of the higher frequencies. When a zero-pole occurs, the entire high frequency spectra is increased, but by increasing the spectral tilt, the appropriate bandwidths and prominences can be matched. When a pole-zero occurs, less spectral tilt needs to be added since the overall effect of the pair in that order is to decrease the amplitude of the higher frequencies. From the ordering of the pair, the spectral tilt and bandwidths are determined by matching the natural utterance with the synthesized utterance. The varying bandwidths of the formant frequencies are shown in Figure 410. In this case, the bandwidth of the second formant actually is increased during the vowel to match the natural utterance. During the lateral the second formant had a small bandwidth and a more prominent peak than exhibited in other utterances. The constant parameters for the second synthesized utterance are shown in Table 4.2. 60 50 40 IAV ---TL --AH S30 20 10 0 0 50 100 150 Time (ms) 200 250 Figure 4-8: Time varying voicing changes in synthesized utterance 3000 2500 2000 f c 1500- 1000 500 - 50 100 150 Time (ms) 200 250 Figure 4-9: Time varying additional pole and zero in synthesized utterance 350 300 250 200 150 100 50 0 0 50 100 150 200 250 mime(ms) Figure 4-10: Time varying bandwidth changes in synthesized utterance Parameter GV GH Value 64 64 55 100 OQ B5 Table 4.2: Constant synthesis parameters of utterance incorporating voicing changes KLSPEC93: MAY 5 1998 DFT-Spec: win:25.6ms FO= 125Hz Rms= 54dB Specto-Spec: win:25.6ms Freq Amp 253 53 990 44 2734 37 3272 34 3984 35 60-A N 40[t A~;n I 1PLlIi IIIl7Y 20i1ML A I I I on-xA ,'~ wV [aprahler.syn] I 1 v A\ _I 2 ---"---3 FREQ (kHz) 4 A /,~.A. vV V 8ms 80mns NvNvV 5 --IA A, v pjlootl Figure 4-11: Spectra of natural utterance during /1/ using 25.6 ms Hamming window MAY 5 1998 KLSPEC93: OFT-Spec: win: 25.6rm F0 - 125Hz Rms a 57dB Specto-Spec: win: 25.6ms Freq Amp 339 55 3261 48 3958 41 FREQ (kHz) AA AA A AA •A AA AA t3~"y-V V-vV'N 80ms [aprahler.syn] AA xlootl Figure 4-12: Spectra of first synthesized utterance during /1/ using 25.6 ms Hamming window KLSPEC93: MAY 5 1998 60/ A 11 A. J 20 AF i 1 ^ A YA. FN7000t 79AtYcvN 't, "k v V -VVV-V-vv DFT-Spec: win: 25.6ms FGP - 123Hz Rms, 54dB Specto-Spec: win: 25.6ms Freq Amp 371 53 938 43 2793 36 3209 34 4009 34 -- FRE -(k z) /% 1 i 11a i 7Fo'N- f A~d"NI, V'N I[ýt2llowoINY~olt p~t VV-VV-VVrVV-V"-V [aprahler~syn) Figure 4-13: Spectra of second synthesized utterance during /1/ using 25.6 ms Hamming window Spectra for the three versions during the lateral are shown in Figures 4-11, 4-12, and 4-13 using a 25.6 ms Hamming Window with no averaging. The minimum occurring at approximately 1800 Hz is apparent in the spectra of the natural utterance and the second synthesized utterance. The first synthesized utterance instead sustains a value of about 20 dB greater in this frequency range. Also, a better match in the higher frequency bandwidths and relative amplitudes are obtained in the second synthesized version of the utterance. To obtain approximately the correct high frequency amplitudes in the first synthesized utterance, the bandwidths are increased, thereby decreasing the prominence of the spectral peaks in those areas. Chapter 5 Results of Perceptual Testing This chapter presents the results of the perceptual tests, rating the naturalness of the synthesized utterances in relation to the natural spoken utterance and with respect to each other. Once the synthesis work is performed, the new method of including additional parameters must be evaluated. Two sets of perceptual experiments were run to evaluate different synthesized utterances. The initial perceptual experiments guided the second set of experiments to evaluate more accurately the naturalness of the utterances. 5.1 First Perceptual Experiments The first set of perceptual experiments was performed with an initial set of synthesized words. In the initial synthesis work, natural utterances spoken between two other words were used to guide the selection of parameters. The synthesized words were presented to listeners in isolation. A disadvantage of this approach was that the synthesized utterances were shorter than what one would expect for isolated words. This work also attempted to separate the effects of all of the individual components contributing to the changes in voicing, including the pole-zero pair, the changes in bandwidths, and the addition of varying spectral tilt. Subjects were asked to rate the quality of the naturalness of utterances presented individually on a continuous scale of 0 to 1. The results were not conclusive except that all the utterances were relatively natural. Since no discrimination between data points could be statistically shown, the data was instead used to guide the second set of more sensitive perceptual experiments. Three issues were identified as needing to be improved for the second set of experiments. To increase the duration of the utterances to be evaluated, additional utterances were recorded from speakers that included the lateral words spoken in isolation instead of within a phrase as before. The method of testing was also changed to involve a comparison of two utterances instead of evaluating each utterance in isolation. Additionally, since no significant difference was apparent in the results of testing the various parameters contributing to voicing changes individually, the parameters were grouped as a whole for evaluation. 5.2 Final testing The naturalness of the synthesized utterances was evaluated using the three versions of the six utterances: the natural spoken utterance, the first synthesized version including only formant frequency transitions, and the second synthesized version including formant frequency transitions and changes of the glottal source and some formant bandwidths. Five English-speaking subjects participated in the experiments. For each word, six stimulus pairs were created consisting of two different versions of the three utterances in all possible combinations. Five repetitions of each stimulus were presented so a total of ten comparisons of each of the utterances with each other were made. The subjects were given these instructions - "In this listening test, you will be asked to judge the naturalness of the pairs of utterances. Each stimulus is composed of two utterances of the same word. When the utterances are presented, you decide which of the utterances sounds more natural, specifically listening to the /1/ sound in both utterances. It is okay to guess if you do not hear a difference." The results are presented in Table 5.1 and plotted in Figure 5-1. The percentage ratings are determined by adding the number of times each utterance was rated more natural than the other regardless of order. Examination of the data suggests that word let luck leap lap loot natural-one 76% 78% 62% 66% 78% one-natural 24% 22% 38% 34% 22% natural-two 68% 64% 62% 56% 58% two-natural 32% 36% 38% 44% 42% one-two 40% 28% 40% 26% 24% two-one 60% 72% 60% 74% 76% law 72% 28% 62% 38% 38% 62% total 72% 28% 62% 38% 33% 67% Table 5.1: Results of perceptual experiments: % first utterance rated more natural by listeners Perceptual Results 80% 70% 60% 50% C13 40% 30% 20% 10% 0% 0 N total m let m luck El leap 0 lap mloot f law one over two over two over natural natural one Figure 5-1: Results of perceptual experiments: % first utterance rated more natural by listeners word two-one rating A F1 + A F2 A F1 A F2 let luck leap lap loot law 60% 72% 60% 74% 76% 62% 527.0 254.0 977.0 507.0 273.0 97.0 195 176 39 214 19 137 332 78 938 293 254 -40 Table 5.2: Total change in frequency between lateral and vowel for synthesized utterances 80% 7r,0/. 70% 65% 60% 55% 50% 0 100 200 300 400 500 600 700 Total Change in Formants (Hz) 800 900 1000 Figure 5-2: Total change in F1 and F2 between lateral and vowel vs. % times second synthesized utterances rated more natural than first synthesized utterance order does not alter the conclusions made here about which utterances are more natural than another. The second synthesized version is zated more natural than the first synthesized version an average of 67% of the time. For all the utterances, the synthesized version incorporating the changes in voicing is rated more natural when compared with the first synthesized version. Additionally the second synthesized version has an equal or higher rating of naturalness than the first synthesized version when each is compared with the natural utterance. The highest naturalness ratings when comparing the two synthesized versions are obtained for loot, lap, and luck (all rated more natural than the simpler version 70% of the time). The total change in formant frequencies, AF1 and AF2, for the utterances (with the exception of law) is compared with the perceptual results of the second synthesized version and the first synthesized version in Table 5.2 and Figure 5-2. As the total change in F1 and F2 increased, the effect of incorporating additional parameters decreased. This would suggest that when a large abruptness occurs due to the change in formant frequencies, additional parameters are not as essential to increasing the naturalness of the utterance. However, even with the large abruptness in formant frequencies, as in the case of leap, the naturalness of the utterance is increased by varying the amplitude of voicing, bandwidths, spectral tilt, and the addition of a pole-zero pair. Chapter 6 Conclusions The American English lateral consonant is prone to considerable variation depending on the individual speaker and on the phonetic context, and this variability makes it more difficult to characterize than some other consonants. The first two formants are rather low and barely separated while the third formant is generally higher in frequency than those for vowels. Characterization of this elusive consonant and the identification of the key perceptual cues is necessary for applications such as speech synthesis. Modeling, further measurements on gathered data, synthesis, and evaluation of synthesis work with perceptual experiments for the lateral consonant are performed in this thesis in order to begin the complete evaluation of the sound. Source-filter modeling of laterals can help to identify the various acoustic characteristics important for these sounds. The vocal tract during the lateral consonant can be modeled as a tube with constrictions and side branches. The production of the lateral /1/ with an alveolar point of articulation creates an interior cavity formed by the tongue blade. An additional cavity is created under the tongue which couples with the back cavity, creating poles and zeros during the lateral. The side branches around the tongue affect the high frequency components of the lateral and account for effects on the spectrum that can be attributed to pole-zero pairs. Individual differences in the exact locations of the pole-zero pairs are expected since the lengths and cross-sectional areas of the side branches are so variable among speakers. The locations of the poles are due to the interaction of the entire system, while the locations of the zeros are due to the configuration of the lateral side channels only. A database of utterances was recorded by six different speakers. The utterances contained prevocalic /1/ followed by six different vowels, and prevocalic /1/ in stop consonant clusters followed by two different vowels. Acoustic analysis of these utterances examined attributes of the sound that provided information about back reactions on the glottal source during the lateral, pole-zero pairs, and increased bandwidths. Measurements of the formant frequencies and amplitudes during the lateral and the vowel show that the change in amplitudes, A Al and A A2, are greater for the singleton lateral than for the cluster lateral. The increase in Al and A2 can not be accounted for by the changes in formant frequencies alone during the release. The additional increase in amplitudes suggests that changes in the source, the bandwidths, and pole-zero pairs are occurring in addition to the transition of the formant frequencies. The theoretical model of the lateral, together with data from the measurements, was used to guide the synthesis of two words us:ng the Klatt synthesizer [8]. The Klatt synthesizer parameters that were manipulated were formant changes, TL(tilt), BW(bandwidths of formants), AV(amplitude of voicing), and additional poles and zeros. Six utterances containing singleton prevocalic /1/ followed by six different vowels were synthesized for use in perceptual experiments. The first synthesized utterance was created by altering only the formant frequencies at the lateral release. The second utterance included changes in bandwidths, the spectral tilt, and the addition of a pole-zero pair as well as the changes in the formant frequencies. These two synthesized utterances were used with the natural utterance to determine the naturalness of the synthesis. The utterances were presented in pairs and the listeners were asked to judge which utterance in the pair sounded more natural. For all utterances, the second synthesized utterance, containing additional changes in voicing, was judged to be more natural, on average, than the simple synthesized utterance. The second synthesized version was also judged more natural against the natural utterance than the first synthesized utterance. The naturalness of the synthesized utterance approaches that of the original utterance with the rapid transition of the formants, changes in the amplitude of voicing, changes in the bandwidths of the formants and the addition of a pole and zero pair. As the modeling suggests, the lateral is inherently variable in high frequency content because of the geometry of the vocal tract configuration. The key acoustic characteristic is not related to the exact placement of the high frequency formants or their amplitudes. Instead, these results support the conclusion that the key acoustic characteristic of the pre-vocalic lateral /1/ is an abruptness at the release of the lateral into the vowel and a good quality, natural sounding synthesized lateral must include this abruptness. This abruptness is created in part by rapid changes in the first two formant frequencies and in part by rapid changes in spectrum amplitude over the speech frequency range. In speech recognition, the analysis should be capable of resolving these rapid changes in order to distinguish /1/ from glides. In synthesis, these changes must be present if the stimuli are to sound natural and are to be discriminated from other sonorant consonants. Therapy to improve a speaker's production of /1/ should emphasize the need for creating this abruptness in the appropriate frequency ranges. 6.1 Further Research Further research to improve the understanding of the lateral is needed. The creation of a model which incorporates acoustic losses including varying cross-sectional area functions as discussed by Maeda would help determine how the acoustic losses are occurring and if they are speaker dependent or context dependent[10]. Modeling which also incorporates additional geometries of the vocal tract during lateral production should be developed. The models should include configurations in which two side channels are formed, but only one is connected to both the front and back cavities and the other connects only to the front cavity. Further examination of the pre-vocalic cluster laterals and the differences between the singleton laterals would also be helpful. Perceptual testing of synthesis work on cluster laterals could begin to unravel the phonetic context mystery of the lateral. Key perceptual cues for the lateral should cross the boundaries of the phonetic context, and the model of the lateral could be improved by including this information. In English, the lateral has more latitude in frequency content because there is no minimal contrasting sound. Examining a language with contrasting lateral sounds such as in Italian or Spanish, could provide more insight to the perceptual cues for the lateral. Appendix A Matlab code for Modeling This is the code used to solve the equations for the modeling section of my thesis. A.1 Solve Lateral Channel Equations Pin3 =0 solving for Uout, e2='Uin21=cos(-k*121)*Uout21' e3='Uin22=cos(-k*122)*Uout22' e4='Poutl= (-j*rho*c/A21)*sin(-k*l21)*Uout21' e6='Poutl- (-j*rho*c/A22)*sin(-k*122)*Uout22' [Uout21,Uout22,Poutl1,Uin21] -solve(e3,e2,e4,e6,'Uout21,Uout22,Pout1,Uin21') Uin=symadd(Uin21,'Uin22') Uout=symadd(Uout21,Uout22) TF=symdiv(Uout, Uin) TF=simplify(TF) TF=simple(TF) TFside= (A21 *sin(k*122)+sin(k*121) *A22)./ (cos(k*121).*A21.*sin(k*122)+sin(k*121).*A22.*cos(k*122)) A.2 Solve Equations for Lateral Branches when One Tube Disappears Pin3 = 0 e2='Uin21=cos(-k*121)*Uout21' e4='Pout 1=( -j *rho*c/A21) *sin(-k*121) *Uout21' [Uout21 ,Poutl] =solve(e2,e4,'Uout21 ,Poutl') TFnos=symdiv(Uout21,'Uin21') TFnos=simplify(TFnos) TFnos=simple(TFnos) A.3 Solve for Entire Model Pin3 = 0 el='Uinl= (cos(-k*11)*(Uin21+Uin22))+((A1/(j*rho*c))*sin(-k*11)*Poutl)' e2= 'Pinli= ((-(j*rho*c/(A1/)*sin(-k*11)*(Uin21l+Uin22))+(cos(-k*ll)*Poutl)' e3='Uin21 -cos(-k*121) *Uout21+A21/(j*rho*c)*sin((-k*121) *Pin3' e4='Poutl= (-j*rho*c)/A21*sin(-k*121)*Uout21+cos(-k*121)*Pin3' e5='Uin22=cos(-k*122) *Uout22+A22/(j*rho*c) *sin((-k*122) *Pin3' e6='Poutl- (-j*rho*c)/A22*sin(-k*122)*Uout22+cos(-k*122)*Pin3' e7='Uout2l+Uout22-cos(-k*13)*Uout3' e8='Pin3= (-j*rho*c/A3)*sin(-k*13)*Uout3' [Uout3,Uin21,Uin22,Pout1,Uout22,Pin3,Uout21]= solve(el,e3,e4,e5,e6,e7,e8,'Uout3,Uin21,Uin22,Pout1,Uout22,Pin3,Uout21') and the answer is... TF=symdiv(Uout3,'Uinl') TF=simplify(TF) TF=simple(TF) A.4 Whole Model when One Side Branch Disappears Pin3 = 0 el='Uinl=(cos(-k*11)*Uin22)+((A1/(j*rho*c))*sin(-k*ll)*Poutl)' e2= 'Pin 1= (-j*rho*c)/A1 *sin(-k*l1)*Uin22+(cos(-k*ll)*Pout1)' e3='Uin22=cos(-k*122)*Uout22+A22/(j*rho*c)*sin(-k*122)*Pin3' e4='Pout1= (-j *rho*c)/A22*sin (-k*122) *Uout22+cos(-k*122) *Pin3' e7='Uout22=cos(-k*13)*Uout3' e8='Pin3= (-j *rho*c/A3)*sin (-k*13)*Uout3' [Uout3,Uin22,Pout1,Uout22,Pin3]= solve(e1 ,e3,e4,e7,e8,'Uout3,Uin22,Pout 1,Uout22,Pin3') TFno=symdiv(Uout3,'Uinl') TFno=simple(TFno) TFno=simplify(TFno) Bibliography [1] C.A. Bickley and K.N. Stevens. Effects of Vocal-tract Constriction on the Glottal Source: Experimental and Modeling Studies. Journal of Phoentics, 14:373-382, 1986. [2] C.A. Bickley and K.N Stevens. Effects of Vocal Tract Constriction on the Glottal Source: Data from Voiced Consonants. In C. Sasaki T. Baer and K. Harris, editors, Laryngeal Function in Phonation and Respiration, pages 239-253. College Hill Press, San Diego, 1987. [3] R.A.W. Bladon. The Production of Laterals: Some Acoustic Properties and their Psychological Implications. In Current Issues in Linguistic Theory, volume 9 of Amsterdam Studies in the Theory and History of Linguistic Scieve IV, pages 501-508. Amsterdam, 1979. [4] R.M. Dalston. Acoustic Characteristics of English /w,r,1/ Spoken Correctly by Young Children and Adults. Journal of the Acoustical Society of America, 57:462-469, 1975. [5] C. Espy-Wilson. An Acoustic-Phonetic Approach to Speech Recognition: Appli- cation to the Semivowels. PhD thesis, Massachusetts Institute of Technology, 1987. [6] C Espy-Wilson. Acoustic Measures for linguistic features distinguishing the semivowels /wjrl/ in American English. America, 92:736-757, 1992. Journal of the Acoustical Society of [7] G. Fant. Acoustic Theory of Speech Production. The Hague, Mouton, 1960. [8] D.H. Klatt and L. C. Klatt. Analysis, Synthesis, and Perception of Voice Quality Variations among Female and Male Talkers. Journal of the Acoustical Society of America, 87:820-857, 1990. [9] P. Ladefoged and I. Maddieson. The Sounds of the World's Languages. Blackwell Publishers, Cambridge, MA, 1996. [10] Shinji Maeda. A Digital Simulation Method of the Vocal-Tract System. Speech Communication, 1:199-229, 1982. [11] K. Miyawaki, W. Strange, R. Verbrugge, A.M. Liberman, J.J. Jenkins, and O. Fujimura. An Effect of Linguistic Experience: The Discrimination of [r] and [1] by Native Speakers of Japanese and English. Perception and Psychophysics, 18(5):331-340, 1975. [12] S. Narayanan, A. Alwan, and K. Haker. An Articulatory Study of Liquid Approximants in American English. In ICPhS Proceedings, volume 3, pages 576-579, 1995. [13] S. Narayanan, A. Alwan, and K. Haker. Toward Articulatory-acoustic Models for Liquid Approximants Based on MRI and EPG data. Part I. The Laterals. Journal of the Acoustical Society of America, 101:1064-1078, 1997. [14] R. Sproat and O. Fujimora. Allopohonic Aariation in English /1/ and its Implications for Phonetic Implementation. Journal of Phonetics, 21:291-311, 1993. [15] K.N. Stevens. Acoustic Phonetics. MIT Press, Cambridge, MA, in press. [16] K.N. Stevens and S.E. Blumstein. Attributes of Lateral Consonants. In Acoustical Society of America Proceedings, 1994.

Analysis and Synthesis of the ... Lateral Consonant Adrienne Prahler

Related documents

Products

Support

Analysis and Synthesis of the ... Lateral Consonant Adrienne Prahler

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib