Detection of Stop Consonant Voicing: Toward a... Independent Model

Detection of Stop Consonant Voicing: Toward a Speaker Independent Model by Xiaomin Mou B.S., Massachusetts Institute of Technology (2000) Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June, 2001 ©Xiaomin Mou. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. SARKER MASSAChUSETT S W9TITUTE OF TECH4OLOGY JUL 3 1 2002 Author........ ................. LIBRARIES Oepartment of Electrical Engineering and Computer Science May 23, 2001 Certified by ...... Kenneth N. Stevens Clarence J. LeBel Professor of Electrical Engineering Thesis Supervisor Accepted by ........... . Aithur C. Smith Chairman, Department Committee on Graduate Students Detection of Stop Consonant Voicing: Toward a Speaker Independent Model by Xiaomin Mou Submitted to the Department of Electrical Engineering and Computer Science on May 23, 2001, in Partial Fulfillment of the Requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science ABSTRACT In this thesis, a method is described for determining from acoustic analysis whether a stop consonant is voiced or voiceless. Stop consonant production and conditions for voicing are first presented. A preliminary set of acoustic cues for determining voicing is formulated next from knowledge of acoustic theory. The acoustic cues include the fundamental frequency, first formant frequency, and the relative amplitudes of the first harmonic, first formant prominence and third formant prominence. The fundamental frequency in the adjacent vowel is used to gauge the stiffness of the vocal folds. Additional cues are the voice onset time (VOT) from release to the onset of voicing and the voice offset periodicity (VOP) immediately after the closure. Some of the measures are used to estimate the spread of the glottis and are measured immediately before the closure and after the onset of voicing, and others provide evidence for stiffening or slacking of the vocal folds. VOT and VOP are the most important voicing cues. VOT of unvoiced stop consonants is on average 45ms higher than that of their voiced counterparts and VOP of voiced stop consonants is on average significantly greater than that of their voiced counterparts. The fundamental frequency, the change in first harmonic amplitude and the change in the difference between the amplitudes of first and second harmonic are cues that can contribute to voicing identification. The results show that a small set of acoustic cues based on theory of speech production may be reliable in determining voicing. Thesis supervisor: Kenneth N. Stevens Title: Clarence Lebel Professor of Electrical Engineering 2 Acknowledgements I thank my advisor, Ken Stevens, who fostered my interest in the field of speech communications, for his unfailing encouragement and illuminating insight which helped me grow both as a researcher and as a person. His feedback during our meetings raised important questions and challenged me to be more critical with my research. I thank members of the speech communications group for the weekly seminars which often brought in speakers across the disciplines and gave me a chance to appreciate the complexity of this field. I thank Dan Shub, Laura Dilley, Stefanie Shattuck-Hufnagel, and Ariel Salomon, for recording the speech data used in this thesis. I thank Arlene for her support. I am grateful for my best friend Cherry Liu, who has been my cheerleader through frustrating moments and all the way to the final stretch. Finally, I thank my parents for loving me and believing in me, always. 3 Contents 1. Introduction 9 1.1 Lexical representation in terms of segments and features 10 1.2 Landmarks and acoustic cues 13 1.3 Thesis outline 15 1.3.1 Past work 15 1.3.2 Present study 16 2. Production and cues of stop consonant voicing 17 2.1 Voicing 17 2.2 Stop consonant voicing 18 2.3 Acoustic cues for stop consonant voicing 19 2.3.1 Voicing context 19 2.3.2 Voicing cues 21 4 3. Databases and acoustic measurements 23 3.1 Description of database 23 3.2 Description of analysis tool 24 3.3 Measurements 25 3.4 Data 28 3.4.1 Voice onset time (VOT) 28 3.4.2 Voice offset periodicity (VOP) 30 3.4.3 Glottalization 32 3.4.4 H1, H2, F1, Al, and A3 34 4. Acoustic cue analysis 41 4.1 Analysis of Hl, H2, Fl, Al, and A3 42 4.2 Average data for voicing cues 42 4.2.1 Average FO data 48 4.3 Combining acoustic cues 50 5. Summary and Discussion 56 5.1 Voicing in isolated utterances 56 5.2 Further work 57 Appendix A 58 Appendix B 76 5 List of Figures 1.2 Vowel and consonant landmarks for "We took a hike." 14 2.3 Measures of stop consonant voicing 22 3.3 An example of gathering measurements for a VCV utterance 26 3.4.1 Average voice onset time (VOT) 30 3.4.2 Average voice offset periodicity (VOP) 31 3.4.3a Glottalization of /t/ 33 3.4.3b Average Fl decrease for unvoiced stops 34 3.4.4a Measurements for determining voicing at the closure and release for voiced and unvoiced consonants in VCV utterances by male speakers 37 3.4.4b Measurements for determining voicing at the closure and release for voiced and unvoiced consonants in VCV utterances by female speakers 38 3.4.4c Measurements for determining voicing at the closure and release for voiced and unvoiced consonants in CVC utterances by male speakers 39 3.4.4d Measurements for determining voicing at the closure and release for voiced and unvoiced consonants in CVC utterances by female speakers 6 40 4.2a Average VCV measurements 43 4.2b Average CVC measurements 44 4.2. 1a Average FO values at the release and closure landmarks for VCVs and CVCs 49 4.2.1b FO values for male and female utterances 50 4.3a Scatter plots of voicing cues in VCV utterances using MAX values 52 4.3b Scatter plots of voicing cues in CVC utterances using MAX values 53 4.3c Scatter plots of the measurement Hi-Al against HI using MAX values 54 4.3d Scatter plots of the measurement H1-A3 against HI using MAX values 55 7 List of Tables 1.1 Feature chart for standard segments in English 8 11 Chapter 1 Introduction The study of speech communication involves an understanding of how a speaker organizes a discrete linguistic representation into an acoustic signal and how a listener decodes that acoustic signal back to the linguistic representation. Quantitative models of the human speech perception and production mechanisms can be used to improve the performance of speech processing applications such as automatic speech recognition. The characteristic information contained in the acoustic signal derived from phonetics and linguistics theories may provide crucial cues that are robust even in natural environments where casual speech may be sloppy and background noise high. In clearly enunciated speech, words are produced with all of their features well represented in the sound, whereas in casual speech the acoustic cues for selected features are often modified in context. Small children manage to develop a system for speech perception that accounts for the variability of speech, which depends on context, speaker, the mode of speaker, and the speech environment. They also learn which modifications are acceptable. Current speech 9 recognition models which rely on training with large amounts of speech data and modeling variability probabilistically fall short of human speech perception for casual speech. Advances in the study of acoustic phonetics are leading to an understanding of relations between linguistic, articulatory, acoustic, and auditory representations of speech. These relations have shown that the variability in speech is not random. Research has shown that the variability arises from principles of speech production and perception and suggests that modeling this variability is a key in improving existing models of natural speech recognition (Stevens, 1995). 1.1 Lexical representation in terms of segments and features The procedure for extracting from casual speech a description of an utterance in terms of distinctive features is a component of a lexical access model (Stevens, 1995). The lexical access model assumes that words are stored in memory as sequences of phonological segments, such as vowels or consonants, each of which is describable by a set of distinctive features. Three kinds of evidence for this word representation are provided. First, the words in a language can potentially be organized into minimal pairs such that one feature of one segment has a different binary value. Examples of such a pair are "bit" and "pit". Secondly, the constraints on the structure and formation of the words can be expressed by rules based on features (Chomsky and Halle, 1968). Thirdly, the anatomical organization of the vocal tract and the respiratory system is responsible for the quantal nature of speech sounds (Stevens, 1989). Over some regions of an articulatory parameter the acoustic properties are relatively stable, but movements outside of these regions show abrupt changes in the acoustics. The lexical access model deduces word sequences by matching patterns of distinctive features against a stored lexicon of words. There are two types of distinctive features: the 10 articulator-free and articulator-bound features. Table 1. 1(Choi, 1999) is a feature chart for the standard segments in English. symbol iy ih ey eh ae aa vowel glide consonant sonorant continuant strident + + + + + + ao + ow + ah + uw uh rr + + + ex au + ai + oi + + h w u r + + + + ++ + stiff slack + spread constricted advancedconstrictedtogue root + - + - - - - + - +---- - - - - - + + - - - - - + - + - - - - - - + - - + + - - - + - - -------+ - ++ + - + - - - nasal tongue body blade lips high low back - - - - + - + + + + + - + + -+ + ++ anterior distributed -+ ++- - + ++- + ++- + + lateral + rhotic round + + + 11 + + + + + I m n ng symbol v dh zzh f th s sh ++++ - ++++ - ++ b d g p t k dj ch vowel glide consonant sonorant continuant strident ++ + + + + + + - - ++ ++++ - + + + + + + stiff slack + + - + + + ---- + + + - - + - + - - - -+ + -+ + - - - + + + + + + spread constricted advancedconstrictedtogue root nasal - tongue body blade + + + + + + + + + + + lips + + + + + + + + + high - + + + low - back + + + + anterior distributed + + lateral rhotic round + + + - + + - ++ - + + + + + + - - + + - - - - - Table 1.1: Feature chart for standard segments in English. The vowels and glides are shown in the top part of the table, and the consonants in the bottom part. The top half shows features for vowels and glides and the bottom half shows the features for consonants. The features for the stop consonants are shown bold-faced. The first six rows are the articulator-free features and the remaining rows are the articulator-bound features. The articulator-free features refer to how the articulators are manipulated. For example, the features [high], [low], and [back] describe the position of the tongue body, and are represented for the vowels and most glides, as well as consonants produced with the tongue body. As another example, a narrow constriction in the oral cavity produces a consonant. Stop consonants /b/ 12 through /k/ are marked [-continuant] because air flow in the mouth is completely blocked during production, whereas fricative consonants /v/ through /sh/ are marked [+continuant] because airflow is not completely blocked. The articulator-bound features refer to the configuration and position of the particular articulators that are involved. For example, the features [stiff], [slack], [spread], and [constricted] describe the configuration of the larynx and the state of the vocal folds, and give information about voicing. The lexical representation in terms of distinctive features is independent of the context in which a word is spoken or the speaking style. 1.2 Landmarks and acoustic cues The speech signal can be segmented into vowels and consonants. Vowels are marked by regions of maximum low-frequency amplitude corresponding to maximum vocal-tract opening in phonated intervals, and consonants are indicated by regions of discontinuities corresponding to releases and closures. These regions of change are referred to as landmarks. In the vicinity of the landmarks, the speech signal can be analyzed for the acoustic cues that correspond to the articulator-bound features. To better analyze these segments of change, the utterances are usually Fourier transformed from the time domain into the frequency domain. A commonly used representation of the utterance in the frequency domain is called the spectrogram. Figure 1.2 shows the spectrogram of the utterance "We took a hike." spoken by a female. Formants are shown for vowels and consonants, and noise energy is shown for burst regions. In the vicinity of these landmarks, the signal is examined in more detail for clues or acoustic cues that reveal information about the place of articulation, nasality, tenseness, and voicing. The acoustic cues may include formant frequencies, the fundamental frequency, and energy distribution in different frequency ranges. For example, high vowels such as /i/ have a low first formant frequency while 13 low vowels such as /a/ have a high first formant frequency. Cues such as low frequency energy near the closure of a consonant may indicate the voicing features (represented by [+ stiff] vocal folds and [+slack] vocal folds in Table 1.1). t uh Wi lL J00 h kaa k ay ~ AL.JbL - -0 -Ole 0 50 JOD 1503M 2M0 300 350 .400 .40 50 53D 60 3010 WflUFA 0 7 25 0 030 J00 J 5D i J 50 5. S p/1 glide vowel vowel glide vowel dipthon stop/cl stop/cl F2 FF1 0 0 F 100 200 300 400 600 500 TIME (ins) 700 800 900 1000 100 200 300 400 600 500 TIM E (ins) 700 800 900 1000 80 40 <CI0 0 Figure 1.2: Vowel and consonant landmarks for "We took a hike." The top figure indicates the consonants and vowels in the waveform and the bottom figures shows the vowel and consonant landmarks as well as formant frequencies in the corresponding spectrogram. The utterance "We took a hike" contains a glide /w/, three stop consonants /t/, /k/, and /k/, four vowels /i/, /uh/, /aa/ and /ay/, as well as a voiceless aspirated consonant /h/. A discontinuity at 275ms follows the first vowel /i/ and leads to a silence interval. This discontinuity is a landmark for the closure of the first stop consonant /t/. A second discontinuity at 300ms, from the silence region in to a burst of high frequency energy, is a landmark for the release of the stop consonant. Together, these two regions are landmarks for a stop consonant. The discontinuity 14 between the /k/ and the following vowel just before 460ms is evidence for voice onset, or phonation following a voiceless aspirated stop consonant. The existence of landmarks show where consonants and vowels begin and end but do not give information about the articulators involved in producing the segments.That information is gathered by zooming in on the landmarks to examine for acoustic cues such as formant frequencies. The front vowel /i/ in "We" at 200ms is marked by a low first formant frequency at 250 Hz and high second formant frequency at 2 kHz. The back vowel /uh/ in "took" is marked by a high first formant frequency at 600 Hz and a low second formant frequency at 1.6 kHz. The stop consonant /k/ in "took" at 455ms is marked by the proximity of the second and third formants at 1850 Hz and 2150 Hz respectively. This proximity reflects the configuration of the /k/ closure. The velar /k/, compared to the labial stops /b/ or /p/, is produced with a vocal-tract constriction located much further from the lips. As a result, the lowest natural frequency of the cavity in front of the constriction likely corresponds with the second formant frequency. Furthermore, one of the natural frequencies of the back cavity will be very close to it. (Stevens, 1998) 1.3 Thesis Outline 1.3.1 Past Work Elizabeth Choi's doctoral work (Choi, 1999), Detection of Consonant Voicing: A Module for a Hierarchical Speech Recognition System, presented a component of a systematic recognition system which focused on the detection of consonant voicing. Each utterance from a database spoken by two speakers was first transformed into a spectral representation and 15 examined for landmarks. Further processing around the landmarks for acoustic cues led to a deconvolution of segments and their features. The features were finally compared against items in the lexicon for possible matches. Error analysis was performed on the measurements from two databases: isolated CVC and VCV syllables, and continuous reading of short sentences. The error rates, scored separately for each landmark, were in the range of 10 to 20 percent, the higher error being for continuous speech. Combining closure and release landmarks reduced error rates by approximately 5 percent (Choi, 1999). 1.3.2 Present Study This study builds upon Choi's doctoral work by focusing solely on the voicing feature in stop consonants both in isolated utterances. Choi's study involved only two speakers, who were examined separately, so the differences in measurements related to physical dimensions of the vocal tract were not accounted for. In addition, absolute measurements were obtained at a given landmark. For example, in a VCV syllable, measurements were obtained at +30ms after a closure and -30ms before a release. This study hopes to arrive at a more robust model by limiting the number of consonants, by focusing on more speaker data, and by considering relative measurements of acoustic cues that do not depend on factors such as speech level, pitch, and gender. Chapter 2 describes stop consonant production and the measurements that can be used to determine voicing. Chapter 3 presents preliminary measurements from isolated VCV and CVC utterances. Chapter 4 gives an analysis of those preliminary measurements and draws some conclusions about the relative importance of different voicing cues. Chapter 5 gives a summary and discusses future work. 16 Chapter 2 Production and cues of stop consonant voicing 2.1 Voicing Voicing refers the manifestation of vocal fold vibration during the production of a speech segment. It is a distinctive feature in stop consonants because two stop consonants can be identical in all the features except voicing. In the feature representation, voicing is determined by the state of the vocal folds which can be either stiff or slack. When the vocal folds are stiff, vibration in an obstruent consonant (such as a stop consonant) is inhibited; a segment in this state is described as unvoiced. Conversely, when the vocal folds are slack, vibration during an obstruent consonant is facilitated, and a segment in such a state is said to be voiced. This thesis is primarily concerned with the voicing feature of the six English stop consonants and this chapter 17 gives a brief description of the cues which distinguish a voiced stop consonant from an unvoiced one. 2.2 Stop consonant voicing Stop consonant production is characterized by a complete closure in the vocal tract, an interval following the closure during which pressure builds up behind the constriction, and finally a sudden release of the constriction. The constriction is formed with the lips in the labials /b/ and / p/, with the tongue tip in the alveolars /d/ and /t/, and with the tongue body raised against the soft palate in the velars /g/ and /k/. The biggest difference between labials and alveolars is the larger amplitude of higher frequencies in the spectrum of the release burst for alveolars. The burst for velar stop consonants are characterized by peaks in the mid frequency range in the spectrum as a consequence of the tongue body position which produces a front cavity with resonances corresponding to frequencies near F2 and F3. At the instant of closure, the intraoral pressure is zero. During the closure interval, as the intraoral pressure increases to approach the pressure across the glottis, glottal airflow decreases. At the release, air abruptly flows through the constriction and there is a decrease in pressure across the glottis as well as intraoral pressure. During the closure interval, the pressure built up behind the constriction leads to a smaller difference in the pressure across the glottis. When this difference is small enough, the vocal folds cease to vibrate and voicing stops. The vocal folds can be further manipulated to be spread apart or stiffened to continue this voiceless state. However, if the pressure built up behind the constriction is not great enough to lead to a small enough difference in pressure across the glottis, airflow will continue and the vocal folds will continue to vibrate. The pressure buildup is 18 limited by expanding the pharyngeal region (Stevens, 1998). The vocal folds remain slack in this voiced state. 2.3 Acoustic cues for stop consonant voicing Acoustic cues for voicing depends on the position and context in which the stop consonant occurs. The cues are also influenced by the speaking rate and speaking style, and the variability in these types of speech also needs to be characterized. Absolute measurements of some cues may not provided adequate cues for voicing, and relative measurements taken over a period of time near the closure and release landmarks may reflect differences in voicing more robustly. For example, immediately prior to closure, there is a in the low frequency amplitude of the radiated sound. Following the closure, the amplitude of the radiated sound is small compared to that in the preceding vowel because sound is no longer radiated from the mouth opening. The amplitude of the glottal pulses would be expected to decrease abruptly at the instant of closure and the amplitude in the first formant region would be expected to decrease immediately after the closure. This amplitude will then increase immediately following the release. The most abrupt decrease in amplitude of F1 in stop consonants would be expected for labials. By monitoring cues such as the low frequency amplitude over time, the voicing feature may be revealed. 2.3.1 Voicing context The voicing feature can be examined in three contexts. In syllable-initial position, the stop consonant is not preceded by a sonorant segment and the voicing evidence can only be found 19 from the region immediately preceding and following the release region. In syllable-final position, the stop is not followed by a sonorant segment and voicing evidence can be found in the region leading from the preceding vowel segment into the closure for the consonant. In the intersonorant position, voicing evidence can be found in both regions of the closure and release. Several voicing cues exist following the release of a stop consonant. These cues include the time from the release to the onset of low-frequency amplitude (VOT for voiced consonants tend to be shorter in duration than for unvoiced consonants), a measure of the breathiness of voicing after onset of glottal vibration (represented in part by the difference between the amplitudes of the first two harmonics, H1-H2), the change over a period of time in the first formant frequency, and the fundamental frequency following the onset of glottal vibration. Cues which exist in the syllable-final position include lengthening of the preceding vowel, possible glottalization of the stop consonant release, and duration of the time interval in which glottal vibration continues. An example of vowel lengthening occurs in the pair of words "rider" and "writer". The vowel which precedes the final voiced consonant /d/ is longer in duration than the vowel which precedes the final unvoiced consonant /t/. An example of glottalization is the /t/ in the word "can't" in which the release of /t/ is not registered as a landmark in the sound. However, cues for the presence of glottalization are evidence for a following voiceless stop consonant, since glottal adduction is often used to enhance the termination of glottal vibration for syllable-final voiceless stops (Stevens, 1998). For stop consonants in the intervocalic position, there is a combination of cues from the closure and release. 20 2.3.2 Voicing cues Voicing cues refer to acoustic cues which reflect the articulatory movements of voiced stop consonant production and distinguish a voiced stop from an unvoiced one. Voicing cues should be subject to minimum variation in spite of differences in the context in which stop consonants appear. When the vocal tract forms a constriction or closure in the oral region, the cues for voicing should permit estimates of the vocal fold stiffness and the degree of abduction or adduction of the glottis. Figure 2.3 is a schematic articulatory representation that accounts for the two types of cues that will be used as the preliminary cues in this study. Based on knowledge from acoustic phonetics, the main indicators of voicing are the stiffness of vocal folds and the spreading of the glottis. The top panel shows a schematic representation of the change in the area of the glottal opening in a VCV sequence, where the consonant is in the intervocalic position, for both a voiced and unvoiced consonant. Consonant closure is marked arbitrarily by -100 and the release is marked by 100. For the VCV utterances in which closure precedes release, -100 refers to the point of closure. For the CVC utterances where release precedes closure, -100 refers to the point of release. The bottom panel shows the change in stiffness of the vocal folds that is postulated to occur during the closure interval and immediately following the release. The top panel shows that the glottis is more spread in the unvoiced stop consonant during the closure interval and immediately prior to the closure as well as after the release. The glottis has to be spread in the unvoiced stop consonant to keep the vocal folds from vibrating. The configuration of the glottis changes relatively little for the voiced stop consonant. The bottom panel shows that the vocal folds are increasingly stiff for the unvoiced consonant and that they are increasingly slack for the voiced consonant. The stiffness of the vocal folds keeps them from vibrating in producing an 21 unvoiced consonant and the slackness of the vocal folds allows vibration to continue in producing a voiced consonant (Stevens, 1999). The acoustic cues for the voicing distinction are proposed based on this view of these adjustments of vocal-fold configuration and stiffness. Closure Release v C v v unvoice nvoiced 0d voice onset C voiced voiced -100 C 100 Release Clos ure C6 v C V voiced C ______v 0.) ed 100 -100 Figure 2.3: Schematic measures of glottal opening and stiffness associated with stop consonant voicing. The top panel shows estimates of the area of the glottal opening and the bottom panel shows the change in vocal fold stiffness during the closure interval and immediately after the release. Closure is marked at -100 and release at 100. The next chapter discusses the particular cues involved in determining voicing, and applies them to recorded speech data. The goal is to determine which voicing cues are the most important in an isolated context of CVC and VCV utterances. 22 Chapter 3 Databases and Acoustic Measurements 3.1 Description of database This study focuses on the measurement of the relative values of certain acoustic parameters at different times within utterances. Data from two female and two male speakers is adequate for that purpose. The stop consonants are b, p, g, k, d, and t. Vowel-consonant-vowel (VCV) clusters such as /ahdah/, and consonant-vowel-consonant (CVC) clusters like /dahd/ are first recorded and then digitized. Results from Choi's study suggested that it may be possible to pool measurements from utterances where vowels adjacent to the consonants are variable (Choi,1999). Therefore, a neutral vowel /ah/, as in /cut/, is chosen for this thesis study. CVC utterances cover the syllable-initial and syllable-final voicing contexts and VCV utterances cover the intervocalic position. For analysis purposes, VCV utterances are separated split into the VC and CV pairs which correspond to the two landmarks of a stop consonant. Similarly, CVC 23 utterances are split into the CV and VC pairs. The relative importance of the acoustic cues at either landmark can then be assessed to arrive at a best combination of these cues. Primary spectral measurements from the center of the landmark are extracted at 10ms intervals. For the VCV utterances, where the closure landmark precedes the release landmark, the closure landmark is arbitrarily assigned -100ms and the release landmark is assigned 100ms. Times relative to these landmarks are selected as measurement points. For the closure landmarks, measurements are taken from -150 to -70ms; at the release, data are sampled from 100 to 170ms. For the CVC utterances, the release landmark is assigned -100ms and the closure landmarks is assigned 100ms, and measurements are taken from -100 to -30ms at the release and 50 to 30ms at the closure. 3.2 Description of analysis tool After the utterances are digitized, the waveforms are analyzed by using XKL, an X-windows port of the interactive speech analysis package originally developed by Dennis Klatt. The XKL program makes a spectral representation of the waveform. XKL computes a 512-point discrete Fourier transform on a length of waveform that is first differenced, and multiplied by a Hamming window. In this thesis, a long window length of 30ms is used to measure HI, H2, and FO and a short window of 6.4ms is used to measure the formant frequencies as well as formant amplitudes. The longer time window corresponding to a higher frequency resolution is used to better capture harmonics. A shorter time window corresponding to a lower frequency resolution is adequate to measure formant frequencies. The XKL program computes and displays fundamental frequency if it determines local spectral maxima at regular intervals. The fundamental frequency is computed by collecting 24 frequencies of local maxima in the dft spectrum. Only peaks below 3kHz are considered and the frequency is specified if the program judges the peaks to be equally spaced. If there is little lowfrequency energy present in the spectrum, or if the distribution of differences is too spread in frequency, no fundamental frequency is displayed. See Klatt, 1980 for details. 3.3 Measurements The first step in the detection of consonant voicing is to locate the closure and release landmarks of the stop consonants. The locations of the landmarks can be aided by the knowledge that a complete closure somewhere in the vocal tract is required and that pressure builds up as a result of the constriction, leading to a reduction or extinction of glottal vibration. The pressure buildup is followed by a sudden release of the constriction which can result in the generation of turbulence noise. In this thesis, the landmarks are determined by hand by examining the waveform for abrupt discontinuities. Around each landmark, the primary acoustics parameters measured with XKL are the fundamental frequency (FO), the amplitude of the first harmonic (Hi), the first formant frequency (F1), the amplitude of the second harmonic (H2), and the amplitudes of the first and third formant spectral prominences (Al and A3). Figure 3.3 shows the different parameters that are measured in different time regions of a VCV utterance. Immediately before closure, the measures that reflect the spread of the glottis are HI, H2, Fl, Al, and A3 and the measure that reflects the stiffness of the vocal folds is FO. The same is true after voicing resumes some time after release. In between voicing, however, a measure of the presence of absence of FO is used to determine the periodicity of the vocal fold vibration immediately after closure and around the 25 release is obtained. At each 10ms sample point, the signal is taken to be periodic if it finds an FO, and is taken to be non-periodic if it fails to find an FO. ci rl V C Hi CH2 A A3 Hi H2 5 F1 Al A3 FO AF0 Z 2 8 V 50ms 3 F1 F F VOT 30ms 40ms Time Figure 3.3: An example of gathering measurements for a VCV utterance. In the 50ms interval before the closure interval, parameters for spread and stiffness are measured. 30ms after the closure interval, periodicity is measured. VOT is determined after the release, and parameters for spread and stiffness are measured in the 40ms after onset of glottal vibration. HI is a measure of the strength of the vocal fold vibration. The difference between Hi and H2 is an indirect measurement of the amount of glottal spreading; Hi-Al, and H1-A3 also reflect the state of the glottis. A larger value for any of these differences in a vowel immediately adjacent to a stop consonant corresponds to a more spread glottis and a broader first formant bandwidth. Al may be smaller in a vowel adjacent to an unvoiced stop consonant where the vocal folds are spread or constricted, leading to acoustic loss. Therefore, H1-AI would be expected to be higher for unvoiced stop consonants than for voiced consonants. Formation of the constriction leads to an abrupt termination of the phonation source, which is reflected in sharp decrease of Fl. F1 reflects the degree of the glottal constriction. After the sudden closure, F1 26 would be expected to fall off substantially and rise after the release. The fall in Fl at consonant closure and the rise in Fl at voicing onset are expected to be greater for voiced consonants. Another cue for stop consonant voicing is a measure of the how long the vocal folds are spread before they come together at the release of the consonant, called the voice onset time (VOT). The VOT is a cue found at the release landmark, as voicing begins sometime after the release of a stop consonant. The VOT is determined in this thesis by taking the difference between the time of release, which is marked by a burst of energy in the waveform, and the time from which XKL detects periodicity consistently. Another measurement preceding constant closure is vowel duration. Vowel duration can be used as a cue for voicing of final-position stop consonants when the vowel is in the utterance final position (Crystal and House, 1990). A vowel followed by a voiceless consonant such as the /eh/ in "bet" is usually shorter in duration than one followed by a voiced consonant such as "bed". There is a natural tendency to make a slightly early glottal opening gesture for a postvocalic voiceless consonant in order to insure that no low frequency voicing cue is generated (Klatt, 1975). At the release landmark, of interest is how many time frames have elapsed before the onset of periodicity. Once periodicity is detected, how do the parameters change with time? At the closure landmark, a similar cue for stop consonant voicing is a measure of how long the vocal folds continue to vibrate after the point of closure. This cue at the closure landmark will be referred to in this thesis as the voice offset periodicity (VOP). This parameter is defined here as the percent of the 4 time frames after the point of closure for which XKL returns a fundamental frequency. Together, the VOP after the closure (in percent), and the VOT before the release (in ms) might give valuable information about the behavior of the vocal folds during the consonant 27 interval, as reflected by the presence or absence of FO. The duration of the vowel may also be a cue for voicing of final stop consonant in CVC utterances. Away from the consonant, before the closure and after the release landmarks, the behavior of the vocal folds in preparation to transition into or out of the consonant might also provide a clue about voicing, as reflected by the change in other parameters (H1, H2, F1, Al and A3). 3.4 Data This section presents data obtained with XKL on the VCV and CVC utterances in two parts. The first part is concerned with the behavior of the parameters during the closure interval. The parameter used is the FO returned by XKL, and this information is used to calculate the VOT and VOP. The second part is concerned with the behavior of the parameters H1, H2, F1, Al, and A3, before the closure and after the onset of voicing at the release landmark. 3.4.1 Voice onset time (VOT) The first plot of figure 3.4.1 shows the average VOT for stop consonants from data provided by Crystal and House (1988). Notice that the VOT for unvoiced stop consonants is about 45ms longer than for voiced stop consonants. The second and third plots of figure 3.4.1 show VOT calculated in this study from the VCV and CVC utterances and reflect the same relationship. The last plot of figure 3.4.1 shows the vowel duration for CVC utterances. On top of each average measurement value, the standard error of that measurements across all speakers is shown. The small and non overlapping standard deviations between unvoiced and voiced stops suggest that it is possible to distinguish voicing by examining solely the VOT. It is simple to distinguish 28 between a voiced and an unvoiced stop. It is, however, difficult to distinguish place of articulation for voiced stop consonants or for unvoiced consonants based on VOT, although there are some systematic differences. The labials have the shortest VOT and the velars have the longest VOT. In a cross-language study of voicing, Lisker and Abramson (1964) found that the velars have consistently higher VOT values than the other stops and suggested that the VOT is, to a certain extent, sensitive to the place of stop closure. In order to prevent the effect of producing overlapping distribution, data from stops of the same manner but different places of articulation were kept separate in that study. Vowel duration is taken to be the time from the release of the syllable-initial stop to the closure of the syllable-final stop. Studies on fricatives by Crystal and House show that the duration of a vowel up to the time of the frication onset tends to be shorter when the vowel is followed by a voiceless fricative than by a voiced fricative, when the consonant is in the syllablefinal position. This duration difference for vowels is negligible when the fricative is in the non syllable-final position. These results also apply to obstruent consonants in general. Previous studies showed that the average duration of vowels followed by //t/ is 160ms, by /d/ 210ms, by /p/ 150ms, by /b/ 210ms, by /k/ 150ms and by /g/ 210ms (Crystal and House, 1990). Panel (d) shows vowel duration data from this study. There is similar relationship in the vowel durations in voiced and voiceless stop consonant segments. Vowel duration for vowels followed by voiced stop consonants is on the average 70ms longer than vowels followed by unvoiced consonants. The This large difference suggests that vowel duration can be a cue that contributes to voicing identification in syllable-final stop consonants. 29 -- - *1'1*~~ - - --- , --------------------- Average stop consonant voice onset time VOT 100b E C (a) 50 - -E F-- U1 0 100- ahbah Average 'JtV voice onsA time VOT 5 2 1 E C g p ahgah h at ahkah ahdah a h ........ .... 50- 7 6 (b)- E 0 0 1 10 0 - . . ... ..-. . 0) E C 2 bahb pahp 1 2 bahb pahp Average dVC voice onsA time VOT 5 gahg kahk 7 6 dahd taht 5 6 dahd taht 50 - E U- C 0 300 Averag CVC vowel duration gahg kahk 7 E20C C E 100 (d) 0 0 1 2 4 3 5 6 7 Figure 3.4.1: Average voice onset time (VOT). Panel (a) shows data from a prior study (Crystal and House, 1990). Panel (b) shows VOT for VCV utterances, panel (c) for CVC utterances and the panel (d) shows the average vowel duration in CVC utterances. The bottom portion of each stacked bar shows the actual VOT in ms and the top portion shows the standard error of the data. 3.4.2 Voice offset periodicity (VOP) Figure 3.4.2 shows the average VOP. The top plot represents the average VOP for VCV utterances and the bottom plot gives the average VOP for CVC utterances. Average VOP for the 30 voiced utterances is higher than that for the unvoiced utterances. Average VOP for the voiced /ahgah/ and /gahg/ is more than 65% higher than average VOP of their counterparts /ahkah/ and /kahk/. The same is true for /d/ and /t/ utterances. Average VOP for the voiced /ahbah/ and /bahb/ is only about 20% higher than their unvoiced counterparts. Average VOP for the unvoiced CVC tuht is 0. Again, the small standard error indicates that VOP is helpful in distinguishing voicing for postvocalic stop consonants. 100 Average VCV voice offset periodicity VOP - -- - - ahpah ahbah 80 a ahkah h a h ahtah -.-.-.- . 60 - C.L 0 0 100 1 -- 80 - bahb -.. -.-..-.-.- - -- --- -- 2 -- - -- 4 3 5 Average CVC voice offset periodicity VOP - -.---.--. dahd kahk g pahp 6 7 taht 60.2 0 0 1 2 4 3 5 6 7 Figure 3.4.2: Average voice offset periodicity (VOP). The top panel shows percent periodicity for VCV utterances and the bottom shows percent periodicity for CVC utterances. The standard error of all the measurements across speakers is shown above each bar. 31 Note that the VOT for unvoiced stop consonants is higher than the VOT for voiced stop consonants, and that the VOP for unvoiced stop consonants is lower than the VOP for unvoiced stop consonants. It takes longer for an unvoiced stop than for a voiced stop to resume voicing at the release, and it is less likely that there is continued voicing after the closure. Note that while VOP for /g/ is more than four times that for /k/, the VOP for /b/ is only twice that for /b/. The smaller difference in VOP for the /p/ and /b/ pairs could arise because the closure of the labial stops is faster, resulting in a more rapid drop in Fl and hence a more rapid decrease in periodicity as measured by XKL. On the other hand, there is a large difference between the VOP of /dahd/ and /taht/. The VOP of /taht/ is zero. The large difference between the VOP of tuht and duhd suggests that the final /t/ is much more frequently glottalized than the final voiceless stops /k/ and /p/. That is, the glottis closes and voicing is cut off even before closure. 3.4.3. Glottalization Glottalization is a phonological modification mostly made in continuous speech. Often, stop consonants in the final position are not fully released. The VOP data of /taht/ may be an indication that the final /t/ is more frequently glottalized than other stop consonants in the final position. Figure 3.4.3a shows two spectrograms. The top one is of /taht/ and the bottom one is of /kahk/. Immediately after 610ms in /taht/, voicing stops altogether and there is no release of the final /t/. On the other hand, there is a definite release of the final /k/ beginning at 700ms. 32 5I I~7 /u /t/release T /t/closure Y- LL o 100 200 300 400 50 600 1iME (ms) 700 800 900 1000' 0 100 200 300 400 600 500 TIME (ms) 700 800 900 1000 70 80 90 10 700 800 900 1000 0 Wrl /uh/, C/0 14 0 100 200 0 0 I ms6o CL* 0* 0 30 400 500 600 TIME (ma) Figure 3.4.3a: Glottalization of /t/. The top panel shows the spectrogram for /taht/ and the bottom shows the spectrogram for /kahk/ by the same speaker. If the final /t/ is glottalized, it should turn out that F1 should drop less for /t/ than for /p/ and /k/. Presumably voicing is cut off so abruptly that F1 will not have had a chance to decrease significantly. Figure 3.4.3b shows that after the point of closure, on average, F1 decreases less for /t/ than for /p/ and /k/. F1 for /k/ decreases the most, from 680Hz at the point of closure to 270Hz 10ms after. F1 for /t/ decreases the least, from 615Hz to 500Hz. F1 for /p/ decreases somewhere in between, from 605Hz to 400Hz. 33 75 Average F1 decrease for unvoiced stops 0 -. -. -. -. -. -.--..-..-7 0 01 - .- . 650 6 00 - 500 - . .. . .-.-.- .. .. -. . . . -. -. - - . . . .- . -. -. -. -.-. 5 50 - - - -. -..- .. . -. -. taht- - -- .. - ..-..-- -.. -...- - -..- -- - ..- . - - - - - --- o- pahp 450 -- kahk 400 - 350- -- ..... -. 300250 50 60 70 80 Time [ms] 90 100 110 Figure 3.4.3b: The average F1 decrease for unvoiced stops. The stars indicate values for /t/, the circles for /p/ and the dots for /k/. Averages are over four utterances, one each by four speakers. Note that the closure is at 100ms and measurements are taken from 50ms to 1 10ms. 3.4.4 H1, H2, F1, Al, and A3 The previous section discussed the importance of VOT, VOP, and vowel duration as cues for voicing. One might conclude that VOT, VOP, or vowel duration alone can determine the presence of voicing. In other contexts such as a noisy environment, however, often VOT and VOP cannot be detected. In casual speech, vowel duration may vary as a result of changes such as emphasis or speaker mood. What other cues can be used to determine voicing in these contexts? The answer might be on the other side of the landmarks. Figure 3.4.4a shows the behavior of H1, H2, Al, A3, FO and Fl measures over time for the isolated VCV utterances /ahgah/ and /ahkah/ spoken by two males as and ds. The following section remarks on general trends in the behavior of these measures and the next chapter provides an analysis of these 34 measures, relates the behavior to theory, and draws conclusions about their strength in determining voicing. In figure 3.4.4a, measures for speaker as are shown on the left column and for speaker ds on the right column. Figure 3.4.4b shows the same measures for the female speakers ld and ss. The fundamental frequency FO and first formant frequency F1 are displayed alone while the amplitudes H2, Al, and A3 are shown relative to HI, the amplitude of the first harmonic. Data for the male speaker as show that at the closure, HI (FO amplitude) begins to drop from 50dB to 25dB. At the release landmark, voicing begins about 35ms after the release, and HI rises from about 30dB to 50dB. Data for the other male speaker ds follows the same trend and voicing begins about 40ms after the release. After the closure, FO frequency for the voiced /ahgah/ remains constant while FO of the unvoiced /ahkah/ becomes undetectable by XKL. XKL returns 0 for FO when the fundamental frequency cannot be detected because there is a lack of regular harmonics. In this thesis, a "0" is assigned to F0 as a null value. The figures show that FO drops from some value to 0 abruptly at the closure. At the release, as soon as voicing begins, FO for the unvoiced is detectable again and is higher than FO for the voiced, presumably because the vocal folds were held stiff to keep from vibrating before voicing. At that point, FO of the unvoiced is about 30Hz greater than FO of the voiced. After the closure, H1-H2 of the unvoiced /ahkah/ is about 15dB greater than that of the voiced /ahgah/. H1-H2 of the voiced falls while that of the unvoiced rises. At the release, H1-H2 of the unvoiced rises from the onset of voicing and then falls while that of the voiced stays relatively constant. The overall change in H1-H2 of the unvoiced stop is greater than that of the voiced. 35 The profile of Hi-Al over time is the same for the unvoiced and voiced stop at the closure. At the release, however, Hi-Al of the voiced decreases steadily while that of the unvoiced rises and falls after voicing begins. Again, the overall change in H1-A2 of the unvoiced stop is greater than that of the voiced. Hl-A3 of the unvoiced stop falls after the closure while that of the voiced stop continues to rise. Because A3 is falling more than Al, H1-Al is almost constant around the closure for both the voiced and unvoiced stops. At the release, when voicing begins, H1-A3 rises and then falls after reaching a maximum around 50ms after the release. Again, the overall change in HIA3 is bigger for the unvoiced than the voiced stop. At the closure, there is nearly no difference in the behavior of F1 in the voiced and unvoiced stops. At the release, F1 is higher for the unvoiced than for the voiced and it falls after about 50ms after release. These general observations apply to the female speaker data shown in Figure 3.4.4b. Figures 3.4.4c and 3.4.4d also show similar trends for the CVC utterances /gahg/ and /kahk/. The closure and release landmarks are switched from those of the VCV utterances. Data for all speakers are shown in the appendix. The next chapter gives an analysis of these cues and assesses their validity and importance in determining voicing. 36 2. 80 060-- VCV Male as: ahgah vs. ahkah . elease Closure . o unvoice . voiced , 20' c60 40 C~D 9J Li- 2080- 0 -100 2400 0 T 200-0100. IL -100 0 0-600 200 100 0 100 200 ......--.-.-.--- 200 0 0100- -100 0 . -. -.. . --. . 4 -200 o50-- -100 T L- Sq0 00. 020-\ 20-01-20 T -20- ClctR. Li- 200 100 VCV Male ds: ahgah vs. ahkah Cl.osure Release . .. fflOM E 100 200 . -.. -...-. ..-- .. 0-100 ..-... . 0 .. . -100 0 100 _,-20-T-202050 -200 100 -0 050- 200 100 0 -100 . . 20--0 *4 2 100 0 00 . . 200 D O -100 0. . %.. . 0 . 100 200 . 09 *I i 0 lb00 - - 110- 100 0 .......... 500 - -. IL 0-200 -100 001 ......................... 0 time in ms 100 0CII -100 10.0 200 100 0 time in ms 200 0 U- 200 0- -a -200 -100 Figure 3.4.4a: Measurements for determining voicing at the closure and release for voiced (dot) and unvoiced (open circle) consonants in VCV utterances with vowel /ah/ for male speakers as and ds. Closure occurs at -100ms and release occurs at 100ms. Data are taken every lOms from 50ms before closure to 30 ms after closure, and from the release to 70ms after release. 37 8060 40 20 I3 O0 20001000 0 200 -, 40- VCV Femal e Id: ahgah vs. ahkah VCV Female ss: ahgah /s. ahkah ClosureRerease Closure Retease . o unvoiced .. . c 60 '. oiced 0 40 20 ... . . ..OD . . . . . . . 0 0 ~ 20' -100 20 o 100 0 00 -100 0 100 200 -....-... 0 200 G9 D... -8~. 0 100-........ ............. . 00. g0I33D -100 0 100 I -100 OQ -10 0 - 0 ~200 -2000 20 0 0 - 200 -100 . . . . 0 100 200 100 200 ,n - . -.. . . . -100 200 -0 50 C 0 100 200 -D 0 1 0. 100g00-100 0 100 T0 200 500- 500-01 -200 100 -D 20 ... .. ... . 250 --.. -20-200 -100 20 0 100 00 0 -100 m40 20, -00 0, 0 -200 0 20 0 -100 0 time in ms 100 0-200 200 -100 0 100 200 .0 0 -100 0 time in ms 100 200 Figure 3.4.4b: Measurements for determining consonant voicing at closure and release for voiced (dot) and unvoiced (open circle) consonants in VCV utterances with vowel /ah/ for female speakers ld and ss. 38 80 0-60>40- 20 ~00 CVC Male as: gahg vs. kahk losu re~.~.~.Re le as e o- 80 --604020 3 0 200 100 - 1 U- 100 0 -100 200 7N200 0100. I00 100 0 -100 0 -00 20 - 200 . .. -204200 .c 20-m0= -201 -200 50 - -100 0 100 0 --20' 4600.. 20 - - 200 .0 coo - 0 -. CQ D ........ .... -100 100 200 0 100 200 0 100 20 0 0 -100 .. ... . .... .. . .. .. ... ---... . -100 0 0^ 0-** -20-200 50 - -100 0 10500.. -100 100 . . . .0 .. . . .... ....... .... -100 CVC Male ds: gahg vs. kahk Release Closure 00 . .... 100 0 200 200 0.. ... .00 01 1500 500 -. -00 01 -200 -100 -. 00. -100 0 100 b 0 time in ms 200 100 01 -200 200 100 200 100 200 0 500- ... .. 0 0 -100 - 0 time in ms Figure 3.4.4c: Measurements for determining consonant voicing at release and closure for voiced (dot) and unvoiced (open circle) consonants in CVC utterances with vowel /ah/ for male speakers as and ds. 39 80 Q6040U- 20' 3Q-00 T I,80 ~ vunivned voicedc0? 60 -20- 1o 0 -100 00 200 .0 0 0-100 1200 01000 1i00 0- 200 -200 40- ... . ... 40 20 -100 100 0 20 20 T-20200 -200 0 ....... 20 - 0- 7-2 -200 a 50- -100 0.- r-" - 500 -U- 0 . . . -20 -200 50- 200 -100 -. . . 0 100 0 -2001000 .... -.. -100 . 200 0 100 200 . -100 -- 100 0 00 01 -200 200 time in ms 100 0 -100 .* . . 0.. . . 200 ... CbD 0 -100 100 200 00 500 0 200 0 200 100 . . -100 . -200 ......- 0* Oko 0. 10 100. 100 0 7- I .. .... 0.. ... 00 -e- -2 -100 0 U- 4 0 .. ... ... ....... ... . . . LL 0100 01 Closure Release 040- 200 . -200 CVC Female ss: gahg vs. kahk VC Female Id: gahg vs. kahk Retease Gbo sur -100 00D 0 time in ms 100 200 Figure 3.4.4d: Measurements for determining consonant voicing at release and closure for voiced (dot) and unvoiced (open circle) consonants in CVC utterances with vowel /ah/ for female speakers ld and ss. 40 Chapter 4 Acoustic cue analysis Much of the data analysis is concerned with the relative importance of the cues near the two landmarks. In the vicinity of each landmark, the importance of the cues is further scrutinized in segments of time which represent periods before the closure, after the closure and after onset of voicing at the release. For each CVC or VCV utterance, the behavior of the parameters H1, H2, F1, Al, and A3 is examined for the 50ms leading up to closure. After the point of closure, FO (indicating presence or absence of periodicity) is examined for 30ms. At the release, FO is examined for 40 ms. After that point, H1, H2, F1, Al, and A3 are examined for 40 more ms. This analysis of voicing can be thought of as four continuous stages. A first stage concentrates on the time leading up to closure, a second stage focuses on the time after closure, a third stage looks at time leading up to voice onset, and a fourth stage examines the time after voice onset. In effect, this analysis concentrates on the acoustic manifestation of laryngeal gestures that overlap into regions adjacent to the landmarks. A typical question that needs to be addressed is: in the transition from 41 vowel to closure, is there an adjustment of the vocal folds to prepare for a voiced or unvoiced stop? Is there something done differently by the vocal folds during the vowel interval before the consonant? 4.1 Analysis of H1, H2, F1, Al, and A3 In addition to the primary cues VOT, VOP, and vowel duration, several important cues exist just before the closure and after voicing onset following the release. This section examines the importance of these cues by tracking the behavior of the parameters HI, H2, FL, Al, and A3 over a period of 50 ms before closure and 40ms after periodicity occurs after release. Average data and standard errors will be presented first, followed by a discussion of the relative importance of each cue. 4.2 Average data for voicing cues Figure 4.2a shows the average data for 24 VCV utterances by four speakers, two female and two male and figure 4.2b shows the average data for the CVC utterances. The first column represents activity at the closure of a voiced utterance, the second column at the release of the same voiced utterance, the third column at the closure of the unvoiced counterpart, the fourth column at the release of the same unvoiced utterance. The first row are plots of the behavior of H1, the second row of F1, the third row of Hl-H2, the fourth row of Hi-Al, and the last row of H1-A3. 42 VCV voiced closure - 10 -- 0 - 10 Max 0 1 10 3 0 2 0 0 -.. ... .-.... . 200 0 .. . .. .. . -200 1 0 10 - 0 -.--.- '-10, 1 0 10 2 2 Diff 1 0 -0 3 - 0 - -10- 0 2 1 0 I'U -. 3 Diff 2 3 0 1 - 2 3 0 -. 10 1 2 1 2 3 1c Ini 3 -.- -10 0 10 3 2 1 . 0 -10- 3 2 0 0 0 . 1 -200 ............... 'U . 0 200 0 -10, Diff Max -10 - ... . . . . -200 L... 1 2 3 0 10 .......... 0 - -10 Max 3 0 1 . 200 - 2 - .- -200 1 3 0 . . 10 - - 0 .. . Max -10 Diff 2 10 -:.10 -- :-. 0 . VCV unvoiced release VCV unvoiced closure VCV voiced release -- 1A 0 1 2 3 0 2 1 10 3 0- 0 1 2 3 . 0 0 - . 0 1 2 3 2 3 10 10 - -. - 1C 0 1 2 3 -10 0 1 2 3 -10 0 1 2 3 -10- 0 1 Figure 4.2a: Average VCV measurements. The first item in each panel (Max) represents the average maximum difference in the measure between two neighboring time frames. Max is the maximum rate of change over lOms. The second item (Diff) shows the difference between the value of the measure in the last time frame and that in the first time frame. The light part of each stacked bar shows the actual measure and the dark portion indicates standard error. VCV closure is defined at -100 and release at 100. Measurements are take from -150 to-i IOims at the closure, and from 140 tol70ms at the release. 43 CVC voiced closure CVC voiced release 10 10 0 -10 0 - Diff Max 0 2 1 3 -10 - 1 0 - 3 2 200 200 I- 0 0 3 2 1 1 3 2 10 0 -10 - 0 10 . 10 3 2 0 1 -10 10 3 2 - -10 10 Diff Max -10 0 .200 - 2 3 1 2 3 1 2 3 1 -0 - 2 1 0 -. 10......... 0 - 1 0 2 3 -10 10 0 3 10 - 1 0 2 3 -10 10 2 3 1 2 3 - -10 10 0 ---. 0 . 1 0 0 . -- 0 - 0 -10 1 0 - 3 2 - -200 ................ -200 - -200 . -200 1 0 0 . Diff Max -10 200 - .. . 0 . Diff Max CVC unvoiced closure CVC unvoiced release 10 - 10 0 1 - 2 3 0 1 2 3 -10 10 0 1 2 - 3 0 1 2 3 00 -10 0 1 2 3 -10 0 -10 -10 Figure 4.2b: Average CVC measurements. The first measure in each plot is the maximum difference between two neighboring time frames. Max is the maximum rate of change over 10ms. The second measure is the difference between the last and first time frame. The light portion of each stacked bar shows the actual measure and the dark portion represents the standard error. CVC release is defined at -100 and closure at 100. Measurements are taken at the release from -60 to -30ms, and at the closure from 100 to 140ms. The first item of each bar graph (Max) shows the average maximum difference in the measurement between any two 10ms time frames during either the 40ms release or 50ms closure interval. The second item (Diff) shows the average difference between the first and last time frames. The bars shown are all stacked with the bottom part representing the actual measure and 44 the top portion representing the standard error of that measure across speakers. Together the values Max and Diff and their respective standard errors can give insight into the general effectiveness of a parameter in characterizing a voiced or a voiceless stop. Comparison of HI for the voiced and unvoiced VCV utterances at the closure shows that their behavior is actually very similar. The Max and Diff values are both slightly negative, indicating a slightly decreasing HI at the closure for both the voiced and unvoiced VCV utterances. However, at the release, the Max and Diff values for the unvoiced VCVs, at 8 dB and 11 dB respectively, are much larger than those for voiced VCVs, which are both about 0 dB. Furthermore, because the standard deviation of Max and Diff for the unvoiced utterance at the release do not overlap with that for the voiced utterance, there is clearly a separation in the behavior of HI at the release between the voiced and unvoiced VCVs. Because HI is a measure of the amplitude of the first harmonic, the large rise in HI at the release of the unvoiced stop consonant reflects the closing of the glottis as glottal vibration begins. Figure 4.2b shows the same behavior of HI in CVC utterances. HI is a valid cue for determining voicing near the release landmark for both VCV and CVC utterances. At the release, the maximum rate of change in F1 is greater for the voiced than for the unvoiced in both the VCV and CVC utterances. This could arise because at the release, there is an increase in low frequency amplitude of radiated sound. At the release of voiced VCVs, Fl rises by a Max of 47Hz (standard error 18) and by an overall Diff of 64Hz (standard error 21). At the release of unvoiced VCVs, F1 drops by a Max of -22Hz (standard error 20) and -27Hz (standard error 23). At the release of voiced CVCs, F1 rises by a Max of 32Hz (standard error 14) and a Diff of 39Hz (standard error 16). At the release of unvoiced CVCs, Fl drops by a Max of -35Hz (standard error 8) and a Diff of -48Hz (standard error 31). At the closure, there are even 45 more considerable differences between the voiced and unvoiced. In the VCV utterances, the decrease in F1 at the closure of the unvoiced utterances is indicated by a Max of -147Hz (standard error 38) and Diff of -209Hz (standard error 40). The decrease in F1 at the closure of the voiced is a Max of -178Hz (standard error 40) and a Diff of -236Hz (standard error 36). At the closure of the voiced CVCs, Fl decreases by a Max of -178Hz (standard error 40) and a Diff of -236 Hz (standard error 36). At the closure of the unvoiced CVCs, F1 decreases by a Max of 122Hz (standard error 33) and -185Hz (standard error 36). The larger decrease in F1 at the closure of the voiced utterances is consistent with the expectation that immediately after the closure, there is a decrease in the low frequency amplitude of radiated sound. The parameter F1 is therefore a valid cue in determining voicing at both the release and the closure landmark. Comparison of the Max and Diff in H1-H2 for the voiced and unvoiced VCVs shows that it is a robust cue for voicing both at the release and at the closure. At the closure, Diff is -1 dB (standard error 0.9) for the voiced and 5 dB (standard error 1.5) for the unvoiced. The positive value of Diff for the unvoiced indicates that the H1-H2 becomes larger approaching the closure. This is consistent with the physiological behavior as the glottal opening increases and a smoother glottal waveform results in a lower H2. The last few pitch periods have become breathy. At the release, Diff is 0.2 dB (standard error 0.2) for the voiced and -4 dB (standard error 1) for the unvoiced. The negative value of Diff indicates that H1-H2 is decreasing after the release. This is again consistent with the behavior of the glottis as it becomes less spread to prepare for voicing. The large differences in Diff at both landmarks and the none-overlapping standard errors would indicate that the measure Hl-H2 is a valid measure for voicing. Therefore, H1-H2 is a fairly valid measure of voicing both at the closure and at the release for the VCVs. Figure 4.2b shows that for the CVCs, however, H1-H2 is not as robust a cue at either the release or the closure. 46 Although there is a similar pattern of HI-H2 increase at the closure and decrease at the release, the large standard errors make it difficult to draw any conclusions about its contribution to voicing identification. The large standard errors associated with the parameters Hi-Al and HI-A3 makes it difficult to draw conclusions about their validity as cues for voicing. Figure 4.2a shows that in the VCV utterances, H1-Al increases more for the voiced than for the unvoiced at the closure landmark, and it decreases more for the unvoiced than for the voiced at the release landmark. Figure 4.2b shows that in the CVC utterances, it also increases more for the voiced than for the unvoiced at the closure, and it decreases more for the unvoiced than for the voiced at the release. H1-A3 follows the same trend at the closure for both the VCV and CVC utterances. There is more of an H1-A3 increase at the release of both the VCV and CVC utterances. Compared with the other parameters, HI-Al and HI-A3 do not contribute much to voicing detection. It should be noted that these are average data over both female and male speakers. Some parameters, such as H1, might change more for the female than for the male. In Glottal Characteristicsof male speakers: Acoustics correlatesand comparison with female data, Hanson and Chuang (1999) looked for parameters for differentiating male and female speech. Acoustic measurements that reflected glottal characteristics on recordings collected on 21 male speakers and 22 female speakers showed that while there was significant overlap across gender, the male data showed lower average values and less interspeaker variation for all measures (first formant bandwidth, HI-H2, Hi-Al, and Hi-A3). Males tend to have a more complete glottal closure, leading to less energy loss at the glottis and less spectral tilt (Hanson and Chuang, 1999). Because this thesis looks for significant parameters for differentiating voicing in stop consonants, the data collected on 2 female speakers and 2 male speakers were analyzed together. 47 The next section discusses the validity of the actual value of FO at the release or closure landmark as a voicing cue. Average data of FO are provided first, followed by separate data from male and female speakers. 4.2.1 Average FO data In the previous chapter, the presence of FO, as detected by the XKL program, was the basis for determining VOT and VOP. In addition, the actual value of FO at the closure or the release may be a valid cue for voicing. FO at the release of an unvoiced stop consonant is expected to be higher than FO at the release of a voiced stop because the vocal folds were held stiff during the unvoiced interval. During the transition from the unvoiced stop consonant to the following vowel segment, FO will fall. Similarly, at the closure, FO is expected to be somewhat higher for the unvoiced stop consonant than for the voiced, assuming that the stiffness or the slackness of the vocal folds is implemented before the closure. It should be noted that FO is determined in this thesis based on the output of the XKL program. Somewhere at the closure landmark, XKL stops detecting periodicity and returns 0 for FO. Somewhere at release landmark, it begins to detect periodicity and returns an actual value for FO. Figure 4.2.1 a shows the average FO values at the release and closure landmarks for both VCV and CVC utterances. Each bar represents the average FO spoken by 4 speakers on 3 utterances (voiced or unvoiced). Data for VCV utterances are shown in the top plot and data for CVC utterances are shown in the bottom plot. In the VCV utterances, the average FO is 10Hz higher for the unvoiced than for the voiced at the closure and 20Hz higher for the unvoiced than for the voiced at the release. In the CVC utterances, the average FO is 5Hz higher for the 48 N&2=cd-- --- --- z -- - -- --- - - -- - unvoiced than for the voiced at the closure and 30Hz higher for the unvoiced than for the voiced at the release. unvoiced release voiced release unvoiced closure Average FO values in VCVs voiced closure 200- .. -..... 150- . -.. F7 0S C- . -.. ....... .... 100- .. -.. ... ....... .... 50 A- 1.5 1 0. 5 2 2.5 3.5 3 4. 5 unvoiced release voiced release unvoiced closure Average F values in CVCs voiced closure 4 200 150 - -... ....... -..... -.... -..-..-.100 00. 5 - - 50 -- 1 1 1.5 2 2.5 3 3.5 4 4.5 Figure 4.2. la: Average FO values at the release and closure landmark for VCV (top) and CVC (bottom) utterances. Each bar represents the average FO spoken by 4 speakers on 3 utterances, either voiced or unvoiced. FO is on average nearly 100Hz higher for the female speaker. Before any conclusions about its validity as a speaker independent acoustic cue for voicing, it is important to examine whether this average trend in FO is actually more influenced by one gender than the other. Figure 4.2.1b shows that this trend holds in both genders. The light bars represent male data and the darker bars to the right represent female data. The measure of FO at the closure or release landmark is therefore a valid cue for voicing. 49 297 FO male and female values in VCVs 300 unvoiced release voiced release unvoiced closure voiced closure 250 female male 200 150 100 50 U- 0.5 A I 1 1.5 2.5 2 A, 3 4 3.5 4.5 F male and female values in CVCs 30 0 F voiced closure voiced release unvoiced closure unvoiced release 250 200 150 100 50 UI F-I 0.5 I 1 1.5 V aI. i 2.5 2 3 3.5 1, 4 4.5 Figure 4.2. Ib: FO values shown separately for males and females in VCVs (top plot) and CVCs (bottom plot). The first bar in each group represents the average FO for female speakers and the second bar represents the average FO for male speakers. 4.3 Combining acoustic cues The previous section summarized the behavior of six measures over time. The measures H1, FO, F1, and H1-H2 were determined to be robust cues for voicing in both the CVC and VCV utterances. The measures Hi-Al and H1-A3 were determined to be poor cues for voicing, partly given the sample size. This section tries to address the following question: given this valid cues, could a combination of these cues yield a better estimate of voicing? In certain contexts, the single cues may be too weak to distinguish voicing but a combination of cues may be strong enough to judge voicing. 50 Figure 4.3a shows some examples of such a combination of voicing cues using the MAX values determined in section 4.2. The first row plots the MAX H1-H2 against MAX H1, the second row plots the MAX F1 against HI, and the last row plots the actual values of FO (as obtained in section 4.2.1) against MAX HI. Hl is used as a basis for these combinations because it is already a valid cue for voicing on its own. The voiced data are indicated by filled circles and the unvoiced data are shown by unfilled circles. The 24 data points in each graph represent 6 stops in the VCV or CVC utterances spoken once by each of the 4 speakers. At the release, the scatter plot of the MAX of HI-H2 and of H1 shows that there is a clear separation in the data points between the voiced and unvoiced. The same observation applies to the two scatter plots below. Except for one or two overlapping data points, the otherwise clean separation indicates that the combination of cues helps to distinguish voicing. The right side of Figure 4.3a shows that there is no such clear separation at the closure landmark. In the bottom plot, notice that there is a separation of data points depending on the gender of the speaker. Data from the female speakers are about 100Hz higher than data from the male speakers. Figure 4.3b shows that for the CVC utterances, there is no separation at the closure and that while there is some separation at the release, compared to the VCV utterances, the separation is not as clear. 51 VCV release ..-.. -.. .. 20 -. 10 C\J I VCV closure .. -.. ..-.. 0- 20 .. ..... ... . ........ 10 - 0- 0 -0 -10- ... ...... -.- .-.. -10x oper: unvoiced filled: voiced -20 -6u -1 0 -5 200 S 0 - i 5 10 5 0 MAX H1 [dB] -20 0 5 MAX H1 [dB] 10 15 -5 0 5 MAX H1 [dB] 10 i 5 N - U- U- x -200 -20C -40C -400 -10 -5 0 5 MAX H1 [dB] - 10 15 10 30C 300 250 r200 0 u_ 150 -5 20C 0 - 10 I - -.-.-. 0 -- . -- 25C 0 N 20( 0 '00~~ 'O LL 15C 10( 100 -10 -5 0 5 MAX H1 [dB] 10 15 -10 0 -5 . 0 5 MAX H1 [dB] 10 15 Figure 4.3a: Scatter plots of the voicing cues in VCV utterances using MAX values. MAX, as described before, is the maximum rate of change over lOms. The left side shows trends at the release and the right side at the closure. Voiced utterances are indicated by filled circles and unvoiced utterances by unfilled circles. 52 CVC release CVC closure 20 20 10 C\J X -10 LL 0 -10 <-20 1~ -30 -1 0 15 10 5 0 MAX H1 [dB] -5 0 0 5 MAX Hi [dB] -5 10 15 200 -I' -9 0 x 0 0. .. . . . open: unvoiced filled: voiced -30 -1 0 200 C\J x t -20 0 -0 0' S 10 0 .0 0 0 0 So 0 00 U- . . . -200 0 00 x -200 - ......... ... .... ......... .. -400 - --- -5 -10 30U .. .. .. . ... -... .-.... -.. 10 0 5 MAX H1 [dB] S- . 0 -10 15 0 0 - 0 100 -10 ..0. --. -5 0 5 MAX H1 [dB] 10 15 .-.. . . .. N 200 0 *0. . .. 0 5 MAX H1 [dB] 250 r 200 LL 150 -5 300 -- 0 . ... 0.. 250 -400 0 J- 150 0 10 .0 100 -10 15 .. . -5 0 5 MAX H1 [dB] 10 15 Figure 4.3b: Scatter plots of the voicing cues in CVC utterances using MAX values. Max is the maximum rate of change over 10 ms. The left side shows trends at the release and the right side at the closure. Voiced utterances are indicated by filled circles and unvoiced utterances by unfilled circles. The other two measures for voicing, HI-A1, and H1-A3, which were determined to be poor for distinguishing voicing in the previous section, are shown in figures 4.3c and 4.3d, combined with the measure HI. The goal is to determine whether these less robust cues are valuable in combination with another more robust cue or cues. Figure 4.3c shows the combination of HI-A l and H1 using their MAX values. There is virtually no activity at the closure, as both cues have a value close to zero. While it was difficult to determine voicing based 53 on data from Hi-Al alone, the combined cues HI-A3 and HI show that there is a clear separation at the release of the VCV utterances. The separation is not as clear for the CVC utterances. Figure 4.3d shows that the combination of Hi-A3 and HI cues yield similar results. closure release 30 30 20 20 VCV F I; 0 -10 0 0 0 0 10 10 I open: unvoiced filled: voiced 00 -1 0 0 x -10 o& 0 -20 ' ' 0 10 MAX H1 [dB] 20 -20L' -10 30 30 30 20- VCV V 0 20 30 20 30 - 20- CVC 10 MAX H1 [dB} CVC Va :C 0 10 10- 0 0 0 0 0 -10 -20' -1 0 00 C 0 0 x 0 0- 0 10 MAX Hi [dB] 0- 20 -20-10 30 I 0: 0 0 10 MAX H1 [dB] Figure 4.3c: Scatter plots of the measurement Hi-Al against HI using MAX values. The left side shows trends at the release and the right side at the closure. Data from VCV utterances are shown in the first row and data from CVC utterances are shown in the second row. Voiced utterances are indicated by filled circles and unvoiced utterances by unfilled circles. 54 release 30 closure 30 - 20- 0 VCV - 20- VCV 0 CO, VY 10- 10- I open: unvoiced filled: voiced 0 0 -10-20L- -10 0 0 0 0 0 -10 10 MAX H1 [dB] 20 CvC 20 CO) 20 30 0 0 x 0 20 30 CvC 10 80 0 00 0 -10 0 -10 10 MAX H1 [dB] 0 0 0 10 - z'-" 0 30 20 -10 [ -20 -1 C 30 30 CO, 48 0 10 MAX H1 20 -20-10 30 fdRI 0 10 MAX Hi [dRil Figure 4.3d: Scatter plots of the measurement H1-A3 against HI using MAX values. The left side shows trends at the release and the right side at the closure. Data from VCV utterances are shown in the first row and data from CVC utterances are shown in the second row. Voiced utterances are indicated by filled circles and unvoiced utterances by unfilled circles. In different speech environments, some cues for voicing will not be found at all while some cues will be more prominent. The discussion above showed that the voicing cues, even relatively weak ones, can be combined to take advantage of every cue that might be present. 55 Chapter 5 Summary and Discussions This thesis examined the acoustic cues for the voicing contrast that exist in the vicinity of closure and release landmarks of stop consonants. Preliminary data collected from four speakers on isolated VCV and CVC utterances were analyzed for trends that prevailed across speakers. The following sections summarize these results and suggest further study. 5.1 Voicing cues in isolated utterances The voicing cues discussed in this thesis were determined from examining isolated CVC and VCV utterances. The cues were chosen initially from knowledge gathered from the study of the speech production mechanism which involved matching the physiological activities with the acoustic manifestations. The parameters chosen were the fundamental frequency, first formant frequency, first and second harmonic amplitudes, first and third formant frequency amplitudes, time to onset of periodicity at the release, and time to offset of periodicity at the closure. A 56 measure of periodicity during the region immediately after the stop consonant closure was used, and the same measure was used for determining the voice onset time at the release. The other measures were relevant in inferring the stiffness and spread of the vocal folds at 50ms before closure and 40 after the onset of voicing. Data showed that the voice onset time (VOT), percent of 10-ms samples out of a four 10ms time frames that were not periodic at the closure (VOP), and vowel duration were the most important cues in determining voicing, and that the measures for spread and stiffness of the vocal folds can be secondary cues when the primary cues are absent from an utterance. Analysis of the secondary cues for the VCV utterances show that HI, FO, F1, and Hi-H2 can contribute to voicing identification at the release landmark. The same analysis showed that Hi-Al and H1-A3 are poor measures of voicing, given the size of the data and the associated variability. When two voicing cues are combined, such as HI and HI-H2, they contribute more to voicing detection for stop consonants. 5.2 Further work The combination of one acoustic cue with H1 as shown in section 4.3 showed that two cues together can be more robust at distinguishing voicing than alone. More work needs to concentrate on linear discriminant analysis or other methods to devise ways of best combining these cues. This combination needs to take into account different speech environments where one cue might be stronger than another or one cue might be missing altogether. This thesis work examined all parameters except FO as they changed with time within an utterance to draw conclusions based on the trend across all speakers. FO was analyzed only at the point of voice offset or the point of voice onset. Utterances were examined for only 70ms after the release. Voice onset time was on average 40ms longer at the release of unvoiced stops than of 57 voiced stops. The long voice onset time for unvoiced stops frequently extended beyond the 70ms time frame used to capture change in the FO parameter. This change, was not monitored in this thesis as a result of the short time frame allowed after the release landmark. The time frame after the release should be extended so that the behavior of FO can be more carefully analyzed. The voice offset periodicity was defined in this thesis as the percentage of periodicity detected by the XKL program over four 10ms time frames. Whether XKL returns a value for FO depends on factors such as the window size. The same window size would encompass more points for female speakers so it could be easier to detect FO in female speech. Periodicity might not be returned if each periodicity is different within a time window as a result of changes in the frequency, amplitude, and spectrum of each pulse. A more reliable way to measure periodicity needs to be developed. The voicing cues discussed in this thesis are found in relatively simple contexts, such as consonant-vowel (CV) or vowel-consonant (VC) utterances. By limiting context, it was possible to examine for patterns in stop consonant voicing. The cues apply in contexts containing stressed vowels such as in the word /cut/. When the stop consonant occurs in more complex contexts such as casually spoken sentences, however, those cues may not be consistently present in the sound. Depending on factors such as noise and speaker modification, the relative importance of these cues will vary. For example, in noisy speech, the primary voicing cue in citation speech, VOT, may not be very apparent. An example of variability is the change from the voiceless and aspirated /t/ in "top" to the voiced and unaspirated /t/ in "stop". A similar change occurs in /p/ from "pot" to "spot". These modifications could be the result of overlapping articulatory gestures for voicelessness which obscure voicing or vice versa. Another example of modification is a blending of the place of articulation of a syllable-final alveolar noncontinuant consonant with 58 that of a following consonant as in the sequence "note book." (Stevens, 1996). The /t/ is glottalized and the /o/ is shortened in this sequence. Other examples of variability include the effect of flaps as in the word "writer". The /t/ in "writer" has been flapped so that it is difficult to distinguish from the /d/ in "rider". Effects of vowel duration also need to be examined. For instance, the second vowel in "paper" is a reduced vowel and this causes the second /p/ to be produced unaspirated. Despite these kinds of variability, a native speaker of English can still rely on some set of clues to help extract what has been heard. A speech recognizer must be able to do the same. A goal for further research would be to determine which voicing cues that apply in citation speech would still apply in continuous speech and which new cues, if any, should be considered. The utterances used in this thesis were recorded by 4 speakers, 2 female, 2 male. In order to make sure the observations made in this thesis are true in general, more speaker data needs to be examined in order to discount for particular speaker differences such as breathiness. The variability of articulation rate should be controlled by training speakers beforehand so that this variability does not contribute as much to the standard error of the vowel duration. More speakers data are needed to provide a more meaningful conclusion about each measurement. 59 Bibliography [1] Blumstein, S. E., K. N. Stevens, L. Glicksman, M. Burton, K. Kurowski. Acoustic and perceptual characteristics of voicing in fricatives and fricative clusters. Journalof the Acoustical Society of America, Vol. 91, 2979-3000, 1992. [2] Choi, J.Y. Detection of consonant voicing: A module for a hierarchical speech recognition system. Ph.D thesis. MIT, 1999. [3] Chomsky N. and Halle, M. The sound patterns of English. New York, Harper and Row. 1968. [4] Crystal, T.H. and A.S. House. Articulation rate and the duration of syllables and stress groups in connected speech. Journalof the Acoustical Society ofAmerica, Vol. 88, 101-112, 1990. [5] Duda, Richard. Pattern classification and scene analysis. New York, Wiley, 1973. [6] Hanson, H. M. and E. S. Chuang. Glottal characteristic of male speakers: Acoustic correlates and comparison with female data. Journalof the Acoustical Society of America, Vol. 106, 10641077, 19909 60 [7] House, A. S. On vowel duration in English. Journal of the Acoustical Society ofAmerica, Vol. 33, 1174-1178, 1961. [8] Klatt, Dennis. Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journalof the Acoustical Society of America, Vol. 59, 1208-1221, 1976. [9] Klatt, Dennis. MIT SpeechVAX user's guide, Internal Memorandum, 1984. [10] Klatt, Dennis. Klatt. Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America, Vol. 67, 971-995, 1980 [11] Lindblom, B. Spectographic study of vowel reduction. Journalof the Acoustical Society of America, Vol. 35, 1773-1781, 1963. [12] Lisker, L. and A. S. Abramson. A cross-language study of voicing in initial stops: Acoustic Measurements. Word 20: 384-422. 1964. [13] Manuel, S. Y. and K. N. Stevens. Revisiting place of articulation measures for stop consonants: implications for models of consonant production. ICPhS San Fransisco, 1999. [14] Nigrin, A. Neural Networks for PatternRecognition. MIT Press, 1993. [15] Ohde, R. N. Fundamental frequency as an acoustic correlate of stop consonant voicing. Journal of the Acoustical Society of America, Vol. 33, 224-230, 1984. [16] S. K. Keyser and K. N. Stevens. Feature geometry and the vocal tract. Phonology, Vol. 11, 207-236, 1994. [17] Stevens, K. N. Acoustic Phonetics.MIT Press, 1998. [18] Stevens, K. N. On the quantal nature of speech. Journalof Phonetics,Vol. 17, 3-46, 1989. [19] Stevens, K. N. Applying phonetic knowledge to lexical access. 4th European conference on speech communication and technology. Madrid, Spain. Vol. 1, 3-11, 1995. 61 [20] Stevens, K. N. Understanding variability in speech: a requisite for advances in speech synthesis and recognition. ASA-ASJ Third Joint Meeting, 1996. [21] Stevens, K. N. and A. S. House. Perturbation of vowel articulations by consonant context: An acoustical study. Journalof Speech and Hearing research,Vol. 6, 111-128, 1963 62 Appendix A 63 8( M 6 4( 2 voiced bahb-as (filled) vs. unvoiced pahp-as (open) Closure Release I- -F 0000000 -150 200 0 -50 -100 50 ~j20C 100 150 200 100 150 200 100 150 20 200 100 150 200 0- IL -150 200 0 -50 -100 50 0 2C - 2 -2( I '00 -150 200 -2 2 ~ 000 -00 01 0 -50 -100 100 50 10 -150 -2 200 -~ CO * 50 0 -50 -100 - 0.66 00 I 00 1ob 50 LL otm 200 80 M 60 40 00 20 -2 200 200 C 100 00 -2 40 C 20 -150 -100 -150 -100 -150 -100 00 - ingmsg -50 0 50 100 150 200 -50 0 50 100 150 2200 -50 0 50 100 150 200 0 50 100 150 200 I (D E G) 150 200 0 **** 00 -150 -50 -100 I9( SI Q00 (D C)000o (DO0000 -2C a 2C00 -2C -150 -50 -100 .. 00 time inms 0 50 100 20 - -2 00 50 C ~ -06 0O -00 -150 -100 -50 0 50 100 150 200 -150 -100 -50 0 50 100 150 200 CO 106i ?00 50( LL C -~~tm in -ms-' 64 ~E 80 M60 40 20 -2 00 voiced bahb-ld (filled) vs. unvoiced pahp-ld (open) Closure Release ** . * 60 0000660 0 0 00 -100 -150 50 0 -50 100 150 200 100 150 200 150 200 -i 200 0 0 -100 -150 20 460 E 0 E) 4)G )E)E 600000' ' o 100 u.. 0'0000 0 -50 50 0Q99QQ00 M -20 0 -20 -100 -150 Q9 50 0 -50 0 Q 100 300000200 20 -- 20 0- M Z-20 DO -150 -100 -50 0 50 100 150 200 00 -150 -100 -50 0 50 100 150 200 100 150 200 150 200 150 200 150 200 lo50 O 0.009 - O 0 500 LL 0 00 0 -50 -100 -150 50 time in ms 80 S60 -40 20 -2 00 voiced bahb-ss (filled) vs. unvoiced pahp-ss (open) Release Closure I I 0 -50 -100 -150 0 (D Q Q 0 '5 - -.. ' 0000 100 50 200 0) 0 E) S100 E)EE LL 0 00 -150 00 _-150 -100 -50 QQ 0 50 - - 100 a: 20 20 ' ' -100'? - - -50 0 0 50 -100 -50 0 50 100 000.90150 200 -100 -50 0 50 100 200 I0 M -20 20 20 -20 io -2 00 I-2 0 50 00 -150 -150 -~ -.. 500 0 -2 00 0 0 0o6 06 0.00 00 ,0100 . . . . 0o 00o - 150 - O 102 066 -150 -100 -50 0 time in ms 65 50 100 150 200 voiced dahd-as (filled) vs. unvoiced taht-as (open) Release Closure 806040- 9 OOOo Oo -0-0--9-0 . 0-. .0 6066 * *60 20 -200 L -150 I I -100 -50 0 2 00- OOOOQ 150 200 -. 0 150 100 50 o oooeoeoe 0- . -5 -10 -10 4600 50 - 0I 01 00- 66 L 100 200 -)E)E o o 20- C- I- -500 -0-100 40 0 100 150 100 150 20 -o 20 - 10 2 CO -0 0 -200 -150 ~ -200 10 60- -100 -50 0 ~Q 1 s -150 -100 -150 -100 50 -50 0 0 50 0 50 ~ 1 10 0 100 200 20 150 200 150 200 5 L 40 -200 Q -50 time 20 ~ in ms 100 1 ? 1 p 0 0 qQ voiced dahd-ds (filled) vs. unvoiced taht-ds (open) Release Closure '00-000( 8000 00 600 -150 -200 -100 -50 0 50 0 0 100 150 . 0 -200 -150 -100 -50 0 50 100 20 1000. 150 200 100 150 200 200 2 00 - 00- IL 0 1 460 - - - - . -20 -150 -100 -150 -100 -150 -100 - *00 em0o e e q 9EoE -50 0 50 S -50 0 50 50 100 10 II 150 15 0 50 100 150 100 5 0 150 200 20. -200 Cf) 0 20 -200 -50 0(5 50-200 -15 -100 000 11 0 0 -0 200 - 10 5 0 -200 -50 -100 0 -50 time in ms 66 50 100 80 voiced dahd-ld (filled) vs. unvoiced taht-ld (open) Release Closure . . . (DG . - 6040- ' -100 ' -150 20 -c -200 2 00- -50 . I . 1 . ' 0 ' 50 I I 'l on 100 0 1 150 200 . 00- - 0 -00 I- 200. 20 -00 20- u. -150 -100 - . -150 1 -100 -O -50 E 0 OE)e ' 50 000 . . . . - .Q - ) ' . 150 200 150 200 .. oD o0 0 -50 100 50 - 0 --- - 100 0-0 20 > -200 -150 -100 0 -50 50 100 200 150 - - 500 0 -50 -100 -150 10 -00 1 50 200 150 100 500- -0 401 000) -100 -150 -200 -50 0 time in ms - I ~0 0 40 20111 -200 -150 00 0 50 0 50 0 50 100 150 200 150 200 150 200 6 ---- 00 0 0 -20 20 -200 -150 -100 -150 -100 20- e E~O -50 i Eo -50 E)E)e 1 oo6i 100 40'1-150 20 -200 111n 0n- ..- -66 -100 0 -50 50 - 0 - 100 ? -20 -150 0Q0.*Q -100 150 -0 -D 0 -50 50 100 9? 200 150 200 0 9?-0 -. 0~o -200 II 50 - 10 o 100 - -otie. - 0 5 00 200 o 0 -50 -100 W200- LL 150 voiced dahd-ss (filled) vs. unvoiced taht-ss (open) Release Closure 80 - U- 00 0 100 50 -150 -100 0 -50 time in ms 67 50 100 )e.L0 150 200 voiced gahg-as (filled) vs. unvoiced kahk-as (open) Release Closure 80m60 - 9 OO -40- 0 00 1 0 -100 1 -50 2-i100 L -0 - 0 -100 E10 E) -50 20- -200 ,7200- -150 -150 200 40 1 100 1 50 150 ®®®eo-- 0 50 100 150 20 4 200 200 200 - 0- Q90 -200 -150 -100 -50 0 50 100 150 200 -20 -200 50 - -150 -100 -50 0 50 100 150 200 - 150 200 150 200 -4.- 0 -150 -200 -100 0 -50 50 50 - -20 0 L-150 -200 4w -100 66 6 -50 oo.. 00. . 100 . 100o 50 0 00 0 2066 20- lsr I -200 -150 0 ' S 020 -200 -20 -o -150 -150 -100 190 -100 -100 -50 E)eE..I ') 0 I II 0 50 - 0 0 -50 -50 0 0 2-. M:-0 ooo 000 1~j00 o -150 -100 100 E 0®oo -20 I 510*101 50 150 200 150 200 200 150 200 0 150 200 E)E )E 100 100 00 0 0 -50 50 100 O 350- -ees . -102 -D II 1 101 -200 -150 -100 -50 0 time in ms 68 50 100 voiced gahg-ld (filled) vs. unvoiced kahk-ld (open) Closure Release 8 06o_ 4 20 200 2000-100 L% o LI- . ( Q 0 1000 -150 -100 -50 . 0 1 100 1 50 1 0 QQ 150 200 150 200 200 150 200 150 200 9Qogo .. 0 0 200 2 QQ®GO 0 -15 -150 -10 -100 -150 o~oo -100 Eo -50 Eso 0 50 -50 OOO 0 50 0: 100 E 00 -2 20 200 ®0 100 6 O 00 ]Z -2 00- -2 0 200 0o -150 0i 1 1 0 -50 -100 u v - 0 - 100 50 k (open) CO, 00 I 200 0 4 0 -22 - 1 -100 -15 200 0 200 0 - -150 0 -50 0 50 100 150 200 -0 -150 -100 -50 0 50 100 150 200 -150 -100 -50 0 tim in ms 50 100 150 200 0 50 0 50 100 150 200 - 50 - * * * * -0- 0 - I C I 106200 50 I 20 -- -150 -50 -100 2 -ioo -so I- -20- timein ms '' oQQo 00 .) - o....e -0-0 0200 -2 S -150 0 -50 -100 - 00 0 I- ~j 1o 50 oo 1 100 - 150 200 10 0 150 200 0 -00 0 0 200 . .- ' 0~~o -15 -150 -100 000000 1 - 5 0 -50 time in ms 69 00 5 00 -..- 50 100 ' g e F 80 C60 40 20 3Q6( - 200 0 100 LL -2 o 0 voiced ahbah-as (filled) vs. unvoiced ahpah-as (open) F 00 46 20 - - - 6 ( ( ( ( 006 -150 1 0 1 50 100 -100 -50 0 50 100 150 -0 -100 -50 50 100 150 -0 5 5 0 5 50 100 -100 ®®D D (D (D( -150 0 1 -50 IF 0 Release Closure . 0000Co 0Q. 0 11 150 200 G . 200 20 0 -20 -0 -10 0 20 OODO -20 0 -2 2 R 50 -150 F C O 10( 22 00 (5GG)C)C)C 00O~ -50 -100 0 00 '090 0000000 150 200 10 200 150 200 - 500 --. 0 -2 00 -150 -100 0 -50 50 100 time in ms voiced ahbah-ds (filled) vs. unvoiced ahpah-ds (open) 80 Closure S60 QQ999 0 40 20 00 366 200 LL100 0 -150 -T-- F 20 0 2 -20 0 1 00 -100 00oo 00 Release 11 ..... -50 0 50 00 -150 100 O 150 200 _ ........ - 4Q9 E) E) E) -100 -50 0 50 I 100 0 -150 0 -100 eO E 150 00 0 -50 50 100 200 - 150 200 40 20 00 100 2-20 -2 50 I II 0 00 1006 -150 -15 (D 00 00 000 -100 0 -0 0 I 1- 0~0 -50 0 0 50 100 -5(?010 1 150 200 5 0 150 200 -2 z 500 0 -2 00 -150 -100 0 -50 time in ms 70 50 100 voiced ahbah-ld (filled) vs. unvoiced ahpah-Id (open) 80S 60 - godu ce 0 40- 2011 -150 -200 3 Reja(e( 5 C ( C 0 , 0 -100 0D0000 N2 00- , 0 -50 0 50 100 150 G''(D -. 0 1400- 0 p ' -150 200 -100 oE 20 - -100 0- 20 -200 50 - -150 -100 0 00 00 -150 10 -200 100 e 150 200 - 1 -50 0 50 100 -50 I 0 I 50 I 100 150 200 150 200 - I 00 0os 0 -50 -100 1O:aa~ 0 ---5 50 - gQQQQDOO066 20- - 0 1 -150 -00 LL -50 - 20 EF 200 50 100-- o 200 150 tie-n---®i 3 -200 200 time in ms voiced ahbah-ss (filled) vs. unvoiced ahpah-ss (open) Closure Release 0000066. ' 0o 1 1 -150 -100 -50 0 50 100 80~ 6014020 -200 ®®( 2200- G ®( O 1 150 200 - . . 00- LL 0 -150 -200 4 20- Ee -50 0 50 -150 150 7 200 -100 -50 0 50 100 150 200 0 00 20 I-20 -200 50- -150 -1 -100 -50 0 50 150 100 00' 0 00 IL 500- 10 Qq - 0 -50 10 200 - 0 0 -20 10 0 100 -. 20 -200 4 10 -100 50 0 01 100 1520 - ' ' ' ' 00 -200 -150 -100 -50 0 time in ms 71 50 100 150 200 voiced ahdah-as (filled) vs. unvoiced ahtah-as (open) Closure Release 80m60 20 0 Q Q G -150 -200 O)0 ®o0669 -100 -50 0 50 9 0 0, 100 150 200 M 200- - 0000006 LL100- 0 0 0 -150 9 '' eoEoo -100 -50 0 50 100 -50 0 50 100 o 150 200 150 200 -20 0 -150 -200 20 - -100 D G® D(® 0 --20 -200 -150 < Doo -100 100 -150 -50 0 50 100 150 -50 0 50 100 150 Q0 -200 -150 200 00 -100 500 L 0 0. 0 6O~~ 0~~6 -20 Q 00 0 -100 -50 LL1 0 time in ms 50 100 Qo 200 200 - 150 200 voiced ahdah-ds (filled) vs. unvoiced ahtah-ds (open) Closure Release 80 60-40 000 650000®®o.000 40 20 1 -150 -200 -100 -50 0 50 1 100 150 200 200 - U- o 10 0 EF46 -500 e-100 E)aeE000 -150 -200 -100 -50 0 50 0 100 - 150 20- 0-20 -200 -150 6 -20 --200 1? 00O 0 50 150 -0 200 - -100 50 0 -50 50 100 50 0 100 00 Q -100 -150 (? 11 0 -50 0 -150 -100 150 I 200 - 150 200 150 200 0 1 m500e- in -200 100 0 o G)o 0 50 - 0 1 -50 0 -150 -200 0 oo000 o -100 20 0 - 200 200 00150 -50 0 time in ms 72 0 50 100 -E 80 60 voiced ahdah-ld (filled) vs. unvoiced ahtah-ld (open) Closure Release 000 006- 0 40 20 -150 - 00 366 200 -100 GrG G 0 - O0 - - - . -)- -Qe 0 0 -50 I 0 50 100 I 150 I . . 0 200 0 .. 0100 0 -150 00 y0 -100 -50 20 0 50 100 150 0000 000 . 0 -20 00 20 C ooooo... -100 -150 -50 0 50 100 -150 1 1 -100 - 00 1040 150 - 09 .. - - -20C -- 2 00 2 5C lI 200 1 -50 0 50 100 150 0G) 0 200 1 200 ~ -150 -100 -50 -150 00-100 -50 0 50 100 00 150 200 0 50 100 150 200 570 L 0 C) V 00 4C 80 O 60 4 2 - 200 CO) 0I 100 0 --2 0 200 - 0 106 01 voiced ahdah-ss (filled) vs. unvoiced ahtah-ss (open) Closure Release . . . .- 0 0(OD 200 -150 -10 0e ' -150 -100 e 0 . . -100 -50 0 50 ,D ee -50 0 1 ~~ 50 ~ 100 10 100 -150 0 -50 -100 -150 50 100 200 200 20 150 200 7 Q Qg 150 0 -50 50 100 150 - - -150 -100 -150 -100 1 200 200 - 0 -50 00 50 0- 0- 150 100 166OO -100 0- 20 e ~ 0~~ 200 - 00 -2 - 00 0- Co 00 0- 2 -o time in ms -50 0 time in ms 73 50 100 0 10 50 100 150 ooo 150 200 2 200 voiced ahgah-as (filled) vs. unvoiced ahkah-as (open) Closure Release 0E 0- 0000006... 0 r- 0- 1 -150 0 20 3( N2C0 - c -100 0- E)00 0 200 1 -) - -) - . ee o E -150 -100 00 1 50 1 0 I1 -50 100 0 0 50 e 100 j - -50 200 150 200 150 0 -Q- -50 -100 -150 200 100 50 0 200 150 00 z010 -2 OO 0- - 0- -o 0 60 QQ -150 -200 -50 -100 50 -. 0 10 50 0 100 200 150 .- I I - 0.-OQgO 9 0 0. 00 -150 -200 56 ))0- -100 -50 I 01i 0 0 50 100 I I II -150 00 -50 -100 0 time in ms 50 0 150 0 - -200 0 ®000 200 - 0 150 100 200 voiced ahgah-ds (filled) vs. unvoiced ahkah-ds (open) Closure Release 80- 60120 0Q . 06 - 40- -150 0 -100 0000 0 -50 1 1 1 50 100 20 0 150 200- G)0 0 . . . I eee 0 1 00L 0 200 40 0 I -100 -150 0 -50 50 - Oeee 100 - 9 150 e 20 0 20- - -20 -00 -150 200 40 0 -201 -200 0 -150 -50 0 50 100 I I I I I - - -100 0o 0 -200 500 - 0 -200 .- 0 -50 50 -150 6 -100 (D -150 100 ,L 10 20 0 150 - -50 ?9.. 0 0 t00m i ms 50 I 100 0 00 0 -100 0 0 0 -)00DG oe 20 0 150 o6 '0 50LL - -100 20 CO - 0 Q)09 0 00 0000 0 - I- . I 0 -50 time in ms 74 50 j100 150 DE 150 20 0 - 200 80 M60 40 20 3-2 200 100 0 -2 0 20 0 -20 -2 - voiced ahgah-Id (filled) vs. unvoiced ahkah-ld (open) Release Closure . voiced 006 - .- D 0 - 6 ovunvoiced -100 -150 -50 0 50 100 -0 -50 0 0 50 10-0 100 150 - 0 -100 -06- -150 -150 00 20 0 -20 -2 00 - 50 -150 co) oo r 0 00 1056 -150 0 -2 0 -150 0 E1 -50 -5 -0 0 50 0 0 -100 -50 0 50 00 -100 100 006 '00 100 150 10 2C0 20 150 2C0 Q0 0 0 -50 0 0 I I I LL -100 -50 50 0 time in ms 1 2C0 150 100 50 I I II - 0 -100 -10 150 0 2 0 20 0 150 - . 0 Q 0 -0 QQ 150 100 20 0 voiced ahgah-ss (filled) vs. unvoiced ahkah-ss (unfilled) Closure Release 80 ~ CO 60 40C -000 20 0 356 00 -150 -100 50 0 -50 100 20 0 150 0 -200 0 100 u. 0 I -150 0 46 -2 20 -100 -50 50 0 100 00000 .090O00000 0 -2 0 -2C JI S20 -150 -100 0 -50 50 100 0 o j I- -20 -- 2 00 5C 1 0 II I -150 -100 - I 0 -50 - I I I 50 100 150 - -1 20 00 . 0000000 20 0 150 -- 0 20 0 150 00 C 00 -150 000 50C _ -100 0 -50 50 00 00 -150 -100 2 0 0 0... o -2 150 100 -50 time 75 0 in ms 50 100 000 150 20 Appendix B 76 FO [Hz] data from speakers as, ds, Id, and ss that were used to plot figure 4.2.la. vc: voiced closure, uc: unvoiced closure, vr: voiced release, ur: unvoiced release. VCV utterances ahbahasvc= ahbahasuc= ahbahasvr= ahbahasur= ahgahas-vc= ahgahasjuc= ahgah_as-vr= ahgahlas-ur= ahdahasvc= ahdahasuc= ahdahasvr= ahdahasur= 93 104 123 105 104 101 114 146 100 100 120 137 ahbahdsvc= 108 ahbahdsuc= 128 ahbahdsvr= 119 ahbahdsur= 157 ahgahdsyvc=106 ahgah_ds-uc= 112 ahgahds-vr= 120 ahgahds-ur= 140 ahdahdsvc= 110 ahdahdsuc= 107 ahdahdsvr= 112 ahdahdsur= 146 ahbahldvc=198 ahbahld_uc=201 ahbahld_vr=214 ahbahld_ur=215 ahgah_ld vc=197 ahgah.1djuc=211 ahgah_1djvr=229 ahgahjld-ur=248 ahdah-ld_vc=197 ahdahld_uc=212 ahdahld_vr=229 ahdahld_ur=239 ahbahssvc=169 ahbahssuc=181 ahbahssvr=243 ahbahssur=255 ahgahss-vc= 180 ahgah-ss-uc= 199 ahgahss-vr= 198 ahgahss-ur= 254 ahdahssvc= 165 ahdahssuc= 191 ahdahssvr= 208 ahdahssur= 243 77 CVC utterances bahbasvc= bahbasuc= bahb-as-vr= bahbasur= gahg-as-vc= gahg-as-uc= gahg-asjvr= gahg-as-ur= dahdasvc= dahdasuc= dahdasvr= dahdasur= 87 94 110 122 105 97 108 113 104 98 108 142 bahbdsvc= 100 bahbdsuc= 101 bahbdsvr= 127 bahbdsur= 115 gahg-ds-vc= 95 gahg-ds-uc= 100 gahg-ds-vr= 121 gahg-ds-ur= 160 dahddsvc= 94 dahddsuc= 95 dahddsvr= 119 dahddsur= 137 bahbld_vc= 179 bahbld_uc= 197 bahbld_vr= 228 bahbjld-ur= 245 gahgjdcvc= 175 gahgjdcuc= 181 gahgjLdvr= 228 gahgjldur= 252 dahdld_vc= 166 dahdld_uc= 168 dahdld_vr= 226 dahdld_ur= 223 bahbssvc= 151 bahbssuc= 160 bahbssvr= 201 bahbss-ur= 271 gahg-ss-vc= 163 gahg-ss-uc= 188 gahg-ss-vr= 213 gahg-ssjur= 275 dahd_ssvc=162 dahd ss uc= 164 dahdssvr= 207 dahdssur= 299 78 FO [Hz], HI [db], Fl [Hz], H1-H2 [dB], HI-Al [dB], and Hl-A3 [dB] data used to plot figures 4.2a and 4.2b. The first number represents the Max value and the second represents the Diff value. vr: voiced release, vc: voiced closure, ur: unvoiced release, uc: unvoiced closure. VCV utterances Male Speaker AS ahbah/ahpah F0_vr:-8.0000 -25.0000 FOvc :-2.0000 -3.0000 FOur:105.0000 103.0000 FOuc :-2.0000 -3.0000 Hvr :-0.4000 -0.6000 Hivc:-1.6000 -2.0000 Hiur:0.9000 1.3000 Hiuc:-1.8000 -1.9000 Flvr :22.0000 39.0000 Flvc :-58.0000 -117.0000 Flur :-20.0000 -20.0000 Fluc :-39.0000 -98.0000 HiH2_vr :-0.3000 -0.4000 HiH2_vc:1.1000 1.7000 HiH2_ur :-2.6000 -3.7000 HiH2_uc:1.0000 2.3000 Hi_Alvr :-0.8000 -1.7000 Hi-Alvc:1.1000 3.3000 HI_Alur :-5.2000 -6.8000 Hi_Aluc:1.1000 2.2000 HiA3_vr:-0.7000 -1.2000 HiA3_vc :5.5000 9.5000 HiA3_ur:-3.8000 -3.9000 HiA3_uc :-3.4000 -7.5000 Male Speaker DS ahbah/ahpah FOvr :-4.0000 -6.0000 FO_vc :-108.0000 -111.0000 FOur:157.0000 157.0000 FOuc :1.0000 1.0000 Hivr:-0.4000 -1.1000 Hivc :-1.1000 -2.1000 Hiur:8.1000 18.5000 Hiuc:-1.9000 -2.0000 Flvr:20 20 Flvc :-58.0000 -137.0000 Flur :-58.0000 -117.0000 Fluc:-39.0000 -79.0000 HiH2_vr :-0.3000 -0.6000 Hi-H2_vc :-0.7000 -0.8000 HiH2_ur :-1.9000 -1.9000 HiH2_uc :1.8000 2.5000 Hi_Alvr :-1.0000 -2.3000 Hi_Alvc :1.3000 -0.2000 Hi-Alur :-13.2000 -5.3000 Hi_Aluc:1.7000 2.5000 HiA3_vr:-1.3000 -1.7000 HiA3_vc:5.2000 7.8000 HIA3_ur:8.9000 14.3000 HiA3_uc :3.2000 4.4000 79 Female Speaker LD ahbah/ahpah FOvr :95.0000 93.0000 FOvc :2.0000 -1.0000 FOur :26.0000 30.0000 F0_uc :-5.0000 -10.0000 Hvr :-0.3000 -0.4000 Hivc :-0.5000 0.4000 HIur:-0.5000 -0.6000 Hiuc :-1.7000 -2.4000 Flvr:156.0000 176.0000 Flvc :-19.0000 -19.0000 Flur :-20 -20 Fluc :-175.0000 -195.0000 HiH2_vr:1.0000 1.3000 HiH2_vc:1.0000 -0.1000 HiH2-ur:-10.7000 -2.2000 HiH2_uc:1.3000 0.9000 Hi_Alvr:1.2000 0.6000 Hi_Alvc :2.6000 3.3000 Hi_Alur:0.4000 0.5000 Hi_Aluc :0.9000 1.7000 HiA3_vr :-1.4000 -0.3000 HiA3_vc :2.8000 7.0000 HiA3_ur:1.9000 2.2000 HiA3_uc :2.6000 6.5000 Female Speaker SS ahbah/ahpah FOvr :243.0000 197.0000 FOvc :2 6 FOur:255.0000 238.0000 FOuc :3.0000 3.0000 Hvr:-0.2000 -0.3000 Hivc :-0.9000 -0.4000 Hiur:6.0000 1.3000 Hiuc :-1.7000 -0.6000 Flvr :137.0000 156.0000 F1_vc :-117.0000 -195.0000 Flur:39.0000 19.0000 Fluc :-117.0000 -196.0000 HiH2_vr :-0.5000 -0.8000 HiH2_vc :-3.700 -7.9000 HiH2_ur :-4.2000 -6.1000 HiH2_uc :-2.0000 0.3000 Hi_Alvr:-0.9000 -1.4000 Hi_Al_vc:1.4000 0.3000 Hi_Alur:-11.4000 -19.8000 Hi_Al_uc :2.2000 4.3000 HiA3_vr:1.4000 3.5000 HiA3_vc :3.3000 6.6000 HIA3_ur :-15.6000 -9.2000 HiA3_uc :3.3000 8.1000 Male Speaker AS ahgah/ahkah FOvr :114.0000 102.0000 FOvc :-2.0000 -3.0000 FOur:146.0000 129.0000 FOuc :-2.0000 -2.0000 HIvr :-0.3000 -0.4000 Hivc:-1.0000 -1.3000 80 HIur:8.6000 18.4000 H Iuc :-1.6000 -1.1000 Fl-vr:156.0000 195.0000 Flvc :-215.0000 -313.0000 Flur :-90.0000 -156.000 Fluc :-391.0000 -489.0000 HiH2_vr:0.3000 0.5000 H lH2_vc:-0.9000 -1.4000 HiH2_ur:-2.1000 -3.0000 HiH2_uc :0.9000 2.6000 H I_A l_vr :-1.6000 -0.4000 Hi_Al_vc:-2.1000 -2.1000 HiAl-ur:7.0000 -0.2000 Hi_Aluc:2.9000 6.4000 HiA3_vr:-2.8000 -3.6000 HiA3_vc:3.7000 5.7000 HiA3_ur:7.7000 15.3000 HiA3_uc :4.6000 9.9000 Male Speaker DS ahgah/ahkah FOvr:120.0000 110.0000 FOvc :-107.0000 -108.0000 FOur :140.0000 124.0000 FOuc :3.0000 4.0000 Hivr:-0.9000 -1.4000 Hivc :0.6000 0.4000 Hiur:5.5000 3.9000 HIuc :2.0000 3.8000 Flvr :39.0000 79.0000 FIve :-156.0000 -312.0000 Flur:-100.0000 18.0000 FlIuc :-137.0000 -274.0000 0 HiH2_vr :0.4000 HiH2_vc:0.8000 1.4000 HiH2_ur :-4.2000 -9.6000 HiH2_uc :2.0000 7.1000 Hi_Al_vr :-1.8000 -2.4000 Hi_Alvc:1.6000 4.1000 Hi_Alur :-12.5000 -19.4000 Hi_Aluc:3.3000 11.0000 HiA3_vr :-0.7000 -0.9000 HiA3_vc :2.3000 3.6000 HiA3_ur :-9.8000 -7.7000 HiA3_uc :2.2000 7.0000 Female Speaker LD ahgah/ahkah FOvr :229.0000 222.0000 FOve :3.0000 8.0000 FOur :248 248 FOuc :-6.0000 -6.0000 Hivr :-0.2000 -0.3000 Hivc:-0.4000 0.3000 Hiur:8.2000 12.0000 Hiuc :-3.4000 -4.2000 Flvr:-20 -20 Flvc :-411.0000 -411.0000 Flur :39.0000 78.0000 Fl-uc :-430.0000 -449.0000 HiH2_vr :0.3000 0.5000 81 HiH2_vc:3.0000 HIH2-ur:-0.7000 HiH2_uc :3.0000 Hi_Al_vr:-3.9000 Hi_Alvc:5.4000 Hi_Al_ur:-8.9000 Hi_Al-uc:4.0000 HiA3_vr:-1.7000 HiA3_vc:6.7000 HiA3_ur:7.2000 HIA3_uc:6.4000 3.6000 -1.7000 6.9000 -6.1000 7.5000 -13.7000 3.6000 -1.9000 10.2000 0.9000 16.6000 Female Speaker SS ahgah/ahkah FOvr:103.0000 193.0000 FOvc :-8.0000 -9.0000 FOur :254.0000 254.0000 FOuc :1.0000 1.0000 Hvr :-0.2000 -0.4000 Hivc:-1.5000 -3.2000 Hiur:21.5000 30.6000 Hiuc:-1.9000 0 Flvr:0 0 Flvc :-332.0000 -273.0000 Flur:-125.0000 -86.0000 FIuc :-157.0000 -195.0000 HiH2_vr:0.5000 1.4000 HiH2_vc :-2.3000 -5.8000 HiH2_ur:2.2000 0.6000 HiH2_uc:12.6000 18.8000 Hi_Alvr:-3.2000 -5.5000 Hi_Al_vc:2.8000 1.6000 Hi_Alur:13.7000 6.2000 Hi_Al_uc :3.4000 9.2000 HiA3_vr :-2.6000 -5.8000 HiA3_vc :4.7000 8.2000 HiA3_ur:21.2000 30.1000 HiA3_uc :4.8000 9.8000 Male Speaker AS ahdah/ahtah FOvr:183.0000 101.0000 FOvc:-1.0000 -1.0000 FOur:137.0000 137.0000 0 FOuc :-2.0000 Hivr:0.3000 0.5000 Hivc:-1.1000 -1.3000 Hiur:9.3000 17.7000 Hiuc :-2.3000 -2.5000 Flvr:20.0000 58.0000 Flvc :-58.0000 -156.0000 Flur:137.0000 117.0000 Fluc :-39.0000 -136.0000 HIH2_vr :-0.5000 -0.5000 HiH2_vc :-0.2000 -0.2000 HiH2_ur :-6.3000 -0.7000 HiH2_uc :0.4000 1.3000 HiAl_vr:1.9000 1.0000 Hi_Alvc :-0.3000 -0.1000 Hi_Alur:11.6000 11.4000 Hi_Al_uc:1.6000 3.4000 82 HiA3_vr:-2.2000 -1.3000 HIA3_vc:1.5000 3.5000 HiA3_ur:7.5000 15.7000 HiA3_uc:3.6000 7.8000 Male Speaker DS ahdah/ahtah FOvr :112.0000 107.0000 FOvc :1.0000 0 FOur:146.0000 140.0000 FOuc :-107.0000 -115.0000 Hlvr :-0.3000 -0.2000 Hvc:-1.7000 -2.1000 Hur:6.3000 8.7000 H_uc:2.3000 1.4000 Flvr :39.0000 59.0000 Fivc :-58.0000 -137.0000 Flur :-78.0000 -59.0000 Fluc :-109.0000 -188.0000 HiH2_vr :0.4000 0.2000 HiH2_vc :-0.3000 -0.4000 HiH2-ur :-2.4000 -6.4000 HiH2_uc:8.6000 6.0000 Hi_Alvr:1.4000 2.4000 Hi_Al_vc:-1.4000 -0.7000 Hi_Alur:-7.0000 -7.5000 Hi_Al_uc:4.6000 13.4000 HiA3_vr:2.1000 5.4000 HiA3_vc :-3.4000 -9.4000 HiA3ur :-18.0000 -8.5000 HiA3_uc :4.0000 8.9000 Female Speaker LD ahdah/ahtah FOvr :229.0000 218.0000 FOvc :1.0000 0 FOur :239.0000 239.0000 FOuc :2.0000 1.0000 Hivr :-0.2000 -0.4000 Hivc :-0.8000 -0.4000 HIur:13.6000 23.8000 Hi_uc :-2.1000 -2.2000 F1_vr:0 0 Flve :-352.0000 -372.0000 Flur :-59.0000 -136.0000 Fl_uc:0 0 HiH2_vr :0.7000 1.4000 HiH2_vc :1.9000 1.0000 HiH2_ur:-1.9000 -3.5000 HiH2_uc:1.3000 2.2000 Hi_Alvr :-3.6000 -5.3000 HiAlve :2.5000 6.3000 Hi_Alur:11.8000 0.8000 Hi_Aluc:1.4000 2.9000 HiA3_vr :-3.2000 -4.8000 HiA3_vc:1.5000 1.2000 HIA3_ur:15.3000 21.8000 HiA3_uc :3.3000 10.4000 Female Speaker SS ahdah/ahtah FOvr :208.0000 193.0000 83 FOvc :4.0000 7.0000 FOur :-50.0000 -54.0000 FOuc :2.0000 3.0000 Hivr :-0.4000 -0.8000 Hve :-2.4000 -2.3000 HIur :4.2000 -1.5000 Hiuc :-2.4000 -3.7000 Flvr:0 0 Flvc :-312.0000 -391.0000 Flur :69.0000 30.0000 Fluc :-137.0000 -215.0000 HiH2_vr :-0.3000 -0.7000 HiH2_vc:-0.8000 -1.1000 HiH2_ur:-5.8000 -10.2000 HiH2_uc :3.9000 5.8000 Hi_Alvr:-4.7000 -2.3000 Hi_Alvc:10.8000 13.7000 Hi_Alur:-11.1000 -14.6000 Hi_Aluc:2.6000 4.2000 HlA3_vr :-2.3000 -3.6000 HiA3_vc :4.5000 8.7000 HiA3_ur:-7.5000 -8.3000 HiA3_uc:1.9000 2.5000 CVC utterances bahb/pahp Male Speaker AS FOvr:110.0000 102.0000 FOve :-5.0000 -8.0000 FOur:122.0000 122.0000 FOuc :-56.0000 -102.0000 Hivr :-0.5000 -0.2000 Hivc :-2.2000 -6.7000 Hiur:10.1000 7.8000 Hiuc :-2.2000 -6.7000 Flvr :78.0000 137.0000 Flvc :-293.0000 -351.0000 Flur :-97.0000 -39.0000 Fluc :-1 17.0000 -157.0000 HiH2_vr:0.3000 0.5000 HiH2_vc :0.3000 0.8000 Hi-H2_ur:8.3000 4.1000 HIH2_uc :6.0000 12.6000 Hi_Al_vr:1.3000 0.7000 Hi_Alvc :-0.8000 -0.7000 Hi_Alur:10.7000 7.0000 HiAluc:6.6000 7.1000 HiA3_vr :-1.2000 -2.0000 HiA3_vc :5.6000 8.4000 HiA3_ur :12.3000 10.1000 HiA3_uc:15.0000 17.7000 bahb/pahp Male Speaker DS FOvr:127.0000 131.0000 FOvc :-6.0000 -19.0000 FOur:68.0000 115.0000 FOuc :-101.0000 -101.0000 Hivr:-1.1000 -1.0000 84 Hl_vc :-2.3000 -6.7000 HIur :-5.5000 -6.6000 Hiuc :-2.3000 -6.7000 Flvr :20.0000 20.0000 Flvc :-39.0000 -78.0000 Flur:-59.0000 -19.0000 Fluc -19.0000 -19.0000 HiH2_vr:-1.1000 -0.5000 HiH2_vc :-1.4000 -4.2000 HiH2_ur:-5.8000 -14.2000 HiH2_uc:17.7000 19.7000 Hi_Alvr:0.5000 0.6000 Hi_Alvc:-4.7000 -6.1000 Hi_Alur:-12.1000 -27.3000 Hi_Aluc:7.6000 16.4000 HiA3_vr:-1.6000 -2.5000 HiA3_vc :-2.4000 -0.7000 HiA3_ur:-12.6000 -27.5000 HiA3_uc :7.6000 16.4000 bahb/pahp Female Speaker LD FOvr :228.0000 232.0000 FOvc:-4.0000 -10.0000 FOur :245.0000 242.0000 FOuc :18.0000 13.0000 Hivr:-0.4000 -0.2000 Hivc:-1.3000 -2.5000 Hiur:1.8000 1.7000 Hiuc:-1.3000 -2.5000 FIvr :19.0000 19.0000 Flvc :-59.0000 -98.0000 Flur :58.0000 39.0000 FIuc :19.0000 19.0000 Hi1H2_vr:0.8000 1.6000 H1H2_vc :-1.8000 -2.9000 H1H2_ur:-1.0000 -1.9000 HiH2_uc :3.6000 2.7000 HiAl-vr:0.7000 1.7000 Hi_Alvc:5.7000 4.0000 HiAlur:-5.9000 -12.9000 HiAluc :-2.0000 2.1000 HiA3_vr :-0.8000 -2.3000 HiA3_vc :4.3000 8.2000 HIA3_ur :-6.3000 -10.6000 HiA3_uc:8.7000 8.1000 bahb/pahp Female Speaker SS FOvr:201.0000 184.0000 FOvc :-7.0000 -17.0000 FOur :270.0000 224.0000 FOuc :-31.0000 -91.0000 Hivr :-0.8000 -2.2000 0 HIve :1.1000 Hiur :-3.9000 -4.0000 0 HIuc:1.1000 Flvr :-19.0000 -19.0000 Flvc :-40.0000 -78.0000 Flur:50.0000 50.0000 Fluc :-39.0000 -39.0000 85 HlH2_vr:-1.9000 H1H2_vc:1.4000 H1H2_ur:3.4000 HiH2_uc :-3.2000 Hi_Alvr:-2.7000 Hi_Alvc:3.0000 HI_Al_ur:-6.6000 Hi_Aluc :-3.1000 HiA3_vr:-1.0000 HiA3_vc:3.2000 HiA3_ur:-5.3000 HiA3_uc :-5.5000 -3.5000 3.3000 0.3000 -4.6000 -6.1000 8.0000 -11.0000 -6.4000 -0.8000 7.1000 -14.7000 -10.3000 gahg/kahk Male Speaker AS FOvr:108.0000 101.0000 FOvc :-100 -100 40 FOur:113.0000 113.0000 FOuc :-1.0000 -2.0000 Hlvr:1.8000 2.4000 Hivc:-1.5000 -2.4000 Hiur:13.1000 14.9000 Hiuc:-1.5000 -2.4000 Flvr:98.0000 98.0000 Flvc :-156.0000 -235.0000 Flur :-98.0000 -157.0000 Fluc :20.0000 -19.0000 H1H2_vr:-1.5000 -2.0000 HiH2_ve :-0.2000 -0.6000 HiH2_ur :-8.0000 -4.7000 HiH2_uc:11.3000 4.5000 Hi_Al_vr :-2.4000 -2.9000 Hi_Al_vc:-1.4000 -1.6000 Hi_Alur:-12.8000 0.4000 Hi_Aluc :10.6000 10.3000 Hi-A3_vr:-1.3000 -1.4000 HiA3_vc :3.0000 7.6000 HiA3_ur :13.3000 8.3000 HiA3_uc :12.3000 7.1000 gahg/kahk Male Speaker DS FOvr:121.0000 104.0000 FOvc :-95.0000 -7.0000 FOur:100.0000 145.0000 0 FOuc:100.0000 Hivr:0.6000 0.1000 Hivc :-1.2000 -3.3000 Hiur:8.1000 10.3000 Hiuc :-1.2000 -3.3000 Fivr:0 0 0 Flvc :-59.0000 -215.0000 Flur :-164.0000 -254.0000 Fluc :-39.0000 -59.0000 HiH2_vr:0.5000 0.8000 HiH2_vc:-1.7000 -1.8000 H1H2_ur :-1.7000 -3.4000 HiH2_uc :2.1000 1.6000 Hi_Alvr:1.4000 2.5000 Hi_Alvc :2.2000 4.3000 Hi_Alur:-8.1000 -17.6000 86 Hi_Aluc:-4.0000 HiA3_vr:-1.6000 HiA3_vc:3.8000 HiA3_ur:9.0000 HiA3_uc:-7.1000 -5.8000 0.4000 7.6000 -0.7000 -9.8000 gahg/kahk Female Speaker LD F0_vr :228.0000 101.0000 FOve :1.0000 1.0000 FOur:-91.0000 -91.0000 FOuc :-43.0000 -66.0000 Hivr:-0.3000 -0.7000 Hivc:-1.6000 -1.0000 Hiur:11.7000 19.5000 HIuc :-1.6000 -1.0000 Flvr:0 0 Flve :-371.0000 -371.0000 Flur:58.0000 39.0000 FLuc :-39.0000 -97.0000 HiH2_vr:-5.7000 -5.4000 HiH2_vc:-1.6000 -0.4000 Hi1_H2_ur:3.4000 1.1000 HiH2_uc :-2.3000 -6.5000 Hi_Alvr :-3.7000 -7.1000 Hi_Alvc :8.2000 18.7000 Hi_Alur:-5.3000 -8.6000 Hi_Aluc :-2.3000 -3.5000 HiA3_vr:-1.1000 -1.9000 HiA3_vc :9.5000 11.6000 HiA3_ur:7.2000 1.8000 HiA3_uc :-4.0000 -9.1000 gahg/kahk Female Speaker SS FOvr:213.0000 193.0000 FOvc:-4.0000 -3.0000 FO-ur:275.0000 275.0000 FOuc :-169.0000 -169.0000 Hivr:-1.3000 -1.6000 Hivc:1.5000 1.8000 Hiur:3.7000 5.4000 Hiuc:1.5000 1.8000 Flvr :-20.0000 -20.0000 Flvc :-176.0000 -351.0000 Flur:19.0000 33.0000 Fluc :-39.0000 -58.0000 HiH2_vr:-2.1000 -3.1000 HiH2_vc:3.3000 2.1000 HiH2_ur:-1.7000 -4.1000 HiH2_uc :4.0000 0.7000 Hi_Alvr :-4.9000 -3.4000 Hi_Alve :4.2000 10.0000 HI_Alur :-4.2000 -6.8000 Hi_Aluc :-4.5000 -0.1000 HiA3_vr :-1.4000 -0.6000 HiA3_ve:8.6000 17.4000 HiA3_ur :-8.1000 -2.7000 HiA3_uc:-4.1000 -3.1000 dahd/taht Male Speaker AS 87 FOvr:108.0000 98.0000 FOvc :-3.0000 -3.0000 FOur:142.0000 142.0000 FOuc :-6.0000 -5.0000 Hvr:-0.6000 -1.4000 Hlvc:6.3000 5.4000 Hiur:8.0000 9.7000 H1_uc :6.3000 5.4000 Flvr :39.0000 78.0000 Flvc :-195.0000 -253.0000 Flur:83.0000 156.0000 Fluc :-39.0000 0 HiH2_vr:0.6000 0.8000 HiH2_vc:14.4000 16.200 HiH2_ur:1.3000 1.3000 HiH2_uc :13.6000 14.4000 Hi_Alvr:1.4000 0.8000 Hi_Alvc:8.1000 9.6000 HiAlur:6.2000 8.1000 Hi_Aluc:11.9000 14.1000 HiA3_vr:1.4000 1.2000 HiA3_vc:8.9000 11.5000 HiA3_ur:8.0000 3.7000 HiA3_uc:10.0000 17.6000 dahd/taht Male Speaker DS FOvr:119.0000 100.0000 FOvc :89.0000 -3.0000 FOur:137.0000 138.0000 0 FOuc :76.0000 Hivr:-1.5000 -3.1000 Hivc:-4.1000 -7.1000 Hiur:6.1000 8.0000 Hiuc:-4.1000 -7.1000 Flvr:39.0000 19.0000 Flvc:-20.0000 -39.0000 Flur:29.0000 39.0000 Fluc:-39.0000 -78.0000 HiH2_vr :-0.7000 -0.9000 HiH2_vc :-3.4000 -5.7000 HiH2_ur :-2.8000 -5.6000 HiH2_uc :4.9000 13.5000 HI_Alvr:1.1000 1.9000 Hi_Alvc :-2.9000 -5.6000 Hi_Alur:-11.3000 -18.0000 Hi_Aluc:6.0000 -0.7000 HiA3_vr :-1.3000 -1.6000 HiA3_vc :-4.0000 -5.6000 HiA3_ur :4.9000 -0.5000 HiA3_uc :6.0000 -1.1000 dahd/taht Female Speaker LD FOvr:226.0000 220.0000 FOvc :-3.0000 -5.0000 FOur:223.0000 223.0000 FOuc :-178.0000 -188.0000 HIvr :-0.1000 -0.2000 Hivc :-1.3000 -4.0000 Hiur:6.1000 13.7000 88 Hiuc:-1.3000 -4.0000 Flvr:0 0 Fl_vc :-20.0000 -39.0000 Flur :-98.0000 -127.0000 Fluc :-20.0000 -39.0000 HiH2_vr:1.1000 2.5000 HiH2_vc:12.8000 12.4000 HiH2_ur:3.6000 4.2000 HiH2_uc:6.6000 2.8000 Hi_Alvr :-3.2000 -4.0000 Hi_Al_vc:4.3000 7.2000 Hi_Alur:-4.5000 1.2000 Hi_Aluc:4.0000 0.1000 HiA3_vr:2.4000 1.1000 HiA3_vc:1.0000 0.7000 HiA3_ur:6.8000 16.5000 Hi-A3_uc :7.6000 4.0000 dahd/taht Female Speaker SS FOvr -4.0000 -6.0000 FOve :-8.0000 -26.0000 FOur :278.0000 278.0000 FOuc :-164.0000 -179.0000 Hivr:-0.4000 -1.0000 Hi-vc:-1.4000 -3.8000 Hiur:7.8000 8.4000 Hiuc:-1.4000 -3.8000 Flvr:137.0000 137.0000 Fl-vc :-39.0000 -117.0000 Flur :-210.0000 -341.0000 Fluc :-39.0000 -39.0000 HiH2_vr :-1.5000 -2.8000 HiH2_vc :1.3000 0.4000 HiH2_ur :-3.4000 -6.6000 HiH2_uc :-23.9000 -22.9000 Hi_Alvr :2.5000 3.0000 Hi_Alvc :3.0000 5.0000 Hi_Alur:-7.2000 -13.2000 Hi_Aluc :-12.8000 -18.4000 HiA3_vr :0.5000 -0.3000 HiA3_ve :3.3000 3.6000 HiA3_ur:-8.6000 0.3000 HiA3_uc :-4.5000 -10.1000 89

Detection of Stop Consonant Voicing: Toward a... Independent Model

Related documents

Products

Support

Detection of Stop Consonant Voicing: Toward a... Independent Model

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib