Vowel recognition: Formants, Spectral Peaks, and Spectral Shape Representations James Hillenbrand Speech Pathology and Audiology Western Michigan University Kalamazoo MI 49008 and Robert A. Houde RIT Research Corporation 125 Tech Park Drive Rochester, NY 14623 Invited presentation, Acoustical Society of America, St. Louis, November, 1995. Hillenbrand & Houde (1995), St. Louis ASA Talk 2 We have decided to restrict our comments today to just a few issues having to do with the spectral representations for vowels. One of the main issues that we'd like to focus on is the general question of formants versus spectral shape representations for speech, with a particular focus on the representation of vowel quality. Before getting into that material, we would like to describe a new database that we collected of acoustic measurements for vowels in /hVd/ syllables (Hillenbrand, Getty, Clark, and Wheeler, 1995). The experiment was designed as a replication and extension of the classic Bell Labs study by Peterson & Barney (1952). The measurements that Peterson & Barney reported have obviously had a profound influence on vowel perception research. Our study was intended to address a number of limitations of this classic work, the most important of which are: (1) unavailability of the original recordings, and (2) the absence of spectral change and duration measurements. The design of the study was quite simple. Recordings were made of /hVd/ utterances spoken by 139 talkers: 45 men, 48 women, and 46 10-12 year-old children. The utterances included the 10 vowels recorded by Peterson & Barney, plus /e/ and /o/. The majority of the talkers (87%) were from the Lower Peninsula of Michigan; the remainder were primarily from other areas of the upper midwest (e.g., Illinois, Wisconsin, Ohio, Minnesota). The talkers were screened for dialect using a fairly simple procedure that I will not have time to describe. Recordings were made of the 12 vowels in /hVd/ context. The acoustic measurements consisted of: (1) vowel duration and "steady state" times, which were measured by hand from gray-scale spectrograms, (2) f0 contours, which were automatically tracked and then hand edited, and (3) contours of F1-F4, which were hand-edited from LPC peak displays. _______________________________________________ Figure 1 _______________________________________________ Figure 1 shows an LPC peak display before and after hand editing. From left to right, the dashed lines indicate the start of the vowel, an "eyeball" measurement of the steadiest portion of the vowel, and the end of the vowel. _______________________________________________ Figure 2 _______________________________________________ The signals were presented to a panel of listeners for identification. Although there are some details that differ, the overall intelligibility of our vowels is nearly identical to Peterson & Barney. In both cases, the error rate is about 5%, where an error is any trial in which the listener reports hearing a vowel other than the one intended by the talker. It is also of some interest to note that identification rates do not differ much across the three groups of talkers. (Peterson & Barney did not report results separately for men, women, and child talkers.) Hillenbrand & Houde (1995), St. Louis ASA Talk 3 _______________________________________________ Figure 3 _______________________________________________ In contrast to the identification results, there were some very large differences in the acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories than the Peterson & Barney data when only F1 and F2 are considered, and (2) the average positions of some of the vowels in F1-F2 space are different. Figure 3 shows a standard F1-F2 plot of the data from our study. Individual tokens that were not well identified by the listeners have been removed, and the database has been thinned of redundant data points. It can be seen that there is a large amount of overlap among adjacent vowels. The overlap is especially large for the vowels /æ/ and /ɛ/, /u/ and /ʊ/ and the group /ɑ-ɔ-ʌ/. As will be seen in a moment, these vowels can be separated rather well when duration and the spectral change are taken into account. _______________________________________________ Figure 4 _______________________________________________ Figure 4 compares average formant values from our study with Peterson & Barney. The figure shows data from the adult females only, although very similar patterns are seen for the men. (The children present a somewhat different case; this is discussed in our JASA paper). The data are plotted in the form of an acoustic vowel diagram; the Peterson & Barney data are shown in dotted lines, and our data are shown in solid lines. The central vowels /ɝ/ and /ʌ/ occupy nearly identical positions in the two data sets, but substantial differences are seen for all other vowels. To the extent that we can rely on traditional articulatory interpretations of formant data, these findings suggest the following: (1) the high vowels in our data are produced somewhat lower in the mouth, (2) the back vowels in our data seem to be produced with somewhat more anterior tongue positions, and (3) the low back vowels /ɑ/ and /ɔ/ in our data are shifted down and forward. There are also very large differences in the production of the vowels /æ/ and /ɛ/. The Peterson & Barney formant data are consistent with the textbook description of these vowels, suggesting a higher and more anterior tongue position for /æ/ than /ɛ/. Our data suggest rather similar tongue positions for these two vowels, at least when the formants are sampled at "steady state". Our results are consistent with several studies by Labov, suggesting that the vowel /æ/ is drifting upward in many dialects of American English, particularly the “Northern Cities” dialect spoken by our subjects (e.g., Labov, Yeager, and Steiner, 1972). We believe that our speakers are differentiating these two vowels primarily by differences in the spectral change patterns and duration. These comparisons raise a rather obvious question: What accounts for the differences between the two sets of formant measurements? Very broadly, there only seem to be two possibilities: (1) methodological differences in the measurement of the formant frequencies (we used LPC spectral peaks and Peterson & Barney used narrow band spectra produced on an analog spectrograph), and (2) dialect differences. Although we do not have time to develop the argument here, we believe that dialect is by far the more important of the two factors. There are two dialect factors that may have played a role. First, as discussed above, there are dialect differences based on Hillenbrand & Houde (1995), St. Louis ASA Talk 4 region. (This is somewhat difficult to evaluate since so little is known about the dialect of Peterson & Barney's talkers, although the majority of the women and children were from the Mid-Atlantic region). Second, the two sets of recordings are separated by approximately 40 years. An excellent study of British RP by Bauer (1985) shows clearly that substantial shifts in vowel formant frequencies can be expected over a period of 40 years. There is one final point that is worth making on these comparisons. There has been a tendency over the years to view the Peterson & Barney measurements as some sort of a "gold standard" that tells us how the vowels of American English are produced. I think that we all understand that these values vary across region and that they can show significant changes over time. If nothing else, our study serves to remind us of these simple facts. _______________________________________________ Figure 5 _______________________________________________ All of the measurements that have been discussed up to this point were taken at the point in the vowel where the formant pattern was judged to be maximally steady. Figure 5 shows what the spectral change patterns look like for our vowels. What we have done here is to simply connect a line between the spectral pattern sampled at 20% of vowel duration and the spectral pattern sampled at 80% of vowel duration. The symbol for each vowel is plotted at the location of the second sample. There are just two simple points that we would like to make about this figure. First, it is obvious that nearly all of the vowels show a fair amount of spectral movement. The only notable exceptions are /i/ and /u/, which show very little formant movement. The more important point, however, is that the spectral changes occur in such a way as to increase the contrast between adjacent vowels. For example, vowel pairs such as /æ/ and /ɛ/, /u/ and /ʊ/, and /ɑ/ and /ɔ/, which occupy very similar static positions, show very different patterns of spectral change. One obvious limitation of this figure is that we are displaying averages rather than individual data points. We have not found a way to make a coherent display of the spectral change patterns that shows individual data points, but this issue can be addressed by looking at the performance of a discriminant classifier that is trained on either static or dynamic formant patterns. _______________________________________________ Figure 6 _______________________________________________ Figure 6 shows the classification accuracy for a quadratic discriminant classifier that has been trained on either a single sample of the formant pattern or two samples, taken at 20% and 80% of vowel duration. The open bars show the results for the single-sample case, and the filled bars show the results for two samples. In the left half of the figure duration information was not used, while the right half of the figure shows comparable results with vowel duration added to the parameter set. Results are shown for several combinations of parameters: F1-F2, F1-F2-F3, f0-F1-F2, and f0-F1-F2-F3. Hillenbrand & Houde (1995), St. Louis ASA Talk 5 The discriminant analysis results are quite straightforward: the pattern classifier does a much better job of sorting the vowels when it is provided with information about the pattern of spectral change. The effect is particularly large for simpler parameter sets. For example, with F1 and F2 alone, the classification accuracy is 71% for a single sample at "steady state" and 91% for two samples of the formant pattern. Adding vowel duration to the parameter set also improves classification accuracy, although the effect is not particularly large if the formant pattern is sampled twice. The pattern classification approach has some obvious limitations (e.g., Nearey, 1992; Hillenbrand & Gayvert, 1993). Showing that a pattern classifier is strongly affected by spectral change is not the same as showing that human listeners rely on spectral change patterns. We recently completed an experiment that was designed to examine this question more closely using synthesized signals. We selected 300 signals from our /hVd/ database. Listeners were asked to identify the original signals and two different synthesized versions. The two synthesis methods are shown in Figure 7. One set of signals was synthesized using the original measured formant _______________________________________________ Figure 7 _______________________________________________ contours, shown here by the dashed curves. The vowel here is /æ/, and the original measured formant contour shows a pronounced offglide in which the formants converge, a pattern which was quite common in our data for /æ/. A second set of signals was synthesized with flat formants, fixed at the values measured at "steady state," shown here by the solid curves. For these signals a 40 ms transition connected the steady-state values to the formant values that were measured at the end of the vowel. Both sets of synthetic signals were generated with the original measured f0 contours and at the original measured durations. _______________________________________________ Figure 8 _______________________________________________ The labeling results are shown in Figure 8. The findings are rather clear: identification rates are highest for the original signals, and lowest for the signals that were synthesized with flattened formant contours. The primary purpose of this experiment was to measure the relative importance of formant movements in vowel recognition, but the significant drop in intelligibility between the naturally produced signals and the signals that were synthesized to match the measured formant patterns is quite important, in our view. This drop in intelligibility can be attributed to some unknown combination of two factors: (1) errors in the estimation of the acoustic measurements, especially the formant frequencies, and (2) the loss of phonetically relevant information in the transformation to a formant representation, as has been suggested by Bladon (Bladon, 1982; Bladon and Lindblom, 1981) and Zahorian and Jagharghi (1986, 1993). We will have more to say about the general question of formant representations later. Hillenbrand & Houde (1995), St. Louis ASA Talk 6 The effect of flattening the formant contours is rather large, and not especially surprising given the significant body of evidence that has accumulated over the last 20 year or so implicating a role for spectral change in vowel recognition (e.g., Jenkins et al., 1983; Nearey & Assmann, 1986; Nearey, 1989; Hillenbrand and Gayvert, 1993). We see this study not so much as a confirmation that spectral change is important, since this is already well established, but as a way to measure the size of the effect using speech signals other than isolated vowels which are closely modeled after naturally produced utterances. An obvious limitation of this study is that vowel duration was not manipulated. We are working on that experiment right now and should have some results to report soon. [Note: This study has been completed – see Hillenbrand, Clark, and Houde, 2000.] FORMANTS VERSUS SPECTRAL SHAPE The next issue that we would like to discuss is the longstanding debate regarding formant representations versus overall spectral shape. This is an issue that has received only sporadic attention over the years, but has been revived to some extent recently, due in part to the work of Bladon and Lindblom (Bladon, 1982; Bladon and Lindblom, 1981) and Zahorian and Jagharghi (1986; 1987). The essence of formant theory is that phonetic quality is controlled not by the fine _______________________________________________ Figure 9 _______________________________________________ details of the spectrum, but by the frequency locations of spectral peaks that correspond to the natural resonant frequencies of the vocal tract. The primary virtue of formant theory is that formant representations account for a very large number of findings in the phonetic perception literature. The most straightforward demonstration is the high intelligibility of speech that has been synthesized from measured formant patterns. There are, however, several important problems with formant representations. The most important of these is the one that Bladon has called the determinacy problem. Quite simply, after several decades of concentrated effort by many imaginative investigators, no one has developed a reliable method to measure formant frequencies automatically from natural speech. The problem, of course, is that formant tracking involves more than just extracting spectral peaks; it is necessary to assign each of the peaks to specific formant slots corresponding to F1, F2, and F3. A second problem has to do with explaining listening errors. As Klatt (1982a; 1982b) has pointed out, errors that are made by human listeners do not appear to be consistent with the idea that the auditory system is tracking formants. When formant trackers make errors, they tend to be large ones since the errors involve missing formants entirely or tracking spurious peaks. Human listeners, on the other hand, nearly always hear a vowel that is very close in phonetic space to the intended vowel. This fact is easily accommodated by whole spectrum model. _______________________________________________ Figure 10 _______________________________________________ Hillenbrand & Houde (1995), St. Louis ASA Talk 7 An alternative to formant representations is a whole spectrum or spectral shape model. The basic idea is essentially the inverse of formant theory: Phonetic quality is assumed to be controlled not by formants, but by the overall shape of the spectrum. According to this view, formants only seem to be important because spectral peaks have such a large effect on the overall shape of the spectrum. The main advantage of this view is that it does not require formant tracking, which we know to be somewhere between difficult and impossible. Spectral shape models can also easily accommodate the data on listening errors. But there is a very significant problem with whole spectrum models: There is very solid evidence that large alterations can be made in spectral shape without affecting phonetic quality, as long as formant frequencies are preserved. There is a fair amount of evidence on this issue, but the key experiment, in our view, comes from a rather simple study by Klatt (1982a). Listeners in Klatt's study were presented with pairs of synthetic signals _______________________________________________ Figure 11 _______________________________________________ consisting of a standard /æ/ and one of 13 comparison signals. The comparison signals were generated by manipulating one of two kinds of parameters: (1) those that affect overall spectral shape (e.g., high- or low-pass filtering, spectral tilt, changes in formant amplitudes or bandwidths, the introduction of spectral notches, etc.) and (2) changes in formant frequencies. Listeners were asked to judge either the overall psychophysical distance or the phonetic distance between the standard and comparison stimuli. Results were very clear: changes in formant frequency were easily the most important affecting phonetic quality. Other changes in spectral shape, while clearly audible to the listeners, did not result in large changes in phonetic quality. The conclusions that we draw from this literature (and from this literature as a whole, not just the Klatt experiment) are the following: 1. Phonetic quality seems to be controlled primarily by the frequencies of spectral peaks. However, we do not believe that a traditional formant analysis is feasible. 2. We believe that the evidence is consistent with a spectral shape model, but one which is strongly dominated by energy in the vicinity of spectral peaks. In response to these findings we have developed a spectrum analysis technique called the Masked Peak Representation (MPR). The MPR is a relatively simple transform that was designed to: (1) show the kind of sensitivity to spectral peaks that is exhibited by human listeners but without requiring explicit formant tracking, (2) be relatively insensitive to spectral shape details other than peaks, and (3) be relatively insensitive to formant mergers, split formants, and related phenomena that cause such difficulty for traditional formant frequency representations. The representation shares some features with a cepstrum-based processing scheme described by Aikawa et al. (1993), which we recently became aware of. _______________________________________________ Figure 12 _______________________________________________ Hillenbrand & Houde (1995), St. Louis ASA Talk 8 The spectrum analysis begins with a smoothed Fourier power spectrum -- the smoothing is carried out both across frequency and across time. The only thing unconventional about the analysis is the calculation of a running average of spectral amplitudes, calculated over a rather wide (800 Hz) window. This running average is subtracted from the smoothed spectrum, producing what we have called the masked spectrum. Subtraction of the running average is a rough simulation of lateral suppression, and one practical result is that it allows you to find spectral peaks that would otherwise be missed. Although it may not be obvious from this figure, the bump corresponding to F2 in this signal is not, in fact, a spectral peak, but a soft shoulder that would be missed by a conventional peak picker. We find this pattern to be rather common, and there is little doubt that listeners are sensitive to these shoulders because the phonetic quality is off for signals that are synthesized without representing these shoulders. _______________________________________________ Figure 13 _______________________________________________ The display on top is a normal FFT spectrogram of the utterance, "OK, Hello.", and the display below shows what the utterance looks like after subtraction of the running average. It can be seen that most of what is left after the subtraction of the running average is the energy in the vicinity of spectral peaks; overall spectral tilt has been removed, as is energy in the valleys between spectral peaks. However, although the formants are quite visible, there are many peaks in the masked spectrum that clearly do not correspond to formants. We are just beginning a series of experiments that are intended test the validity of the MPR. One specific claim that we are making is that the MPR retains those aspects of the original speech signal that are most relevant to phonetic quality. As one test of this claim, we resynthesized signals from the /hVd/ database from masked spectra. The synthesizer that was used in this experiment was a damped sinewave vocoder. The vocoder is driven by three kinds of information: 1. Spectral Peaks 2. Instantaneous Fundamental Frequency 3. A Measure of the Degree of Periodicity _______________________________________________ Figure 14 _______________________________________________ The spectral peaks that drive the synthesizer are extracted from the masked spectrum. The signal then is resynthesized by summing exponentially damped sinusoids at frequencies corresponding to the spectral peaks. This figure shows three damped sinusoids that would be summed to produce the vowel /ɑ/. If the periodicity measure indicates that the speech is voiced, the damped sinusoids are repeated at a rate corresponding to the measured fundamental period. For unvoiced speech, the damped sinusoids are pulsed on and off at random intervals. Hillenbrand & Houde (1995), St. Louis ASA Talk 9 __________________________________________________________________ Audio Tape: Damped Sinewave Synthesis Demo __________________________________________________________________ The vowel intelligibility tests that we conducted with the damped sinewave synthesizer used the same 300 signals from the /hVd/ database that were used in the formant synthesis experiment described previously. The degree-of-periodicity measure that is normally used to control the voice/unvoice balance in the damped sinewave vocoder was not used in this experiment: The damped sinusoids were pulsed at random intervals during the /h/, and at intervals derived from the hand-edited fundamental frequency measurements during the remainder of the syllable. Listeners were asked to identify: (1) the original, naturally produced utterances, (2) utterances that were resynthesized from the MPR spectra using the damped sinewave vocoder, and (3) formant synthesized utterances that followed the measured pitch and formant contours. _______________________________________________ Figure 15 _______________________________________________ There are several details of the labeling results with these signals that are of some interest, but in the limited time that is available we will only be able to take a brief look at overall intelligibility. The main finding here is that the signals that were generated from the MPR-derived spectral peaks are quite intelligible, although less intelligible than the original signals. The main point to be made about these results is that vowel quality (and probably phonetic quality in general) is maintained quite well if spectral peaks are preserved. Further, the spectral peaks do not necessarily need to correspond to formants. We should point out that we do not regard this as an especially strong test of the validity of the MPR. A stronger test would involve demonstrating that phonetic distance judgments such as those reported in Klatt's (1982a) study can be predicted well from masked spectra. We are currently developing a distance measure using masked spectra that could be used in an experiment of this kind. We hope to have results to report soon. Hillenbrand & Houde (1995), St. Louis ASA Talk 10 REFERENCES Aikawa, K., Singer, H., Kawahara, H., and Tohkura, Y. (1993). A dynamic cepstrum incorporating time-frequency masking and its application to continuous speech recognition. ICASSP-93, 668-671. Assmann, P., Nearey, T., and Hogan, J. (1982). Vowel identification: orthographic, perceptual, and acoustic aspects. Journal of the Acoustical Society of America, 71, 975-989. Bauer, L. (1985). Tracing phonetic change in the received pronunciation of British English. J. Phon., 13, 61-81. Bennett, D.C. (1968). Spectral form and duration as cues in the recognition of English and German vowels. Lang. & Speech, 11, 65-85. Bennett, S. (1983). A three-year longitudinal study of school-aged children's fundamental frequencies. J. Speech Hear. Res., 26, 137-142. Bladon, A. (1982). Arguments against formants in the auditory representation of speech. In R. Carlson and B. Granstrom (Eds.) The Representation of Speech in the Peripheral Auditory System, Amsterdam: Elsevier Biomedical Press, 95-102. Bladon, A. and Lindblom, B. (1981). Modeling the judgment of vowel quality differences. J. Acoust. Soc. Am., 69, 1414-1422. Hedlin, P. (1982). A representation of speech with partials. In R. Carlson and B. Granstrom (Eds.) The Representation of Speech in the Peripheral Auditory System, Amsterdam: Elsevier Biomedical Press, 247-250. Hillenbrand, J.M., Clark, M.J., and Houde, R.A. (2000). Some effects of duration on vowel recognition. Journal of the Acoustical Society of America, 108, 3013-3022. Hillenbrand, J., and Gayvert, R.T. (1993). Vowel classification based on fundamental frequency and formant frequencies. Journal of Speech and Hearing Research, 36, 694-700. Hillenbrand, J., and Gayvert, R.T. (1993). Identification of steady-state vowels synthesized from the Peterson and Barney measurements. Journal of the Acoustical Society of America, 94, 1993, 668-674. Hillenbrand, J., Getty, L.A., Clark, M.J., and Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97, 1300-1313. Hillenbrand, J.M., and Nearey, T.N. (1999). Identification of resynthesized /hVd/ syllables: Effects of formant contour. Journal of the Acoustical Society of America, 105, 3509-3523. Jenkins, J.J., Strange, W., and Edman, T.R. (1983). Identification of vowels in 'vowelless' syllables. Percept. Psychophys., 34, 441-450. Klatt, D.H. (1982a). Prediction of perceived phonetic distance from critical-band spectra: A first step. IEEE ICASSP, 1278-1281. Hillenbrand & Houde (1995), St. Louis ASA Talk 11 Klatt, D.H. (1982b). Speech processing strategies based on auditory models. The Representation of Speech in the Peripheral Auditory System, Amsterdam: Elsevier Biomedical Press, 181196. Labov, W., Yaeger, M., and Steiner, R. (1972). "A quantitative study of sound change in progress," Report on National Science Foundation Contract NSF-FS-3287. Miller, J.D. (1989). Auditory-perceptual interpretation of the vowel. Journal of the Acoustical Society of America, 85, 2114- 2134. Nearey, T.M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85, 2088-2113. Nearey, T.M. (1992). Applications of generalized linear modeling to vowel data. In J. Ohala, T. Nearey, B. Derwing, M. Hodge, G. Wiebe (Eds.), Proceedings of ICSLP 92, 583-586. Edmonton, AB: University of Alberta. Nearey, T.M., Hogan, J, and Rozsypal, A. (1979). Speech signals, cues and features. In G. Prideaux (Ed.) Perspectives in Experimental Linguistics. Amsterdam: Benjamin. Nearey, T.M., and Assmann, P. (1986). Modeling the role of vowel inherent spectral change in vowel identification. Journal of the Acoustical Society of America, 80, 1297-1308. Peterson, G., and Barney, H.L. (1952). Control methods used in a study of the vowels. Journal of the Acoustical Society of America, 24, 175-184. Strange, W. (1989). Dynamic specification of coarticulated vowels spoken in sentence context. Journal of the Acoustical Society of America, 85, 2135-2153. Strange, W., Jenkins, J.J., and Johnson, T.L. (1983). Dynamic specification of coarticulated vowels. Journal of the Acoustical Society of America, 74, 695-705. Syrdal, A.K. (1985). Aspects of a model of the auditory representation of American English vowels. Speech Communication, 4, 121-135. Syrdal, A.K., and Gopal, H.S. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. Journal of the Acoustical Society of America, 79, 1086-1100. Zahorian, S., and Jagharghi, A. (1986). Matching of 'physical' and 'perceptual' spaces for vowels. Journal of the Acoustical Society of America, Suppl. 1, 79, S8. Zahorian, S., and Jagharghi, A. (1993). Spectral shape features versus formants as acoustic correlates for vowels. Journal of the Acoustical Society of America, 94, 1966-1982. Hillenbrand & Houde (1995), St. Louis ASA Talk 12 Figure 1. Hillenbrand & Houde (1995), St. Louis ASA Talk 13 Figure 2. Hillenbrand & Houde (1995), St. Louis ASA Talk 14 Figure 3. Hillenbrand & Houde (1995), St. Louis ASA Talk 15 Figure 4. Hillenbrand & Houde (1995), St. Louis ASA Talk 16 Figure 5. Hillenbrand & Houde (1995), St. Louis ASA Talk 17 Figure 6. Hillenbrand & Houde (1995), St. Louis ASA Talk 18 Figure 7. Hillenbrand & Houde (1995), St. Louis ASA Talk Figure 8. (Note: This study has been completed and the results published in Hillenbrand & Nearey, JASA, 1999) 19 Hillenbrand & Houde (1995), St. Louis ASA Talk 20 Formant Representations Basic Idea: Phonetic quality is controlled NOT by the detailed shape of the spectrum, but the formant frequency pattern. Advantage: Formants account for a large number findings in speech perception. The high intelligibility of formant synthesis (and even sinewave speech) confirms that formants convey a great deal of phonetic information. Problems: (1) Formant tracking is an unresolved problem (2) Many listening errors are inconsistent with formants Figure 9. Hillenbrand & Houde (1995), St. Louis ASA Talk 21 Spectral Shape Representations Basic Idea: Phonetic quality is controlled NOT by the formant pattern, but by the overall shape of the spectrum. Advantages: (1) Does not require formant tracking (2) Accommodates data on listening errors (and vowel similarity judgments) Problem: Large alterations can be made in spectral shape without affecting phonetic quality as long as the formant frequencies are preserved. Figure 10. Hillenbrand & Houde (1995), St. Louis ASA Talk 22 Klatt (1982) Experiment ----------------------------------------------------------------Psy Phon ----------------------------------------------------------------Spectral Shape Manipulations ----------------------------------------------------------------1. Random Phase 20.0 2.9 2. High-Pass Filter 15.8 1.6 3. Low-Pass Filter 7.6 2.2 4. Spectral Tilt 9.6 1.5 5. Overall Amplitude 4.5 1.5 6. Formant Bandwidth 3.4 2.4 7. Spectral Notch 1.6 1.5 ----------------------------------------------------------------Formant Frequency Manipulations ----------------------------------------------------------------8. F1 ±8% 6.5 6.2 9. F2 ±8% 5.0 6.1 10. F1, F2 ±8% 8.2 8.6 11. F1, F2, F3, F4, F5, ±8% 10.7 8.1 --------------------------------------------------------------------------------------------------------------------------------Psy: Perceived Psychophysical Distance Phon: Perceived Phonetic Distance Figure 11. Hillenbrand & Houde (1995), St. Louis ASA Talk Figure 12. 23 Hillenbrand & Houde (1995), St. Louis ASA Talk 24 Figure 13. Hillenbrand & Houde (1995), St. Louis ASA Talk 25 Figure 14. Hillenbrand & Houde (1995), St. Louis ASA Talk 26 ________________________________________________________________________________ ________________________________________________________________________________ IDENTIFICATION RESULTS ________________________________________________________________________________ ORIGINAL SIGNALS 96.9 MPR SYNTHESIS 90.9 FORMANT SYNTHESIS 88.7 ________________________________________________________________________________ ________________________________________________________________________________ Figure 15.