Vowel recognition: Formants, Spectral Peaks, and Spectral Shape Representations James Hillenbrand

advertisement
Vowel recognition: Formants, Spectral Peaks,
and Spectral Shape Representations
James Hillenbrand
Speech Pathology and Audiology
Western Michigan University
Kalamazoo MI 49008
and
Robert A. Houde
RIT Research Corporation
125 Tech Park Drive
Rochester, NY 14623
Invited presentation, Acoustical Society of America, St. Louis, November, 1995.
Hillenbrand & Houde (1995), St. Louis ASA Talk
2
We have decided to restrict our comments today to just a few issues having to do with the
spectral representations for vowels. One of the main issues that we'd like to focus on is the general
question of formants versus spectral shape representations for speech, with a particular focus on the
representation of vowel quality. Before getting into that material, we would like to describe a new
database that we collected of acoustic measurements for vowels in /hVd/ syllables (Hillenbrand,
Getty, Clark, and Wheeler, 1995). The experiment was designed as a replication and extension of
the classic Bell Labs study by Peterson & Barney (1952). The measurements that Peterson &
Barney reported have obviously had a profound influence on vowel perception research. Our study
was intended to address a number of limitations of this classic work, the most important of which
are: (1) unavailability of the original recordings, and (2) the absence of spectral change and
duration measurements.
The design of the study was quite simple. Recordings were made of /hVd/ utterances
spoken by 139 talkers: 45 men, 48 women, and 46 10-12 year-old children. The utterances
included the 10 vowels recorded by Peterson & Barney, plus /e/ and /o/. The majority of the talkers
(87%) were from the Lower Peninsula of Michigan; the remainder were primarily from other areas
of the upper midwest (e.g., Illinois, Wisconsin, Ohio, Minnesota). The talkers were screened for
dialect using a fairly simple procedure that I will not have time to describe. Recordings were made
of the 12 vowels in /hVd/ context. The acoustic measurements consisted of: (1) vowel duration and
"steady state" times, which were measured by hand from gray-scale spectrograms, (2) f0 contours,
which were automatically tracked and then hand edited, and (3) contours of F1-F4, which were
hand-edited from LPC peak displays.
_______________________________________________
Figure 1
_______________________________________________
Figure 1 shows an LPC peak display before and after hand editing. From left to right, the
dashed lines indicate the start of the vowel, an "eyeball" measurement of the steadiest portion of the
vowel, and the end of the vowel.
_______________________________________________
Figure 2
_______________________________________________
The signals were presented to a panel of listeners for identification. Although there are
some details that differ, the overall intelligibility of our vowels is nearly identical to Peterson &
Barney. In both cases, the error rate is about 5%, where an error is any trial in which the listener
reports hearing a vowel other than the one intended by the talker. It is also of some interest to note
that identification rates do not differ much across the three groups of talkers. (Peterson & Barney
did not report results separately for men, women, and child talkers.)
Hillenbrand & Houde (1995), St. Louis ASA Talk
3
_______________________________________________
Figure 3
_______________________________________________
In contrast to the identification results, there were some very large differences in the
acoustic measurements. The two main differences were: (1) our data showed somewhat more
overlap among vowel categories than the Peterson & Barney data when only F1 and F2 are
considered, and (2) the average positions of some of the vowels in F1-F2 space are different. Figure
3 shows a standard F1-F2 plot of the data from our study. Individual tokens that were not well
identified by the listeners have been removed, and the database has been thinned of redundant data
points. It can be seen that there is a large amount of overlap among adjacent vowels. The overlap is
especially large for the vowels /æ/ and /ɛ/, /u/ and /ʊ/ and the group /ɑ-ɔ-ʌ/. As will be seen in a
moment, these vowels can be separated rather well when duration and the spectral change are taken
into account.
_______________________________________________
Figure 4
_______________________________________________
Figure 4 compares average formant values from our study with Peterson & Barney. The
figure shows data from the adult females only, although very similar patterns are seen for the men.
(The children present a somewhat different case; this is discussed in our JASA paper). The data are
plotted in the form of an acoustic vowel diagram; the Peterson & Barney data are shown in dotted
lines, and our data are shown in solid lines. The central vowels /ɝ/ and /ʌ/ occupy nearly identical
positions in the two data sets, but substantial differences are seen for all other vowels. To the extent
that we can rely on traditional articulatory interpretations of formant data, these findings suggest
the following: (1) the high vowels in our data are produced somewhat lower in the mouth, (2) the
back vowels in our data seem to be produced with somewhat more anterior tongue positions, and
(3) the low back vowels /ɑ/ and /ɔ/ in our data are shifted down and forward. There are also very
large differences in the production of the vowels /æ/ and /ɛ/. The Peterson & Barney formant data
are consistent with the textbook description of these vowels, suggesting a higher and more anterior
tongue position for /æ/ than /ɛ/. Our data suggest rather similar tongue positions for these two
vowels, at least when the formants are sampled at "steady state". Our results are consistent with
several studies by Labov, suggesting that the vowel /æ/ is drifting upward in many dialects of
American English, particularly the “Northern Cities” dialect spoken by our subjects (e.g., Labov,
Yeager, and Steiner, 1972). We believe that our speakers are differentiating these two vowels
primarily by differences in the spectral change patterns and duration.
These comparisons raise a rather obvious question: What accounts for the differences
between the two sets of formant measurements? Very broadly, there only seem to be two
possibilities: (1) methodological differences in the measurement of the formant frequencies (we
used LPC spectral peaks and Peterson & Barney used narrow band spectra produced on an analog
spectrograph), and (2) dialect differences. Although we do not have time to develop the argument
here, we believe that dialect is by far the more important of the two factors. There are two dialect
factors that may have played a role. First, as discussed above, there are dialect differences based on
Hillenbrand & Houde (1995), St. Louis ASA Talk
4
region. (This is somewhat difficult to evaluate since so little is known about the dialect of Peterson
& Barney's talkers, although the majority of the women and children were from the Mid-Atlantic
region). Second, the two sets of recordings are separated by approximately 40 years. An excellent
study of British RP by Bauer (1985) shows clearly that substantial shifts in vowel formant
frequencies can be expected over a period of 40 years.
There is one final point that is worth making on these comparisons. There has been a
tendency over the years to view the Peterson & Barney measurements as some sort of a "gold
standard" that tells us how the vowels of American English are produced. I think that we all
understand that these values vary across region and that they can show significant changes over
time. If nothing else, our study serves to remind us of these simple facts.
_______________________________________________
Figure 5
_______________________________________________
All of the measurements that have been discussed up to this point were taken at the point in
the vowel where the formant pattern was judged to be maximally steady. Figure 5 shows what the
spectral change patterns look like for our vowels. What we have done here is to simply connect a
line between the spectral pattern sampled at 20% of vowel duration and the spectral pattern
sampled at 80% of vowel duration. The symbol for each vowel is plotted at the location of the
second sample. There are just two simple points that we would like to make about this figure. First,
it is obvious that nearly all of the vowels show a fair amount of spectral movement. The only
notable exceptions are /i/ and /u/, which show very little formant movement. The more important
point, however, is that the spectral changes occur in such a way as to increase the contrast between
adjacent vowels. For example, vowel pairs such as /æ/ and /ɛ/, /u/ and /ʊ/, and /ɑ/ and /ɔ/, which
occupy very similar static positions, show very different patterns of spectral change.
One obvious limitation of this figure is that we are displaying averages rather than
individual data points. We have not found a way to make a coherent display of the spectral change
patterns that shows individual data points, but this issue can be addressed by looking at the
performance of a discriminant classifier that is trained on either static or dynamic formant patterns.
_______________________________________________
Figure 6
_______________________________________________
Figure 6 shows the classification accuracy for a quadratic discriminant classifier that has
been trained on either a single sample of the formant pattern or two samples, taken at 20% and 80%
of vowel duration. The open bars show the results for the single-sample case, and the filled bars
show the results for two samples. In the left half of the figure duration information was not used,
while the right half of the figure shows comparable results with vowel duration added to the
parameter set. Results are shown for several combinations of parameters: F1-F2, F1-F2-F3, f0-F1-F2,
and f0-F1-F2-F3.
Hillenbrand & Houde (1995), St. Louis ASA Talk
5
The discriminant analysis results are quite straightforward: the pattern classifier does a
much better job of sorting the vowels when it is provided with information about the pattern of
spectral change. The effect is particularly large for simpler parameter sets. For example, with F1
and F2 alone, the classification accuracy is 71% for a single sample at "steady state" and 91% for
two samples of the formant pattern. Adding vowel duration to the parameter set also improves
classification accuracy, although the effect is not particularly large if the formant pattern is sampled
twice.
The pattern classification approach has some obvious limitations (e.g., Nearey, 1992;
Hillenbrand & Gayvert, 1993). Showing that a pattern classifier is strongly affected by spectral
change is not the same as showing that human listeners rely on spectral change patterns. We
recently completed an experiment that was designed to examine this question more closely using
synthesized signals. We selected 300 signals from our /hVd/ database. Listeners were asked to
identify the original signals and two different synthesized versions. The two synthesis methods are
shown in Figure 7. One set of signals was synthesized using the original measured formant
_______________________________________________
Figure 7
_______________________________________________
contours, shown here by the dashed curves. The vowel here is /æ/, and the original measured
formant contour shows a pronounced offglide in which the formants converge, a pattern which was
quite common in our data for /æ/. A second set of signals was synthesized with flat formants, fixed
at the values measured at "steady state," shown here by the solid curves. For these signals a 40 ms
transition connected the steady-state values to the formant values that were measured at the end of
the vowel. Both sets of synthetic signals were generated with the original measured f0 contours and
at the original measured durations.
_______________________________________________
Figure 8
_______________________________________________
The labeling results are shown in Figure 8. The findings are rather clear: identification rates
are highest for the original signals, and lowest for the signals that were synthesized with flattened
formant contours. The primary purpose of this experiment was to measure the relative importance
of formant movements in vowel recognition, but the significant drop in intelligibility between the
naturally produced signals and the signals that were synthesized to match the measured formant
patterns is quite important, in our view. This drop in intelligibility can be attributed to some
unknown combination of two factors: (1) errors in the estimation of the acoustic measurements,
especially the formant frequencies, and (2) the loss of phonetically relevant information in the
transformation to a formant representation, as has been suggested by Bladon (Bladon, 1982; Bladon
and Lindblom, 1981) and Zahorian and Jagharghi (1986, 1993). We will have more to say about the
general question of formant representations later.
Hillenbrand & Houde (1995), St. Louis ASA Talk
6
The effect of flattening the formant contours is rather large, and not especially surprising
given the significant body of evidence that has accumulated over the last 20 year or so implicating
a role for spectral change in vowel recognition (e.g., Jenkins et al., 1983; Nearey & Assmann,
1986; Nearey, 1989; Hillenbrand and Gayvert, 1993). We see this study not so much as a
confirmation that spectral change is important, since this is already well established, but as a way to
measure the size of the effect using speech signals other than isolated vowels which are closely
modeled after naturally produced utterances. An obvious limitation of this study is that vowel
duration was not manipulated. We are working on that experiment right now and should have some
results to report soon. [Note: This study has been completed – see Hillenbrand, Clark, and Houde,
2000.]
FORMANTS VERSUS SPECTRAL SHAPE
The next issue that we would like to discuss is the longstanding debate regarding formant
representations versus overall spectral shape. This is an issue that has received only sporadic
attention over the years, but has been revived to some extent recently, due in part to the work of
Bladon and Lindblom (Bladon, 1982; Bladon and Lindblom, 1981) and Zahorian and Jagharghi
(1986; 1987). The essence of formant theory is that phonetic quality is controlled not by the fine
_______________________________________________
Figure 9
_______________________________________________
details of the spectrum, but by the frequency locations of spectral peaks that correspond to the
natural resonant frequencies of the vocal tract. The primary virtue of formant theory is that formant
representations account for a very large number of findings in the phonetic perception literature.
The most straightforward demonstration is the high intelligibility of speech that has been
synthesized from measured formant patterns.
There are, however, several important problems with formant representations. The most
important of these is the one that Bladon has called the determinacy problem. Quite simply, after
several decades of concentrated effort by many imaginative investigators, no one has developed a
reliable method to measure formant frequencies automatically from natural speech. The problem, of
course, is that formant tracking involves more than just extracting spectral peaks; it is necessary to
assign each of the peaks to specific formant slots corresponding to F1, F2, and F3.
A second problem has to do with explaining listening errors. As Klatt (1982a; 1982b) has
pointed out, errors that are made by human listeners do not appear to be consistent with the idea
that the auditory system is tracking formants. When formant trackers make errors, they tend to be
large ones since the errors involve missing formants entirely or tracking spurious peaks. Human
listeners, on the other hand, nearly always hear a vowel that is very close in phonetic space to the
intended vowel. This fact is easily accommodated by whole spectrum model.
_______________________________________________
Figure 10
_______________________________________________
Hillenbrand & Houde (1995), St. Louis ASA Talk
7
An alternative to formant representations is a whole spectrum or spectral shape model. The
basic idea is essentially the inverse of formant theory: Phonetic quality is assumed to be controlled
not by formants, but by the overall shape of the spectrum. According to this view, formants only
seem to be important because spectral peaks have such a large effect on the overall shape of the
spectrum. The main advantage of this view is that it does not require formant tracking, which we
know to be somewhere between difficult and impossible. Spectral shape models can also easily
accommodate the data on listening errors. But there is a very significant problem with whole
spectrum models: There is very solid evidence that large alterations can be made in spectral shape
without affecting phonetic quality, as long as formant frequencies are preserved. There is a fair
amount of evidence on this issue, but the key experiment, in our view, comes from a rather simple
study by Klatt (1982a). Listeners in Klatt's study were presented with pairs of synthetic signals
_______________________________________________
Figure 11
_______________________________________________
consisting of a standard /æ/ and one of 13 comparison signals. The comparison signals were
generated by manipulating one of two kinds of parameters: (1) those that affect overall spectral
shape (e.g., high- or low-pass filtering, spectral tilt, changes in formant amplitudes or bandwidths,
the introduction of spectral notches, etc.) and (2) changes in formant frequencies. Listeners were
asked to judge either the overall psychophysical distance or the phonetic distance between the
standard and comparison stimuli. Results were very clear: changes in formant frequency were
easily the most important affecting phonetic quality. Other changes in spectral shape, while clearly
audible to the listeners, did not result in large changes in phonetic quality.
The conclusions that we draw from this literature (and from this literature as a whole, not just
the Klatt experiment) are the following:
1. Phonetic quality seems to be controlled primarily by the frequencies of spectral peaks.
However, we do not believe that a traditional formant analysis is feasible.
2. We believe that the evidence is consistent with a spectral shape model, but one which is
strongly dominated by energy in the vicinity of spectral peaks.
In response to these findings we have developed a spectrum analysis technique called the
Masked Peak Representation (MPR). The MPR is a relatively simple transform that was designed
to: (1) show the kind of sensitivity to spectral peaks that is exhibited by human listeners but without
requiring explicit formant tracking, (2) be relatively insensitive to spectral shape details other than
peaks, and (3) be relatively insensitive to formant mergers, split formants, and related phenomena
that cause such difficulty for traditional formant frequency representations. The representation
shares some features with a cepstrum-based processing scheme described by Aikawa et al. (1993),
which we recently became aware of.
_______________________________________________
Figure 12
_______________________________________________
Hillenbrand & Houde (1995), St. Louis ASA Talk
8
The spectrum analysis begins with a smoothed Fourier power spectrum -- the smoothing is
carried out both across frequency and across time. The only thing unconventional about the
analysis is the calculation of a running average of spectral amplitudes, calculated over a rather wide
(800 Hz) window. This running average is subtracted from the smoothed spectrum, producing what
we have called the masked spectrum. Subtraction of the running average is a rough simulation of
lateral suppression, and one practical result is that it allows you to find spectral peaks that would
otherwise be missed. Although it may not be obvious from this figure, the bump corresponding to
F2 in this signal is not, in fact, a spectral peak, but a soft shoulder that would be missed by a
conventional peak picker. We find this pattern to be rather common, and there is little doubt that
listeners are sensitive to these shoulders because the phonetic quality is off for signals that are
synthesized without representing these shoulders.
_______________________________________________
Figure 13
_______________________________________________
The display on top is a normal FFT spectrogram of the utterance, "OK, Hello.", and the
display below shows what the utterance looks like after subtraction of the running average. It can
be seen that most of what is left after the subtraction of the running average is the energy in the
vicinity of spectral peaks; overall spectral tilt has been removed, as is energy in the valleys between
spectral peaks. However, although the formants are quite visible, there are many peaks in the
masked spectrum that clearly do not correspond to formants.
We are just beginning a series of experiments that are intended test the validity of the MPR.
One specific claim that we are making is that the MPR retains those aspects of the original speech
signal that are most relevant to phonetic quality. As one test of this claim, we resynthesized signals
from the /hVd/ database from masked spectra. The synthesizer that was used in this experiment was
a damped sinewave vocoder. The vocoder is driven by three kinds of information:
1. Spectral Peaks
2. Instantaneous Fundamental Frequency
3. A Measure of the Degree of Periodicity
_______________________________________________
Figure 14
_______________________________________________
The spectral peaks that drive the synthesizer are extracted from the masked spectrum. The
signal then is resynthesized by summing exponentially damped sinusoids at frequencies
corresponding to the spectral peaks. This figure shows three damped sinusoids that would be
summed to produce the vowel /ɑ/. If the periodicity measure indicates that the speech is voiced, the
damped sinusoids are repeated at a rate corresponding to the measured fundamental period. For
unvoiced speech, the damped sinusoids are pulsed on and off at random intervals.
Hillenbrand & Houde (1995), St. Louis ASA Talk
9
__________________________________________________________________
Audio Tape: Damped Sinewave Synthesis Demo
__________________________________________________________________
The vowel intelligibility tests that we conducted with the damped sinewave synthesizer used
the same 300 signals from the /hVd/ database that were used in the formant synthesis experiment
described previously. The degree-of-periodicity measure that is normally used to control the
voice/unvoice balance in the damped sinewave vocoder was not used in this experiment: The
damped sinusoids were pulsed at random intervals during the /h/, and at intervals derived from the
hand-edited fundamental frequency measurements during the remainder of the syllable.
Listeners were asked to identify: (1) the original, naturally produced utterances, (2)
utterances that were resynthesized from the MPR spectra using the damped sinewave vocoder, and
(3) formant synthesized utterances that followed the measured pitch and formant contours.
_______________________________________________
Figure 15
_______________________________________________
There are several details of the labeling results with these signals that are of some interest,
but in the limited time that is available we will only be able to take a brief look at overall
intelligibility. The main finding here is that the signals that were generated from the MPR-derived
spectral peaks are quite intelligible, although less intelligible than the original signals. The main
point to be made about these results is that vowel quality (and probably phonetic quality in general)
is maintained quite well if spectral peaks are preserved. Further, the spectral peaks do not
necessarily need to correspond to formants. We should point out that we do not regard this as an
especially strong test of the validity of the MPR. A stronger test would involve demonstrating that
phonetic distance judgments such as those reported in Klatt's (1982a) study can be predicted well
from masked spectra. We are currently developing a distance measure using masked spectra that
could be used in an experiment of this kind. We hope to have results to report soon.
Hillenbrand & Houde (1995), St. Louis ASA Talk
10
REFERENCES
Aikawa, K., Singer, H., Kawahara, H., and Tohkura, Y. (1993). A dynamic cepstrum incorporating
time-frequency masking and its application to continuous speech recognition. ICASSP-93,
668-671.
Assmann, P., Nearey, T., and Hogan, J. (1982). Vowel identification: orthographic, perceptual, and
acoustic aspects. Journal of the Acoustical Society of America, 71, 975-989.
Bauer, L. (1985). Tracing phonetic change in the received pronunciation of British English. J.
Phon., 13, 61-81.
Bennett, D.C. (1968). Spectral form and duration as cues in the recognition of English and German
vowels. Lang. & Speech, 11, 65-85.
Bennett, S. (1983). A three-year longitudinal study of school-aged children's fundamental
frequencies. J. Speech Hear. Res., 26, 137-142.
Bladon, A. (1982). Arguments against formants in the auditory representation of speech. In R.
Carlson and B. Granstrom (Eds.) The Representation of Speech in the Peripheral Auditory
System, Amsterdam: Elsevier Biomedical Press, 95-102.
Bladon, A. and Lindblom, B. (1981). Modeling the judgment of vowel quality differences. J.
Acoust. Soc. Am., 69, 1414-1422.
Hedlin, P. (1982). A representation of speech with partials. In R. Carlson and B. Granstrom (Eds.)
The Representation of Speech in the Peripheral Auditory System, Amsterdam: Elsevier
Biomedical Press, 247-250.
Hillenbrand, J.M., Clark, M.J., and Houde, R.A. (2000). Some effects of duration on vowel
recognition. Journal of the Acoustical Society of America, 108, 3013-3022.
Hillenbrand, J., and Gayvert, R.T. (1993). Vowel classification based on fundamental frequency
and formant frequencies. Journal of Speech and Hearing Research, 36, 694-700.
Hillenbrand, J., and Gayvert, R.T. (1993). Identification of steady-state vowels synthesized from
the Peterson and Barney measurements. Journal of the Acoustical Society of America, 94,
1993, 668-674.
Hillenbrand, J., Getty, L.A., Clark, M.J., and Wheeler, K. (1995). Acoustic characteristics of
American English vowels. Journal of the Acoustical Society of America, 97, 1300-1313.
Hillenbrand, J.M., and Nearey, T.N. (1999). Identification of resynthesized /hVd/ syllables: Effects
of formant contour. Journal of the Acoustical Society of America, 105, 3509-3523.
Jenkins, J.J., Strange, W., and Edman, T.R. (1983). Identification of vowels in 'vowelless' syllables.
Percept. Psychophys., 34, 441-450.
Klatt, D.H. (1982a). Prediction of perceived phonetic distance from critical-band spectra: A first
step. IEEE ICASSP, 1278-1281.
Hillenbrand & Houde (1995), St. Louis ASA Talk
11
Klatt, D.H. (1982b). Speech processing strategies based on auditory models. The Representation of
Speech in the Peripheral Auditory System, Amsterdam: Elsevier Biomedical Press, 181196.
Labov, W., Yaeger, M., and Steiner, R. (1972). "A quantitative study of sound change in progress,"
Report on National Science Foundation Contract NSF-FS-3287.
Miller, J.D. (1989). Auditory-perceptual interpretation of the vowel. Journal of the Acoustical
Society of America, 85, 2114- 2134.
Nearey, T.M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the
Acoustical Society of America, 85, 2088-2113.
Nearey, T.M. (1992). Applications of generalized linear modeling to vowel data. In J. Ohala, T.
Nearey, B. Derwing, M. Hodge, G. Wiebe (Eds.), Proceedings of ICSLP 92, 583-586.
Edmonton, AB: University of Alberta.
Nearey, T.M., Hogan, J, and Rozsypal, A. (1979). Speech signals, cues and features. In G. Prideaux
(Ed.) Perspectives in Experimental Linguistics. Amsterdam: Benjamin.
Nearey, T.M., and Assmann, P. (1986). Modeling the role of vowel inherent spectral change in
vowel identification. Journal of the Acoustical Society of America, 80, 1297-1308.
Peterson, G., and Barney, H.L. (1952). Control methods used in a study of the vowels. Journal of
the Acoustical Society of America, 24, 175-184.
Strange, W. (1989). Dynamic specification of coarticulated vowels spoken in sentence context.
Journal of the Acoustical Society of America, 85, 2135-2153.
Strange, W., Jenkins, J.J., and Johnson, T.L. (1983). Dynamic specification of coarticulated vowels.
Journal of the Acoustical Society of America, 74, 695-705.
Syrdal, A.K. (1985). Aspects of a model of the auditory representation of American English
vowels. Speech Communication, 4, 121-135.
Syrdal, A.K., and Gopal, H.S. (1986). A perceptual model of vowel recognition based on the
auditory representation of American English vowels. Journal of the Acoustical Society of
America, 79, 1086-1100.
Zahorian, S., and Jagharghi, A. (1986). Matching of 'physical' and 'perceptual' spaces for vowels.
Journal of the Acoustical Society of America, Suppl. 1, 79, S8.
Zahorian, S., and Jagharghi, A. (1993). Spectral shape features versus formants as acoustic
correlates for vowels. Journal of the Acoustical Society of America, 94, 1966-1982.
Hillenbrand & Houde (1995), St. Louis ASA Talk
12
Figure 1.
Hillenbrand & Houde (1995), St. Louis ASA Talk
13
Figure 2.
Hillenbrand & Houde (1995), St. Louis ASA Talk
14
Figure 3.
Hillenbrand & Houde (1995), St. Louis ASA Talk
15
Figure 4.
Hillenbrand & Houde (1995), St. Louis ASA Talk
16
Figure 5.
Hillenbrand & Houde (1995), St. Louis ASA Talk
17
Figure 6.
Hillenbrand & Houde (1995), St. Louis ASA Talk
18
Figure 7.
Hillenbrand & Houde (1995), St. Louis ASA Talk
Figure 8. (Note: This study has been completed and the results
published in Hillenbrand & Nearey, JASA, 1999)
19
Hillenbrand & Houde (1995), St. Louis ASA Talk
20
Formant Representations
Basic Idea:
Phonetic quality is controlled NOT by the
detailed shape of the spectrum, but the
formant frequency pattern.
Advantage:
Formants account for a large number
findings in speech perception. The high
intelligibility of formant synthesis (and
even sinewave speech) confirms that
formants convey a great deal of phonetic
information.
Problems:
(1) Formant tracking is an unresolved
problem (2) Many listening errors are
inconsistent with formants
Figure 9.
Hillenbrand & Houde (1995), St. Louis ASA Talk
21
Spectral Shape Representations
Basic Idea:
Phonetic quality is controlled NOT
by the formant pattern, but by the
overall shape of the spectrum.
Advantages:
(1) Does not require formant
tracking
(2) Accommodates data on
listening errors (and vowel
similarity judgments)
Problem:
Large alterations can be made in
spectral shape without affecting
phonetic quality as long as the formant
frequencies are preserved.
Figure 10.
Hillenbrand & Houde (1995), St. Louis ASA Talk
22
Klatt (1982) Experiment
----------------------------------------------------------------Psy
Phon
----------------------------------------------------------------Spectral Shape Manipulations
----------------------------------------------------------------1. Random Phase
20.0
2.9
2. High-Pass Filter
15.8
1.6
3. Low-Pass Filter
7.6
2.2
4. Spectral Tilt
9.6
1.5
5. Overall Amplitude
4.5
1.5
6. Formant Bandwidth
3.4
2.4
7. Spectral Notch
1.6
1.5
----------------------------------------------------------------Formant Frequency Manipulations
----------------------------------------------------------------8. F1 ±8%
6.5
6.2
9. F2 ±8%
5.0
6.1
10. F1, F2 ±8%
8.2
8.6
11. F1, F2, F3, F4, F5, ±8%
10.7
8.1
--------------------------------------------------------------------------------------------------------------------------------Psy: Perceived Psychophysical Distance
Phon: Perceived Phonetic Distance
Figure 11.
Hillenbrand & Houde (1995), St. Louis ASA Talk
Figure 12.
23
Hillenbrand & Houde (1995), St. Louis ASA Talk
24
Figure 13.
Hillenbrand & Houde (1995), St. Louis ASA Talk
25
Figure 14.
Hillenbrand & Houde (1995), St. Louis ASA Talk
26
________________________________________________________________________________
________________________________________________________________________________
IDENTIFICATION RESULTS
________________________________________________________________________________
ORIGINAL
SIGNALS
96.9
MPR
SYNTHESIS
90.9
FORMANT
SYNTHESIS
88.7
________________________________________________________________________________
________________________________________________________________________________
Figure 15.
Download