Acoustic Properties of Vocalic Nuclei Associated with Prosodic

advertisement
Acoustic Properties of Vocalic Nuclei Associated with Prosodic Stress in
Spontaneous American English Discourse
By Leah Hitchcock
Introduction and Background
As speech recognition and speech synthesis technology continue to improve, it
becomes increasingly important in the field of linguistics to understand all the details of
speech such as the cues in the sound wave that signal accent in stress-accent languages
like English. Until very recently, it was generally accepted that f0 variation, or pitch
change, was the main factor involved in English stress. However, recent research has
shown that in fact f0 variation is not the only indicator of prosodic stress-accent, or even
the primary cue (Silipo and Greenberg, 1999 & 2000, Beckman 1986). Even as early as
1955 there was evidence that duration and amplitude are cues for perceived stress (Fry,
1955). Stress-accent languages, like English, differ from pitch-accent languages like
Japanese in that they use other features in addition to pitch change to denote phrasal
accent. (Beckman, 1986) While there is much evidence that f0 change is an important
cue of stress, research on automatic stress labeling systems shows that automatic stress
labeling algorithms are most accurate when using a combination of duration and
amplitude rather than pitch change to label stress. (Silipo and Greenberg, 2000 & 1999,
Van Kuijk & Boves, 1999)
This suggests that while stressed syllables do tend to have pitch variation, it may
not be the most important cue of stress in English, and that the most important cue is
actually duration, or a combination of duration and amplitude. (Silipo and Greenberg,
1999 and 2000, Beckman, 1986, Van Kuijk and Boves, 1999). Further, it has been
1
suggested that some of the pitch differences exhibited in stressed syllables are an artifact
of duration: longer segments have more time for variation of features like f0 (Silipo and
Greenberg 1999). This is not to say that f0 variation isn’t an important cue of stressaccent; it is. However, it is clear that vowel duration and amplitude also play a role in
determining which vowels are perceived as stressed, and automatic speech recognition
systems could benefit greatly from an understanding of exactly how different factors
affect the realization of stress-accent in spontaneous speech.
Many studies have looked into the phenomenon of reduction of unstressed vowels
(i.e. Lindblom, 1963, van Bergem, 1993, Koopmans-van Beinum, 1987, Fourakis, 1991,
Engstrand, 1988). Typically, vowels that become reduced in speech are short and
unstressed. Vowel reduction is not necessarily a trend toward centralization of the vowel
(see Lindblom, 1963), as other factors, such as consonantal context also seem to play a
role in the shape of the formant trajectories of unstressed vowels. An increased
knowledge of the acoustics of stressed and unstressed vowels could be very useful in the
study of vowel reduction in spontaneous speech, which in turn should aid the fields of
speech recognition and speech synthesis.
This paper explores the roles of duration and amplitude in determining whether
specific vowels are perceived as stressed in spontaneous American English dialogues.
Based on previous research, we would expect duration to play the clearest role in
determining perceived stress-accent, and that amplitude would play a secondary role for
most vowels (Silipo and Greenberg, 1999 and 2000, Van Kuijk and Boves, 1999,
Beckman, 1986, Fry, 1955). However, most previous research has been done on labspeech, so a major purpose of this study is to determine general patterns related to stress-
2
accent in spontaneous speech, and to determine to what extent these patterns support the
results of past research on lab speech.
The concept of stress is difficult to define. Many different definitions exist (see
Asher, 1994 and Crystal, 1992), so it is important to clarify what type of stress this paper
deals with. Lexical, or word stress, is the type most commonly referred to. Lexical stress
is canonical, and is what is marked in pronunciation guides in dictionaries. Phrasal stress
refers to which syllables are accented within an utterance. “While this might often be
related to the grammatical structure of the utterance, on cannot predict when the
grammatical structure will be reflected phonetically. The speaker may decide to
recognize a syntactic unit or to overlook it,” (Asher, 1994:4357). Semantics, syntax, and
other factors affect which syllables actually end up as being accented in an utterance (see
Vanderslice and Ladefoged, 1972). This paper deals with phrasal stress-accent rather
than lexical stress. In other words, rather than the dictionary’s markings of lexical stress,
the stress patterns that are realized phonetically in dialogues are what are being studied.
This encompasses many different types of stress; sometimes a syllable that is expected,
according to canonical lexical stress, to be accented is in fact accented in an utterance,
and sometimes a syllable is accented for other reasons, such as emphasis. Because there
are not definitive rules that determine the conditions under which a person will stress a
certain syllable, the data are based on perceived stress.
Methods and Corpus Data
The data for this paper are from the Switchboard corpus. The Switchboard corpus
contains over 140 hours worth of short telephone conversations on various topics. For
this paper, approximately 54 minutes worth of utterances were transcribed at the phonetic
3
segment level by University of California, Berkeley, linguistics students. Level of stressaccent was also manually labeled by an independent set of transcribers for syllabic nuclei.
Of the 54 minutes of speech, 45.43 minutes was analyzable and the remainder being
filled pauses, stutters, and other non-speech. (The usable 45.43 minutes consisting of
9,922 words, 13,446 syllables, and 33,370 phones, comprising 674 utterances). The
average length of an utterance was 4.76 seconds, the average number of words per
utterance was 18.5 and the average number of syllables was 23.25. (Only utterances
between 2 and 17 seconds were used, and about 60% of these were between 4 and 8
seconds. The number of words ranged from 2 to 64, and the number of syllables from 5
to 81.) The corpus includes 581 speakers, of whom 288 were female and 293 were male.
Most speakers were represented by only a single utterance in the data set; a few had
multiple utterances that were used. Speakers were asked for the region they lived in
during their formative years in order to ascertain what dialect of American English they
speak. The number of speakers for each dialect region are in table 1.
# of speakers by dialect region
New England
37
New York
41
North
88
North Mid
89
South
65
South Mid
147
West
72
Mixed
42
Table 1. Total number of speakers from each of the seven dialect regions, as well as
mixed dialect. Mixed speakers are people who moved around a lot during their formative
years.
Of the actual speech data, 769 syllables from the utterances were not used because
they had syllabic consonants (such as “el”) for their nucleus rather than a vocalic nucleus.
4
631 filled pauses (“um,” “uh,” etc.) were also excluded from analysis because of the
drastically different patterns exhibited in filled pauses compared to all other words in the
data set.
The data were phonetically transcribed by three individuals using a variant of
Arapet (which was originally used to label the TIMIT corpus) (Greenberg 1997). The
interlabeler agreement was approximately 74%. When transcribers disagreed on the
identity of a vocalic segment, it was generally only a slight disagreement, i.e. one level of
frontness or height. It was rare for transcribers to disagree on whether a segment was a
diphthong or a monophthong.
Two different individuals marked the corpus material for stress-accent. I was one
of the transcribers, but I was hired to do the transcription work long before the idea for
this paper came about, so the fact that half of the labeling was done by me should not
create a bias in the data. The material was marked for three levels of accent: fully
accented, completely unaccented, and an intermediate level. The intermediate level
includes all syllables judged to be not primary stress, but not completely lacking stress
either. Fully unaccented vowels were not necessarily reduced, but most of the
occurrences of [ix] and [ax], which are both types of schwa, fell into the fully unaccented
category (see Table 2). Many of these were probably reduced forms of other vowels;
since the corpus was transcribed phonetically rather than canonically the proportion of
reduced vowels is not known. The nuclei were labeled based on perceptually based stress
accent, not based on knowledge of canonical (dictionary based) lexcial stress. The
transcribers and a supervisor met weekly to insure that the proper criteria were being used
for labeling. Other research has relied on modifications of canonical lexical stress
5
patterns when studying prosody, (i.e. Van Kuijk & Boves, 1999, Beckman, 1986) but
since spontaneous speech differs greatly from canonical speech, perceived stress-accent
should be more a more accurate representation of the way people speak. When speaking,
people place more emphasis on words or ideas that are most important to the message
they are trying to express. Labeling perceived stress-accent therefore enables someone
reading a transcription to see what the speaker intended to stress in his or her speech,
something that lexical stress does not convey.
All material used was labeled by both transcribers, and the stress-accent markings
were averaged. As with the phonetic transcription, the transcribers generally agreed.
Interlabeler agreement was 85% for unstressed nuclei, 78% for fully stressed nuclei, and
95% for any level of accent (both transcribers attributed some amount of stress to the
nucleus). When the transcribers disagreed, it was usually by only one step (i.e. fully
stressed vs. intermediate, not fully stressed vs. fully unstressed). Generally when the
transcribers disagreed, a third observer attested to the ambiguity of the level of accent of
the syllable in question. Table 2 includes the averaged duration and amplitude data for
each level of stress accent, with 0 referring to completely unaccented nuclei, 0.5 to the
intermediate level, and 1 referring to fully accented nuclei. Levels 0.25 and 0.75 are the
result of the averaging of the two transcriptions.
The duration of the segments was computed from the hand-labeled material.
About a third of the material was hand-segmented by transcribers and the remainder was
automatically segmented using 72 minutes of hand segmented material to train the
automatic labeler (Greenberg, Chang, and Hollenback, 2000). The amplitude, expressed
in terms of log base e, of each segment’s pressure waveform was computed and
6
normalized relative to the mean of the entire utterance (Greenberg, Chang and
Hollenback, 2000). Integrated energy has been shown to be the most accurate means of
determining the level of stress-accent for a vocalic segment in past research (Silipo and
Greenberg, 1999 and 2000) as it reflects both duration and amplitude. For the purposes
of this paper, an approximation of integrated energy was calculated for each vocalic
nucleus by multiplying the duration in milliseconds by the normalized amplitude.
Duration (ms)
Stress
[iy]
0 0.25
78
Amplitude (normalized log)
0.5 0.75
1 all
0 0.25
0.5 0.75
Duration x Amplitude
1 all
0 0.25
98
114
122 132 100 0.96 0.97 0.99 0.99 1.02 0.98
75
0.5 0.75
% of total occurrences
1 all
95 111
120
134
0 0.25
0.5 0.75
97 44.8 14.3 13.4
total
1
9.3 18.2 1270
[ey]
90
94
122
130 155 129 0.99 1.01 1.03 1.03 1.05 1.03
90
94 126
132
162 132 16.4
39
525
[ay]
108
113
126
143 174 141
1 1.02 1.03 1.05 1.08 1.04 108
115 129
149
186 147 16.6 12.8 19.7 14.7 36.2
790
[aw]
103
121
150
156 203 168 1.04 1.02 1.05 1.05 1.06 1.05 105
122 157
162
213 175
187
94
114
177 161
116 129
155
182 140 22.6
15 17.6 13.8
156 101 49.4
[oy]
98
111 168 154
0.97 1.04 1.06 1.04
[ow]
102
117
126
150 170 136 0.98
[uw]
70
101
104
153 152 103 0.95 0.96 0.97 0.98 1.03 0.98
[ih]
65
78
86
[ix]
49
53
51
[eh]
67
82
79
[ah]
77
89
96
[ax]
54
78
76
62
70
[uh]
61
74
71
70
78
[ae]
91
113
[aa]
86
94
[ao]
100
79
89
95
75 0.96
1 1.02 1.04 1.07 1.03 100
1 1.01 1.02 1.06 0.99
68
98
99
151
91
9.6 15.5
23 43.9
16.7
4.2 79.2
24
31
646
7.3 10.9
8.6 23.8
478
74 56.7
13
9.9
7.4 12.9 2126
46 89.1
7.4
2.3
433
37 10.8 11.7
12 28.6 1217
62
78
86
0.92
45
52
52
96
82 0.97 1.02 1.03 1.05 1.08 1.02
66
83
81
101
104
85
102 115
93 0.98 1.02 1.03 1.05 1.08 1.03
75
90
98
107
124
95 35.6 14.4 15.6
12 22.5 1060
56 0.94
1 1.03 1.04 1.09 0.95
51
77
77
65
75
53 89.3
0.8
67 0.97 1.02 1.05 1.05 1.09 1.01
59
75
75
73
85
68
123
144 165 137 0.98 1.02 1.03 1.04 1.07 1.04
88
113 126
110
116 134 114
1 1.03 1.05 1.07 1.09 1.06
86
96 115
87
107 143 115
1
80
50 0.92 0.97 1.01
97
1 1.03 1.04 1.08 1.05 102
91
101
8
9.1 17.3 18.1
6.7
2.4
54 11.3 11.3
328
148
175 142 16.3 11.2 15.8 15.3 41.4
823
123
144 121
690
112
154 122 13.4
17 12.5 14.5 14.8 41.3
6.8 17.7 21.1
Table 2. The relationship of stress accent to several acoustic properties of vocalic nuclei.
The vowels are grouped into the categories of diphthongs, lax monophthongs, and tense
monophthongs, because the vowels within each of these groups tend to behave similarly.
The first group of data is the average duration of the vowels, in milliseconds, for each
stress level and the intrinsic duration of the vowel. The second group of data is the
normalized amplitude for each stress level as well as the intrinsic amplitude. The
amplitude for each vocalic segment was normalized (log base e) with respect to the entire
utterance within which it occurred. The third group is the simple product of duration
times amplitude. The duration and amplitude of each vowel were multiplied together,
and then the average of the products was calculated. The final group of data is the
percentage of the time each vowel appears at each level of stress accent, along with a
column denoting the total number of occurrences of each vowel. Blank cells in the table
indicate less than four occurrences of the vowel for that stress level.
7
0.8 1729
8.8 14.6
41
351
Results
The data reveal several patterns about stress-accent in English. First, duration and
amplitude clearly play a large role in determining whether a vowel is perceived as
stressed. Second, high vowels, which are intrinsically shorter and quieter than low
vowels, (see also Lehiste and Peterson, 1959 and 1960, Beckman, 1986, Black 1949) also
tend not to be perceived as stressed nearly as often as their lower counterparts (Hitchcock
& Greenberg, 2001). Third, the shortest vowels, i.e. the lax monophthongs, tend to
exhibit greater amplitude differences than the intrinsically longer tense monophthongs
and diphthongs. These tendencies are very consistent, with a few minor exceptions that
appear to be the result of an insufficient amount of data (see Table 2).
8
Figure 1. (reprinted from Hitchcock and Greenberg, 2001)
The graphs above show the relationship between the position of the tongue and
factors associated with prosodic stress-accent. The Y-axis shows the factor (duration,
amplitude, or duration x amplitude) being measured for each graph. The Y-axes were
inverted in order to show the relationship of vowel height to the factor in question. The
graphs in the first column show normalized amplitude, log base e, which was calculated
with respect to the average amplitude of the entire utterance in which the vowel occurs.
9
The graphs in the second column show the duration of the vowels in milliseconds. The
graphs in the third column show the product of duration times amplitude. The X-axis of
all of the graphs shows the vowels, either diphthongs or monophthongs, arranged
approximately by the horizontal tongue position of the vowel. The resulting shape of the
graphs is strikingly similar to that of a vowel space chart, suggesting that there is a very
close relationship between place of articulation and factors such as vowel duration and
amplitude.
Figure 2. Reprinted from Hitchcock and Greenberg, 2001.
The proportion of fully accented and fully unaccented monophthongs and diphthongs.
Along the X-axis are the vowels arranged approximately by horizontal position of the
tongue. The Y-axis shows the proportion in terms of percentage of the total number of
occurrences of the vowel. The Y-axis is inverted for the graphs showing proportion of
fully stressed nuclei to better show the relationship of vowel height to stress-accent.
There is a clear relationship between vowel height and stress accent. High vowels
tend to be accented much less often than low vowels. This is true of both the diphthongs
and monophthongs. This tendency is consistent with the hypothesis that vowel duration
is one of the main cues of stress accent in English because the low vowels, which tend to
10
be longer, also tend to be accented more often. Vowel height, duration and stress-accent
appear to be closely tied together. It is clear from looking at the figures and tables that
there is a correlation between these three factors. To a lesser degree, vowel amplitude is
also associated with vowel height and stress-accent, but the correlation between duration
and stress-accent is clearer and more consistent. What is not entirely clear about the
relationship between stress-accent and vowel height, duration and amplitude is whether
the correlation is due to the fact that low vowels tend to be intrinsically longer and louder
than high vowels, and therefore are perceived as stressed more often, or vice versa. We
know that long, loud vowels tend to be perceived as stressed, and that low vowels are
longer and louder than high vowels, and more often stressed than high vowels, but which
tendency is responsible for the other is not clear.
Problems
The patterns exhibited in the data are strikingly clear. However, they deal with
averages: average duration, and average amplitude for a given category. Therefore,
standard deviations were calculated for all of the duration and amplitude data (see Tables
3 and 4).
11
Standard Deviations of Vowel Durations
0
dur (ms)
[iy]
0.25
sd
78
0.035
[ey]
90
[ay]
108
[aw]
dur (ms)
0.5
sd
dur (ms)
0.75
sd
dur (ms)
1
sd
dur (ms)
all
sd
dur (ms)
sd
98
0.05
114
0.057
122
0.068
132
0.066
100
0.055
0.049
94
0.038
122
0.054
130
0.057
155
0.074
129
0.066
0.059
113
0.048
126
0.049
143
0.066
174
0.075
141
0.069
103
0.077
121
0.043
150
0.051
156
0.057
203
0.074
168
0.073
98
0.013
111
168
0.075
154
0.072
[ow]
102
0.043
117
0.052
126
0.053
150
0.076
170
0.084
136
0.071
[uw]
70
0.046
101
0.063
104
0.052
153
0.089
152
0.099
103
0.077
[ih]
65
0.034
78
0.036
86
0.041
89
0.039
95
0.055
75
0.041
[ix]
49
0.023
53
0.019
51
0.016
0.028
50
0.023
[eh]
67
0.038
82
0.043
79
0.035
97
0.047
96
0.049
82
0.044
[ah]
77
0.049
89
0.043
96
0.064
102
0.063
115
0.068
93
0.059
[ax]
54
0.026
78
0.055
76
0.056
62
0.027
70
0.04
56
0.031
[uh]
61
0.036
74
0.053
71
0.036
70
0.05
78
0.049
67
0.042
[ae]
91
0.055
113
0.062
123
0.059
144
0.073
165
0.071
137
0.072
[aa]
86
0.043
94
0.038
110
0.047
116
0.044
134
0.059
114
0.054
[ao]
100
0.058
79
0.047
87
0.05
107
0.057
143
0.072
115
0.066
[oy]
0.025
Table 3. Standard deviations of vowel durations. Each stress-accent level has two
columns, the first being the average duration of the vowel in milliseconds, and the second
being the standard deviation (in seconds) of the mean duration. The vowels are grouped
into the categories of diphthongs, lax monophthongs, and tense monophthongs.
Standard Deviations of Vowel Amplitudes
0
amp
0.25
sd
amp
0.5
sd
amp
0.75
sd
amp
1
sd
amp
all
sd
amp
sd
[iy]
0.96
0.08
0.97
0.064
0.99
0.074
0.99
0.066
1.02
0.064
0.98
0.077
[ey]
0.99
0.081
1.01
0.059
1.03
0.056
1.03
0.064
1.05
0.061
1.03
0.068
[ay]
1
0.085
1.02
0.063
1.03
0.057
1.05
0.059
1.08
0.057
1.04
0.07
[aw]
1.04
0.061
1.02
0.084
1.05
0.058
1.05
0.057
1.06
0.058
1.05
0.061
0.97
0.054
1.04
1.06
0.049
1.04
0.058
[ow]
0.98
0.085
1
0.074
1.02
0.057
1.04
0.067
1.07
0.056
1.03
0.076
[uw]
0.95
0.08
0.96
0.055
0.97
0.064
0.98
0.084
1.03
0.064
0.98
0.08
[ih]
0.96
0.083
1
0.067
1.01
0.073
1.02
0.073
1.06
0.069
0.99
0.086
[ix]
0.92
0.104
0.97
0.089
1.01
0.095
0.035
0.92
0.105
[eh]
0.97
0.097
1.02
0.058
1.03
0.083
1.05
0.063
1.08
0.059
1.02
0.091
[ah]
0.98
0.079
1.02
0.068
1.03
0.072
1.05
0.058
1.08
0.055
1.03
0.079
[ax]
0.94
0.097
1
0.075
1.03
0.069
1.04
0.048
1.09
0.055
0.95
0.097
[uh]
0.97
0.079
1.02
0.076
1.05
0.062
1.05
0.061
1.09
0.075
1.01
0.086
[ae]
0.98
0.083
1.02
0.076
1.03
0.065
1.04
0.065
1.07
0.061
1.04
0.076
[aa]
1
0.086
1.03
0.071
1.05
0.061
1.07
0.057
1.09
0.062
1.06
0.074
[ao]
1
0.068
1
0.071
1.03
0.067
1.04
0.071
1.08
0.058
1.05
0.071
[oy]
12
0.035
Table 4. Standard deviations of vowel amplitudes. Each stress-accent level has two
columns, the first being the normalized average amplitude and the second being the
standard deviation of the mean amplitude. The vowels are grouped into the categories of
diphthongs, lax monophthongs, and tense monophthongs.
These standard deviations show that there is a lot of variation in the durations and
amplitudes of the vowels; however the standard deviations are fairly consistent across
categories, and vowels with higher intrinsic durations and amplitudes tended to have
higher standard deviations. There are many possible reasons for the amount of variance
in the data. An obvious contribution to the variance is the fact that the speakers of the
corpus were of both genders and were from all over the United States, and therefore their
speech can be expected to exhibit gender differences as well as dialect differences.
Additionally, there may have been a slight dialect-bias among the phonetic segment
transcribers for segments such as [ao], which differ according to dialect regions of the
US.
Analysis of the data for the different genders shows only very slight differences in
the vowels of female versus male speakers. The vowels of female speakers were on
average 9 milliseconds longer than the vowels of male speakers, with intrinsically short
vowels exhibiting a smaller gender difference and intrinsically long vowels exhibiting a
greater gender difference. Male speakers tended to have a slightly larger dynamic range
of amplitude between accented and unaccented occurrences of each vowel than female
speakers. These two tendencies suggest that female speakers may speak at a slightly
slower rate than male speakers, and that male speakers might utilize amplitude
differences a little more to convey accent. However, since these differences are very
slight, more research would need to be done to substantiate these findings. Overall, the
13
patterns of female and male speakers did not differ significantly from the patterns for all
speakers (see figures 1 and 3-6).
0
0.05
0.1
iy
iy
uw
uw
ey
ey
0.15
ay
ay
aw
aw
ow
ow
oy
oy
female
male
0.2
Figure 3. Duration in seconds, male vs. female speakers, diphthongs. The Y-axis is the
average duration, in seconds. The vowels are arranged along the X-axis approximately
according to horizontal tongue position. The Y-axis was inverted to show the
relationship between vowel height and duration.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
ix
ih
ih
ax
ax
eh
eh
ae
ae
aa
aa
uh
ah
ah
female
male
ao
ao
Figure 4. Duration in seconds, male vs. female speakers, monophthongs. The Y-axis is
the average duration, in seconds. The vowels are arranged along the X-axis
approximately according to horizontal tongue position. The Y-axis was inverted to show
the relationship between vowel height and duration.
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-0.02
uw
ay
iy
iy
ey
ey
oy
ay
ow
ow
uw
female
male
aw
aw
Figure 5. Amplitude range for male and female speakers for the diphthongs. Range is
defined as the normalized average energy of the stressed occurrences of the vowel minus
14
the normalized average energy of the unstressed occurrences. The Y-axis shows the
range, and the vowels are arranged approximately according to horizontal tongue position
along the X-axis.
0.2
0.15
ax
ax
ix
ih
ih
0.1
ix
0.05
eh
eh
ae
ae
aa
aa
uh
uh
ah
ah
female
male
ao
ao
0
Figure 6. Amplitude range for male and female speakers for the monophthongs.
Amplitude range is defined as the normalized average amplitude of the stressed
occurrences of the vowel minus the normalized average amplitude of the unstressed
occurrences. The Y-axis shows the range, and the vowels are arranged approximately
according to horizontal tongue position along the X-axis.
Unfortunately, there was an insufficient amount of data from each dialect region
to obtain information about the general patterns associated with each region. The
expectation, had there been more data, is that speakers from certain regions would tend
to, on average, have longer vowels than speakers from other dialect regions, and that
some regions would exhibit a more pronounced amplitude difference between accented
and unaccented vowels than other regions. Theoretically, this would account for some of
the variance.
Most likely, however, the variance among the duration data is due to different
speaking rates among different speakers, and the variance in amplitude data, similarly, is
due to that fact that individual speakers vary the volume of their voices differently. What
is important in this research is not that there is variance among the data; in fact, with the
high number of factors, such as different speakers, genders, and dialects, which
contribute to variance in spontaneous speech, a relatively large amount of variance in the
15
data is expected. What is interesting about the data is that despite the variance, the
patterns came out so clearly. The study of spontaneous speech is a fairly new area of
linguistics, made possible by extensive improvements to computer technology in recent
years, and the fact that such clear patterns exist is very encouraging to the field of speech
technology, particularly for speech recognition. Understanding the patterns that exist in
actual speech should make possible much more accurate recognition systems than those
that currently exist.
Future research projects could attempt to correct for factors like gender, dialect
and speaking rate in order to understand even better how stress-accent is exhibited in
spontaneous speech, but it is doubtful that the results of such research would contradict
the general patterns shown in this paper. However, an understanding of the factors that
contribute variety to spontaneous speech should have a profound affect on the field of
speech recognition, and therefore deserve attention. This paper is a first look at general
patterns in spontaneous speech; hopefully these patterns will be studied much more
closely in the future.
Conclusions
The data in this paper clearly show a relationship between stress-accent and vowel
height, duration and amplitude in American English. This supports the results of much
recent research on stress accent (Beckman, 1986, Van Kuijk and Boves, 1999, Silipo and
Greenberg, 1999 and 2000) which has shown this relationship in both lab speech and
spontaneous speech to a certain extent. The data on vowel duration and amplitude
support Lehiste and Peterson’s research (1959 &1960) on intrinsic duration and
amplitude of vowels. While their intrinsic durations, which were based on lab speech, are
16
much longer than what was shown in spontaneous speech, the overall patterns are the
same. Low vowels and diphthongs tend to be louder and longer than high vowels and
monophthongs. The prosodic data clearly show a close relationship between vowel
duration and stress-accent, as well as amplitude and stress-accent. Additionally, duration
and amplitude are clearly related to vowel height, which is also related to stress-accent.
Longer, louder vowels tend to be perceived as accented; low vowels tend to be longer and
louder than high vowels; therefore low vowels tend to be perceived as accented more
often than high vowels.
There are a couple of theories as to why low vowels are longer and louder than
high vowels (there is a good summary of some of these in Beckman, 1986). One
explanation for the durational differences is that physically it takes longer to transition
from a low vowel to a surrounding consonantal environment because the constrictions
made to produce consonants are high in the mouth, and the tongue is low in the mouth
when producing low vowels. The traditional explanation for the amplitude differences is
also based on physics. Basically, to produce a high vowel that is equal in loudness to a
low vowel, a different amount of effort is required of the speaker depending on the
constriction in the vocal tract, which has a different shape for low vowels than high
vowels. (see Beckman, 1986, Lehiste and Peterson, 1959 and 1960).
For the purposes of this study, the physical reason for the different intrinsic
amplitudes and durations of vowels is not as important as the effect that these differences
have on stress-accent. A very short vowel, like [ix], probably will be perceived as
stressed at a shorter duration than a very long vowel, like [aw]. It has been shown in
several languages that intrinsic duration differences affect phonemic length contrasts (see
17
Beckman, 1986, chapter 5). This is consistent with the findings of the current study.
Table 2 shows the varying average length and amplitude of vowels, and clearly some
vowels are perceived as stressed at much shorter durations than others.
However, despite different length requirements for different vowels, it is also true
that vowels that are intrinsically longer and louder tend to be perceived as stressed much
more frequently than their shorter, quieter counterparts. Apparently, the actual duration
and amplitude of a vowel relative to other vowels affect perceived stress-accent, as do the
proportional duration and amplitude of a vowel to its intrinsic duration and amplitude.
Vowels with high intrinsic durations and amplitudes (i.e. the low vowels) are perceived
as stressed much more frequently than those with lower intrinsic durations and
amplitudes. However, even those vowels that are typically very long and loud are
sometimes unstressed, generally when the duration and amplitude of the vowel are much
lower than usual.
This finding is consistent with the theory of prosody proposed by Vanderslice and
Ladefoged (1972) that separates syllables into heavy versus light and accented versus
unaccented. According to this theory, light syllables are typically reduced, or in any case
unstressed, while heavy syllables tend to not be reduced, even when they are unaccented.
Light syllables can only “one of the vowels , I [], o (the monophthongal reduction of
ou) or u [] or a syllabic consonant. To these we would add [r] and, occasionally in
final open syllables, [i] as in northern US city, and [u] as in Hindu,” (p.823). This list
corresponds very closely with the high vowels, both monophthongs and diphthongs,
which the data from this paper clearly show are the vowels most likely to appear in
unaccented syllables. Vanderslice and Ladefoged’s paper suggests that all other vowels
18
(i.e. the low vowels) appear in heavy, (i.e. likely to be accented and unreduced) syllables.
This paper makes a high versus low distinction between the two classes of vowels, but
the result is similar: some vowels are much more likely to be accented than others are.
The difference is that their theory is a theory on how to label prosodic features, while this
study shows concretely the patterns that exist in spontaneous discourse.
The extent to which the data collected for this study support various research
projects done on lab speech is encouraging. The finding that low vowels are intrinsically
longer and louder than high vowels has been shown by Lehiste and Peterson (1959 and
1960, Lehiste, 1996) in studies done on lab speech. Fox (1955) first showed with
synthesized speech that duration and amplitude are important cues of perceived stressaccent in English. Now that technology has made it possible to conduct in-depth research
on spontaneous speech, many studies done previously on lab speech will be questioned;
the results of this study show that some patterns exhibited in lab speech are very similar
to spontaneous speech.
Acknowledgements
I would like to thank Steve Greenberg and John Ohala for their advice and
assistance, Joy Hollenback and Shawn Chang for their help with the research and
compiling data, Jeff Good for prosodically transcribing the data, and Candace Cardinal,
Rachel Coulston and Colleen Richey for phonetically transcribing the corpus.
19
Download