Acoustic Properties of Vocalic Nuclei Associated with Prosodic

Acoustic Properties of Vocalic Nuclei Associated with Prosodic Stress in Spontaneous American English Discourse By Leah Hitchcock Introduction and Background As speech recognition and speech synthesis technology continue to improve, it becomes increasingly important in the field of linguistics to understand all the details of speech such as the cues in the sound wave that signal accent in stress-accent languages like English. Until very recently, it was generally accepted that f0 variation, or pitch change, was the main factor involved in English stress. However, recent research has shown that in fact f0 variation is not the only indicator of prosodic stress-accent, or even the primary cue (Silipo and Greenberg, 1999 & 2000, Beckman 1986). Even as early as 1955 there was evidence that duration and amplitude are cues for perceived stress (Fry, 1955). Stress-accent languages, like English, differ from pitch-accent languages like Japanese in that they use other features in addition to pitch change to denote phrasal accent. (Beckman, 1986) While there is much evidence that f0 change is an important cue of stress, research on automatic stress labeling systems shows that automatic stress labeling algorithms are most accurate when using a combination of duration and amplitude rather than pitch change to label stress. (Silipo and Greenberg, 2000 & 1999, Van Kuijk & Boves, 1999) This suggests that while stressed syllables do tend to have pitch variation, it may not be the most important cue of stress in English, and that the most important cue is actually duration, or a combination of duration and amplitude. (Silipo and Greenberg, 1999 and 2000, Beckman, 1986, Van Kuijk and Boves, 1999). Further, it has been 1 suggested that some of the pitch differences exhibited in stressed syllables are an artifact of duration: longer segments have more time for variation of features like f0 (Silipo and Greenberg 1999). This is not to say that f0 variation isn’t an important cue of stressaccent; it is. However, it is clear that vowel duration and amplitude also play a role in determining which vowels are perceived as stressed, and automatic speech recognition systems could benefit greatly from an understanding of exactly how different factors affect the realization of stress-accent in spontaneous speech. Many studies have looked into the phenomenon of reduction of unstressed vowels (i.e. Lindblom, 1963, van Bergem, 1993, Koopmans-van Beinum, 1987, Fourakis, 1991, Engstrand, 1988). Typically, vowels that become reduced in speech are short and unstressed. Vowel reduction is not necessarily a trend toward centralization of the vowel (see Lindblom, 1963), as other factors, such as consonantal context also seem to play a role in the shape of the formant trajectories of unstressed vowels. An increased knowledge of the acoustics of stressed and unstressed vowels could be very useful in the study of vowel reduction in spontaneous speech, which in turn should aid the fields of speech recognition and speech synthesis. This paper explores the roles of duration and amplitude in determining whether specific vowels are perceived as stressed in spontaneous American English dialogues. Based on previous research, we would expect duration to play the clearest role in determining perceived stress-accent, and that amplitude would play a secondary role for most vowels (Silipo and Greenberg, 1999 and 2000, Van Kuijk and Boves, 1999, Beckman, 1986, Fry, 1955). However, most previous research has been done on labspeech, so a major purpose of this study is to determine general patterns related to stress- 2 accent in spontaneous speech, and to determine to what extent these patterns support the results of past research on lab speech. The concept of stress is difficult to define. Many different definitions exist (see Asher, 1994 and Crystal, 1992), so it is important to clarify what type of stress this paper deals with. Lexical, or word stress, is the type most commonly referred to. Lexical stress is canonical, and is what is marked in pronunciation guides in dictionaries. Phrasal stress refers to which syllables are accented within an utterance. “While this might often be related to the grammatical structure of the utterance, on cannot predict when the grammatical structure will be reflected phonetically. The speaker may decide to recognize a syntactic unit or to overlook it,” (Asher, 1994:4357). Semantics, syntax, and other factors affect which syllables actually end up as being accented in an utterance (see Vanderslice and Ladefoged, 1972). This paper deals with phrasal stress-accent rather than lexical stress. In other words, rather than the dictionary’s markings of lexical stress, the stress patterns that are realized phonetically in dialogues are what are being studied. This encompasses many different types of stress; sometimes a syllable that is expected, according to canonical lexical stress, to be accented is in fact accented in an utterance, and sometimes a syllable is accented for other reasons, such as emphasis. Because there are not definitive rules that determine the conditions under which a person will stress a certain syllable, the data are based on perceived stress. Methods and Corpus Data The data for this paper are from the Switchboard corpus. The Switchboard corpus contains over 140 hours worth of short telephone conversations on various topics. For this paper, approximately 54 minutes worth of utterances were transcribed at the phonetic 3 segment level by University of California, Berkeley, linguistics students. Level of stressaccent was also manually labeled by an independent set of transcribers for syllabic nuclei. Of the 54 minutes of speech, 45.43 minutes was analyzable and the remainder being filled pauses, stutters, and other non-speech. (The usable 45.43 minutes consisting of 9,922 words, 13,446 syllables, and 33,370 phones, comprising 674 utterances). The average length of an utterance was 4.76 seconds, the average number of words per utterance was 18.5 and the average number of syllables was 23.25. (Only utterances between 2 and 17 seconds were used, and about 60% of these were between 4 and 8 seconds. The number of words ranged from 2 to 64, and the number of syllables from 5 to 81.) The corpus includes 581 speakers, of whom 288 were female and 293 were male. Most speakers were represented by only a single utterance in the data set; a few had multiple utterances that were used. Speakers were asked for the region they lived in during their formative years in order to ascertain what dialect of American English they speak. The number of speakers for each dialect region are in table 1. # of speakers by dialect region New England 37 New York 41 North 88 North Mid 89 South 65 South Mid 147 West 72 Mixed 42 Table 1. Total number of speakers from each of the seven dialect regions, as well as mixed dialect. Mixed speakers are people who moved around a lot during their formative years. Of the actual speech data, 769 syllables from the utterances were not used because they had syllabic consonants (such as “el”) for their nucleus rather than a vocalic nucleus. 4 631 filled pauses (“um,” “uh,” etc.) were also excluded from analysis because of the drastically different patterns exhibited in filled pauses compared to all other words in the data set. The data were phonetically transcribed by three individuals using a variant of Arapet (which was originally used to label the TIMIT corpus) (Greenberg 1997). The interlabeler agreement was approximately 74%. When transcribers disagreed on the identity of a vocalic segment, it was generally only a slight disagreement, i.e. one level of frontness or height. It was rare for transcribers to disagree on whether a segment was a diphthong or a monophthong. Two different individuals marked the corpus material for stress-accent. I was one of the transcribers, but I was hired to do the transcription work long before the idea for this paper came about, so the fact that half of the labeling was done by me should not create a bias in the data. The material was marked for three levels of accent: fully accented, completely unaccented, and an intermediate level. The intermediate level includes all syllables judged to be not primary stress, but not completely lacking stress either. Fully unaccented vowels were not necessarily reduced, but most of the occurrences of [ix] and [ax], which are both types of schwa, fell into the fully unaccented category (see Table 2). Many of these were probably reduced forms of other vowels; since the corpus was transcribed phonetically rather than canonically the proportion of reduced vowels is not known. The nuclei were labeled based on perceptually based stress accent, not based on knowledge of canonical (dictionary based) lexcial stress. The transcribers and a supervisor met weekly to insure that the proper criteria were being used for labeling. Other research has relied on modifications of canonical lexical stress 5 patterns when studying prosody, (i.e. Van Kuijk & Boves, 1999, Beckman, 1986) but since spontaneous speech differs greatly from canonical speech, perceived stress-accent should be more a more accurate representation of the way people speak. When speaking, people place more emphasis on words or ideas that are most important to the message they are trying to express. Labeling perceived stress-accent therefore enables someone reading a transcription to see what the speaker intended to stress in his or her speech, something that lexical stress does not convey. All material used was labeled by both transcribers, and the stress-accent markings were averaged. As with the phonetic transcription, the transcribers generally agreed. Interlabeler agreement was 85% for unstressed nuclei, 78% for fully stressed nuclei, and 95% for any level of accent (both transcribers attributed some amount of stress to the nucleus). When the transcribers disagreed, it was usually by only one step (i.e. fully stressed vs. intermediate, not fully stressed vs. fully unstressed). Generally when the transcribers disagreed, a third observer attested to the ambiguity of the level of accent of the syllable in question. Table 2 includes the averaged duration and amplitude data for each level of stress accent, with 0 referring to completely unaccented nuclei, 0.5 to the intermediate level, and 1 referring to fully accented nuclei. Levels 0.25 and 0.75 are the result of the averaging of the two transcriptions. The duration of the segments was computed from the hand-labeled material. About a third of the material was hand-segmented by transcribers and the remainder was automatically segmented using 72 minutes of hand segmented material to train the automatic labeler (Greenberg, Chang, and Hollenback, 2000). The amplitude, expressed in terms of log base e, of each segment’s pressure waveform was computed and 6 normalized relative to the mean of the entire utterance (Greenberg, Chang and Hollenback, 2000). Integrated energy has been shown to be the most accurate means of determining the level of stress-accent for a vocalic segment in past research (Silipo and Greenberg, 1999 and 2000) as it reflects both duration and amplitude. For the purposes of this paper, an approximation of integrated energy was calculated for each vocalic nucleus by multiplying the duration in milliseconds by the normalized amplitude. Duration (ms) Stress [iy] 0 0.25 78 Amplitude (normalized log) 0.5 0.75 1 all 0 0.25 0.5 0.75 Duration x Amplitude 1 all 0 0.25 98 114 122 132 100 0.96 0.97 0.99 0.99 1.02 0.98 75 0.5 0.75 % of total occurrences 1 all 95 111 120 134 0 0.25 0.5 0.75 97 44.8 14.3 13.4 total 1 9.3 18.2 1270 [ey] 90 94 122 130 155 129 0.99 1.01 1.03 1.03 1.05 1.03 90 94 126 132 162 132 16.4 39 525 [ay] 108 113 126 143 174 141 1 1.02 1.03 1.05 1.08 1.04 108 115 129 149 186 147 16.6 12.8 19.7 14.7 36.2 790 [aw] 103 121 150 156 203 168 1.04 1.02 1.05 1.05 1.06 1.05 105 122 157 162 213 175 187 94 114 177 161 116 129 155 182 140 22.6 15 17.6 13.8 156 101 49.4 [oy] 98 111 168 154 0.97 1.04 1.06 1.04 [ow] 102 117 126 150 170 136 0.98 [uw] 70 101 104 153 152 103 0.95 0.96 0.97 0.98 1.03 0.98 [ih] 65 78 86 [ix] 49 53 51 [eh] 67 82 79 [ah] 77 89 96 [ax] 54 78 76 62 70 [uh] 61 74 71 70 78 [ae] 91 113 [aa] 86 94 [ao] 100 79 89 95 75 0.96 1 1.02 1.04 1.07 1.03 100 1 1.01 1.02 1.06 0.99 68 98 99 151 91 9.6 15.5 23 43.9 16.7 4.2 79.2 24 31 646 7.3 10.9 8.6 23.8 478 74 56.7 13 9.9 7.4 12.9 2126 46 89.1 7.4 2.3 433 37 10.8 11.7 12 28.6 1217 62 78 86 0.92 45 52 52 96 82 0.97 1.02 1.03 1.05 1.08 1.02 66 83 81 101 104 85 102 115 93 0.98 1.02 1.03 1.05 1.08 1.03 75 90 98 107 124 95 35.6 14.4 15.6 12 22.5 1060 56 0.94 1 1.03 1.04 1.09 0.95 51 77 77 65 75 53 89.3 0.8 67 0.97 1.02 1.05 1.05 1.09 1.01 59 75 75 73 85 68 123 144 165 137 0.98 1.02 1.03 1.04 1.07 1.04 88 113 126 110 116 134 114 1 1.03 1.05 1.07 1.09 1.06 86 96 115 87 107 143 115 1 80 50 0.92 0.97 1.01 97 1 1.03 1.04 1.08 1.05 102 91 101 8 9.1 17.3 18.1 6.7 2.4 54 11.3 11.3 328 148 175 142 16.3 11.2 15.8 15.3 41.4 823 123 144 121 690 112 154 122 13.4 17 12.5 14.5 14.8 41.3 6.8 17.7 21.1 Table 2. The relationship of stress accent to several acoustic properties of vocalic nuclei. The vowels are grouped into the categories of diphthongs, lax monophthongs, and tense monophthongs, because the vowels within each of these groups tend to behave similarly. The first group of data is the average duration of the vowels, in milliseconds, for each stress level and the intrinsic duration of the vowel. The second group of data is the normalized amplitude for each stress level as well as the intrinsic amplitude. The amplitude for each vocalic segment was normalized (log base e) with respect to the entire utterance within which it occurred. The third group is the simple product of duration times amplitude. The duration and amplitude of each vowel were multiplied together, and then the average of the products was calculated. The final group of data is the percentage of the time each vowel appears at each level of stress accent, along with a column denoting the total number of occurrences of each vowel. Blank cells in the table indicate less than four occurrences of the vowel for that stress level. 7 0.8 1729 8.8 14.6 41 351 Results The data reveal several patterns about stress-accent in English. First, duration and amplitude clearly play a large role in determining whether a vowel is perceived as stressed. Second, high vowels, which are intrinsically shorter and quieter than low vowels, (see also Lehiste and Peterson, 1959 and 1960, Beckman, 1986, Black 1949) also tend not to be perceived as stressed nearly as often as their lower counterparts (Hitchcock & Greenberg, 2001). Third, the shortest vowels, i.e. the lax monophthongs, tend to exhibit greater amplitude differences than the intrinsically longer tense monophthongs and diphthongs. These tendencies are very consistent, with a few minor exceptions that appear to be the result of an insufficient amount of data (see Table 2). 8 Figure 1. (reprinted from Hitchcock and Greenberg, 2001) The graphs above show the relationship between the position of the tongue and factors associated with prosodic stress-accent. The Y-axis shows the factor (duration, amplitude, or duration x amplitude) being measured for each graph. The Y-axes were inverted in order to show the relationship of vowel height to the factor in question. The graphs in the first column show normalized amplitude, log base e, which was calculated with respect to the average amplitude of the entire utterance in which the vowel occurs. 9 The graphs in the second column show the duration of the vowels in milliseconds. The graphs in the third column show the product of duration times amplitude. The X-axis of all of the graphs shows the vowels, either diphthongs or monophthongs, arranged approximately by the horizontal tongue position of the vowel. The resulting shape of the graphs is strikingly similar to that of a vowel space chart, suggesting that there is a very close relationship between place of articulation and factors such as vowel duration and amplitude. Figure 2. Reprinted from Hitchcock and Greenberg, 2001. The proportion of fully accented and fully unaccented monophthongs and diphthongs. Along the X-axis are the vowels arranged approximately by horizontal position of the tongue. The Y-axis shows the proportion in terms of percentage of the total number of occurrences of the vowel. The Y-axis is inverted for the graphs showing proportion of fully stressed nuclei to better show the relationship of vowel height to stress-accent. There is a clear relationship between vowel height and stress accent. High vowels tend to be accented much less often than low vowels. This is true of both the diphthongs and monophthongs. This tendency is consistent with the hypothesis that vowel duration is one of the main cues of stress accent in English because the low vowels, which tend to 10 be longer, also tend to be accented more often. Vowel height, duration and stress-accent appear to be closely tied together. It is clear from looking at the figures and tables that there is a correlation between these three factors. To a lesser degree, vowel amplitude is also associated with vowel height and stress-accent, but the correlation between duration and stress-accent is clearer and more consistent. What is not entirely clear about the relationship between stress-accent and vowel height, duration and amplitude is whether the correlation is due to the fact that low vowels tend to be intrinsically longer and louder than high vowels, and therefore are perceived as stressed more often, or vice versa. We know that long, loud vowels tend to be perceived as stressed, and that low vowels are longer and louder than high vowels, and more often stressed than high vowels, but which tendency is responsible for the other is not clear. Problems The patterns exhibited in the data are strikingly clear. However, they deal with averages: average duration, and average amplitude for a given category. Therefore, standard deviations were calculated for all of the duration and amplitude data (see Tables 3 and 4). 11 Standard Deviations of Vowel Durations 0 dur (ms) [iy] 0.25 sd 78 0.035 [ey] 90 [ay] 108 [aw] dur (ms) 0.5 sd dur (ms) 0.75 sd dur (ms) 1 sd dur (ms) all sd dur (ms) sd 98 0.05 114 0.057 122 0.068 132 0.066 100 0.055 0.049 94 0.038 122 0.054 130 0.057 155 0.074 129 0.066 0.059 113 0.048 126 0.049 143 0.066 174 0.075 141 0.069 103 0.077 121 0.043 150 0.051 156 0.057 203 0.074 168 0.073 98 0.013 111 168 0.075 154 0.072 [ow] 102 0.043 117 0.052 126 0.053 150 0.076 170 0.084 136 0.071 [uw] 70 0.046 101 0.063 104 0.052 153 0.089 152 0.099 103 0.077 [ih] 65 0.034 78 0.036 86 0.041 89 0.039 95 0.055 75 0.041 [ix] 49 0.023 53 0.019 51 0.016 0.028 50 0.023 [eh] 67 0.038 82 0.043 79 0.035 97 0.047 96 0.049 82 0.044 [ah] 77 0.049 89 0.043 96 0.064 102 0.063 115 0.068 93 0.059 [ax] 54 0.026 78 0.055 76 0.056 62 0.027 70 0.04 56 0.031 [uh] 61 0.036 74 0.053 71 0.036 70 0.05 78 0.049 67 0.042 [ae] 91 0.055 113 0.062 123 0.059 144 0.073 165 0.071 137 0.072 [aa] 86 0.043 94 0.038 110 0.047 116 0.044 134 0.059 114 0.054 [ao] 100 0.058 79 0.047 87 0.05 107 0.057 143 0.072 115 0.066 [oy] 0.025 Table 3. Standard deviations of vowel durations. Each stress-accent level has two columns, the first being the average duration of the vowel in milliseconds, and the second being the standard deviation (in seconds) of the mean duration. The vowels are grouped into the categories of diphthongs, lax monophthongs, and tense monophthongs. Standard Deviations of Vowel Amplitudes 0 amp 0.25 sd amp 0.5 sd amp 0.75 sd amp 1 sd amp all sd amp sd [iy] 0.96 0.08 0.97 0.064 0.99 0.074 0.99 0.066 1.02 0.064 0.98 0.077 [ey] 0.99 0.081 1.01 0.059 1.03 0.056 1.03 0.064 1.05 0.061 1.03 0.068 [ay] 1 0.085 1.02 0.063 1.03 0.057 1.05 0.059 1.08 0.057 1.04 0.07 [aw] 1.04 0.061 1.02 0.084 1.05 0.058 1.05 0.057 1.06 0.058 1.05 0.061 0.97 0.054 1.04 1.06 0.049 1.04 0.058 [ow] 0.98 0.085 1 0.074 1.02 0.057 1.04 0.067 1.07 0.056 1.03 0.076 [uw] 0.95 0.08 0.96 0.055 0.97 0.064 0.98 0.084 1.03 0.064 0.98 0.08 [ih] 0.96 0.083 1 0.067 1.01 0.073 1.02 0.073 1.06 0.069 0.99 0.086 [ix] 0.92 0.104 0.97 0.089 1.01 0.095 0.035 0.92 0.105 [eh] 0.97 0.097 1.02 0.058 1.03 0.083 1.05 0.063 1.08 0.059 1.02 0.091 [ah] 0.98 0.079 1.02 0.068 1.03 0.072 1.05 0.058 1.08 0.055 1.03 0.079 [ax] 0.94 0.097 1 0.075 1.03 0.069 1.04 0.048 1.09 0.055 0.95 0.097 [uh] 0.97 0.079 1.02 0.076 1.05 0.062 1.05 0.061 1.09 0.075 1.01 0.086 [ae] 0.98 0.083 1.02 0.076 1.03 0.065 1.04 0.065 1.07 0.061 1.04 0.076 [aa] 1 0.086 1.03 0.071 1.05 0.061 1.07 0.057 1.09 0.062 1.06 0.074 [ao] 1 0.068 1 0.071 1.03 0.067 1.04 0.071 1.08 0.058 1.05 0.071 [oy] 12 0.035 Table 4. Standard deviations of vowel amplitudes. Each stress-accent level has two columns, the first being the normalized average amplitude and the second being the standard deviation of the mean amplitude. The vowels are grouped into the categories of diphthongs, lax monophthongs, and tense monophthongs. These standard deviations show that there is a lot of variation in the durations and amplitudes of the vowels; however the standard deviations are fairly consistent across categories, and vowels with higher intrinsic durations and amplitudes tended to have higher standard deviations. There are many possible reasons for the amount of variance in the data. An obvious contribution to the variance is the fact that the speakers of the corpus were of both genders and were from all over the United States, and therefore their speech can be expected to exhibit gender differences as well as dialect differences. Additionally, there may have been a slight dialect-bias among the phonetic segment transcribers for segments such as [ao], which differ according to dialect regions of the US. Analysis of the data for the different genders shows only very slight differences in the vowels of female versus male speakers. The vowels of female speakers were on average 9 milliseconds longer than the vowels of male speakers, with intrinsically short vowels exhibiting a smaller gender difference and intrinsically long vowels exhibiting a greater gender difference. Male speakers tended to have a slightly larger dynamic range of amplitude between accented and unaccented occurrences of each vowel than female speakers. These two tendencies suggest that female speakers may speak at a slightly slower rate than male speakers, and that male speakers might utilize amplitude differences a little more to convey accent. However, since these differences are very slight, more research would need to be done to substantiate these findings. Overall, the 13 patterns of female and male speakers did not differ significantly from the patterns for all speakers (see figures 1 and 3-6). 0 0.05 0.1 iy iy uw uw ey ey 0.15 ay ay aw aw ow ow oy oy female male 0.2 Figure 3. Duration in seconds, male vs. female speakers, diphthongs. The Y-axis is the average duration, in seconds. The vowels are arranged along the X-axis approximately according to horizontal tongue position. The Y-axis was inverted to show the relationship between vowel height and duration. 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 ix ih ih ax ax eh eh ae ae aa aa uh ah ah female male ao ao Figure 4. Duration in seconds, male vs. female speakers, monophthongs. The Y-axis is the average duration, in seconds. The vowels are arranged along the X-axis approximately according to horizontal tongue position. The Y-axis was inverted to show the relationship between vowel height and duration. 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 -0.02 uw ay iy iy ey ey oy ay ow ow uw female male aw aw Figure 5. Amplitude range for male and female speakers for the diphthongs. Range is defined as the normalized average energy of the stressed occurrences of the vowel minus 14 the normalized average energy of the unstressed occurrences. The Y-axis shows the range, and the vowels are arranged approximately according to horizontal tongue position along the X-axis. 0.2 0.15 ax ax ix ih ih 0.1 ix 0.05 eh eh ae ae aa aa uh uh ah ah female male ao ao 0 Figure 6. Amplitude range for male and female speakers for the monophthongs. Amplitude range is defined as the normalized average amplitude of the stressed occurrences of the vowel minus the normalized average amplitude of the unstressed occurrences. The Y-axis shows the range, and the vowels are arranged approximately according to horizontal tongue position along the X-axis. Unfortunately, there was an insufficient amount of data from each dialect region to obtain information about the general patterns associated with each region. The expectation, had there been more data, is that speakers from certain regions would tend to, on average, have longer vowels than speakers from other dialect regions, and that some regions would exhibit a more pronounced amplitude difference between accented and unaccented vowels than other regions. Theoretically, this would account for some of the variance. Most likely, however, the variance among the duration data is due to different speaking rates among different speakers, and the variance in amplitude data, similarly, is due to that fact that individual speakers vary the volume of their voices differently. What is important in this research is not that there is variance among the data; in fact, with the high number of factors, such as different speakers, genders, and dialects, which contribute to variance in spontaneous speech, a relatively large amount of variance in the 15 data is expected. What is interesting about the data is that despite the variance, the patterns came out so clearly. The study of spontaneous speech is a fairly new area of linguistics, made possible by extensive improvements to computer technology in recent years, and the fact that such clear patterns exist is very encouraging to the field of speech technology, particularly for speech recognition. Understanding the patterns that exist in actual speech should make possible much more accurate recognition systems than those that currently exist. Future research projects could attempt to correct for factors like gender, dialect and speaking rate in order to understand even better how stress-accent is exhibited in spontaneous speech, but it is doubtful that the results of such research would contradict the general patterns shown in this paper. However, an understanding of the factors that contribute variety to spontaneous speech should have a profound affect on the field of speech recognition, and therefore deserve attention. This paper is a first look at general patterns in spontaneous speech; hopefully these patterns will be studied much more closely in the future. Conclusions The data in this paper clearly show a relationship between stress-accent and vowel height, duration and amplitude in American English. This supports the results of much recent research on stress accent (Beckman, 1986, Van Kuijk and Boves, 1999, Silipo and Greenberg, 1999 and 2000) which has shown this relationship in both lab speech and spontaneous speech to a certain extent. The data on vowel duration and amplitude support Lehiste and Peterson’s research (1959 &1960) on intrinsic duration and amplitude of vowels. While their intrinsic durations, which were based on lab speech, are 16 much longer than what was shown in spontaneous speech, the overall patterns are the same. Low vowels and diphthongs tend to be louder and longer than high vowels and monophthongs. The prosodic data clearly show a close relationship between vowel duration and stress-accent, as well as amplitude and stress-accent. Additionally, duration and amplitude are clearly related to vowel height, which is also related to stress-accent. Longer, louder vowels tend to be perceived as accented; low vowels tend to be longer and louder than high vowels; therefore low vowels tend to be perceived as accented more often than high vowels. There are a couple of theories as to why low vowels are longer and louder than high vowels (there is a good summary of some of these in Beckman, 1986). One explanation for the durational differences is that physically it takes longer to transition from a low vowel to a surrounding consonantal environment because the constrictions made to produce consonants are high in the mouth, and the tongue is low in the mouth when producing low vowels. The traditional explanation for the amplitude differences is also based on physics. Basically, to produce a high vowel that is equal in loudness to a low vowel, a different amount of effort is required of the speaker depending on the constriction in the vocal tract, which has a different shape for low vowels than high vowels. (see Beckman, 1986, Lehiste and Peterson, 1959 and 1960). For the purposes of this study, the physical reason for the different intrinsic amplitudes and durations of vowels is not as important as the effect that these differences have on stress-accent. A very short vowel, like [ix], probably will be perceived as stressed at a shorter duration than a very long vowel, like [aw]. It has been shown in several languages that intrinsic duration differences affect phonemic length contrasts (see 17 Beckman, 1986, chapter 5). This is consistent with the findings of the current study. Table 2 shows the varying average length and amplitude of vowels, and clearly some vowels are perceived as stressed at much shorter durations than others. However, despite different length requirements for different vowels, it is also true that vowels that are intrinsically longer and louder tend to be perceived as stressed much more frequently than their shorter, quieter counterparts. Apparently, the actual duration and amplitude of a vowel relative to other vowels affect perceived stress-accent, as do the proportional duration and amplitude of a vowel to its intrinsic duration and amplitude. Vowels with high intrinsic durations and amplitudes (i.e. the low vowels) are perceived as stressed much more frequently than those with lower intrinsic durations and amplitudes. However, even those vowels that are typically very long and loud are sometimes unstressed, generally when the duration and amplitude of the vowel are much lower than usual. This finding is consistent with the theory of prosody proposed by Vanderslice and Ladefoged (1972) that separates syllables into heavy versus light and accented versus unaccented. According to this theory, light syllables are typically reduced, or in any case unstressed, while heavy syllables tend to not be reduced, even when they are unaccented. Light syllables can only “one of the vowels , I [], o (the monophthongal reduction of ou) or u [] or a syllabic consonant. To these we would add [r] and, occasionally in final open syllables, [i] as in northern US city, and [u] as in Hindu,” (p.823). This list corresponds very closely with the high vowels, both monophthongs and diphthongs, which the data from this paper clearly show are the vowels most likely to appear in unaccented syllables. Vanderslice and Ladefoged’s paper suggests that all other vowels 18 (i.e. the low vowels) appear in heavy, (i.e. likely to be accented and unreduced) syllables. This paper makes a high versus low distinction between the two classes of vowels, but the result is similar: some vowels are much more likely to be accented than others are. The difference is that their theory is a theory on how to label prosodic features, while this study shows concretely the patterns that exist in spontaneous discourse. The extent to which the data collected for this study support various research projects done on lab speech is encouraging. The finding that low vowels are intrinsically longer and louder than high vowels has been shown by Lehiste and Peterson (1959 and 1960, Lehiste, 1996) in studies done on lab speech. Fox (1955) first showed with synthesized speech that duration and amplitude are important cues of perceived stressaccent in English. Now that technology has made it possible to conduct in-depth research on spontaneous speech, many studies done previously on lab speech will be questioned; the results of this study show that some patterns exhibited in lab speech are very similar to spontaneous speech. Acknowledgements I would like to thank Steve Greenberg and John Ohala for their advice and assistance, Joy Hollenback and Shawn Chang for their help with the research and compiling data, Jeff Good for prosodically transcribing the data, and Candace Cardinal, Rachel Coulston and Colleen Richey for phonetically transcribing the corpus. 19

Acoustic Properties of Vocalic Nuclei Associated with Prosodic

Related documents

Products

Support

Acoustic Properties of Vocalic Nuclei Associated with Prosodic

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib