Introduction Chapter 1 1 Introduction 1.1 Background This dissertation investigates the realisation of the intonational high target in falling nuclear pitch accents in English. This work arises from the observation that the high target is often realised not as a single, well-defined peak, but as a plateau in the contour. The existence of such plateaux raises the question of what a speaker’s real intonational target is. In situations where there is a well-defined turning point in the contour it is likely that this turning point is the speaker’s target, but when the target is realised as a plateau it is more difficult to know which parts of the plateau are linguistically relevant, and which parts are merely the results of interpolations between targets and smoothing in the production process. One question that initially arises is why we should expect there to be an intonational target in the contour. Phonetically, any given pitch change can, of course, be expressed by reference to its trajectory, for example rising or falling, or can be seen as two different pitch targets, such as high and low, joined by an interpolation. These two alternative representations for a falling intonation contour are shown diagrammatically in Figure 1.1. High Low Fall Figure 1.1 A falling intonation contour expressed firstly as an indivisible whole and secondly as a sequence of two intonational targets From a phonological point of view, however, these two alternatives are not equivalent, as it is important to know what the meaningful features of intonation in any particular language are. Establishing the correct representation means that rules can be more economical and may capture generalisations that might otherwise be missed. Introduction 2 These two different ways of representing the same contour have influenced the phonological description of intonation. Ladd (1996: 11) states that, like any other phonological description, a complete phonological description of intonation must minimally include a level at which sounds are represented as a small number of categorically distinct elements and another level at which these elements are mapped onto the physical continuously varying parameters of the sound, in this case the intonation contour. Different methods for describing intonation use different types of categorically distinct elements, influenced by the decision to see the contour in terms of either movements or pitch targets. This debate is a long standing one, often referred to, following Bolinger (1951), as the ‘levels versus configurations’ debate. Descriptions in the British tradition, for example, focus on the movements in the intonation contour. 1.1.1 Contours as the units of analysis The British tradition of intonational description, begun by Palmer (1922) and exemplified in the work of O’Connor and Arnold (1973) and Halliday (1967), emphasises the importance of movement in the intonation contour. In this tradition, the tone group or tone unit, the equivalent of the intonational phrase, is separated into several units. The nucleus is the syllable of maximum prominence. The syllables before the first accent make up the pre-head. The syllables from the first accent to the nucleus are termed the head and the syllables following the nucleus are the tail. For example, as shown in Figure 1.2 in a neutral reading of the sentence, “I don’t remember the number”, the nucleus falls on the syllable ‘num’, the pre-head is the word ‘I’, the head consists of the words, ‘don’t remember the’ and the tail is the final syllable of the word ‘number’. I pre-head don’t remember the head num ber nucleus tail Figure 1.2 The units of analysis used in the British tradition The intonation in each of these units is described with reference to the movements in pitch that are associated with them. For example, the intonation within the nucleus is described by stating the height at which the movement starts and the movement/s after the nucleus. This description gives rise to nuclear tones such as high-rise or rise-fall. Introduction 3 1.1.2 Evidence for tonal targets Despite the descriptions of intonation within the British tradition, which emphasise the importance of movement in the contour, there is abundant evidence that in both production and perception movements in pitch are rather unimportant. Instead, as discussed in more detail in the following sections, the crucial factor is that a particular height of contour is reached at a particular time in relation to the text. So, speakers aim to produce particular pitch targets, and the links between these targets are linguistically irrelevant interpolations. It is, of course, not the case that pitch movements are linguistically unimportant in all languages. In fact, languages vary in terms of whether movements or pitch targets are prime. The earliest discussion of this variation related to the representation of contour tones (such as rises and falls) in tone languages. As early as Pike (1948) it had been suggested that different languages behave differently according to whether their contour tones behave as unitary wholes or sequences of level tones. Many lines of evidence have been brought to bear on this issue and are summarised by Anderson (1978). Pike (1948) suggested that Asian tone languages represent a situation in which contour tones are unitary wholes, whilst African tone languages represent a case where contour tones are better described as sequences of level tones. This dichotomy now seems a little too clearcut as it has recently been demonstrated not only that these geographical boundaries are largely irrelevant but that different dialects of the same language may in fact behave differently in this respect. Chen (2000 cited in Zhiming 2003: 150) shows, for example, that in Mandarin and Min dialects of Chinese, contour tones are single units, whereas in Wu dialects contour tones are best treated as strings of level tones. What is clear is that for any particular language or variety the evidence must be assessed separately. The initial evidence for the importance of pitch targets in non-tone languages began with Bruce’s (1977) study of the Swedish word accents. Introduction 4 1.1.2.1 The evidence from Swedish word-accents Bruce (1977) produced a study of the Swedish word accents. In Swedish, the stressed syllable of a word carries one of two word accents. These accents are termed acute (Accent 1) or grave (Accent 2). For example the word anden with Accent 1 means ‘the duck’, whilst anden with Accent 2 means ‘the spirit’. The word accents are distinguished by means of two distinctive pitch accents. Bruce (1977) studies the Stockholm variety of Swedish, in which it appears from earlier studies that the word accent distinction is based on the number of peaks in the accent. Thus Accent 1 words have a single peak whilst Accent 2 words have two peaks. In a production study of one primary and two secondary informants, Bruce attempts to tease apart the relative contributions of different levels of intonation. When word accents are isolated so that they are not affected by sentence accents or terminal junctures, both Accent 1 and Accent 2 are actually realised as a single peak in the contour, rather than the double peak for Accent 2 words that had previously been proposed. The actual difference between the two single peaked accents was found to be primarily one of timing. As shown in Figure 1.3, the peak for Accent 1 words occurs in the prevocalic consonant, and the peak for Accent 2 words occurs in the stressed vowel itself (Bruce 1977: 49). There is also a difference in the gradient of the fall from the peak, the gradient being steeper in Accent 1 than in Accent 2 words. V C V I II C V C V C Figure 1.3 Difference in timing of two Stockholm Swedish word accents (adapted from Bruce (1977: 64)) Introduction 5 In order to assess which aspects of the contour are perceptually relevant for the word accent distinction, Bruce (1977: chpt. 7) carries out a perceptual experiment using synthetic speech, where the start, end and gradient of the fall are varied. In the test stimuli the phrase ‘inga malmer’ was used where ‘inga’ was in focus and ‘malmer’ could be one of two words depending on the word accent. With Accent 1 the phrase would be a woman’s name whereas with Accent 2 it would be translated as ‘no ores’ (as the first name ‘Inga’ and the plural of the negative, indefinite pronoun ‘inga’ are homophonous). The timing of the fall varied in eight steps of 10 ms whilst at each timing position the fall then occurred over 40 ms, 60 ms or 80 ms to create the different gradients. Subjects were asked whether they perceived each stimulus as a woman’s name or as the phrase ‘no ores’. When the mid-point of the fall is taken as the measure of timing, it is clear that the gradient of the fall has no effect on the identification of the word accents. For each gradient, when the midpoint is 15 ms or less into the vowel, the word is identified as Accent 1. When the midpoint occurs at 35 ms into the vowel, however, a shift of identification has occurred and the word is identified as Accent 2. Bruce (1977: chpt. 8) uses these findings in deciding how to formulate the pitch rules for his model of Swedish intonation. He states that he will describe pitch in terms of levels rather than in terms of movements as “reaching a certain pitch level at a particular point in time is the important thing, not the movement (rise or fall) itself” (p.132). In this way pitch movements become mere transitions (often over voiceless sections of the utterance) between pitch levels, and pitch change has to be inferred by comparing the pitch levels in different vowels. There is also abundant evidence that pitch targets or levels are the important part of the contour in English intonation as well. Ladd (1996) discusses three main ways in which an analysis using pitch targets is to be preferred to one using contour tones. These are tone timing, tone scaling and the failure of predictions based on pitch excursions, as discussed in the following subsections. Introduction 6 1.1.2.2 The evidence from English 1.1.2.2.1 Timing Ashby (1978) notes that the timing of accentual peaks is very constant. In his study he compares several tokens of three different speakers’ productions of high falls and low rises. He notes, in falls, for example, that although there are differences between speakers, each speaker’s timing of the peak is largely constant. For example, one speaker consistently times the peak just after the onset of the vowel (p.332), while another speaker aligns peaks later (p.334) but just as consistently. These findings suggest that the timing of peaks is something that the speaker tightly controls. 1.1.2.2.2 Scaling Liberman and Pierrehumbert (1984) demonstrate that the scaling of pitch targets is also very constant. In one experiment, the phrase ‘Anna came with Manny’ is read with either ‘Anna’ in focus (in answer to the question ‘Who came with Manny?’) or with ‘Manny’ in focus (in answer to the question ‘Who did Anna come with?’). These phrases were read by four speakers in ten different pitch ranges. The first important finding is that in falling accents, low tones (the lowest point of the fall) show little variation. They remain at a constant fundamental frequency (F0) regardless of pitch range and are “nearly constant for a given speaker” (p.181). High tones (the peak of the accent) are found to vary with pitch range, getting higher with increasing range. Nevertheless, another important result suggests that the relative heights of peaks are also being precisely scaled. Looking at the ratio between the first and second peak (pp. 173-176), the results show that there is a constant relation between the two peaks regardless of pitch range, suggesting that their height is being carefully controlled. 1.1.2.2.3 Failure of predictions based on excursions Finally, the evidence also suggests that when the difference between an approach based on targets and one based on contours can be teased apart, the results are better explained by representations in terms of targets or levels. Liberman and Pierrehumbert (1984: 211) compare the stability of the peak height ratios (discussed in the previous section) to ratios derived from the size of the pitch excursion. They find that the results using rises and falls are not correlated well and conclude that “it appears that rises and falls are not being as carefully controlled as relative peak levels are” (p. 211). Introduction 7 It is clear for English, then, that the important parts of the intonation contour are pitch levels or targets. This finding suggests that descriptions of English intonation should use pitch levels or targets rather than pitch movements as the phonologically distinct elements. 1.1.3 Early descriptions using levels as the units of analysis The American tradition of intonation description sees the pitch contour as a string of level tones that are joined together. As early as work by Pike (1945), systems were being suggested in which the phonological units were level tones rather than the movements within the contour. In these early American analyses, each utterance is described in terms of pitch levels and terminal junctures. There are four pitch levels (with 1 being either the highest or the lowest depending on the system). One of these four levels is specified at the beginning of each utterance and then whenever a change to a different pitch level takes place. One of three terminal junctures (level, falling or rising) is marked on the last movement within the utterance. These terminal junctures, then, are different in kind to the pitch levels used to represent the pitch in other parts of the utterance, possibly reflecting an acknowledgement of the special status of the final movement within the phrase. The initial criticisms of American levels approaches are expressed by Bolinger (1951) and are summarised in Ladd (1983) and Ladd (1996: 60). The first criticism focuses on the fact that the number of levels into which the pitch range is divided is arbitrary. There is no theoretical reason why four levels should be used in preference to any other number. In addition, it is unclear how the pitch range should be divided up into these levels. It is impossible to suggest that pitch levels should change at certain Hertz values, as pitch range varies considerably both within and between speakers. Finally, there was no sense of an association between the tune and the text, meaning that the representation did not make clear where exactly tones were located in relation to the utterance. Introduction 8 1.1.4 Autosegmental - Metrical Approaches Autosegmental-metrical (AM) descriptions of intonation solve many of the problems inherent in traditional levels analyses, and Ladd (1996 and 1984) suggests that AM approaches effectively resolve the ‘levels versus configurations’ debate. The success of AM approaches depends on two features of the theory. Firstly, the pitch accent is considered to be the prime unit of analysis but can be further divided into level tones. Secondly, only two level tones H (high) and L (low) are used so that linguistically unimportant variations in pitch range are not represented phonologically. Pierrehumbert’s (1980) thesis represents the beginning of the popular acceptance of AM approaches to the phonological description of English intonation. In Pierrehumbert’s work, the basic categorical element is the pitch accent, a local pitch event associated with the prominent syllables in the utterance (Ladd 1996: 46). Pitch accents are further analysed into sequences of H and L tones and may consist of either a single tone (monotonal) or a combination of two tones (bitonal). Also crucial is how pitch accents align with stressed syllables. In bitonal accents, one tone is aligned with the stressed syllable. This tone is called the ‘starred’ tone and is noted by a following asterisk (*). The other tone is said to be either ‘leading’ or ‘trailing’ and is noted by the addition of a raised hyphen (¯). The two tones are linked by a plus sign (+). For example, the notation L* + H¯ indicates that a pitch accent contains a low tone associated with the stressed syllable, which is then followed immediately by a high (trailing) tone. In addition to pitch accents, intonation contours also contain phrase accents and boundary tones, both of which occur at the edges of prosodic domains. In the original Pierrehumbert (1980) work, phrase accents occur immediately after the pitch accent in the main stress in the phrase, and boundary tones are found at the end of every intonational phrase. In contrast to pitch accents, phrase and boundary tones may only be monotonal. In the notation of this system, phrase accents are followed by a hyphen (-) and boundary tones are followed by a percentage sign (%). Introduction 9 In addition to the phonological specification of tones, phonetic interpretation rules also play a large part in establishing the contour in AM approaches. In Pierrehumbert (1980: 50-4) one set of rules specifies the fundamental frequency (F0) of the tone, and a second set of rules interpolates between the tones and accounts for the F0 transitions between them. Firstly, tones are evaluated from left to right and each tone’s F0 will depend on its own phonological specification (i.e. H or L) and also the phonological and phonetic value of the previous tone. The rules for interpolation come into play only when a tone’s F0 value has been specified. These rules are sensitive to that value, to the tone’s temporal location and also to the phonological specification of the tone. In general, interpolation is linear between two tones of different specification (i.e. HL or LH) or between two low tones (LL). A dipping transition occurs between two H (HH) tones. Later developments of the Pierrehumbert (1980) approach have led to changes in the notation system used and also to the theoretical status of some tones. For example, Beckman and Pierrehumbert (1986) suggest that a phrase tone actually marks an intermediate phrase, and that an intonation phrase is marked by both a phrase and a boundary tone. In terms of notation, some systems, such as those used by Grice (1995) and Gussenhoven (1984), no longer use the plus sign or the hyphen when indicating bitonal pitch accents. Nevertheless, the basic premises of the original theory, such as the pitch accent and two level tones, are maintained in most recent developments of AM theory. Introduction 10 1.1.5 Autosegmental Phonology Autosegmental-metrical (AM) approaches to intonation arose from the development of autosegmental-metrical approaches to phonology in general. Before the rise of AM phonology, the standard generative approach had been the most prominent theory of phonology, and phonemes were considered to be of prime importance. Chomsky and Halle’s (1968) Sound Pattern of English (SPE) and its precursors, such as Jackobson et al. (1952), had little to say about suprasegmentals such as intonation. As phonemes were considered to be bundles of binary valued distinctive features (represented in feature matrices), suprasegmental features, which could spread over more than one phoneme, did not sit easily within this approach. The earliest of the objections to the SPE approach arose in respect to the treatment of prosodic features such as tone. In his seminal work, Goldsmith (1976) brings together the deficiencies of the SPE approach in this respect, and suggests that an autosegmental approach can account for phenomena in the world’s languages that an SPE approach cannot account for. In essence, the autosegmental approach to phonology allows certain features to be removed from the feature bundle representing a phoneme and placed on other tiers of representation, where they are associated with phonemes in the segmental tier but retain autonomy. Goldsmith (1976) states a number of now widely accepted arguments in support of the autosegmental approach, which will be briefly outlined in the following sections. 1.1.5.1 Tonal stability A serious problem for standard generative approaches had been that of tonal stability. It is often the case in the world’s languages that if a vowel is deleted by some phonological rule, or is made unable to carry a tone, then any tones associated with that vowel are not similarly deleted but are instead moved to another vowel. This fact is of prime importance in the autosegmental tradition, as it suggests that the representation of tone is autonomous from that of the segments, and hence a representation with a separate tier for tone is required. As an example, Goldsmith (1976: 43) discusses the case of Lomongo. In Lomongo, if two vowels are juxtaposed the first is elided, as shown in example 1. Introduction (1) Bàlóngó băkáé 11 bàlóngãkáé (his book)1 However, although the second /o/ vowel of bàlóngó has been deleted, its high tone has not. Instead, the /a/ vowel of băkáé has undergone a change of tone where the high tone from the deleted segment is combined with its own rising tone. The phenomenon of tonal stability is, therefore, a strong demonstration that the features of tone are not properties of the vowel as are other features such as roundness or tenseness. If tone was a distinctive feature of the vowel, then it should be deleted along with all the other features when the vowel is deleted. Instead, if tone is considered to exist on a separate tier, there is no reason why it should be deleted with the vowel. 1.1.5.2 Melody Levels Another classic line of evidence in support of an autosegmental representation of tone is that some tone languages show grammatically distinctive melody levels. These melody levels are sequences of tones that serve grammatical functions in the language, such as marking tense, for example. Goldsmith (1976) discusses Leben’s (1973) account of melody levels in Mende, a language of Sierra Leone. In Mende, there are five tones: low, high, rising, falling and rising-falling. There are also morphemes with one, two and three syllables. As Goldsmith (1976: 46) explains, if the distribution of tones over syllables were random, then we should be able to find every combination of tones in every morpheme type. This is not the case however. For each morpheme length, only the same five patterns are found. Leben (1973: 65) proposes a tone-mapping rule where tones are mapped from left to right onto a morpheme. If there are more vowels than tones, the last tone of the pattern is copied onto the remaining vowels. If there are more tones than vowels, mapping proceeds in a one to one fashion and the final vowel receives all the remaining tones as shown in example 2. 1 Symbols for tones in this dissertation will follow those typically used in phonological descriptions of tone languages. is a high tone, is a low tone, is a mid tone, is a rising tone and is a falling tone. is a high tone combined with a rising tone. Introduction 12 (2) Tone High (H) Low (L) Falling (HL) Rising (LH) Rising-Falling(LHL) 1 syllable k kp mb mb mb 2 syllable pp bl kny navo nyh 3 syllable nkl It is suggested, then, that there are in fact only five underlying melodies, and that these occur irrespective of the segmental make-up, a further argument for separate representations of tone and other features. 1.1.5.3 Floating tones Another of Goldsmith’s (1976) arguments in favour of treating tone as an autosegment is the existence of floating tones. Just as a tone may be left behind when a vowel is deleted, it seems that tones may also exist independently of vowels to begin with. These are ‘floating tones’. Odden (1995: 447) demonstrates the existence of floating tones in languages such as Ewe. In Ewe, a language with three level tones spoken in Ghana, a word that ends in a mid tone (M) is usually lowered to a low tone, which spreads backwards to any other mid tones, as shown for the word eto meaning ‘buffalo’ in example 3. However, some words, such as eto meaning ‘mortar’, do not lower the final M. This is considered to be because eto (mortar) actually ends with a floating H tone in the underlying representation and therefore the mid tone will not lower, as it is not word final. (3) Underlying Surface t t t (buffalo) t (mortar) Thus, the existence of floating tones also demonstrates that tone must be an autosegment, as it can exist even when no vowel is present. Introduction 13 1.1.5.4 Contour tones As we have seen, tone languages might have level tones (such as high or low) or contour tones (such as rising and falling), where the pitch changes in the course of a tone-bearing unit. Goldsmith (1976) discusses the case of Igbo. In Igbo the tone that occurs in the verb stem is determined by the form of the clause. In the “I Main” form of a clause the verb is low toned. When the subject noun phrase would, in isolation, end in a low tone, the tones remain unchanged, as shown in example 4 taken from Goldsmith (1976: 33), where the tone on the second vowel of Eze is low. (4) Ézè (the chief) Ézè cì àkhwá (the chief was carrying eggs) If, however, the subject noun phrase ends in a high tone when spoken in isolation, when it is placed before the low toned verb the situation is different. Instead of retaining the high tone it would have in isolation, the tone changes to a falling tone, as shown in example 5. (5) Ékwé (a name) Ékwê cì àkhwá (Ekwe was carrying eggs) This situation is considered to demonstrate that in this language, contour tones are, in reality, sequences or combinations of level tones. It is this sequencing aspect of contour tones that had caused problems for SPE approaches in which tone was considered to be a part of the feature bundle. In a word such as Ékwê in example 2, the falling contour is realised on a single short vowel2. In an SPE approach this would be doubly difficult to account for. Firstly, if contour tones are to be represented as sequences of level tones, the vowel would need to have two specifications for the values [high] and [low]. One set of values would represent the beginning of the falling contour, which would be [+high], [low]. The other set of values would represent the end of the falling contour and would have the opposite values [-high], [+low]. These two sets of values were problematic as all other features in the matrix had only a single value for any one segment, and therefore the rules would have no way of stating how the double-valued feature should be interpreted. 2 The concentration here is on short vowels, as long vowels could conceivably be represented with two feature matrices, with each matrix specified differently for tone. Introduction 14 The second problem was that, even if two specifications are allowed for [high] and [low], there is no way to ensure that they would occur in the right order and create a falling rather than a rising tone. As the feature matrix is inherently an unordered bundle of distinctive features, ordering the features in the vertical dimension so that the specification for the high tone comes above that for the low tone has no effect on the temporal order in production. The issues arising from contour tones were therefore a severely complicating factor in the SPE approach, where the specification of tone was considered to be an integral part of the distinctive feature specification. In an autosegmental representation where the tones and vowels are represented on different tiers, the change of tone at the end of the noun phrase in example 5 can be seen as anticipatory assimilation. As the one-to-one relationship between features and phonemes inherent in the SPE approach had been removed, the vowel can be linked to, or associated with, more than one tone at a time. In this formulation, then, the contour tone on the final vowel of Ekwe results from the final vowel being associated with two level tones, a high followed by a low, as shown in example 6. (6) Ekwe ci H H akhwa L L H Introduction 15 1.2 Tones, targets and turning points So, AM approaches to intonation owe much to the development of autosegmental phonology in general, including not only the separation of intonation and text onto different tiers, but also the representation of intonation in terms of level tones or pitch targets. When dealing with AM descriptions of intonation it is important to distinguish between the notions of tone, pitch target and turning point. The main distinction is one based on different levels of analysis. Tones are the phonologically specified elements of the contour such as H and L, which might either form part of a pitch accent and be associated with stressed syllables, or be associated with a prosodic boundary as a boundary or edge tone. At the phonetic level, tones are implemented as intonational targets, which speakers aim to produce at a particular F0 level at a particular time. As there is a distinction between the phonological tone and its phonetic implementation, a single tone may be realised in a range of different ways. Turning points in the contour are the most explicit realisation of a pitch target. In Bruce (1977), for example, tones are implemented phonetically as pitch targets, and are realised as the maximally high or low parts of the contour; turning points where the contour changes from rising to falling or vice versa. So, a maximum in the contour is a H pitch target, and a minimum is a L pitch target. In Pierrehumbert (1980), however, as Ladd (1996: 103) describes, the close association between tones, targets and turning points, present in Bruce (1977), has been lost. This results in a rather unclear definition of tone and is considered by Ladd (1996) to be one of the unresolved problems with AM analyses of intonation. There are some cases in the theory, for example, where a turning point in the contour is not considered to be either a phonological tone or a phonetic pitch target. There are other cases where a tone in the phonological description of an intonation pattern is not realised as a turning point in the contour and is sometimes not even meant to represent a pitch target. Both these cases are discussed below. Introduction 16 1.2.1 Turning points that are not tones or targets Firstly, then, it is sometimes the case that a turning point in the contour is not identified with a tone in the phonological description and is not considered to be a pitch target. For example, although, in general, interpolations between targets are considered to be fitted to straight lines, this is not always the case. As mentioned above on page 9, the transition between two H* accents is said to be ‘dipping’ (Pierrehumbert 1980: 70). This dipping transition is illustrated in Pierrehumbert’s figures 2.10 (1980: 281) and 2.11 (1980: 282), and is redrawn below in Figure 1.4. Pierrehumbert (1980: 70) acknowledges that the different interpolation rules for different tones is a problem, and states that she has tried to model this type of contour by positing a L tone between the two H*s. However, this would be problematic for other aspects of the theory since the postulated L could only be trailing or leading (since it occurs on a metrically weak syllable), but both H*+L¯ and L¯+H* have already been used to account for downstep (as will be discussed in the next section). Therefore, the low turning point is not expressed as a low tone in the phonological description and is not considered to be a low target, but is merely a special type of transition. H* H* Figure 1.4 A dipping transition between two H* accents (redrawn from Pierrehumbert 1980: 282) Introduction 17 1.2.2 Tones that are not targets or turning points The opposite situation is also found when a tone in the phonological specification is not considered to be a target and does not appear as a turning point in the contour. In strings of L tones, for example, the individual tones do not appear as turning points. For example, Pierrehumbert’s figures 2.14 (1980: 282) and 2.15 (1980: 283) show that in sequences such as L* L- L% and L* L* L-, the contour is fairly flat, and each independently specified tone does not appear as a turning point. The low boundary tone does not relate to a low pitch target but rather signals the absence of the rise that would be present if the boundary were specified as H% (Ladd 1996: 103). As can be seen in Pierrehumbert’s Figure 4.2 (1980: 329), redrawn below in Figure 1.5, the contour remains flat between the low phrase accent and boundary tone. As Pierrehumbert (1980: 72) states, “cases like this are a major source of ambiguity in the intonational system; in cases where the L tones are on different levels, the location of accents is not readily recovered from the F0 contour”. A further complication involves the downstep rule. Pierrehumbert (1980) suggests that four pitch accents can trigger downstep of a following H. These are H¯+L*, H*+L¯, L*+H¯ and L¯+H*. As Pierrehumbert (1980: 152) states, three of these accents “occur transparently in the contour”. For H*+L¯, however, the “unstarred tone of the pitch accent does not show up in the same obvious way” (1980: 152). As can be seen below in Figure 1.5, there seems to be a smooth transition from one accented syllable to the next, and the L¯ does not appear as a turning point in the contour. Again, the L is not considered to be a pitch target, but merely triggers downstep on the following H*. Pierrehumbert (1980) explains that the reason for using H*+L¯ as one of the set that triggers downstep is that it is the only pitch accent left in the inventory that has not yet been used for another purpose. H*+L¯ H*+L¯ H* L¯ L% Figure 1.5 Downstepping H*+L- accents (redrawn from Pierrehumbert 1980: 329) Introduction 18 1.2.3 Tones that are targets but not turning points The lack of a clear definition for tones is not peculiar to the Pierrehumbert approach. It is a problem for AM approaches in general and still has not been resolved. In addition to the problems discussed above, where tone, target and turning point are not associated in a one to one fashion, there is also another problem. It is often the case that researchers agree that there is a tone in the phonological description, and that this tone is implemented as a tonal target. However, the realisation of the tonal target may often not be a single turning point, making it difficult to decide with any precision which part of the contour should be labelled as the target. This difficulty in locating targets is especially problematic as the current focus in the discipline is on target alignment (the precise timing relationship between a target and some tone-bearing unit), and therefore researchers wish to identify with some precision the location of targets in the contour. Nevertheless, they are often frustrated in their attempts to find a single turning point with which they can identify a pitch target. The following quotations illustrate some of the problems they face. Introduction (1) 19 “The primary difficulty in using the F0 contour as the phonetic representation of intonation is the extent to which it is affected by the speech segments…Such effects on F0 have been the object of much study in their own right…However, from the point of view of characterising the intonation system synchronically, they are a source of noise which must be factored out…it is often difficult to determine precisely the location of tones with respect to the text” Pierrehumbert (1980: 14) (2) “in order to abstract away from a surface F0 contour that is unavoidably polluted by segmental influences and voicing irregularities we argue that one should consider models that are able to dissociate these influences and irregularities from the primary features of interest…” van Santen (2002: 107) (3) “The time point for the F0 maximum is probably the least reliable of measurements. Segmental effects from the nasal and irregularities in the pitch made its location rather uncertain in many cases. Where there were two alternative locations that could be construed to be the relevant F0 peak, we took the average between the two.” Silverman and Pierrehumbert (1990: 79) (4) “In identifying peaks (H) and troughs (L) in the F0 contour itself, in many cases the physical evidence is not clear cut. For example, the high or low value may be sustained over a period of time, and analysts must make a principled decision: should they choose a point at one or other edge of this sustained frequency, or select a mid-point?” House and Wichmann (1996: 3) (5) “there are cases in which only stipulative decisions can be made regarding tonal target position, either because a single maximum (or minimum) cannot be discerned (as in plateaus …), or where a perturbation is created by some consonant, or because the target is masked by the presence of a voiceless sound” D’Imperio (2002a: 103) Introduction 20 There are several points that arise from these quotations. Quotations 1, 2, 3 and 5 all note that segmental effects on the contour can make it difficult to tell where targets are located. For example, voiceless segments may disguise the location of a target, and certain consonants may create perturbations that make the F0 rise or fall. Quotations 1 and 2 both see the effects of segmental factors to be only noise that should be removed if generalisations are to be drawn from the data. Quotation 3 demonstrates the type of solution researchers resort to when they are unsure where a target is or which part of the contour is to be identified with the target. Although these solutions are principled ones, different researchers use different methods, and the solutions are generally arbitrary and not based on any external evidence that might suggest which point is likely to be a speaker’s target. Quotations 4 and 5 illustrate a different but related problem: how can a pitch target be identified when the F0 stays level for some time? If there is no clear peak or trough in the contour, how can the exact location of the target be specified? Should the target be identified with a particular part of the level section (e.g. the beginning, middle or end) or with the whole section of contour at that level? It is important to note that this problem does not only apply when plateaux are completely flat. Even if there is a point that is slightly higher than the surrounding contour, it is not clear that the listener’s ear will be sensitive enough to tell that the pitch has changed. This dissertation will focus on situations where there is a plateau in the contour. In these situations, it is clear that there is a tone in the phonological description and that this tone is implemented as a pitch target. However, the realisation as a level stretch of a particular pitch makes it unclear exactly where the speaker’s target is located. Introduction 21 1.3 Previous studies of intonational plateaux There are a few cases in the literature where intonational plateaux themselves are the objects of study, rather than just being noted as an interesting point or presenting a problem to overcome. These cases will be discussed below, in relation to the individual languages under study. It is important to make clear at this point that these plateaux are those found as the phonetic implementation of a single phonological tone, where the pitch target is realised as a plateau rather than the well-defined peak or trough that a strong version of AM theory would predict. These plateaux are different in kind from those found in rise-plateau-slump configurations (see e.g. Cruttenden 1997: 133-136) in varieties of Urban Northern British. In rise-plateau-slump configurations, the plateau is sustained over several unaccented syllables, whereas the focus in the studies below is on plateaux found in a single accented syllable, as the realisation of a single pitch target. 1.3.1 Tokyo Japanese Pierrehumbert and Beckman (1988) discuss a case of plateaux found as low boundary tones in Tokyo Japanese. Traditional accounts of this dialect suggest that the accentual phrase starts with a L tone and is followed by a H tone. The rise is considered to mark the boundary. However, in some cases, such as when the first syllable is accented or long (containing two sonorant morae), the L tone is usually assumed to be absent. Nevertheless, when Pierrehumbert and Beckman (1988) examine utterances with long and accented initial syllables they find that there is still a rise, as shown in the left panel of Figure 1.6 when the accented word uni is initial. The main difference between that contour and the one for the unaccented word ume, shown in the right panel of Figure 1.6, is in the timing and the duration of the low tone. In the unaccented case, the low tone is lower and appears to be sustained, forming a plateau. a mai u ni a mai u me Figure 1.6 F0 contours for an accented (left panel) and unaccented (right panel) first syllable (redrawn from Pierrehumbert and Beckman 1988: 27) Introduction 22 Pierrehumbert and Beckman (1988: 14) term this difference in the realisation ‘L-% tone allophony’ which they believe is governed by the weight and accentuation of the first syllable of the following phrase. They suggest that the differences may be due to different associations to the prosodic hierarchy. When the initial mora of a phrase is accented (i.e. bears a H tone) the L is associated with the boundary and is realised as a single low turning point; this is termed the ‘weak allophone’. However, when the initial syllable is unaccented and the first mora does not bear a H tone, the L tone moves from the boundary and becomes associated with this mora; this is the ‘strong allophone’, and the L is realised with a longer duration and a lower F0 than the weak allophone. It seems that the strong allophone is considered to be the normal case that occurs if other tones do not intrude on the associated mora. Pierrehumbert and Beckman (1988: 28) state that in the accented case the L tone “cannot have the extended duration that it has when it alone is associated with the first syllable”. They further suggest that these differences are due to the signalling of prominence relations; “by lengthening its duration and by lowering its pitch relative to that of adjacent H tones, a L tone can be made prosodically stronger” (1988: 29). In this account, then, the plateau is seen as a normal realisation of a low target and its presence is conditioned by prominence relations within the utterance. Unfortunately we know little about the detailed phonetic realisation of each allophone. We are told that the strong allophone is 10 Hz lower than the weak allophone (1988: 29), but we are not told how long the strong allophone is or if the difference between the two allophones is gradient or categorical. Introduction 23 1.3.2 British English 1.3.2.1 Wichmann et al. (1999) Wichmann et al. (1999) find plateaux when studying the effects of discourse structure on peak timing in English. The stimuli consisted of texts where the same word occurred in three different positions: sentence final (nuclear), paragraph initial (the first accented syllable in a paragraph) and sentence initial (the first accented syllable in a paragraph medial sentence). Findings suggested that peaks are later in the more initial positions (p.1766). However, in addition to peaks, plateaux were also found for some speakers in topic-initial position (p.1768) as shown below in Figure 1.7. Figure 1.7 A plateau in paragraph initial position (upper panel ) and peak in sentence initial position (lower panel) (taken from Wichmann et al. 1999: 1767) No detail is given about the duration or other aspects of these plateaux, but the authors do suggest two reasons for the occurrence of the plateaux. One is that the plateau itself may be a device for signalling structure. In this version the realisation of the high target as a plateau would itself signal initiality. This explanation sees the whole plateau as the target and sees the target’s realisation as influenced by discourse factors. The second explanation is that the plateau may be an alternative way of delaying the peak. The speaker separates the rise and fall by a plateau, thus delaying the peak without delaying the rise. This explanation, then, sees the end of the plateau as the speaker’s target, whilst the plateau itself is merely a way of creating peak delay without delaying the preceding rise. The diagrams given to illustrate the plateau (p.1767, and Figure 1.7 above) suggest, however, that the end of the rise may be delayed as well. If this were the case for all the plateaux found then this second explanation would seem unlikely. Introduction 24 1.3.2.2 ProSynth The ProSynth3 group conducted the largest scale study of intonational plateaux. Their work was motivated by a desire to produce natural-sounding synthetic speech that would be robust in difficult listening conditions. Their work on intonation focuses on that of a single male speaker of Southern British English. The database consists of 467 utterances all with a falling (H*L) nuclear accent. Utterances have either one or two syllables in the final foot and the accented syllable varies in terms of onset and coda type. House et al. (1999b) present an account of the procedures used to model the intonation, whilst Ogden et al. (2000) present the results of several perceptual experiments. 1.3.2.2.1 House et al. (1999a and 1999b) House et al. (1999b) describe how they began their attempt at modelling intonation by reducing the contour to a small number of turning points. They found that the high tone of nuclear accents could not be successfully modelled using a single turning point representing a peak, but that instead two points marking the beginning (PON) and end (POF) of a plateau were needed (p.2344) as shown in Figure 1.8. In order to automatically derive these turning points an algorithm was implemented to identify the point in the contour where the pitch was within 4% of the absolute peak value. This 4% range was motivated by studies cited in Rosen and Fourcin (1986) (which will be discussed further in Chapter 2) that suggest that 4% approximates the range of perceptual equality, so that everything within this range should sound of an equal pitch to the listener. Figure 1.8 A plateau in the contour and the turning points identified for use in speech synthesis (taken from House et al. 1999b: 3) “An integrated prosodic approach to device-independent, natural-sounding speech synthesis”. ESPRC grant numbers: GR/L53069, GR/L51829 and GR/L52109 3 Introduction 25 A statistical analysis of the whole database revealed that both the beginning and the end of the plateau covary systematically with linguistic structure (pp.2344-2345). Both the onset (PON) and the offset (POF) of the plateau are found to occur later in feet with two syllables than in feet with one syllable. Segmental effects also influence the timing of the turning points. In feet with only one syllable, PON is earlier with sonorant onsets than other types, and POF is later with voiceless than with sonorant onsets. In feet with two syllables, PON is later when the onset of the syllable is voiced, whereas POF is later with empty onsets and earlier with voiceless codas. House et al. (1999b) make two main claims about the status of the plateau. One is that the plateau might be an alternative phonetic implementation of H*L resulting from systematic structural differences (p.2346). Although the statistical analysis of the database treated all high targets as plateaux, the visual analysis suggests that plateaux are found more often in two-syllable feet with voiced obstruent onsets and more variably with sonorant onsets. In one-syllable feet the simple peak was more common but again plateaux were sometimes found in syllables with sonorant onsets (p.2344). The second suggestion (p.2346) is that the end of the plateau (POF) may be the speaker’s real target, as it is less prone to microprosodic effects and less likely to be affected by interpolation from turning points in the previous accent than PON. There is also a suggestion that the alignment of POF may create a different semantic effect, as a later alignment of this point within the syllable is described as giving an impression of greater finality. Introduction 26 1.3.2.2.2 Ogden et al. (2000) In Ogden et al. (2000) the effectiveness of the modelling is tested by means of a number of perceptual tests, two of which focus on intonation. In one experiment, the effect of correct modelling on the impression of naturalness or neutralness is investigated. In this experiment, utterances are synthesised which have a monosyllabic final foot. For each utterance, POF and LON (the turning point representing the low elbow in the contour) are aligned either correctly (for a monosyllabic foot) or incorrectly (for a polysyllabic foot). Exact alignment is based on onset and coda type and is modelled in relation to the vowel. Pairs of segmentally identical utterances were created where intonation was correct in one member of the pair and incorrect in the other. Eleven subjects pressed buttons to indicate which version in each pair sounded more neutral. Seventy-eight percent of responses favoured correct alignment and this was found to be significantly more than would be predicted by chance (p.203). One subject however, was given different instructions and asked to indicate not which version sounded more neutral, but which he preferred. This subject preferred the version with the incorrect alignment 89% of the time, indicating that instructions have an important effect on subjects’ responses (p.204). A second experiment was designed to see if the correct modelling of intonation facilitated listeners’ understanding of the linguistic message. Subjects read a story and were then given written questions, the answers to which were utterances synthesised with either the correct or incorrect alignment of POF and LON. They had to indicate if the synthesised answer was true or false, and reaction times and errors were measured. There was a trend for reaction times to be quicker with correct alignment, but this trend was not significant (p.204). Error data are not reported. It is suggested (p.206) that there are problems with the experiment. The questions differ in difficulty, there is a high cognitive load and the subjects were not forced to respond fast, so seem to have very variable reaction times and not to make many errors. It seems that the alignment of the end of the plateau or the duration of the plateau may create different perceptual effects. It is noted that a later POF in the stimuli creates a longer plateau, which is perceived as a high-fall, as opposed to the low-fall perceived when the plateau is shorter. It is noted that the high-fall is usually associated with emphasis. It is not possible to tell whether the difference in perception is created by the alignment, or the duration of the plateau, or a combination of the two. Introduction 27 The results from the ProSynth group demonstrate that the phenomenon of the plateau is subject to influences from linguistic structure, and that the alignment of the end point (in combination with that of the low tone) may affect perceptions of naturalness and might possibly affect comprehension. 1.3.3 Neapolitan Italian D’Imperio (2002a) explicitly addresses the question of what the tonal target really is inside a plateau, an example of which is shown in Figure 1.9. Data is reported from D’Imperio et al. (2000), who use the timing differences between statements and questions in Neapolitan Italian as a test-bed to determine the perceived location of tonal targets. In Neapolitan Italian both questions and (narrow-focus) statements are realised by rise-falls, the peaks of which are aligned around 40 ms later in questions (D’Imperio 2002a: 103). D’Imperio and House (1997) and D’Imperio et al. (2000) show that in perceptual experiments where the peak of a rise-fall configuration is shifted later in time, subjects identify earlier peaks as statements and later peaks as questions. Figure 1.9 An example of a plateau in a H+L* accent in Neapolitan Italian (taken from D'Imperio 2002a: 101) This situation was used to test what would happen if plateau stimuli were used instead of peak stimuli. D’Imperio et al. (2000) conducted the same perceptual experiment again but with a 45 ms plateau spreading rightwards from the peak. The argument is that if the real tonal target is the beginning of the plateau then there should be no difference between responses to plateau and peak stimuli (as the plateau is extended forwards in time). The different stimuli are shown below in Figure 1.10. However, when the results from plateau stimuli are compared to those from peak stimuli, it is clear that they are actually very different. Specifically, plateau stimuli elicit more question responses at each timing point. For example, at the second time step, plateau stimuli gained 75% question responses, whilst peak stimuli gained only around 10%. This result was taken to indicate that the perceived target was not located at the beginning of the plateau. Introduction 28 An alternative hypothesis was that the perceived target might be at the end of the plateau. This hypothesis was tested by comparing the results from the plateau experiment to the results from the peak stimulus corresponding to the plateau offset (that is the peak stimulus three time steps later than the peak stimulus timed at plateau onset). Here, the differences between peak and plateau stimuli were much smaller. For example, at the second time step, 82% of responses favoured the question interpretation, compared to 75% for plateau stimuli. This results leads to the suggestion that the perceived tonal target is located near to but not exactly (as the results were not identical) at the end of a plateau. Plateau Peak at plateau onset (time step 1) Peak at plateau offset (time step 4) Figure 1.10 Plateau stimuli and peak stimuli timed at plateau onset and offset (adapted from D'Imperio 2002a: 104) Introduction 29 1.3.4 Summary and discussion of previous work on plateaux The plateau has been the subject of only four main investigations all with different aims, methods and conclusions. The Japanese plateaux of Pierrehumbert and Beckman (1988) are unusual in that they are low tones that are in boundary position. In this case, the plateau is seen as being an alternative realisation of a low tone with realisation conditioned by prominence relations within the utterance. The Wichmann et al. (1999) study of discourse variables in British English finds plateaux in paragraph initial (prenuclear) position. It is again suggested that the plateau could be an alternative realisation of a target alternating with a single high turning point with realisation constrained by discourse factors. Alternatively, it is suggested that the end of the plateau might be the speaker’s real target. House et al. (1999a and 1999b) find high plateaux in nuclear position that covary with onset, coda and foot type. Again it is suggested that the plateau may be an alternative realisation of a peak, constrained by structural factors, but it is also suggested that the end of the plateau may be a more suitable target due to fewer microprosodic influences. This fact is reflected in the perceptual work of Ogden et al. (2000), who model differences in foot structure in relation to the end of the plateau (and low tone) and find that there is an effect on perception of naturalness and possibly also on processing. D’Imperio et al. (2000) and D’Imperio (2002a) are the only studies to explicitly address the question of tonal targets in plateaux. In Neapolitan Italian it appears that a point near to the end of the plateau is likely to be the perceived tonal target. Thus, there are two basic views. One is that the whole plateau is an alternative realisation of a simple turning point conditioned by various discourse, prominence or structural factors. In this view the plateau is an intentional device on the part of the speaker to signal something meaningful about the utterance by means of contrast to a simple peak. The alternative view suggests that the end of the plateau is the speakers’ real target and is in some sense equivalent to a simple peak. In these cases it is unclear why a speaker might produce a plateau rather than a peak, but the plateau itself is not linguistically significant, only the timing of the end point. Introduction 30 It is interesting that the beginning of the plateau has not been postulated as the speaker’s intended target. As House et al. (1999b: 2346) state, the beginning of the plateau is probably more prone to microprosodic effects than the end. In addition, it seems likely that target location is influenced by perceptual constraints. In the high target in falling accents the end of the plateau is equivalent to the beginning of the fall (especially in the ProSynth work where it represents the end of the range of perceptual equality). It is likely that this is a perceptually salient part of the contour for listeners, more so than the beginning of the plateau, at which point they still have to wait to find out what will happen next. The only information given by the beginning of the plateau is that the pitch is no longer rising. At the end of the plateau, however, the listener knows that pitch is no longer level and has begun to fall. Introduction 31 1.4 Aims, scope and structure of this dissertation It seems that, despite the strong predictions of descriptions based on pitch targets and turning points, the plateau is a fairly robust phenomenon, found for both high and low tones, for both pitch accents and boundary tones, and in at least three different languages. Nevertheless, we still know very little, and thus the main aim of this dissertation is to extend the set of facts known about intonational plateaux. This dissertation will focus on high targets in nuclear falls in British English, as these have been the most extensively studied. In addition, English is the language for which most is known about peaks, meaning that results obtained for plateaux can be compared to previous findings. The dissertation will aim to answer a number of questions that arise from the presence of plateaux in the contour and from the previous four studies in the literature. For example, House et al. (1999a and 1999b) suggest that plateaux covary with linguistic structure but this claim is based on studies of only one speaker and in relation to only three linguistic variables. This dissertation will begin by studying the realisations of high targets in the productions of five speakers of Standard Southern British English, and additional linguistic structures will also be considered. Although Wichmann et al. (1999) have considered discourse factors, other factors known to affect peak alignment, such as non-linguistic variables like pitch span, have not yet been considered. The third chapter will, therefore, consider how the plateau is affected by some of these variables, and how the effects found are related to previous findings for peak alignment. The results given in this third chapter allow us to choose between the possibilities that the whole plateau or one or other of its end points is the speaker’s real pitch target. Little is known about listeners’ use of the plateau. Although Ogden et al. (2000) have demonstrated that the alignment of the end of the plateau can affect naturalness judgements, it is not known if alignment affects speech processing, as the results from Ogden et al’s. (2000) second experiment were not significant. Chapter 4 considers the possible use of plateau alignment in spoken word recognition. Introduction 32 Chapter 5 returns to the issue of why speakers produce plateaux rather than simple peaks. This chapter considers the perceptual effects of having a plateau in the contour rather than a peak and the physiological mechanisms behind the production of plateaux. Finally, a reason for the occurrence of plateaux is presented and discussed in relation to the previous studies of plateaux. Chapter 6 summarises the work presented in the other chapters and draws general conclusions.