Chapter 1

advertisement
Introduction
Chapter 1
1
Introduction
1.1 Background
This dissertation investigates the realisation of the intonational high target in falling
nuclear pitch accents in English. This work arises from the observation that the high
target is often realised not as a single, well-defined peak, but as a plateau in the contour.
The existence of such plateaux raises the question of what a speaker’s real intonational
target is. In situations where there is a well-defined turning point in the contour it is
likely that this turning point is the speaker’s target, but when the target is realised as a
plateau it is more difficult to know which parts of the plateau are linguistically relevant,
and which parts are merely the results of interpolations between targets and smoothing in
the production process.
One question that initially arises is why we should expect there to be an intonational
target in the contour. Phonetically, any given pitch change can, of course, be expressed
by reference to its trajectory, for example rising or falling, or can be seen as two different
pitch targets, such as high and low, joined by an interpolation. These two alternative
representations for a falling intonation contour are shown diagrammatically in Figure 1.1.
High
Low
Fall
Figure 1.1 A falling intonation contour expressed firstly as an indivisible whole and secondly as a
sequence of two intonational targets
From a phonological point of view, however, these two alternatives are not equivalent, as
it is important to know what the meaningful features of intonation in any particular
language are.
Establishing the correct representation means that rules can be more
economical and may capture generalisations that might otherwise be missed.
Introduction
2
These two different ways of representing the same contour have influenced the
phonological description of intonation.
Ladd (1996: 11) states that, like any other
phonological description, a complete phonological description of intonation must
minimally include a level at which sounds are represented as a small number of
categorically distinct elements and another level at which these elements are mapped onto
the physical continuously varying parameters of the sound, in this case the intonation
contour. Different methods for describing intonation use different types of categorically
distinct elements, influenced by the decision to see the contour in terms of either
movements or pitch targets.
This debate is a long standing one, often referred to,
following Bolinger (1951), as the ‘levels versus configurations’ debate. Descriptions in
the British tradition, for example, focus on the movements in the intonation contour.
1.1.1 Contours as the units of analysis
The British tradition of intonational description, begun by Palmer (1922) and exemplified
in the work of O’Connor and Arnold (1973) and Halliday (1967), emphasises the
importance of movement in the intonation contour. In this tradition, the tone group or
tone unit, the equivalent of the intonational phrase, is separated into several units. The
nucleus is the syllable of maximum prominence. The syllables before the first accent
make up the pre-head. The syllables from the first accent to the nucleus are termed the
head and the syllables following the nucleus are the tail. For example, as shown in Figure
1.2 in a neutral reading of the sentence, “I don’t remember the number”, the nucleus falls
on the syllable ‘num’, the pre-head is the word ‘I’, the head consists of the words, ‘don’t
remember the’ and the tail is the final syllable of the word ‘number’.
I
pre-head
don’t remember the
head
num
ber
nucleus
tail
Figure 1.2 The units of analysis used in the British tradition
The intonation in each of these units is described with reference to the movements in
pitch that are associated with them. For example, the intonation within the nucleus is
described by stating the height at which the movement starts and the movement/s after the
nucleus. This description gives rise to nuclear tones such as high-rise or rise-fall.
Introduction
3
1.1.2 Evidence for tonal targets
Despite the descriptions of intonation within the British tradition, which emphasise the
importance of movement in the contour, there is abundant evidence that in both
production and perception movements in pitch are rather unimportant.
Instead, as
discussed in more detail in the following sections, the crucial factor is that a particular
height of contour is reached at a particular time in relation to the text. So, speakers aim to
produce particular pitch targets, and the links between these targets are linguistically
irrelevant interpolations.
It is, of course, not the case that pitch movements are linguistically unimportant in all
languages. In fact, languages vary in terms of whether movements or pitch targets are
prime. The earliest discussion of this variation related to the representation of contour
tones (such as rises and falls) in tone languages. As early as Pike (1948) it had been
suggested that different languages behave differently according to whether their contour
tones behave as unitary wholes or sequences of level tones. Many lines of evidence have
been brought to bear on this issue and are summarised by Anderson (1978). Pike (1948)
suggested that Asian tone languages represent a situation in which contour tones are
unitary wholes, whilst African tone languages represent a case where contour tones are
better described as sequences of level tones. This dichotomy now seems a little too clearcut as it has recently been demonstrated not only that these geographical boundaries are
largely irrelevant but that different dialects of the same language may in fact behave
differently in this respect. Chen (2000 cited in Zhiming 2003: 150) shows, for example,
that in Mandarin and Min dialects of Chinese, contour tones are single units, whereas in
Wu dialects contour tones are best treated as strings of level tones.
What is clear is that for any particular language or variety the evidence must be assessed
separately. The initial evidence for the importance of pitch targets in non-tone languages
began with Bruce’s (1977) study of the Swedish word accents.
Introduction
4
1.1.2.1 The evidence from Swedish word-accents
Bruce (1977) produced a study of the Swedish word accents. In Swedish, the stressed
syllable of a word carries one of two word accents. These accents are termed acute
(Accent 1) or grave (Accent 2). For example the word anden with Accent 1 means ‘the
duck’, whilst anden with Accent 2 means ‘the spirit’. The word accents are distinguished
by means of two distinctive pitch accents. Bruce (1977) studies the Stockholm variety of
Swedish, in which it appears from earlier studies that the word accent distinction is based
on the number of peaks in the accent. Thus Accent 1 words have a single peak whilst
Accent 2 words have two peaks.
In a production study of one primary and two secondary informants, Bruce attempts to
tease apart the relative contributions of different levels of intonation. When word accents
are isolated so that they are not affected by sentence accents or terminal junctures, both
Accent 1 and Accent 2 are actually realised as a single peak in the contour, rather than the
double peak for Accent 2 words that had previously been proposed. The actual difference
between the two single peaked accents was found to be primarily one of timing. As
shown in Figure 1.3, the peak for Accent 1 words occurs in the prevocalic consonant, and
the peak for Accent 2 words occurs in the stressed vowel itself (Bruce 1977: 49). There is
also a difference in the gradient of the fall from the peak, the gradient being steeper in
Accent 1 than in Accent 2 words.
V
C
V
I
II
C
V
C
V
C
Figure 1.3 Difference in timing of two Stockholm Swedish word accents (adapted from Bruce (1977:
64))
Introduction
5
In order to assess which aspects of the contour are perceptually relevant for the word
accent distinction, Bruce (1977: chpt. 7) carries out a perceptual experiment using
synthetic speech, where the start, end and gradient of the fall are varied. In the test
stimuli the phrase ‘inga malmer’ was used where ‘inga’ was in focus and ‘malmer’ could
be one of two words depending on the word accent. With Accent 1 the phrase would be a
woman’s name whereas with Accent 2 it would be translated as ‘no ores’ (as the first
name ‘Inga’ and the plural of the negative, indefinite pronoun ‘inga’ are homophonous).
The timing of the fall varied in eight steps of 10 ms whilst at each timing position the fall
then occurred over 40 ms, 60 ms or 80 ms to create the different gradients. Subjects were
asked whether they perceived each stimulus as a woman’s name or as the phrase ‘no
ores’.
When the mid-point of the fall is taken as the measure of timing, it is clear that the
gradient of the fall has no effect on the identification of the word accents. For each
gradient, when the midpoint is 15 ms or less into the vowel, the word is identified as
Accent 1. When the midpoint occurs at 35 ms into the vowel, however, a shift of
identification has occurred and the word is identified as Accent 2.
Bruce (1977: chpt. 8) uses these findings in deciding how to formulate the pitch rules for
his model of Swedish intonation. He states that he will describe pitch in terms of levels
rather than in terms of movements as “reaching a certain pitch level at a particular point
in time is the important thing, not the movement (rise or fall) itself” (p.132). In this way
pitch movements become mere transitions (often over voiceless sections of the utterance)
between pitch levels, and pitch change has to be inferred by comparing the pitch levels in
different vowels.
There is also abundant evidence that pitch targets or levels are the important part of the
contour in English intonation as well. Ladd (1996) discusses three main ways in which
an analysis using pitch targets is to be preferred to one using contour tones. These are
tone timing, tone scaling and the failure of predictions based on pitch excursions, as
discussed in the following subsections.
Introduction
6
1.1.2.2 The evidence from English
1.1.2.2.1
Timing
Ashby (1978) notes that the timing of accentual peaks is very constant. In his study he
compares several tokens of three different speakers’ productions of high falls and low
rises.
He notes, in falls, for example, that although there are differences between
speakers, each speaker’s timing of the peak is largely constant. For example, one speaker
consistently times the peak just after the onset of the vowel (p.332), while another speaker
aligns peaks later (p.334) but just as consistently. These findings suggest that the timing
of peaks is something that the speaker tightly controls.
1.1.2.2.2
Scaling
Liberman and Pierrehumbert (1984) demonstrate that the scaling of pitch targets is also
very constant. In one experiment, the phrase ‘Anna came with Manny’ is read with either
‘Anna’ in focus (in answer to the question ‘Who came with Manny?’) or with ‘Manny’ in
focus (in answer to the question ‘Who did Anna come with?’). These phrases were read
by four speakers in ten different pitch ranges.
The first important finding is that in falling accents, low tones (the lowest point of the
fall) show little variation.
They remain at a constant fundamental frequency (F0)
regardless of pitch range and are “nearly constant for a given speaker” (p.181). High
tones (the peak of the accent) are found to vary with pitch range, getting higher with
increasing range. Nevertheless, another important result suggests that the relative heights
of peaks are also being precisely scaled. Looking at the ratio between the first and second
peak (pp. 173-176), the results show that there is a constant relation between the two
peaks regardless of pitch range, suggesting that their height is being carefully controlled.
1.1.2.2.3
Failure of predictions based on excursions
Finally, the evidence also suggests that when the difference between an approach based
on targets and one based on contours can be teased apart, the results are better explained
by representations in terms of targets or levels. Liberman and Pierrehumbert (1984: 211)
compare the stability of the peak height ratios (discussed in the previous section) to ratios
derived from the size of the pitch excursion. They find that the results using rises and
falls are not correlated well and conclude that “it appears that rises and falls are not being
as carefully controlled as relative peak levels are” (p. 211).
Introduction
7
It is clear for English, then, that the important parts of the intonation contour are pitch
levels or targets. This finding suggests that descriptions of English intonation should use
pitch levels or targets rather than pitch movements as the phonologically distinct
elements.
1.1.3 Early descriptions using levels as the units of analysis
The American tradition of intonation description sees the pitch contour as a string of level
tones that are joined together. As early as work by Pike (1945), systems were being
suggested in which the phonological units were level tones rather than the movements
within the contour. In these early American analyses, each utterance is described in terms
of pitch levels and terminal junctures. There are four pitch levels (with 1 being either the
highest or the lowest depending on the system). One of these four levels is specified at
the beginning of each utterance and then whenever a change to a different pitch level
takes place. One of three terminal junctures (level, falling or rising) is marked on the last
movement within the utterance. These terminal junctures, then, are different in kind to
the pitch levels used to represent the pitch in other parts of the utterance, possibly
reflecting an acknowledgement of the special status of the final movement within the
phrase.
The initial criticisms of American levels approaches are expressed by Bolinger (1951)
and are summarised in Ladd (1983) and Ladd (1996: 60). The first criticism focuses on
the fact that the number of levels into which the pitch range is divided is arbitrary. There
is no theoretical reason why four levels should be used in preference to any other number.
In addition, it is unclear how the pitch range should be divided up into these levels. It is
impossible to suggest that pitch levels should change at certain Hertz values, as pitch
range varies considerably both within and between speakers. Finally, there was no sense
of an association between the tune and the text, meaning that the representation did not
make clear where exactly tones were located in relation to the utterance.
Introduction
8
1.1.4 Autosegmental - Metrical Approaches
Autosegmental-metrical (AM) descriptions of intonation solve many of the problems
inherent in traditional levels analyses, and Ladd (1996 and 1984) suggests that AM
approaches effectively resolve the ‘levels versus configurations’ debate. The success of
AM approaches depends on two features of the theory. Firstly, the pitch accent is
considered to be the prime unit of analysis but can be further divided into level tones.
Secondly, only two level tones H (high) and L (low) are used so that linguistically
unimportant variations in pitch range are not represented phonologically.
Pierrehumbert’s (1980) thesis represents the beginning of the popular acceptance of AM
approaches to the phonological description of English intonation. In Pierrehumbert’s
work, the basic categorical element is the pitch accent, a local pitch event associated with
the prominent syllables in the utterance (Ladd 1996: 46). Pitch accents are further
analysed into sequences of H and L tones and may consist of either a single tone
(monotonal) or a combination of two tones (bitonal). Also crucial is how pitch accents
align with stressed syllables. In bitonal accents, one tone is aligned with the stressed
syllable. This tone is called the ‘starred’ tone and is noted by a following asterisk (*).
The other tone is said to be either ‘leading’ or ‘trailing’ and is noted by the addition of a
raised hyphen (¯). The two tones are linked by a plus sign (+). For example, the notation
L* + H¯ indicates that a pitch accent contains a low tone associated with the stressed
syllable, which is then followed immediately by a high (trailing) tone.
In addition to pitch accents, intonation contours also contain phrase accents and boundary
tones, both of which occur at the edges of prosodic domains. In the original
Pierrehumbert (1980) work, phrase accents occur immediately after the pitch accent in the
main stress in the phrase, and boundary tones are found at the end of every intonational
phrase. In contrast to pitch accents, phrase and boundary tones may only be monotonal.
In the notation of this system, phrase accents are followed by a hyphen (-) and boundary
tones are followed by a percentage sign (%).
Introduction
9
In addition to the phonological specification of tones, phonetic interpretation rules also
play a large part in establishing the contour in AM approaches. In Pierrehumbert (1980:
50-4) one set of rules specifies the fundamental frequency (F0) of the tone, and a second
set of rules interpolates between the tones and accounts for the F0 transitions between
them. Firstly, tones are evaluated from left to right and each tone’s F0 will depend on its
own phonological specification (i.e. H or L) and also the phonological and phonetic value
of the previous tone. The rules for interpolation come into play only when a tone’s F0
value has been specified. These rules are sensitive to that value, to the tone’s temporal
location and also to the phonological specification of the tone. In general, interpolation is
linear between two tones of different specification (i.e. HL or LH) or between two low
tones (LL). A dipping transition occurs between two H (HH) tones.
Later developments of the Pierrehumbert (1980) approach have led to changes in the
notation system used and also to the theoretical status of some tones. For example,
Beckman and Pierrehumbert (1986) suggest that a phrase tone actually marks an
intermediate phrase, and that an intonation phrase is marked by both a phrase and a
boundary tone. In terms of notation, some systems, such as those used by Grice (1995)
and Gussenhoven (1984), no longer use the plus sign or the hyphen when indicating
bitonal pitch accents. Nevertheless, the basic premises of the original theory, such as the
pitch accent and two level tones, are maintained in most recent developments of AM
theory.
Introduction
10
1.1.5 Autosegmental Phonology
Autosegmental-metrical (AM) approaches to intonation arose from the development of
autosegmental-metrical approaches to phonology in general. Before the rise of AM
phonology, the standard generative approach had been the most prominent theory of
phonology, and phonemes were considered to be of prime importance. Chomsky and
Halle’s (1968) Sound Pattern of English (SPE) and its precursors, such as Jackobson et
al. (1952), had little to say about suprasegmentals such as intonation. As phonemes were
considered to be bundles of binary valued distinctive features (represented in feature
matrices), suprasegmental features, which could spread over more than one phoneme, did
not sit easily within this approach. The earliest of the objections to the SPE approach
arose in respect to the treatment of prosodic features such as tone. In his seminal work,
Goldsmith (1976) brings together the deficiencies of the SPE approach in this respect, and
suggests that an autosegmental approach can account for phenomena in the world’s
languages that an SPE approach cannot account for.
In essence, the autosegmental
approach to phonology allows certain features to be removed from the feature bundle
representing a phoneme and placed on other tiers of representation, where they are
associated with phonemes in the segmental tier but retain autonomy. Goldsmith (1976)
states a number of now widely accepted arguments in support of the autosegmental
approach, which will be briefly outlined in the following sections.
1.1.5.1 Tonal stability
A serious problem for standard generative approaches had been that of tonal stability. It
is often the case in the world’s languages that if a vowel is deleted by some phonological
rule, or is made unable to carry a tone, then any tones associated with that vowel are not
similarly deleted but are instead moved to another vowel.
This fact is of prime
importance in the autosegmental tradition, as it suggests that the representation of tone is
autonomous from that of the segments, and hence a representation with a separate tier for
tone is required. As an example, Goldsmith (1976: 43) discusses the case of Lomongo.
In Lomongo, if two vowels are juxtaposed the first is elided, as shown in example 1.
Introduction
(1)
Bàlóngó băkáé
11
bàlóngãkáé (his book)1
However, although the second /o/ vowel of bàlóngó has been deleted, its high tone has
not. Instead, the /a/ vowel of băkáé has undergone a change of tone where the high tone
from the deleted segment is combined with its own rising tone.
The phenomenon of tonal stability is, therefore, a strong demonstration that the features
of tone are not properties of the vowel as are other features such as roundness or
tenseness. If tone was a distinctive feature of the vowel, then it should be deleted along
with all the other features when the vowel is deleted. Instead, if tone is considered to
exist on a separate tier, there is no reason why it should be deleted with the vowel.
1.1.5.2 Melody Levels
Another classic line of evidence in support of an autosegmental representation of tone is
that some tone languages show grammatically distinctive melody levels. These melody
levels are sequences of tones that serve grammatical functions in the language, such as
marking tense, for example. Goldsmith (1976) discusses Leben’s (1973) account of
melody levels in Mende, a language of Sierra Leone. In Mende, there are five tones: low,
high, rising, falling and rising-falling. There are also morphemes with one, two and three
syllables. As Goldsmith (1976: 46) explains, if the distribution of tones over syllables
were random, then we should be able to find every combination of tones in every
morpheme type. This is not the case however. For each morpheme length, only the same
five patterns are found. Leben (1973: 65) proposes a tone-mapping rule where tones are
mapped from left to right onto a morpheme. If there are more vowels than tones, the last
tone of the pattern is copied onto the remaining vowels. If there are more tones than
vowels, mapping proceeds in a one to one fashion and the final vowel receives all the
remaining tones as shown in example 2.
1
Symbols for tones in this dissertation will follow those typically used in phonological descriptions of tone
languages.  is a high tone,  is a low tone,  is a mid tone,  is a rising tone and  is a falling tone.
 is a high tone combined with a rising tone.
Introduction
12
(2)
Tone
High (H)
Low (L)
Falling (HL)
Rising (LH)
Rising-Falling(LHL)
1 syllable
k
kp
mb
mb
mb
2 syllable
pp
bl
kny
navo
nyh
3 syllable
nkl
It is suggested, then, that there are in fact only five underlying melodies, and that these
occur irrespective of the segmental make-up, a further argument for separate
representations of tone and other features.
1.1.5.3 Floating tones
Another of Goldsmith’s (1976) arguments in favour of treating tone as an autosegment is
the existence of floating tones. Just as a tone may be left behind when a vowel is deleted,
it seems that tones may also exist independently of vowels to begin with. These are
‘floating tones’. Odden (1995: 447) demonstrates the existence of floating tones in
languages such as Ewe. In Ewe, a language with three level tones spoken in Ghana, a
word that ends in a mid tone (M) is usually lowered to a low tone, which spreads
backwards to any other mid tones, as shown for the word eto meaning ‘buffalo’ in
example 3. However, some words, such as eto meaning ‘mortar’, do not lower the final
M. This is considered to be because eto (mortar) actually ends with a floating H tone in
the underlying representation and therefore the mid tone will not lower, as it is not word
final.
(3)
Underlying
Surface
t
t
t 
(buffalo)
t
(mortar)
Thus, the existence of floating tones also demonstrates that tone must be an autosegment,
as it can exist even when no vowel is present.
Introduction
13
1.1.5.4 Contour tones
As we have seen, tone languages might have level tones (such as high or low) or contour
tones (such as rising and falling), where the pitch changes in the course of a tone-bearing
unit. Goldsmith (1976) discusses the case of Igbo. In Igbo the tone that occurs in the
verb stem is determined by the form of the clause. In the “I Main” form of a clause the
verb is low toned. When the subject noun phrase would, in isolation, end in a low tone,
the tones remain unchanged, as shown in example 4 taken from Goldsmith (1976: 33),
where the tone on the second vowel of Eze is low.
(4)
Ézè (the chief)
Ézè cì àkhwá (the chief was carrying eggs)
If, however, the subject noun phrase ends in a high tone when spoken in isolation, when it
is placed before the low toned verb the situation is different. Instead of retaining the high
tone it would have in isolation, the tone changes to a falling tone, as shown in example 5.
(5)
Ékwé (a name)
Ékwê cì àkhwá (Ekwe was carrying eggs)
This situation is considered to demonstrate that in this language, contour tones are, in
reality, sequences or combinations of level tones. It is this sequencing aspect of contour
tones that had caused problems for SPE approaches in which tone was considered to be a
part of the feature bundle. In a word such as Ékwê in example 2, the falling contour is
realised on a single short vowel2. In an SPE approach this would be doubly difficult to
account for. Firstly, if contour tones are to be represented as sequences of level tones, the
vowel would need to have two specifications for the values [high] and [low]. One set of
values would represent the beginning of the falling contour, which would be [+high], [low]. The other set of values would represent the end of the falling contour and would
have the opposite values [-high], [+low]. These two sets of values were problematic as all
other features in the matrix had only a single value for any one segment, and therefore the
rules would have no way of stating how the double-valued feature should be interpreted.
2
The concentration here is on short vowels, as long vowels could conceivably be represented with two
feature matrices, with each matrix specified differently for tone.
Introduction
14
The second problem was that, even if two specifications are allowed for [high] and [low],
there is no way to ensure that they would occur in the right order and create a falling
rather than a rising tone. As the feature matrix is inherently an unordered bundle of
distinctive features, ordering the features in the vertical dimension so that the
specification for the high tone comes above that for the low tone has no effect on the
temporal order in production.
The issues arising from contour tones were therefore a severely complicating factor in the
SPE approach, where the specification of tone was considered to be an integral part of the
distinctive feature specification. In an autosegmental representation where the tones and
vowels are represented on different tiers, the change of tone at the end of the noun phrase
in example 5 can be seen as anticipatory assimilation. As the one-to-one relationship
between features and phonemes inherent in the SPE approach had been removed, the
vowel can be linked to, or associated with, more than one tone at a time. In this
formulation, then, the contour tone on the final vowel of Ekwe results from the final
vowel being associated with two level tones, a high followed by a low, as shown in
example 6.
(6)
Ekwe ci
H
H
akhwa
L L
H
Introduction
15
1.2 Tones, targets and turning points
So, AM approaches to intonation owe much to the development of autosegmental
phonology in general, including not only the separation of intonation and text onto
different tiers, but also the representation of intonation in terms of level tones or pitch
targets. When dealing with AM descriptions of intonation it is important to distinguish
between the notions of tone, pitch target and turning point. The main distinction is one
based on different levels of analysis. Tones are the phonologically specified elements of
the contour such as H and L, which might either form part of a pitch accent and be
associated with stressed syllables, or be associated with a prosodic boundary as a
boundary or edge tone. At the phonetic level, tones are implemented as intonational
targets, which speakers aim to produce at a particular F0 level at a particular time. As
there is a distinction between the phonological tone and its phonetic implementation, a
single tone may be realised in a range of different ways. Turning points in the contour are
the most explicit realisation of a pitch target. In Bruce (1977), for example, tones are
implemented phonetically as pitch targets, and are realised as the maximally high or low
parts of the contour; turning points where the contour changes from rising to falling or
vice versa. So, a maximum in the contour is a H pitch target, and a minimum is a L pitch
target.
In Pierrehumbert (1980), however, as Ladd (1996: 103) describes, the close association
between tones, targets and turning points, present in Bruce (1977), has been lost. This
results in a rather unclear definition of tone and is considered by Ladd (1996) to be one of
the unresolved problems with AM analyses of intonation. There are some cases in the
theory, for example, where a turning point in the contour is not considered to be either a
phonological tone or a phonetic pitch target. There are other cases where a tone in the
phonological description of an intonation pattern is not realised as a turning point in the
contour and is sometimes not even meant to represent a pitch target. Both these cases are
discussed below.
Introduction
16
1.2.1 Turning points that are not tones or targets
Firstly, then, it is sometimes the case that a turning point in the contour is not identified
with a tone in the phonological description and is not considered to be a pitch target. For
example, although, in general, interpolations between targets are considered to be fitted to
straight lines, this is not always the case. As mentioned above on page 9, the transition
between two H* accents is said to be ‘dipping’ (Pierrehumbert 1980: 70). This dipping
transition is illustrated in Pierrehumbert’s figures 2.10 (1980: 281) and 2.11 (1980: 282),
and is redrawn below in Figure 1.4. Pierrehumbert (1980: 70) acknowledges that the
different interpolation rules for different tones is a problem, and states that she has tried to
model this type of contour by positing a L tone between the two H*s. However, this
would be problematic for other aspects of the theory since the postulated L could only be
trailing or leading (since it occurs on a metrically weak syllable), but both H*+L¯ and
L¯+H* have already been used to account for downstep (as will be discussed in the next
section).
Therefore, the low turning point is not expressed as a low tone in the
phonological description and is not considered to be a low target, but is merely a special
type of transition.
H*
H*
Figure 1.4 A dipping transition between two H* accents (redrawn from Pierrehumbert 1980: 282)
Introduction
17
1.2.2 Tones that are not targets or turning points
The opposite situation is also found when a tone in the phonological specification is not
considered to be a target and does not appear as a turning point in the contour. In strings
of L tones, for example, the individual tones do not appear as turning points. For
example, Pierrehumbert’s figures 2.14 (1980: 282) and 2.15 (1980: 283) show that in
sequences such as L* L- L% and L* L* L-, the contour is fairly flat, and each
independently specified tone does not appear as a turning point. The low boundary tone
does not relate to a low pitch target but rather signals the absence of the rise that would be
present if the boundary were specified as H% (Ladd 1996: 103). As can be seen in
Pierrehumbert’s Figure 4.2 (1980: 329), redrawn below in Figure 1.5, the contour remains
flat between the low phrase accent and boundary tone. As Pierrehumbert (1980: 72)
states, “cases like this are a major source of ambiguity in the intonational system; in cases
where the L tones are on different levels, the location of accents is not readily recovered
from the F0 contour”.
A further complication involves the downstep rule. Pierrehumbert (1980) suggests that
four pitch accents can trigger downstep of a following H. These are H¯+L*, H*+L¯,
L*+H¯ and L¯+H*. As Pierrehumbert (1980: 152) states, three of these accents “occur
transparently in the contour”. For H*+L¯, however, the “unstarred tone of the pitch
accent does not show up in the same obvious way” (1980: 152). As can be seen below in
Figure 1.5, there seems to be a smooth transition from one accented syllable to the next,
and the L¯ does not appear as a turning point in the contour. Again, the L is not
considered to be a pitch target, but merely triggers downstep on the following H*.
Pierrehumbert (1980) explains that the reason for using H*+L¯ as one of the set that
triggers downstep is that it is the only pitch accent left in the inventory that has not yet
been used for another purpose.
H*+L¯
H*+L¯
H*
L¯
L%
Figure 1.5 Downstepping H*+L- accents (redrawn from Pierrehumbert 1980: 329)
Introduction
18
1.2.3 Tones that are targets but not turning points
The lack of a clear definition for tones is not peculiar to the Pierrehumbert approach. It is
a problem for AM approaches in general and still has not been resolved. In addition to
the problems discussed above, where tone, target and turning point are not associated in a
one to one fashion, there is also another problem. It is often the case that researchers
agree that there is a tone in the phonological description, and that this tone is
implemented as a tonal target. However, the realisation of the tonal target may often not
be a single turning point, making it difficult to decide with any precision which part of the
contour should be labelled as the target. This difficulty in locating targets is especially
problematic as the current focus in the discipline is on target alignment (the precise
timing relationship between a target and some tone-bearing unit), and therefore
researchers wish to identify with some precision the location of targets in the contour.
Nevertheless, they are often frustrated in their attempts to find a single turning point with
which they can identify a pitch target. The following quotations illustrate some of the
problems they face.
Introduction
(1)
19
“The primary difficulty in using the F0 contour as the phonetic
representation of intonation is the extent to which it is affected by the
speech segments…Such effects on F0 have been the object of much
study in their own right…However, from the point of view of
characterising the intonation system synchronically, they are a source
of noise which must be factored out…it is often difficult to determine
precisely the location of tones with respect to the text”
Pierrehumbert (1980: 14)
(2)
“in order to abstract away from a surface F0 contour that is
unavoidably polluted by segmental influences and voicing
irregularities we argue that one should consider models that are able to
dissociate these influences and irregularities from the primary features
of interest…”
van Santen (2002: 107)
(3)
“The time point for the F0 maximum is probably the least reliable of
measurements. Segmental effects from the nasal and irregularities in
the pitch made its location rather uncertain in many cases. Where
there were two alternative locations that could be construed to be the
relevant F0 peak, we took the average between the two.”
Silverman and Pierrehumbert (1990: 79)
(4)
“In identifying peaks (H) and troughs (L) in the F0 contour itself, in
many cases the physical evidence is not clear cut. For example, the
high or low value may be sustained over a period of time, and analysts
must make a principled decision: should they choose a point at one or
other edge of this sustained frequency, or select a mid-point?”
House and Wichmann (1996: 3)
(5)
“there are cases in which only stipulative decisions can be made
regarding tonal target position, either because a single maximum (or
minimum) cannot be discerned (as in plateaus …), or where a
perturbation is created by some consonant, or because the target is
masked by the presence of a voiceless sound”
D’Imperio (2002a: 103)
Introduction
20
There are several points that arise from these quotations. Quotations 1, 2, 3 and 5 all note
that segmental effects on the contour can make it difficult to tell where targets are located.
For example, voiceless segments may disguise the location of a target, and certain
consonants may create perturbations that make the F0 rise or fall. Quotations 1 and 2
both see the effects of segmental factors to be only noise that should be removed if
generalisations are to be drawn from the data. Quotation 3 demonstrates the type of
solution researchers resort to when they are unsure where a target is or which part of the
contour is to be identified with the target. Although these solutions are principled ones,
different researchers use different methods, and the solutions are generally arbitrary and
not based on any external evidence that might suggest which point is likely to be a
speaker’s target.
Quotations 4 and 5 illustrate a different but related problem: how can a pitch target be
identified when the F0 stays level for some time? If there is no clear peak or trough in the
contour, how can the exact location of the target be specified? Should the target be
identified with a particular part of the level section (e.g. the beginning, middle or end) or
with the whole section of contour at that level? It is important to note that this problem
does not only apply when plateaux are completely flat. Even if there is a point that is
slightly higher than the surrounding contour, it is not clear that the listener’s ear will be
sensitive enough to tell that the pitch has changed.
This dissertation will focus on situations where there is a plateau in the contour. In these
situations, it is clear that there is a tone in the phonological description and that this tone
is implemented as a pitch target. However, the realisation as a level stretch of a particular
pitch makes it unclear exactly where the speaker’s target is located.
Introduction
21
1.3 Previous studies of intonational plateaux
There are a few cases in the literature where intonational plateaux themselves are the
objects of study, rather than just being noted as an interesting point or presenting a
problem to overcome. These cases will be discussed below, in relation to the individual
languages under study. It is important to make clear at this point that these plateaux are
those found as the phonetic implementation of a single phonological tone, where the pitch
target is realised as a plateau rather than the well-defined peak or trough that a strong
version of AM theory would predict. These plateaux are different in kind from those
found in rise-plateau-slump configurations (see e.g. Cruttenden 1997: 133-136) in
varieties of Urban Northern British. In rise-plateau-slump configurations, the plateau is
sustained over several unaccented syllables, whereas the focus in the studies below is on
plateaux found in a single accented syllable, as the realisation of a single pitch target.
1.3.1 Tokyo Japanese
Pierrehumbert and Beckman (1988) discuss a case of plateaux found as low boundary
tones in Tokyo Japanese. Traditional accounts of this dialect suggest that the accentual
phrase starts with a L tone and is followed by a H tone. The rise is considered to mark
the boundary. However, in some cases, such as when the first syllable is accented or long
(containing two sonorant morae), the L tone is usually assumed to be absent.
Nevertheless, when Pierrehumbert and Beckman (1988) examine utterances with long
and accented initial syllables they find that there is still a rise, as shown in the left panel
of Figure 1.6 when the accented word uni is initial. The main difference between that
contour and the one for the unaccented word ume, shown in the right panel of Figure 1.6,
is in the timing and the duration of the low tone. In the unaccented case, the low tone is
lower and appears to be sustained, forming a plateau.
a
mai
u
ni
a
mai
u
me
Figure 1.6 F0 contours for an accented (left panel) and unaccented (right panel) first syllable
(redrawn from Pierrehumbert and Beckman 1988: 27)
Introduction
22
Pierrehumbert and Beckman (1988: 14) term this difference in the realisation ‘L-% tone
allophony’ which they believe is governed by the weight and accentuation of the first
syllable of the following phrase.
They suggest that the differences may be due to
different associations to the prosodic hierarchy. When the initial mora of a phrase is
accented (i.e. bears a H tone) the L is associated with the boundary and is realised as a
single low turning point; this is termed the ‘weak allophone’. However, when the initial
syllable is unaccented and the first mora does not bear a H tone, the L tone moves from
the boundary and becomes associated with this mora; this is the ‘strong allophone’, and
the L is realised with a longer duration and a lower F0 than the weak allophone.
It seems that the strong allophone is considered to be the normal case that occurs if other
tones do not intrude on the associated mora. Pierrehumbert and Beckman (1988: 28) state
that in the accented case the L tone “cannot have the extended duration that it has when it
alone is associated with the first syllable”. They further suggest that these differences are
due to the signalling of prominence relations; “by lengthening its duration and by
lowering its pitch relative to that of adjacent H tones, a L tone can be made prosodically
stronger” (1988: 29).
In this account, then, the plateau is seen as a normal realisation of a low target and its
presence is conditioned by prominence relations within the utterance. Unfortunately we
know little about the detailed phonetic realisation of each allophone. We are told that the
strong allophone is 10 Hz lower than the weak allophone (1988: 29), but we are not told
how long the strong allophone is or if the difference between the two allophones is
gradient or categorical.
Introduction
23
1.3.2 British English
1.3.2.1 Wichmann et al. (1999)
Wichmann et al. (1999) find plateaux when studying the effects of discourse structure on
peak timing in English. The stimuli consisted of texts where the same word occurred in
three different positions: sentence final (nuclear), paragraph initial (the first accented
syllable in a paragraph) and sentence initial (the first accented syllable in a paragraph
medial sentence). Findings suggested that peaks are later in the more initial positions
(p.1766). However, in addition to peaks, plateaux were also found for some speakers in
topic-initial position (p.1768) as shown below in Figure 1.7.
Figure 1.7 A plateau in paragraph initial position (upper panel ) and peak in sentence initial position
(lower panel) (taken from Wichmann et al. 1999: 1767)
No detail is given about the duration or other aspects of these plateaux, but the authors do
suggest two reasons for the occurrence of the plateaux. One is that the plateau itself may
be a device for signalling structure. In this version the realisation of the high target as a
plateau would itself signal initiality. This explanation sees the whole plateau as the target
and sees the target’s realisation as influenced by discourse factors.
The second
explanation is that the plateau may be an alternative way of delaying the peak. The
speaker separates the rise and fall by a plateau, thus delaying the peak without delaying
the rise. This explanation, then, sees the end of the plateau as the speaker’s target, whilst
the plateau itself is merely a way of creating peak delay without delaying the preceding
rise. The diagrams given to illustrate the plateau (p.1767, and Figure 1.7 above) suggest,
however, that the end of the rise may be delayed as well. If this were the case for all the
plateaux found then this second explanation would seem unlikely.
Introduction
24
1.3.2.2 ProSynth
The ProSynth3 group conducted the largest scale study of intonational plateaux. Their
work was motivated by a desire to produce natural-sounding synthetic speech that would
be robust in difficult listening conditions. Their work on intonation focuses on that of a
single male speaker of Southern British English. The database consists of 467 utterances
all with a falling (H*L) nuclear accent. Utterances have either one or two syllables in the
final foot and the accented syllable varies in terms of onset and coda type. House et al.
(1999b) present an account of the procedures used to model the intonation, whilst Ogden
et al. (2000) present the results of several perceptual experiments.
1.3.2.2.1
House et al. (1999a and 1999b)
House et al. (1999b) describe how they began their attempt at modelling intonation by
reducing the contour to a small number of turning points. They found that the high tone
of nuclear accents could not be successfully modelled using a single turning point
representing a peak, but that instead two points marking the beginning (PON) and end
(POF) of a plateau were needed (p.2344) as shown in Figure 1.8.
In order to
automatically derive these turning points an algorithm was implemented to identify the
point in the contour where the pitch was within 4% of the absolute peak value. This 4%
range was motivated by studies cited in Rosen and Fourcin (1986) (which will be
discussed further in Chapter 2) that suggest that 4% approximates the range of perceptual
equality, so that everything within this range should sound of an equal pitch to the
listener.
Figure 1.8 A plateau in the contour and the turning points identified for use in speech synthesis
(taken from House et al. 1999b: 3)
“An integrated prosodic approach to device-independent, natural-sounding speech synthesis”. ESPRC
grant numbers: GR/L53069, GR/L51829 and GR/L52109
3
Introduction
25
A statistical analysis of the whole database revealed that both the beginning and the end
of the plateau covary systematically with linguistic structure (pp.2344-2345). Both the
onset (PON) and the offset (POF) of the plateau are found to occur later in feet with two
syllables than in feet with one syllable. Segmental effects also influence the timing of the
turning points. In feet with only one syllable, PON is earlier with sonorant onsets than
other types, and POF is later with voiceless than with sonorant onsets. In feet with two
syllables, PON is later when the onset of the syllable is voiced, whereas POF is later with
empty onsets and earlier with voiceless codas.
House et al. (1999b) make two main claims about the status of the plateau. One is that
the plateau might be an alternative phonetic implementation of H*L resulting from
systematic structural differences (p.2346).
Although the statistical analysis of the
database treated all high targets as plateaux, the visual analysis suggests that plateaux are
found more often in two-syllable feet with voiced obstruent onsets and more variably
with sonorant onsets. In one-syllable feet the simple peak was more common but again
plateaux were sometimes found in syllables with sonorant onsets (p.2344).
The second suggestion (p.2346) is that the end of the plateau (POF) may be the speaker’s
real target, as it is less prone to microprosodic effects and less likely to be affected by
interpolation from turning points in the previous accent than PON. There is also a
suggestion that the alignment of POF may create a different semantic effect, as a later
alignment of this point within the syllable is described as giving an impression of greater
finality.
Introduction
26
1.3.2.2.2
Ogden et al. (2000)
In Ogden et al. (2000) the effectiveness of the modelling is tested by means of a number
of perceptual tests, two of which focus on intonation. In one experiment, the effect of
correct modelling on the impression of naturalness or neutralness is investigated. In this
experiment, utterances are synthesised which have a monosyllabic final foot. For each
utterance, POF and LON (the turning point representing the low elbow in the contour) are
aligned either correctly (for a monosyllabic foot) or incorrectly (for a polysyllabic foot).
Exact alignment is based on onset and coda type and is modelled in relation to the vowel.
Pairs of segmentally identical utterances were created where intonation was correct in one
member of the pair and incorrect in the other. Eleven subjects pressed buttons to indicate
which version in each pair sounded more neutral. Seventy-eight percent of responses
favoured correct alignment and this was found to be significantly more than would be
predicted by chance (p.203). One subject however, was given different instructions and
asked to indicate not which version sounded more neutral, but which he preferred. This
subject preferred the version with the incorrect alignment 89% of the time, indicating that
instructions have an important effect on subjects’ responses (p.204).
A second experiment was designed to see if the correct modelling of intonation facilitated
listeners’ understanding of the linguistic message. Subjects read a story and were then
given written questions, the answers to which were utterances synthesised with either the
correct or incorrect alignment of POF and LON. They had to indicate if the synthesised
answer was true or false, and reaction times and errors were measured. There was a trend
for reaction times to be quicker with correct alignment, but this trend was not significant
(p.204). Error data are not reported. It is suggested (p.206) that there are problems with
the experiment. The questions differ in difficulty, there is a high cognitive load and the
subjects were not forced to respond fast, so seem to have very variable reaction times and
not to make many errors.
It seems that the alignment of the end of the plateau or the duration of the plateau may
create different perceptual effects. It is noted that a later POF in the stimuli creates a
longer plateau, which is perceived as a high-fall, as opposed to the low-fall perceived
when the plateau is shorter. It is noted that the high-fall is usually associated with
emphasis. It is not possible to tell whether the difference in perception is created by the
alignment, or the duration of the plateau, or a combination of the two.
Introduction
27
The results from the ProSynth group demonstrate that the phenomenon of the plateau is
subject to influences from linguistic structure, and that the alignment of the end point (in
combination with that of the low tone) may affect perceptions of naturalness and might
possibly affect comprehension.
1.3.3 Neapolitan Italian
D’Imperio (2002a) explicitly addresses the question of what the tonal target really is
inside a plateau, an example of which is shown in Figure 1.9. Data is reported from
D’Imperio et al. (2000), who use the timing differences between statements and questions
in Neapolitan Italian as a test-bed to determine the perceived location of tonal targets. In
Neapolitan Italian both questions and (narrow-focus) statements are realised by rise-falls,
the peaks of which are aligned around 40 ms later in questions (D’Imperio 2002a: 103).
D’Imperio and House (1997) and D’Imperio et al. (2000) show that in perceptual
experiments where the peak of a rise-fall configuration is shifted later in time, subjects
identify earlier peaks as statements and later peaks as questions.
Figure 1.9 An example of a plateau in a H+L* accent in Neapolitan Italian (taken from D'Imperio
2002a: 101)
This situation was used to test what would happen if plateau stimuli were used instead of
peak stimuli. D’Imperio et al. (2000) conducted the same perceptual experiment again
but with a 45 ms plateau spreading rightwards from the peak. The argument is that if the
real tonal target is the beginning of the plateau then there should be no difference between
responses to plateau and peak stimuli (as the plateau is extended forwards in time). The
different stimuli are shown below in Figure 1.10. However, when the results from
plateau stimuli are compared to those from peak stimuli, it is clear that they are actually
very different. Specifically, plateau stimuli elicit more question responses at each timing
point.
For example, at the second time step, plateau stimuli gained 75% question
responses, whilst peak stimuli gained only around 10%. This result was taken to indicate
that the perceived target was not located at the beginning of the plateau.
Introduction
28
An alternative hypothesis was that the perceived target might be at the end of the plateau.
This hypothesis was tested by comparing the results from the plateau experiment to the
results from the peak stimulus corresponding to the plateau offset (that is the peak
stimulus three time steps later than the peak stimulus timed at plateau onset). Here, the
differences between peak and plateau stimuli were much smaller. For example, at the
second time step, 82% of responses favoured the question interpretation, compared to
75% for plateau stimuli. This results leads to the suggestion that the perceived tonal
target is located near to but not exactly (as the results were not identical) at the end of a
plateau.
Plateau
Peak at
plateau onset
(time step 1)
Peak at
plateau offset
(time step 4)
Figure 1.10 Plateau stimuli and peak stimuli timed at plateau onset and offset (adapted from
D'Imperio 2002a: 104)
Introduction
29
1.3.4 Summary and discussion of previous work on plateaux
The plateau has been the subject of only four main investigations all with different aims,
methods and conclusions. The Japanese plateaux of Pierrehumbert and Beckman (1988)
are unusual in that they are low tones that are in boundary position. In this case, the
plateau is seen as being an alternative realisation of a low tone with realisation
conditioned by prominence relations within the utterance. The Wichmann et al. (1999)
study of discourse variables in British English finds plateaux in paragraph initial
(prenuclear) position. It is again suggested that the plateau could be an alternative
realisation of a target alternating with a single high turning point with realisation
constrained by discourse factors. Alternatively, it is suggested that the end of the plateau
might be the speaker’s real target. House et al. (1999a and 1999b) find high plateaux in
nuclear position that covary with onset, coda and foot type. Again it is suggested that the
plateau may be an alternative realisation of a peak, constrained by structural factors, but it
is also suggested that the end of the plateau may be a more suitable target due to fewer
microprosodic influences. This fact is reflected in the perceptual work of Ogden et al.
(2000), who model differences in foot structure in relation to the end of the plateau (and
low tone) and find that there is an effect on perception of naturalness and possibly also on
processing. D’Imperio et al. (2000) and D’Imperio (2002a) are the only studies to
explicitly address the question of tonal targets in plateaux. In Neapolitan Italian it
appears that a point near to the end of the plateau is likely to be the perceived tonal target.
Thus, there are two basic views. One is that the whole plateau is an alternative realisation
of a simple turning point conditioned by various discourse, prominence or structural
factors. In this view the plateau is an intentional device on the part of the speaker to
signal something meaningful about the utterance by means of contrast to a simple peak.
The alternative view suggests that the end of the plateau is the speakers’ real target and is
in some sense equivalent to a simple peak. In these cases it is unclear why a speaker
might produce a plateau rather than a peak, but the plateau itself is not linguistically
significant, only the timing of the end point.
Introduction
30
It is interesting that the beginning of the plateau has not been postulated as the speaker’s
intended target. As House et al. (1999b: 2346) state, the beginning of the plateau is
probably more prone to microprosodic effects than the end. In addition, it seems likely
that target location is influenced by perceptual constraints. In the high target in falling
accents the end of the plateau is equivalent to the beginning of the fall (especially in the
ProSynth work where it represents the end of the range of perceptual equality). It is
likely that this is a perceptually salient part of the contour for listeners, more so than the
beginning of the plateau, at which point they still have to wait to find out what will
happen next. The only information given by the beginning of the plateau is that the pitch
is no longer rising. At the end of the plateau, however, the listener knows that pitch is no
longer level and has begun to fall.
Introduction
31
1.4 Aims, scope and structure of this dissertation
It seems that, despite the strong predictions of descriptions based on pitch targets and
turning points, the plateau is a fairly robust phenomenon, found for both high and low
tones, for both pitch accents and boundary tones, and in at least three different languages.
Nevertheless, we still know very little, and thus the main aim of this dissertation is to
extend the set of facts known about intonational plateaux. This dissertation will focus on
high targets in nuclear falls in British English, as these have been the most extensively
studied. In addition, English is the language for which most is known about peaks,
meaning that results obtained for plateaux can be compared to previous findings. The
dissertation will aim to answer a number of questions that arise from the presence of
plateaux in the contour and from the previous four studies in the literature.
For example, House et al. (1999a and 1999b) suggest that plateaux covary with linguistic
structure but this claim is based on studies of only one speaker and in relation to only
three linguistic variables. This dissertation will begin by studying the realisations of high
targets in the productions of five speakers of Standard Southern British English, and
additional linguistic structures will also be considered.
Although Wichmann et al. (1999) have considered discourse factors, other factors known
to affect peak alignment, such as non-linguistic variables like pitch span, have not yet
been considered. The third chapter will, therefore, consider how the plateau is affected
by some of these variables, and how the effects found are related to previous findings for
peak alignment. The results given in this third chapter allow us to choose between the
possibilities that the whole plateau or one or other of its end points is the speaker’s real
pitch target.
Little is known about listeners’ use of the plateau. Although Ogden et al. (2000) have
demonstrated that the alignment of the end of the plateau can affect naturalness
judgements, it is not known if alignment affects speech processing, as the results from
Ogden et al’s. (2000) second experiment were not significant. Chapter 4 considers the
possible use of plateau alignment in spoken word recognition.
Introduction
32
Chapter 5 returns to the issue of why speakers produce plateaux rather than simple peaks.
This chapter considers the perceptual effects of having a plateau in the contour rather than
a peak and the physiological mechanisms behind the production of plateaux. Finally, a
reason for the occurrence of plateaux is presented and discussed in relation to the
previous studies of plateaux.
Chapter 6 summarises the work presented in the other chapters and draws general
conclusions.
Download