Sequential and Simultaneous Grouping


Can schemata overrule primitive grouping cues in speech perception?

William J. Barry, Georg Meyer and Jacques Koreman


The topic of this talk is the separation of co-occurring VC-syllables and the way in which primitives which are important for simultaneous grouping play a role in sequential grouping.

An alternative title could be to ask the question: "Can schemata overrule primitive grouping cues in speech perception?".

This paper is the result of a collaboration between Georg Meyer from the Dept. of

Communication and Neuroscience of this university and Bill Barry and myself, from the Institute of Phonetics in Saabrücken.

This paper looks at the relationship between linguistic representations and perceptual processing known from auditory scene analysis (ASA).

After a short introduction, I will briefly mention some well-known ASA experiments, namely:

• double-vowel experiments and

• work on the function of primitives to hold speech together when whole utterances are processed in noise (where this noise could be for example utterances by other speakers)

Then I'll discuss ASA from a more linguistic viewpoint and present 2 small experiments which only begin to explore the questions which are raised.

Finally, I'll present some very tentative conclusions.


Auditory scene analysis is concerned with the ability of the observer to distinguish different objects or rather, to group signal properties belonging to the same objects when other objects are present simultaneously. Analogous to a scene in visual perception, the acoustic signal is regarded as a scene, in which different events must be associated with different objects. This view implies a fundamental distinction between properties of the acoustic signal, which are called primitives, and the related auditory "objects" which are present in the scene. These are called schemata.


Two primitives have been investigated extensively in so-called double-vowel experiments. In these experiments, 2 vowels are offered simutaneously and it is the task of the listener to identify the two vowels. The two primitives are:

• On- and offset (time-domain primitive)

• F0 and harmonicity (frequency-domain primitive)


To illustrate the function of simultaneity of vowel onset and/or offset, I have drawn a stylised spectrogram with time on the horizontal axis and frequency on the vertical axis. The blue lines indicate harmonics of the first vowel, the dotted red lines indicate harmonics of a second vowel.

As you can see, there is an exact overlap in the harmonics of the two vowels, except for a time shift between the 2 vowels. Of course, such experiments can only be carried out with synthetic stimuli. It has been shown experimentally that listeners are better able to keep the 2 vowels apart if there is a shift in their on- or offsets. It is assumed that the simultaneity of events in the signal binds these events together, so that the listener hypothesises they belong to one and the same schema.


Just to illustrate the effect of F0 and harmonicity – since it is used in the second experiment I’ll present - the next slide shows a stylised spectrogram with two vowels at different F0's. Since the vowels are "spoken" at different F0's, their harmonics, which are multiples of F0, are at different frequencies. The lowest blue line indicates F0 of the first vowel; all other blue lines, which are equidistant, are harmonics. The other vowel has a higher fundamental frequency (as indicated by the lowest red line, which is above the lowest blue line) and its harmonics have a wider spacing

(being multiples of this relatively high F0). It has been shown experimentally that listeners can use the patterns of equidistant harmonics to relate them to two vowels. If two vowels are spoken at different F0's, it is easier for listeners to distinguish them.


In work on whole utterances in automatic ASA (i.e. automatic speech recognition of signals from a mixed set of signals), these primitives have been applied effectively. Continuity of the melodic contour, for instance, (i.e. continuity of F0 and its harmonics) can be used to select related signal properties from a mixed signal.

Still, it is not very clear what exactly the primitives and schemata are. Continuity, for example, should not be confused with an uninterrupted melodic contour, since in a normal speech signal voiced, and therefore melodic, elements are interleaved with voiceless, and therefore inherently non-melodic elements. Continuity should therefore be considered to play a role at a higher level: there should be melodic continuity between syllables, in the same way as a melody in music implies a more abstract continuity in the sequence of musical notes.

The question of what continuity means has, to our knowledge, not been asked at the lower level of speech sound sequences. Simultaneous isolated vowels are not very typical of speech, since utterances normally consist of syllables with concatenated vowels and consonants, with

continuous transitions between them. How simultaneous syllables are segregated, or how each

vowel is perceptually linked to the correct consonant, is unclear. This paper is a first attempt to adress this question. We have asked listeners to identify double VC-syllables.

The theoretical status of the terms primitive and schema in relation to linguistics, which is concerned with the same phenoma, is unclear.


Primitives may be comparable to what are called acoustic cues in phonetic theories of speech perception, since they trigger the perception of phonological distinctions between sounds.


What schemata the acoustic cues trigger is less clear. I don't think it has been made explicit in

ASA what the schemata actually are. In phonetic theory, they could be syllables which are perceived on the basis of the acoustic cues....


or they could be phonological features like [nasal] or [alveolar], which are the smallest properties which can distinguish sounds. The experiments which are reported here serve to examine these alternatives.


So far, we have presented two alternative definitions of schemata.

The primitives (acoustic cues) present in the two synthesised VC-syllables are available in the signal simultaneously and trigger schemata. These schemata could be phonological features, which consequently combine to vowel and consonant schemata to trigger the syllable percept.

Please note that at the "primitive" level the VC-transitions for the two synthesised syllables are


Although the vowel-to-/l/ and vowel-to-/n/ trigger the same [alveolar] place-of-

articulation schema, the two transitions are different because they have different starting points as well as slightly different endpoints. The starting points depend on the formants of the vowel.

So the place primitives are different for the two VC-syllables, while they trigger the same place schema.


It is also possible that the schemata represent syllables, in which case no intermediate phonological features are hypothesised. The continuous acoustic cues in the vowel, the transition into the following consonant and in the consonant itself would trigger a syllable schema.

We shall show in the first experiment that different predictions fall out from the two perception schemes.


In the first experiment, which addresses the question whether and how acoustic continuity cues

are used to link up speech segments, listeners were presented with pairs of simultaneous VC syllables, where the vowels were different and were selected from German /a:, e:, o:/ and the consonants, which were also different, were /l/ and /n/. All combinations of different vowels and consonants were presented.

The subjects in the double-syllable experiment were 10 German listeners who were able to identify the individual synthetic VC-syllables with more than 95% accuracy. This ensures that the listeners had no problems with the sound quality of the synthetic stimuli.


The listeners' task was to identify the pair of VC-syllables. The following acoustic cues were available to them:

• the 40-ms VC place transition with formants trajectories reflecting the movement of the articulators from the vowel target to the consonant target and

• a nasality cue starting early in the vowel. This pre-nasalisation is caused by anticipation of the following /n/, for which the velum, which is a relatively large and inert articulator, must be lowered to allow air to escape through the nose.


First a quick word about the relative importance of these two acoustic cues from a phonetic viewpoint:

• it has been shown in many perception experiments in the 1950's and 60's that the vowel transitions are very important for the perception of the place of articulation of the following cononant (or in the case of CV: for the preceding consonant). In fact, the vowel transition seems to be more important even than the consonant itself.

• If we look at speech production, we notice that pre-nasalisation varies widely between languages and even speakers. English, for example, has strong pre-nasalisation, whereas Dutch has not. Given its variability, we should not expect it to play a very important role in comparison with the place cues.


To oppose the place transition and nasality cues to eachother, we created a set of what we have called "inconsistent" stimuli. These are stimuli in which the non-nasalised vowel leads into /n/, while the nasalised vowel leads into /l/. The place transitions are correct, though, so that nasality and place transitions present conflicting cues to the listener.


We shall say that the VC-syllables are identified correctly, if the continuity in the place cue is used to identity the pair of VC-syllables. In the case of "inconsistent" stimuli this means that the nasality cue must be ignored.

For consistent stimuli, both the place transitions and the nasality cue can be used by the listener to come to a correct identification of the pair of VC-syllables.


The first question we shall try to answer is whether listener use the continuity in the nasality cue, or any continuity cues, at all to link the vowel up with a consonant.


If they do not, we should expect no difference between their reactions to consistent and inconsistent stimuli. The pairs of simultaneous VC-syllables should be identified equally well in the two conditions.


If they do, we should expect them to link the vowel up with the correct consonant more often for consistent than for inconsistent stimuli. After all, in the case of inconsistent stimuli the nasality cue counteracts the place cue and may lead the listener down the garden path....


The results show a significant difference in the identification rate of consistent and inconsistent

VC-syllable pairs.


This shows that listeners do therefore use the nasality cue. Whether the listeners use it as a primitive continuation cue linking the pre-nasalised vowel to /n/ or whether they map the nasality cues onto a phonetic feature [nasal] and then use that to link the pre-nasalised vowel to /n/ cannot be decided on the basis of this experiment.

And, so as not to raise expectations, this will remain an open question for now, since the second experiment doen't adress the question either.


A second question we shall try to answer is related to the schemata which listeners use: are they phonetic features or syllables ?


If the place cues do not immediately trigger a phonetic feature schema for [alveolar], listeners can use the difference in the vowel transitions to identify the right pair of VC-syllables. Please recall that the two VC-transitions are different. Therefore, given the identification rate of the vowels, the identification rates for the VC-syllables should be above chance level .


If the place cues do immediately trigger a phonetic feature schema for [alveolar], listeners should link the vowel equally often to /n/ as to /l/. The reason is that, since at the feature level /n/ and /l/ cannot be distinguished by the place schema, both being [alveolar], the difference in the vowel transitions is no longer available to them when they combine the vowel and consonant schemata into a syllable percept. Therefore, given the identification rate of the vowels, the identification rates for the VC-syllables should be at chance level.


The vowel pairs in the consistent stimuli are identified correctly for a little over 70% of the stimuli. If the listeners cannot use the different vowel transitions because they've been mapped onto the same [alveolar] schema, the vowel should be linked up with /l/ and /n/ equally often. The bar representing VC identifications should then be half the height of the correct vowel bar. As you can see, it is only slightly higher for the consistent stimuli (in yellow) and lower (due to the effects of the inconsistent nasality cue) in the case of the inconsistent stimuli.


We conclude therefore that the place cue is not used by listeners to link the vowel up with the correct consonant, i.e. the consonant to which the place transition points. The most likely explanation in terms of ASA terminology is that the place primitives, which are different for the two VC-syllables, have triggered the same [alveolar] place-of-articulation schema, so that it can no longer be used by the listener to differentiate the two VC-syllables.


We can conclude from the first experiment that

• listeners do use acoustic continuity cues to link speech sounds, since in the case of inconsistent stimuli, the nasality cue leads listeners to link the vowel to the wrong consonant more often.

We have no evidence that nasality is mapped onto a schema immediately, nor do we have any counter-evidence showing that it acts as a primitive.

• the place-of-articulation cues which are present in the signal are mapped onto a schema immediately. This is how we explain that listeners cannot use this cue to link the vowels up with the correct consonant. At the level where they do this, the acoustically different place transitions have triggered the same [alveolar] place schema, so that the two consonants can no longer be distinguished.


In a second experiment, double VC-syllables with one VC at 120 Hz and the other at 140 Hz were played to the listener in addition to the same-F0 stimuli that we used in the first experiment.

As before, 10 German listeners who were able to identify more than 95% of the individual VCsyllables judged the double VC-syllables.


The place and nasality cues are available to the listeners, as before. In addition, they can use the

F0 cue.


F0 has two functions:

1) it serves as a well-documented simultaneous grouping cue (cf. double vowel experiments)

2) it serves as a sequential grouping cue, since it helps to link the vowel and consonant which were spoken at the same F0 because it provides a "primitive" melody and harmonicity cue to the listeners.

Note that F0 cannot trigger a schema at the level of the stimuli offered for identification. It may trigger schemata at higher levels of prosodic-phonological structuring, though.


Since F0 in the different-F0 condition helps to perceptually segregate the two VC-syllables, correct identification of the VC pairs should be expected to increase.

The question here is how the F0 cue interacts with the nasality cue, however.


Possibly the F0 cue, which has been shown to be a very important perceptual cue, increases the

VC identification rates, but is not so strong as to override nasality altogether. In that case, we should expect a significant difference between the consistent and inconsistent conditions even when the 2 VC-syllables are at different F0's.


If the F0 cue is so strong that it completely overrides the inconsistent nasality cue, we should expect the significant difference between the consistent and inconsistent stimuli in the same-F0 condition to disappear in the different-F0 condition.


As the graph shows, the percentage of correct VC identifications in the same-F0 condition is higher for the consistent than for the inconsistent stimuli. This difference is significant, as it was in the first experiment.

In the different-F0 condition, there is no significant difference between the two conditions.


The results therefore show that F0, which is known to be important for simultaneous grouping, is also very important for sequential grouping.


We conclude therefore that

• F0 is a strong continuation cue which links the vowel with the consonant and

• The F0 cue is so strong that it can override the nasality cue.

conclusions.


With so little work to base them on, they're all the more tentative, but I hope they present interesting hypotheses as a basis for further work.


• We could hypothesise that primitives trigger schemata immediately, if possible. It was shown in the first experiment that this is probably true for place of articulation. If you remember, the place primitives triggered the same place-of-articulation schema and therefore could not be used to link the vowel with the right consonant.

We've pointed out that for nasality, we have neither proof for or against its being mapped onto a schema.


We can set up two different hypotheses to explain the results that we found in the second experiment:

• the first possibility is that the F0 cue has a greater weight than the nasality cue.

Competing auditory cues are weighted, similar to the weighting of visual cues as suggested by Larry Maloney in the very first talk of this EMPG meeting.


• another possibility is that the F0 cue segregates the signal into two streams. The nasality cue cannot be used because it is in two different streams for the inconsisitent stimuli.

These hypotheses will be the subject of further investigation.


