In unprepared speech there are many different kinds of spoken

advertisement

1/37

An exploratory study into the intentionality of disfluency production

Oliver William Stewart

0678004

MSc Psycholinguistics

The University of Edinburgh

2007

2/37

Abstract

Much research has been conducted into the ways that listeners make use of disfluencies in a speech stream. Conversely, there has been very little work conducted on the production of disfluencies. One particularly under researched question is whether disfluencies are an epiphenomenon of speech production problems (e.g.

Oomen & Postma, 2001) or if they are used intentionally to signal upcoming delays or to comment on some facet of the speech stream (e.g. Clark & Fox Tree, 2002).

Additionally, there has been suggestion that disfluencies do not act as a homogeneous group and might perform different functions within an utterance (e.g. Bortfeld et al.

2001). We perform an exploratory study which begins to investigate these claims by manipulating both Performance (epiphenomenon) and Competence (intentional usage) related factors within a controlled experimental environment. The results of this experiment are indexed by disfluency type to allow between type comparisons.

Twenty Four participants undertook the experiment and produced a total of 1344 spontaneously generated but roughly formulaic utterances for analysis. Our main finding that Performance factors can increase the usage of certain types of disfluency

(a) expands on previous findings that Performance factors can be manipulated experimentally, and (b) suggests that disfluencies should be treated individually rather than as a homogenous group. We find little support for the intentional use of any form of disfluency within this experiment.

3/37

In unprepared speech there are many different kinds of spoken phenomena which frequently get collated and headed by the term ‘disfluency’. Fox Tree defines disfluencies as “phenomena that interrupt the flow of speech and do not add propositional content to an utterance. This includes long pauses, repeated words or phrases, restarted sentences and the fillers uh and um ” (1995, p. 709). While this definition may be considered a little incomprehensive (more recent studies (e.g.

Schnadt & Corley, 2006) have included other phenomena such as prolongations) it clearly shows that there are many, often independent phenomena being considered as one homogeneous group. Furthermore, there have been numerous attempts to quantify the rate at which disfluencies occur in normal unprepared speech. Fox Tree (1995) averaged across several studies and claimed that approximately 6% of all words uttered are, or contain, some form of disfluency. Conversely, Oviatt (1995) and

Bortfeld, Leon, Bloom, Schober & Brennan (2001) investigated the circumstances under which disfluency rates varied. Two findings which are particularly important for this study (and which will be expanded upon later) are that, according to Oviatt

(1995), (a) people tend to be more disfluent in dialogues (5.50-8.83 disfluencies per

100 words) than in monologues (3.60 per 100 words) and (b), disfluency rates were also much higher in speech intended for a human partner (5.50-8.83 per 100 words) than speech intended for a machine (0.78-1.87 per 100 words).

As can be seen from the figures above, “spontaneous human speech is notoriously disfluent” (Brennan & Schober, 2001, p. 274) and this is, in itself, one factor which makes disfluency an interesting area of language to study. This was not always the case; many language processing models (e.g., Stolcke & Shriberg, 1996) assumed that disfluencies were extraneous to the message being conveyed and so were of no use to the listener at all. As a consequence of this belief, disfluencies were considered to be filtered out of the speech stream by the listener and disregarded entirely. This view has met with much controversy and has, more recently, been thoroughly discredited. This discrediting has mostly centred on findings which indicate that listeners make some use of disfluencies. They cannot, therefore, be filtered out of the utterance before processing takes place. We will examine a sample of this (mostly behavioural) evidence shortly, initially however; we will briefly examine a more syntactic line of reasoning which is also counter to the concept of filtering.

4/37

Ferreira and colleagues (Bailey & Ferreira, 2003; Ferreira, Lau & Bailey,

2004; Ferreira & Bailey, 2004; Lau & Ferreira 2005) have conducted much work examining the integration of disfluencies into a grammatical structure. Within this work is raised the issue of incremental processing: It is generally considered that the comprehension system operates incrementally; processing input as it becomes available rather than waiting until the whole utterance is obtained before beginning processing. Ferreira and colleagues therefore argue “it cannot be true that the parser begins to build constituents only once input has been cleansed of disfluencies”

(Ferreira & Bailey, 2004, p. 232). While this argument does not necessarily hold true for disfluencies such as hesitations or filled pauses (as it could be argued that these may be filtered out in ‘real time’), it does hold for repairs that retroactively alter the syntactic structure of an utterance. This is because fairly recent evidence (e.g.

Christianson, Hollingworth, Halliwell & Ferreira, 2001; Ferreira et al. 2004) has shown that participants retain certain syntactic expectations of an utterance, even after a repair has taken place which should have removed them (regardless of whether the parser was filtering disfluencies in real time or after the whole utterance was collated).

It is difficult to see then, how a filtering account could accommodate these syntactic findings. In support of Ferreira and colleagues is evidence from other lines of research. Below we present a sample of behavioural (and subsequently electrophysiological) evidence which supports the assertions pertaining to filtering that are mentioned above.

The first piece of behavioural evidence we present is Fox Tree (2001). Fox

Tree conducted a reaction time experiment in which participants were visually presented with a target word and then played an audio recording containing that word.

Some of the recordings contained a filled pause which either took the form of an Uh or an Uh. The participant was instructed to press a button when they heard the target word. Fox Tree tentatively concludes that the filtering account is problematical as she found there to be a processing speed benefit when Uh was present as compared to when there was no filled pause and when Um was present. This she accredited to an

Uh in some way heightening the listener’s attention to an upcoming word or phrase.

Similarly, Brennan & Williams (1995) conducted a study in which participants were asked to rate on a seven point scale how reliable both their own and other people’s answers to twenty general knowledge questions were. These ratings were referred to as feeling of knowing (FOK) and f eeling of another’s knowing (FOAK)

5/37 respectively. Brennan & Williams divided the recorded answers into those that contained rising pitch, falling pitch, unfilled pauses (short = 1s and long = 5s) and Uh and Um. For the purpose of this investigation most interesting finding is that in both the FOK and FOAK experiments, participants felt that those answers which contained a filled pause (Uh or Um) were less likely to be correct than those that did not.

Therefore, listeners (in this offline study at least) consider a filled pause to indicate uncertainty; further evidence that listeners draw information from disfluencies.

In addition to the above findings Arnold and colleagues (Arnold, Fagnano &

Tanenhaus, 2003; Arnold, Tanenhaus, Altman & Fagnano, 2004; Arnold &

Tanenhaus, In Press) used eye tracking as a way of investigating the effects of disfluency on the “core language comprehension process” (Arnold et al 2003, p .34).

Participants were presented with a computer mediated visual display of a 5x5 grid containing four items: Two phonological competitor items (e.g. camel and candle) and two distracter items. Participants were given instructions as to moving the items within the grid. The first instruction was intended to set up a discourse given status

(i.e. that both the items mentioned (put the X below the X) became part of the current discourse). The second instruction (which, depending on the condition, sometimes contained a disfluency in the form of a prolonged article followed by a filled pause) either referred to the discourse given item or the other competitor. Arnold et al examined participant’s eye movements during and immediately after the disfluency and found that a when a disfluency was present in the utterance, participants tended to make more early fixations on the non-discourse given competitor item. Conversely when there was no disfluency in the utterance, participants tended to make early fixations on the discourse given competitor item. Arnold & Tanenhaus (In Press) have further investigated the influence of disfluency on referent selection by conducting preliminary studies comparing fixations on concrete and abstract items in fluent and disfluent conditions. The findings of this study suggest that a disfluency in an utterance increases early fixations on both new discourse items and items with low codability. The explanation given by Arnold and colleagues for this effect is that disfluencies increase a listener’s expectancy that an upcoming referent will be difficult to integrate into the discourse. Thus when presented with a finite array of items, a disfluency would prompt the listener to expect, and fixate, on the most problematic item (in this instance integration difficulty could be measured either by discourse status or by codability). Whether the Expectancy Hypothesis can be

6/37 generalised to spontaneously produced speech beyond that generated by this task (i.e. without a finite set of items which vary in terms of integration difficulty along only one or two dimensions) is still under investigation. However, these results are further evidence in support of the assertion that disfluencies should not be filtered and that we should be “considering disfluency as an informative aspect of the speech signal, rather than as noise” (Arnold et al., 2004, p. 581)

Having briefly summarised a sample of the literature concerning behavioural evidence in support of informative disfluencies , we can conclude that the second and currently most accepted way of viewing disfluencies is, not as errors which must be filtered from the speech stream, but as important parts of the speech stream which convey information to the listener. We turn now to electrophysiological evidence to examine some of the current theories which attempt to explain some of the behavioural benefits and effects that disfluencies appear to have on language processing.

We begin our exploration with Corley, MacGregor & Donaldson (In Press) who conducted Event Related Potential (ERP) research into disfluency and predictability. Each stimuli sentence was recorded twice, once with a disfluency (a

Hesitation which included a prolongation followed by the filled pause er , e.g. thee er ) and once without. Utterance final target words were then digitally edited into the utterance in place of a pseudo-target words which were used to reduce prosodic and phonotactic confounds. The edited-in target word was either highly predictable or highly unpredictable given the context provided by the utterance. Through counterbalancing these two conditions (disfluency and predictability) and presenting the utterances audibly to participants, Corley et al. were able to use ERP (“neural activity recorded at the scalp, time locked to the onset of a cognitive event of interest”) (In Press) to investigate whether or not disfluencies had any observable neurological effect on the processing of predictable versus unpredictable words.

Corley et al. specifically focused on the N400 effect (a spike in negativity at around

400ms after the onset of the cognitive event) which is thought to relate to the processing and integration of semantic items. They found that there was reduction in the magnitude of the N400 spike when a disfluency preceded an unpredictable target word (as compared to a predictable target word). In addition to this finding, Corley et al. also conducted a secondary experiment in which they asked participants of the

ERP study to undergo a recognition memory test. In this test participants pressed

7/37 buttons to indicate whether the word presented on the screen had been present in the previous experiment. The main finding from this experiment was that participants were more likely to remember words which were preceded by a disfluency as compared to those which were not. Corley et al. therefore conclude that not only do disfluencies have immediate on-line effects on language processing; they also have long term effects on how following words may be represented in memory. Corley et al. also have several suggestions as to how the underlying mechanism(s) generating these effects might operate. The two which are most influential for this study are:

Firstly, that certain disfluencies are, in effect, words which are processed normally but which listeners take as a comment on the status of the speaker’s production system

(e.g. encountering difficulty coding an unpredictable referent) (Clark & Fox Tree

2002).

Secondly, Corley et al. posit that there may be no particular relationship between the disfluency and the processing differences except that the disfluency constitutes a delay in input before a difficult item. This would, presumably, allow an incremental parser to ‘catch up’ with the input and be free of other processing demands when the item became available.

Another possible explanation is presented by Collard, Corley & MacGregor

(Submitted). Collard et al. modified the paradigm used by Corley et al. above by adding oddball items. In this context, oddballs are words which have different acoustic properties to the rest of the utterance that they are in. This could mean, for example, having a different pitch or intensity. Outside of a linguistic context, oddball items have traditionally been used to address questions regarding the cognitive apportioning of listener’s attention during an experiment. When an oddball (i.e. something unexpected or ‘out of the ordinary’) occurs there is a measurable P300 effect (a spike in positive energy around 300ms after the event) as well a mismatch in negativity (MMN). In the study, Collard et al. altered the acoustic properties of existing items immediately before disfluencies in an attempt to differentiate between accounts that posit there is a linguistic mechanism for handling disfluency (e.g.

Arnold et al., 2003; Clark & Fox Tree, 2002) and accounts that accredit a mechanism which functions outside of the linguistic processing system. Collard et al. argue that there is no linguistic difference between the normal utterances and those containing oddballs. Therefore, if there is a difference between the oddball and non oddball, disfluent utterances, there must be an influence of some kind of extra-linguistic

8/37 processing. By examining the P300 and MMN effects, Collard et al. found that certain disfluencies (here, as in Corley et al. (In Press), termed Hesitations) serve to focus the listener’s attention on the upcoming word. Thus, they conclude that there is at least some extra-linguistic processing occurring during the management of disfluencies.

They do stress, however, that this finding does not disprove linguistic accounts and state that “a thorough account of disfluency should describe and account for effects on both attentional and linguistic processes” (Collard et al., Submitted).

Although all the evidence presented above strongly discredits the concept of filtering, we can see mixed support for the ideas that: a). The comprehension system has a special mechanism(s) for handling disfluencies. b). Listeners are particularly sensitive to, and draw specific information from, different types of disfluency.

One recent study by Corley, Akker & Hartsuiker (Submitted), which expands on previous work by Bailey & Ferreira (2003), casts doubt on both of these assumptions. Corley et al. conducted experiments in which filled pauses were replaced either by silence or by a non-speech sound such as a sine wave tone. The intent of this was to examine the delay hypothesis mentioned above. The stimuli consisted of a visual display of two pictures and an auditory instruction to press the button which corresponded to one of these pictures. Within the instruction was some form of delay (a filled pause, silent pause or sine wave tone) which occurred either immediately preceding the target word or earlier in the utterance. The main finding of this study was that any form of delay, regardless of content, produced similar sorts of processing benefits as have been detailed above (i.e. an increase in processing speed of the following word).

Assuming then that the temporal delay hypothesis (Corley et al., Submitted) is correct, there is a conflict raised between having an overarching non-linguistic account of disfluency handling and the evidence supporting differences between the disfluency types (e.g. Fox Tree, 2001). Any differences in processing accredited to different types of disfluency would most likely been seen as support for a linguistic account as they would be primarily linguistic differences. Certainly, corpus studies and production literature have indicated that certain disfluencies seem to consistently

9/37 appear in certain locations within utterances (Clark & Fox Tree 2002; Fox Tree &

Clark, 1997; Swerts, 1998) and that they appear to perform different functions

(Schnadt & Corley, 2006). This would tend to imply that there is some kind of linguistic processing of disfluency occurring. The alternative is that these different forms of disfluency simply constitute different lengths of delay and so variation in the information drawn from each disfluency type could occur without the requirement for additional disfluency processing mechanisms within the parser.

Whether the mechanisms by which disfluencies are handled are linguistic, non-linguistic or a mixture of the two, it is clear that on some level disfluencies such as filled pauses and prolongations are used by listeners to assist in their comprehension of a speakers’ utterance. Although there has been much research conducted into the comprehension side of disfluency, there is still very little work that sheds light on the speaker’s role in disfluency production. One question which is particularly relevant to both lines of research is:

Do speakers use disfluencies as signals to their listeners or are they simply an epiphenomenon generated during production difficulties?

This question appears to directly correlate with linguistic versus non-linguistic accounts of disfluency parsing. If speakers are sending signals to their listeners then it seems reasonable to assume that they would do so by using the specific form of disfluency which best reflects the signal they wish to send. As such it would be sensible to assume that listeners would, not only be sensitive to particular forms of disfluency as they construe differing information, but also have specific linguistic mechanisms within the parser to receive them. If, on the other hand, speakers do not intend to send signals via disfluencies (i.e. they are epiphenomenon) it is unlikely that specific linguistic mechanisms exist to decipher them and it seems probable the listener would use other, non-linguistic methods, to derive the meaning.

As mentioned above there has been comparatively little work conducted on the mechanisms and intentions behind disfluency production . It also seems as though the majority of work that has been conducted in this area has centred on corpus analysis.

Bortfeld et al. (2001) conducted a corpus analysis to investigate the distribution of various forms of disfluencies in various different circumstances. This work was born from a desire to replicate findings from other research such as Oviatt

10/37

(1995) and Shriberg (1996) in one single study which drew results from only one corpus. In this way Bortfeld et al. could recheck and collate all the existing findings in a way that made between finding comparisons meaningful (as each finding would be from the same corpus and the coding protocols used would be consistent). Using

Schober & Carstensen’s (2001) task oriented corpus, the study investigated factors thought to influence disfluency production: speaker’s age, task roles (director vs. matcher), difficulty of task (describing abstract shapes vs. photographs of children), familiarity of speaker and listener, and gender. The findings most pertinent to the current study are that Bortfeld et al. found distributional differences among the disfluency types with relation to task role and difficulty, particularly with regards to filled pauses. It was Bortfeld et al’s expectation that directing a task would be more difficult and would therefore increase the amount of disfluencies generated by the speaker (as compared to performing the matching role). This showed to be particularly true for filled pauses. There are two ways to interpret this finding. Firstly, the task of directing the interaction is simply more difficult and so the cognitive load placed upon the director is greater, reducing available resources for speech planning

(assuming there is a centralised resource pool). In this account more filled pauses are produced as an epiphenomenon of planning stalls. Secondly, it may be the case that the speaker, struggling under the additional cognitive load of the task, is signalling upcoming delays to the listener by means of increased numbers of filled pauses.

Bortfeld et al. appear conclude that cognitive load could not solely account for the increase in filler rates between director and matcher. Therefore, although not explicitly stated, they appear to show evidence for some kind of intentional, disfluency based signalling on the part of the speaker.

In strict support of this finding is the corpus analysis work reported in Clark &

Fox Tree (2002). Clark & Fox Tree looked specifically at filled pauses (Uh and Um) and the way that they may be represented by the speaker. They argue that Uh and Um are in fact words in English like any other and that they should be classified as interjections (like, ah or oh ). Clark & Fox Tree argue that filled pauses, like interjections, should be defined by their function (i.e. they signal a delay) and also that

(based on corpus analysis) Uh signals shorter delays than Um. The basic meaning of

‘filled pause’ would therefore be: “used to announce the initiation of what is expected to be a minor, or major, delay in speaking” (2002, p. 86). In addition to the defined basic meaning, Clark & Fox Tree argue that, like all words, interjections have

11/37 secondary, implied , meanings and that people can use filled pauses to transmit a variety of other, more interpersonal messages ranging from “speakers want to keep the floor” to “speakers are inviting their addressees to speak” (p. 90). Finally, Clark &

Fox Tree argue that there are two types of messages that speakers try to transmit, the first are the primary messages which consist of the ‘official’ topic of conversation.

The second are collateral messages which comment on the speaker’s performance in the primary message. The account of disfluency production given by Clark and Fox

Tree (2002) would (if considered in terms of the comprehension account it would correlate to) be a highly linguistic account of disfluency management. This is unsurprising, however, as the lead author, at least, comes originally from a purely in linguistic background. As such it is possible that, in an attempt to explain the processing of Uh and Um linguistically, Clark & Fox Tree have neglected to consider other possible non-linguistic accounts and have a posited an account which, to some psycholinguists at least, seems overly complex and counterintuitive. O’Connell &

Kowal (2005) for example, have argued against the theory that Uh and Um are to be considered and treated as normal English words. Clark & Fox Tree state that “To be an English word is to confirm to the phonology, prosody, syntax, semantics and pragmatics of English words” (2002, p. 103) but to O’Connell & Kowal both the semantics of Uh and Um and their classification as interjections are questionable. In an analysis of a corpus of television interviews, O’Connell & Kowal found no evidence that Uh and Um signalled an upcoming delay: “in roughly 80% and 60% of the time, such a prediction would be false after uh and um , respectively, and uh and um are therefore poor perceptual cues to silent pauses for the listener” (2005, p. 562).

While, on the surface, this statement appears to contra to almost all of the evidence given so far in this paper, it is not, on closer inspection, conflictive. O’Connell &

Kowal argue that uh and um do not signal delay (i.e. they do not reliably occur before silent pauses) but this does not mean that they are not performing the function of accommodating a production problem within an utterance (i.e. the filled pause is the delay). Moreover, whether this accommodation constitutes a signal (either intentional or otherwise) to the listener, is a separate issue. This evidence is however, directly contra to Clark & Fox Tree’s position and draws into question the basic semantic meaning Clark & Fox Tree assign to the filled pauses, and thus their status as words.

In addition, O’Connell & Kowal highlight a methodological issue in Clark & Fox

Tree (2002) which throws further doubt on the status of filled pauses. Clark & Fox

12/37

Tree define a silent pause as having a duration of at least (0.5s) but “Their recorded means for both uh (0.17) and u:h (0.34) are shorter than this minimum” (O’Connell &

Kowal, 2005, p. 570). This implies that “they inappropriately included nonoccurrences… as contributors… to these means” (p. 570) which would undoubtedly reduce the reliability of their results and cast doubt on their conclusions. Finally,

O’Connell & Kowal argue that interjections are segregated from the speech stream by prosodic changes (e.g. loudness or intonation) rather than through articulatory isolation (see also O’Connell, Kowal & Ageneau, 2005). This emphatic distinction,

O’Connel & Kowal argue, does not correlate with the usage of Uh and Um which are

“characteristically non-emphatic” (2005, p. 572).

On these grounds, it seems, there is reason to be cautious when considering

Clark & Fox Tree’s strictly linguistic account of disfluency production and signalling.

It would however, be foolish to dismiss their findings entirely without first making some attempt to empirically verify or discredit them.

It is the purpose of this study to provide an empirical, non-corpus investigation of some of the claims detailed above. Particularly we investigate the claim that speakers use disfluencies to intentionally signal to their listeners that they are experiencing planning difficulties (and / or imply other things). We use the terminology of linguistic Performance and linguistic C ompetence to differentiate between the two perspectives. Linguistic Performance relates to the view that planning problems are accommodated in the speech stream by disfluencies and that speakers are entirely egocentric in their use of disfluent items (i.e. making no attempt to alert a listener to their cognitive status). Linguistic Competence refers to the theory that speakers are intentionally (although not necessarily consciously) using disfluencies to send messages to listeners concerning the state of their production system, potential upcoming delays, or more interpersonal factors such as their desire to remain uninterrupted.

This theoretical distinction is difficult to test experimentally and, as we can see from the findings discussed above, most work conducted in this area has examined the positioning and surroundings of disfluent items. However, in this paper we report an exploratory experiment which directly, empirically tests this distinction by manipulating the presence of a listener while speakers perform a spontaneous speech generation task.

13/37

The experiment is an expansion on work conducted by Oomen & Postma

(2001) who used the Network Task (originally used by Levelt, 1983) to investigate the production of various different disfluencies and speech errors. The network task is a computer mediated task in which participants describe a changing visual scene.

Participants are presented with sets of pictures connected by networks of paths, with up to three connections between any pair of pictures. Participants describe the path of a dot as it progresses through the network by identifying which of the paths the dot is taking and which object it is heading towards. This produces comparatively natural, but predictable, connected speech. In our experiment one condition sees participants direct a listener around each network in such a way that the listener can replicate (on a blank network) the exact route that the dot takes. In the second condition the speakers are told that they are describing the route the dot takes for a developing speech sound corpus. This first condition provides an opportunity for the speaker to produce

Competence based disfluencies which signal upcoming delay to the listener while the second condition acts as a control against which to compare disfluency rates.

One potential criticism of this Audience manipulation is that the speaker must be facing the screen while describing the dot’s route. This means that the speaker is not facing the listener (or has their view obscured by a computer monitor). It may be argued that the listener (who is not allowed to give feedback during the description) may be less than usually salient to the speaker. We address this issue by presenting a post-experiment questionnaire. Three of the questions on the questionnaire are designed to ask (in different ways) if the speaker was conscious of the listener during the task. We can, therefore, assess whether the listener was normally salient to the speaker during the task.

Returning to Oomen & Postma (2002); the authors manipulated the speed at which the dot travels through the network in attempt to apply pressure to the speech production system (participants were instructed to alter their speech rate to ‘keep up’ the dot). In the faster speech rate condition Oomen & Postma found a reliable difference in the amount of errors and disfluencies produced (as compared to the slow condition) and so gave further support to the Speed / Accuracy trade of theory

(MacKay, 1982; Dell, 1986). Based on the concept that the faster a person speaks the greater the pressure applied to the speech production system, we use a similar speech rate manipulation to Oomen & Postma to in affect the amount disfluencies produced as a result of linguistic Performance issues (i.e. epiphenomenon).

14/37

In addition, recent findings by Schnadt & Corley (2006) (also using the network task) indicate that altering the ‘difficulty’ of an item (based on lexical frequency and name agreement) causes a reliable increase in the amount of (certain) disfluencies produced. Following Schnadt & Corley’s example we also manipulate the

‘difficulty’ of certain items to increase disfluency rates. We do this for three reasons; the first is that it simply increases rates of disfluency (regardless of the mechanisms responsible) which, given the comparatively low disfluency rates usually generated in speech production tasks, boosts much needed statistical power and helps to avoid floor effects in the data. Secondly, it allows for a degree of comparability with existing work which enables us to confirm that both our experiment and our participants are acting as expected. Thirdly, and most importantly, having a proportion of difficult to name items in the task provides an opportunity to present the speaker with a very obvious cue to potential production problems. This very blatant cue might be what is needed to prompt a speaker into signalling an upcoming delay.

In this way, item difficulty would increase the production of both Performance and

Competence based disfluencies.

So to summarise: We use the Network Task as a context in which we manipulate three variables, the presence of an Audience, the Speed at which the dot moves around the network and the Difficulty of the items to be named.

In addition, inspired by Bortfeld et al’s statement that “perhaps some disfluencies serve an interpersonal coordination function, such as displaying a speaker’s intentional or metacognitive state to a partner, while others simply represent casualties of an overworked production system” (2001, p. 7. italics added) as well as much debate in the literature; we index the results by disfluency type. This allows us to investigate the claims that different disfluencies are performing (possibly intentionally) different functions. It may even be the case that some disfluency types can be accredited to linguistic Performance while others are related to linguistic

Competence.

As we turn to our predictions for this experiment, another of Bortfeld et al’s statements becomes highly relevant: “With a more difficult task, speakers are more likely to have trouble and to display that trouble to an addressee, so the effects of cognitive load will not be independent of effects of interpersonal coordination (if, indeed, the latter are at work) (2001, p. 20). With this in mind, when we examine the results of this study, we are not just looking for main effects of each of the

15/37 manipulations but also additive effects and interactions between the manipulations. If, for example, there is no effect of Audience, we may decide to accredit any effect of

Difficulty to linguistic Performance issues rather than Competence. Conversely, if there is an effect of Audience but no effect of Difficulty we might conclude that

(some) disfluencies are a part of linguistic Competence but that they are only used to implicate more interpersonal messages and do not signal production delay.

As discussed above, we associate each variable with a different pattern of increase in disfluency.

Audience

– An increase in disfluency rates related to this variable indicates the

Speed presence of Competence related disfluencies.

– An increase in disfluency rates related to this variable can be accredited to Performance disfluencies.

Difficulty

– An increase in disfluency rates related to this variable can be accredited to either Performance and / or Competence disfluencies depending on the effects observed in the other variables.

If claims of intentional signalling are true, we would expect to see a greater amount of disfluencies when a listener is present than when one is not. We would also expect to see a greater amount of disfluencies related to Difficulty when a listener is present as compared to when one is not. On the other hand, if claims of intentional signalling are false, we would expect to see no difference in disfluency rates whether an audience is present or not.

Method

The experiment was described as a communicative task in which one participant assumed the role of the director and the other the follower. Subsequently the follower (confederate) was removed and the director was instructed to repeat the task (with different materials). This second stage of the experiment was presented as distinct from the first and as a data collection session (for a developing speech sound corpus) rather than an experiment. Two of the three variables (Audience and Speed) were counterbalanced in a Latin Square design to avoid fatigue or learning effects and to reduce the salience of speed changes. Speed was set at two levels: the fast and the slow condition. These speeds (set to 25 and 35 in version 2.0 of the network task)

16/37 equate to an approximate ‘total time to pass through a network’ of 30 and 45 seconds respectively. After the experiment, participants were presented with a questionnaire which assessed the validity of their data. The questionnaire included questions about their linguistic history as well as the salience of the listener.

Participants.

Twenty four students (8 male and 16 female) from the University of Edinburgh participated in the experiment for financial reward. All were native

British English speakers. None had spent much time immersed within a different culture or been overly exposed to another language (including American English), nor did they have any known visual or auditory difficulties.

Materials. In this experiment we used eight trial networks and two practise networks. Each network consisted of eight black and white pictures connected by up to three straight, curved or looping lines (Figure 1 provides an example). These lines represent the possible paths the dot can travel along. The pictures differ in location as well as path configuration in each network. In the experimental networks, each network consisted of eight pictures, one of which the dot started on and two of which the dot travelled through twice. The description of the starting picture was discounted from the analysis as were the descriptions (and corresponding path descriptions) of those pictures which had already been named (second pass items). Pictures were selected from the IPNP (International picture naming project) online database (a collation by Szekely et al. (2004) of several picture sets including Snodgrass &

Vanderwart, 1980 and Abbate & La Chappelle, 1984) and digitally resized where necessary, to fit the item slots in the network.

17/37

Figure 1. An example of a network used in the experiment. The example shows all available types of paths (straight, curved and looped) and their possible configurations. In this example the dot starts at the ghost and finishes at the panda. The order of items is: ghost (filler), trumpet (hard), bone (easy), swan (hard), king (easy), trumpet (second pass item), swan (second pass item), clamp (hard), sun (easy), panda (hard).

The pictures to be used in the experiment were selected, based on name agreement and lexical frequency values, from a collation of data from both the IPNP and the CELEX (Centre for lexical information) online databases (name agreement and lexical frequency respectively). This collation was performed by Michael Schnadt of the University of Edinburgh. We defined two levels of the Difficulty variable as easy and hard . Average name agreement and lexical frequency scores for the easy condition were 0.017 and 85.4 respectively and 1.530 and 3.643 for the hard condition (a lower value for name agreement indicates higher agreement). Filler items

(used for the start of each networks as well as for all the items in the practise networks) were words which were extremely ‘easy’ (on average 0.127 name agreement and 104.82 lexical frequency). We alternated between easy and hard items

18/37 based on the order the items were seen in (dictated by the path the dot took) and kept the sequence alternate regardless of the difficulty of any second pass items. This meant that the overall challenge of each network stayed constant throughout. There were seven analysed descriptions in each network so, to avoid a disparity between easy and hard items, the three to four ratio was alternated across networks. Finally, a pilot study was conducted to asses which, if any, of the pictures selected were overly difficult to identify in terms of the picture itself rather than any lexical property of the item. Any items which pilot participants could not, or struggled to identify were replaced with those of similar lexical properties.

Procedure.

The participants were given a set of written instructions to read and were asked to confirm that they understood all that was asked of them and / or ask any questions they had before beginning. The instructions described the upcoming task, defined the sort of speech we wanted to elicit (i.e. complete utterances rather than minimalist descriptions), gave instructions as to how to operate the network task software and highlighted the kind of information the listener required (e.g. the starting location of the dot). The instructions also included a section defining the structure of the experiment in which the separate nature of the ‘director / follower experiment’ and the supposed corpus collection was emphasised. The instructions were given in written form to both provide consistency across participants and to highlight (in an official manner) the role of listener within the experiment (thus increasing the salience further). Participants were then seated in front of the computer screen at a distance that they felt comfortable with and went through the two practice networks. The participant’s speech was monitored by the experimenter during these two practice networks and any amendments to style or information content were instructed before the experimental trials began. The (confederate) listener was introduced by name and was seated behind the participant to reduce the potential for unintended feedback through facial expressions or gestures (the layout of the experimental room made the seating arrangement unremarkable). The listener was allowed to minimally interact with the participant between but not during networks. This was an attempt to increase the overall plausibility of the directing task and the salience of the listener while maintaining experimental control throughout the description. Participants described four networks to the listener who was then removed. Participants then described a further four networks. They were reminded that the second set of four networks were

19/37 for a corpus of speech sounds and the experimenter left the booth and shut the door before they began. Ostensibly this was to increase the sound quality of the recordings by cutting ambient noise. In reality it was an attempt to increase the plausibility of the corpus recordings and highlight the lack of a listener.

Due to counterbalancing of the Audience variable, in half of the experiments the supposed speech sound corpus recordings were undertaken first and the directing task second. In these experiments participants were presented with a different set of instructions. The instructions were identical in all but the structure section which informed the participant of the order of the tasks. In addition the listener was not introduced until after the sound corpus recordings had been conducted, this again was intended to maintain listener salience.

After the experiment participants were asked to fill out a short questionnaire which queried their linguistic history as well as assessing their level of commitment to the deceptions within the study, particularly there were three questions which evaluated the salience of the listener during the directing task.

Transcription and Coding.

Utterances were transcribed in full and broken down into two component parts: the Path description and the Item description. The speech was, for the most part, in a fairly formulaic format (e.g. now it’s taking the left curved path towards the hammer), although this was deviated from in both content and format on occasion. The Path description included everything up until the last word before the Item which contained propositional content. In the example above the

Path description would include now it’s taking the left curved path and the Item description would be towards the hammer . On the occasions where this format was deviated from, the utterance was transcribed as a whole but any disfluencies were accredited to the appropriate section. The Path and Item descriptions were also collapsed into a one description labelled Complete .

Disfluency was coded perceptually by the first author and 20% of utterances were cross-coded by an independent second coder. The second coder was given instruction into what constituted each type of disfluency and a copy of the transcription protocol. Agreement differed slightly for each disfluency type but averaged 73%. In cases of disagreement, disfluent items were examined and discussed until an agreement was reached. Disfluent items were perceptually identified as either

Prolongations, Uh, Um, Prolonged Uh, Prolonged Um, Hesitations or Repairs. Table

20/37

1 exemplifies the identification of different disfluency types. Where identifiable anomalies such as coughs or laughs occurred they were marked in the transcriptions but not analysed. Similarly, unidentified phenomena in the speech stream were marked with question marks but were not included in the analysis.

Table 1

Examples from the data of each identified disfluency type

Disfluency Type Example from data

Prolongation

Uh left on [the:] bottom curve from there is takes the (uh) upper curved path towards

Um towards the (um) deer

Prolonged Uh up to the (u:h) spoon

Prolonged Um (u:m) starting at the cheese

Hesitations to {} the wood

Repairs the claw ^the foot^

Once coded, disfluency rates were quantified by a tally of instances of disfluencies rather than amounts. Each utterance may contain one or multiple instances of a certain disfluency type but this method simply generates a score based on the presence of a disfluency type not the amount of that type within an utterance. Therefore we generate a score which equates to the proportion of utterances in each condition which contain a certain type of disfluency. This quantification method enables a more reliable comparison between disfluency types (as compared to a simple numeric tally) as it is not confounded by the comparative likelihood of some disfluency types to reoccur within the same disruption.

The questionnaire included three questions which assessed the salience of the listener to the participants. These questions were in yes / no format making them easily quantifiable. Each participant received a score between 0 and 3 depending on their answers. The analysis threshold was set at 2 so any participant scoring less than

2 was discarded and their data replaced.

21/37

Results

Four participants failed to meet the analysis threshold (2 or above) in the postexperiment questionnaire and so were rejected. Four replacement participants underwent the experiment and subsequently met or exceeded the threshold. Out of the twenty four included participants, 12 achieved a score of 2 and 12 achieved 3.

No differences between numbers of prolonged Uh or prolonged Um could be found in any conditions or in any description section. These were therefore collapsed into the corresponding filled pause type (i.e. Uh or Um). Uh and Um were analysed both individually and combined as Filled Pauses. No other changes were made to the disfluency types detailed previously. Utterances were analysed both as a whole (the

Complete description) and as two separate description sections: the Item description

(between the last content word before the target item and the end of the utterance), and the Path description (from utterance onset up to and including the last content word before the target item). Proportions of utterances containing disfluencies of each type were analysed with Difficulty ( easy or hard ), Speed ( slow or fast ), and Audience

( present or absent ) as within-subject factors. Table 2 shows the F values and significance levels for all significant and marginal results, indexed by disfluency type and divided by description section. Table 3 shows the Mean disfluency rates by condition, indexed by disfluency type and divided by description type

In the Complete description there was a highly significant main effect of

Difficulty for Any Disfluency Type (an amalgamation of all other disfluency types),

Prolongations and Hesitations ( p < .001). In all cases disfluency rates increased in the hard condition, as compared to the easy condition. There was a marginal interaction between Audience and Difficulty for Uh indicating that participants were slightly more likely to produce an Uh when the listener was absent and the item was hard than in the present and easy conditions. Finally, there was a marginal effect of Speed for both Filled Pauses and Uh. In both cases participants produced slightly more disfluencies in the slow condition than in the fast condition.

In the Item description analysis, the same main effect of Difficulty was observed for Any Disfluency Type, Prolongation and Hesitations (statistically significant to p < .001). In addition, Repairs showed a significant effect of Difficulty

( p < .05). In line with the effects above, participants produced more Repairs when the

22/37 item was hard to name than when it was easy . There was also a marginal interaction between Speed and Difficulty for Hesitations: Participants produced slightly more

Hesitations in the slow and hard condition as compared to the fast and easy condition.

Finally, there was a marginal Audience effect for Filled Pauses. Participants appear to produce slightly more Filled Pauses when the audience is absent. This may due to an extremely low value (0.0) in an audience present condition of Um (audience present , slow , hard ).

In the Path description analysis there was no consistent effect of Difficulty like those seen above. Speed was significant for Prolongations ( p < .05) and marginal for

Any Disfluency Type and Uh .

In all cases, participants produced more disfluent items in the slow condition than in the fast condition. Finally, there was a significant three way interaction between Audience, Speed and Difficulty for Filled Pauses and Uh

( p < .05), as well as a similar marginal interaction for Any Disfluency Type. In both cases, participants produced more disfluent items in the present, slow, hard condition than in the absent, fast, easy condition.

It should be noted that many of these effects and interactions are dependant on each other: The Complete description is an amalgamation of the Item and the Path descriptions, the Any Disfluency Type disfluency type is an amalgamation of all other disfluency types, and Filled Pauses is an amalgamation of the Uh and Um disfluency types. The consequences of this (in terms of the effects and interactions presented above) will be detailed in the Discussion section.

To summarise; the main effect of the experiment is Difficulty which is numerically consistent throughout the Complete and the Item descriptions but not in the Path description (as can be seen from Figure 2. below). There is little effect of

Speed except in the Path description, and the only significant effects of Audience are in three way interactions with Speed and Difficulty in the Path description.

Table 2. Degrees of freedom, F and mean squared error values indexed by disfluency type for (a) the complete utterance; (b) the item description

(between the last content word before the target item and the end of the utterance) and (c) the path description from utterance onset up to and including the last content word before the target item.

Disfluency Type Effect

Description Type:

By participants

Complete

23/37

Any Disfluency Type

Prolongation

Filled Pauses

Uh

Hesitation

Disfluency Type

Any Disfluency Type

Prolongation

Filled Pauses

Hesitation

Repair

Difficulty

Difficulty

Speed

Speed

Audience x Difficulty

Difficulty

Effect

Difficulty

Difficulty

Audience

Difficulty

Speed x Difficulty

Difficulty

1,23

1,23

1,23

1,23

1,23

1,23 df

1,23

1,23

1,23

1,23

1,23

1,23

F

1

19.559 ***

16.549 ***

3.647 m m 3.450

3.599 m

13.800 ***

MSe

5.522

1.413

1.285

0.799

0.418

0.982 df

Description Type:

By participants

F

1

Item

MSe

24.950 ***

20.354 ***

3.242 m

13.905 ***

3.833 m

5.406 *

4.572

1.401

0.272

0.793

0.196

0.163

Disfluency Type

Any Disfluency Type

Prolongation

Filled Pauses

Uh m p < .10

Speed

Effect

Speed df

1,23

Audience x Speed x Difficulty 1,23

Speed 1,23

Audience x Speed x Difficulty

Audience x Speed x Difficulty

Description Type:

By participants

Path

F

1 m

MSe

3.046

1,23

1,23

1,23

3.016

3.930

7.709

6.433

3.489

4.490 m

*

* m

*

1.717

0.493

0.505

0.293

0.227

* p < .05

** p < .01

*** p < .001

Table 3. Mean proportions of utterances that contain a disfluency for each experimental condition; indexed by disfluency type across (a) the complete utterance; (b) the item description (between the last content word before the target item and the end of the utterance) and (c) the path description from utterance onset up to and including the last content word before the target item.

Audience Present

Complete

Audience Absent

24/37

Any Disfluency Type

Prolongations

Filled Pauses

Uh

Um

Hesitations

Repairs

Any Disfluency Type

Prolongations

Filled Pauses

Uh

Um

Hesitations

Repairs

Any Disfluency Type

Prolongations

Filled Pauses

Uh

Um

Hesitations

Repairs

Easy

Fast

Hard

2.45

0.96

0.58

0.33

0.25

0.33

0.67

4.25

1.79

1.00

0.42

0.58

0.71

0.75

Slow

Easy Hard

3.46

1.50

1.13

0.71

0.42

0.29

0.54

4.42

2.08

0.88

0.42

0.46

0.96

0.50

Easy

Fast

Hard Easy

Slow

Hard

2.83

1.13

0.63

0.29

0.33

0.38

0.71

4.29

1.79

0.83

0.46

0.38

0.88

0.79

3.00

1.13

1.00

0.50

0.50

0.25

0.63

4.88

1.83

1.29

0.83

0.46

0.83

0.92

Easy

Audience Present

Fast Slow

Hard Easy Hard

Item

Easy

Audience Absent

Fast Slow

Hard Easy Hard

1.13

0.63

0.17

0.08

0.08

0.29

0.04

2.46

1.46

0.25

0.13

0.13

0.58

0.17

1.08

0.67

0.21

0.21

0.00

0.21

0.00

2.96

1.58

0.38

0.21

0.17

0.88

0.13

1.42

0.83

0.17

0.08

0.08

0.29

0.13

2.79

1.38

0.46

0.29

0.17

0.71

0.25

1.13

0.50

0.42

0.21

0.21

0.17

0.04

2.71

1.29

0.50

0.33

0.17

0.71

0.21

Easy

Audience Present

Fast Slow

Hard Easy Hard

Path

Easy

Audience Absent

Fast Slow

Hard Easy Hard

1.42

0.33

0.42

0.25

0.17

0.04

0.63

1.79

0.33

0.75

0.29

0.46

0.13

0.58

2.38

0.83

0.92

0.50

0.42

0.08

0.54

1.46

0.50

0.50

0.21

0.29

0.08

0.38

1.42

0.29

0.46

0.21

0.25

0.08

0.58

1.50

0.42

0.38

0.17

0.21

0.17

0.54

1.88

0.63

0.58

0.29

0.29

0.08

0.58

2.17

0.54

0.79

0.50

0.29

0.13

0.71

Difficulty. Complete description

25.00

20.00

15.00

10.00

5.00

0.00

An y

Di sf lu en y

Ty pe

Pr ol on ga tio ns

Fi lle d

Pa us es Uh

Um

Disfluency Type

He sit at io ns

Re pa irs

Easy

Hard

25/37

Difficulty. Item Description

14.00

12.00

10.00

8.00

6.00

4.00

2.00

0.00

An y D isf lu en y Ty pe

Pr olo ng at ion s

Fi lle d

Pa us es Uh

Um

Disfluency Type

He sit at ion s

Re pa irs

Easy

Hard

Difficulty. Path Description

9.00

8.00

7.00

6.00

5.00

4.00

3.00

2.00

1.00

0.00

Easy

Hard

An y

Di sf lu en y

Ty pe

Pr ol on ga tio ns

Fi lle d

Pa us es Uh

Um

He sit at io ns

Re pa irs

Disfluency Type

Figure 2. Bar Charts showing a numerically consistent main effect of Difficulty across all disfluency types in Complete and Item descriptions, and a less consistent pattern in the Path description. Note, error bars are calculated by

Standard Error, scales vary between graphs and data is collapsed across

Speed and Audience.

26/37

Discussion

One of the general and overarching findings of this study is that manipulation of experimental variables can affect production rates of different forms of disfluency in a controlled, experimental environment. Before we begin to discuss the more specific findings of the experiment however, we should address certain possible criticisms relating to the methodology of the study. Firstly, there was a gender disparity among the participants (8 Male and 16 Female). In their corpus analysis,

Bortfeld et al. (2001) found that during directing tasks, men produced more filled pauses than women (see also Shriberg, 1996). In the experiment reported here, gender was not controlled with regards to task order (directing task / corpus collection) but a post-hoc analysis indicates that males were fairly well distributed between the two orders. In addition, all variables were within-subjects so gender differences should not have adversely affected the results. Secondly, Oviatt (1995) found that disfluency rates were much higher in speech intended for a human partner (5.50-8.83 per 100 words) than speech intended for a machine (0.78-1.87 per 100 words) and, it could be argued that our Audience conditions correlate with the human / machine conditions that Oviatt discusses. We found no support for this argument: The Mean proportions of utterances containing a disfluency in the audience present condition (14.58) and the audience absent condition (15.00) did not differ statistically. However, based on these figures, it could be further argued that our Audience manipulation was unsuccessful due to either a lack of audience salience or the nature of the directing task (being more like a monologue than a dialogue). The latter could be considered to bias the experiment towards eliciting Performance disfluencies rather than Competence disfluencies as it reduces the likelihood of the implied signals (such as floor holding or inviting a listener to speak) occurring.

We argue that the post experiment questionnaire establishes that the listener was salient to all speakers during the task and that some even consciously undertook speaking strategies (such as speaking louder, more clearly or more descriptively).

Therefore we argue that our experimental Audience manipulation was successful within its intended parameters (discussed below). With regards to the possibility of experimental bias towards Performance based disfluency: it is important to note that from the outset this study has been a very early, exploratory examination of this hitherto under researched theoretical question. We feel it is important, therefore, to

27/37 maintain as high a level of experimental control as possible at this stage and look to increasing the dialogical freedom of the task in subsequent experiments. As such (and as befits an empirical study) we have made all attempts to maintain as high a level of experimental control as possible. The introduction of a fully dialogical interaction, while being more ecologically valid, would undoubtedly have reduced the strength of this control and subsequently the confidence with which we are able to accredit our results. We feel that this study examines the intended question adequately and fairly within the parameters defined by its exploratory nature. Consider the automatic nature of many cognitive mechanisms (e.g. attentional orientation or self-monitoring); we feel that, while the presence of a listener may not incite the kind of implied uses of disfluency Clark & Fox Tree (2002) define; it is enough to trigger the main, semantically based signals (i.e. to signal upcoming delay) if the linguistic Competence theories of disfluency production are correct.

As mentioned in the Results sections, some of the effects we observed were driven by (a) either other disfluency types, or (b) other description sections. As we discuss the results in more detail this should be kept in mind and will be highlighted wherever appropriate.

Independent Effects. Difficulty was the manipulation which produced the most consistent and highly significant results. This effect was observable for Any

Disfluency Type, Prolongations, Hesitations and (to a lesser level of significance)

Repairs. It was only observable in the Complete and Item descriptions. Initially, it should be noted that the Complete description is an amalgamation of the Item and

Path descriptions: if the observable effects are only present in one of these description sections, we can accredit any effects in the Complete description to that description section. Unsurprisingly there are no independent affects of Difficulty in the Path description: This is undoubtedly because participants have not yet encountered the difficult item whilst describing the Path. We can therefore accredit the effects observed in the Complete description to the Item description. In addition, Any

Disfluency Type is an amalgamation of all other disfluency types and therefore represents the proportion of utterances that contain any type of disfluency. This allows us to get a general idea of the trend of disfluency production but should be considered differently to the other disfluency types detailed here. The Difficulty manipulation shows us that Prolongations, Hesitations and Repairs are all influenced (increased) by

28/37 a harder to name item (difficulty based on name agreement and lexical frequency). It seems plausible that both Prolongations and Hesitations are, in effect, ‘buying time’ while participants retrieve and produce the lexical item they are using to label the picture. The increase in Repairs produced seems to reflect the recovery from errors relating to allocating item labels. An increase in the misallocation of item labels is consistent with the production of low name agreement and low frequency words.

Interestingly, there is no independent effect of Difficulty on either of the Filled

Pauses: Uh or Um. This might indicate that either, Filled Pauses occur more often in fluid descriptive speech (i.e. the Path description) than in referent identification (i.e. the Item description) and so are not affected the by Difficulty manipulation, or, Uh and Um are not the primary signals of upcoming delay that linguistic Competence views would posit them to be.

The Speed manipulation produced only one independently significant effect and that was for Prolongations in the Path description. We might expect that the greater the pressure on the language production system, the more production stalls / problems there would be. If this is the case we would anticipate an increase in disfluency production to accommodate these difficulties. Counter intuitively, in the fast condition participants produced fewer Prolongation s than in the slow condition.

There are two possible explanations for this. The first is that the Speed / Accuracy trade off requires people to take less care in the pre-articulatory and post-articulatory monitoring of their speech when under greater time pressure. In this way people may produce fewer disfluencies because they are less concerned about having flowing, accurate speech than conveying the main message of the utterance adequately. The second explanation may be that rather than producing fewer disfluencies when in the fast condition, participants produce more disfluencies when in the slow condition.

This way of looking at the effect allows the possibility that there is some influence of speech timing and / or speech rhythm occurring in the slow condition. It is possible that, as an artefact of the task, participants are finding themselves ahead of the dot in terms of their descriptions and are forced to use some kind of disfluent item to ‘fill time’ until the dot has caught up. This would be particularly relevant in cases where the dot is approaching a path choice as the speaker would be unable to anticipate the result of the choice and would be forced to wait until it was resolved before proceeding with the description. Alternatively, given the formulaic style of speech

29/37 that many participants use in this task, it may be that participants are falling into a speech rhythm which includes a disfluent item. For example: now its taking [the:] top curved path to the cheese now its taking [the:] left curved path to the monkey now its taking [the:] straight path to the chair

If this is the case, the observed Speed effect for prolongations may be confounded by the conscious adoption of speaker strategies.

The Speed manipulation is not entirely confined to Prolongations (although they were the only disfluency type to reach significance): there was also a marginal effect for the filled pause Uh (also in the Path description). This, combined with the

Prolongation effect, generated a marginal Any Disfluency Type effect in the Path description and led the marginal Speed effects for Uh and Filled Pauses observed in the Complete description. These marginal effects might indicate that our Speed manipulation was not strong enough to elicit differences to the extent that we were expecting (with relation to Oomen & Postma’s (2001) findings) and that a stronger manipulation might pick up effects for other disfluency types such as Um or

Hesitations. We cannot, from this experiment alone, draw a robust conclusion as to (a) whether the observed Speed effects are a result of genuine production differences between the conditions or of a task related artefact, and (b) if the Speed manipulation was powerful enough to elicit all possible effects for all disfluency types. However, if we were to conclude that the observed effects did reflect genuine production differences, we would accredit the effects found here to linguistic Performance rather than linguistic Competence. This is because they would relate entirely to the Speed /

Accuracy trade off (i.e. more concern for the main message than the way it is delivered) and, as such, would not be concerned with informing listeners as to the status of the production system.

The Audience manipulation produced no independently significant effects but there was one effect which was marginal: Filled Pauses in the Item description. As mentioned in the Results section, participants produced more Filled Pauses when the audience was absent than present . This effect is created by an extremely low (0.0) value for Um in an audience present condition. The question that is raised is whether or not this value is low for a specific reason (i.e. there is something special about the

30/37 audience present, slow, hard condition which causes participants to avoid using Um), or if it can simply be explained as a chance occurrence. Given that (a) there are no other effects for Um throughout the entire experiment, (b) there are no other independent effects for Audience (relating to Um or otherwise) within the experiment,

(c) there are no other independent effects, or interactions, involving Audience in the

Item description, and (d) the direction of the effect within the interactions involving

Audience (in the Path description) is the opposite to the direction of the Um effect observed here; we feel confident drawing the conclusion that this is not a reliable effect and should be discounted as a chance occurrence. The lack of significant results for the Audience manipulation could be argued as a sign of a weak or inappropriate manipulation. As we have argued above, the post experiment questionnaire indicates that the listener was salient during the task and we feel that the task gives fair opportunity for Competence based disfluencies to arise in their semantic form.

Interactions. The two significant three way interactions were for Filled Pauses and Uh in the Path description. There was no similar effect for Um so we can accredit the majority of the Filled Pause effect to the Uh effect. In both cases there is an increase of Filled Pauses when the Audience is present indicating that there may be some kind of Competence based disfluency production occurring. One other result

(marginal) which warrants discussion is the Audience and Difficulty interaction for

Uh in the Complete description. At first glance this appears to be generated from the three way interaction for Uh in the Path description. However, on closer inspection we can see that the direction of the disfluency increase for Audience is reversed (i.e. this interaction shows an increase in disfluency when the audience is absent ). This was only a marginal result and may be another chance occurrence, but it may have an impact on the accreditation of disfluencies related to Difficulty to Performance or

Competence and so is worth highlighting here.

To summarise the discussion above in the context of our predictions: we defined two variables which would relate independently to two different theories of disfluency production (i.e. Audience - Competence, Speed - Performance). We also defined a third variable (Difficulty) and correlated it with both Competence and

Performance depending on how the other variables performed. We found no effects of

Audience for any disfluency type despite listener salience ratings being sufficiently

31/37 high. We also found little effect of Speed and concluded that the manipulation may have been too weak to produce the results expected. In addition, we raised concerns that task related artefacts or conscious speaker strategies might be accountable for the

Speed effects observed. We did however, find highly significant effects of Difficulty across Prolongations, Hesitations and Repairs which are corroborated by recent research by Schnadt & Corley (2006). In the introduction we stated that the way we accredited effects related to Difficulty would depend on the presence of other variables. However, neither Audience nor Speed were particularly revealing in terms of reliable effects so it is difficult to attribute these findings to a source. One possibility (supported by the Audience, Speed & Difficulty interactions for Filled

Pauses and Uh) is that the effects observed for Difficulty are a result of participants actively responding to the very obvious cues to upcoming production problems that are provided by difficult items. If this is the case then the increase in disfluencies related to Difficulty should be defined as part of linguistic Competence. One problem with this view is that there were no independent effects relating to Difficulty for Uh anywhere in the experiment (there was a two way, Audience and Difficulty interaction but it was marginal and, as mentioned above, the increase in disfluency ran contra to the other findings so it may not be reliable). This means that the primary evidence for supporting a Competence theory is based on one three way interaction for Uh when the effects observed independently for Difficulty only manifested in Prolongations,

Hesitations and Repairs. This disparity (between the disfluency types exhibiting the

Difficulty effects) casts doubt on the appropriateness of using the Audience, Speed and Difficulty interactions as evidence to support the accreditation of the observed effects to linguistic Competence. Alternatively, we could accredit the effects found for

Difficulty to Performance which would be more in line with (a) the findings of existing research (e.g. Schnadt & Corley, 2006) and (b) the assertions concerning the accommodation of planning problems made earlier in this paper.

We conclude that in this experiment, there was little or no influence of linguistic Competence on the production of any of the disfluency types examined here. There was also little effect of an increase of pressure applied to the speech production system. In both these cases there were marginal effects or interactions which indicate that a different experimental design (i.e. a more dialogical task) or an increase in power (i.e. a greater variation between the slow and fast conditions) might

32/37 draw out greater effects. The main effect of the experiment was Difficulty and we accredit all effects for Prolongations, Hesitations and Repairs to linguistic

Performance. We did not see significant effects for either of the Filled Pauses (Uh or

Um) so, based on the findings of this experiment and, despite feeling that our

Audience manipulation was effective; we are unable to accredit Filled Pauses to either

Performance or Competence with 100% certainty.

General Discussion

Implications for disfluency mechanisms in comprehension.

As mentioned before, there seems to be a fairly tight correlation between linguistic and nonlinguistic accounts of disfluency handling in comprehension, and the Competence and

Performance distinction we have drawn here for disfluency production. For our observed effects we have accredited Prolongations, Hesitations and Repairs to linguistic Performance (i.e. they are produced as an epiphenomenon of speech production problems rather than through any intentional signalling). This correlates to a non-linguistic account of disfluency production as, if a speaker does not intend to send a message (within a disfluency), it seems unlikely that a listener would have specific mechanisms within the parser with which to decipher that message. It is more likely that the listener would make use of some non-linguistic mechanism such as delay (e.g. Corley et al. Submitted) or attention orientation (e.g. Collard et al.

Submitted) to draw information from the disfluent item. Interestingly, in the experiment above, we found no effects which caused variation in the production of

Uh or Um. We cannot, therefore, be sure that they are not Competence related forms of disfluency and that they do no correlate with linguistic based theories of disfluency comprehension (e.g. Clark & Fox Tree, 2002). Based on this uncertainty we do not attempt to decry linguistic theories of disfluency comprehension entirely. As Collard et al. have commented; it may be that a complete account of disfluency comprehension utilises both linguistic and non-linguistic methods to describe the parsing disfluent utterances.

Experimental Development.

It has been mentioned that the above experiment is an early exploratory study into an under researched area. As such the experiment was designed with experimental control as a high priority. This meant the task chosen

33/37 had to be easily controllable in terms of the level of interaction between participants.

It was also mentioned that subsequent experiments might relax the experimental control requirements and increase the dialogical nature of the experiment in order to investigate the claims that some disfluencies have implied uses. These uses might only be elicited in a fully interactive task where dialogical features such as interruption or floor holding become important. Given that we have no irrefutable evidence that all disfluencies are Performance based only (as Uh and Um produced no observed variation related to any manipulation), it seems prudent to consider different tasks or variations to the network task that might allow further investigation into the production of Competence disfluencies. Initially, it would be sensible to increase the amount of participants examined as this might increase statistical power enough to convert some of the marginal effects we observed into significant ones. More specifically we look to increasing the power of the Speed manipulation which showed weaker than expected effects. This could be done within the network task by increasing the variation between the fast and slow conditions (particularly speeding up the fast condition). Secondly and most importantly, we look to increasing the power and dialogical nature of the Audience manipulation. This could be done within a variation of the network task paradigm (maintaining comparability with this study on

Speed and Difficulty) although it may require a re-write of the network task software.

The task could be made more communicative by, for example, having the participants alternate roles as director and follower and allowing feedback and discussion during each network. This would highlight both the presence and the role of the follower and allow for dialogical Competence disfluencies to occur. The audience absent condition could be presented as a data collection session as it was in this experiment. Great care would have to be taken when analysing the data in order to maintain comparability between the audience present and absent conditions (discounting disfluencies produced as a result of other-initiated repairs, for example). The audience manipulation could also be varied outside of the network task (although comparability with this study would be lost) by using an existing task such as the Map Task. It is difficult to see, however, how comparability between Audience present and absent conditions could be achieved in a task as collaborative as the Map Task. Possibly an alternative experimental way of distinguishing Performance and Competence disfluencies would be required.

34/37

Conclusions

The experiment conducted in this study has greatly progressed research into the question of intentionality in disfluency production. We have two main findings:

Firstly, disfluencies should not be considered as a homogeneous group as different types can be affected, to different degrees, by different circumstantial variations.

Secondly, in this initial examination, we find no evidence to support linguistic theories of disfluency production (i.e. we could not reliably accredit any variation in disfluency production to linguistic Competence). We were, however, equally unable to accredit Filled Pauses to either Performance or Competence so we look to further research concerning the production of this particular disfluency type before drawing any definitive conclusions.

35/37

References:

Abbate, M. S., & La Chappelle, N. B. (1984a). Pictures, please! A language supplement. Tucson, AZ: Communication Skill Builders.

Abbate, M. S., & La Chappelle, N. B. (1984b). Pictures, please! An articulation supplement. Tucson, AZ: Communication Skill Builders.

Arnold, J. E., & Tanenhausm M. K. (In Press). Disfluency effects in comprehension: how new information can become accessible. In Gibson, E., and Perlmutter,

N. (Eds) The processing and acquisition of reference. MIT Press.

Arnold, J. E., Fagnano, M., & Tanenhaus, M. K. (2003). Disfluencies signal theee, um, new information. Journal of Psycholinguistic Research, 32 , 25-36.

Arnold, J. E., Tanenhaus, M. K., Altmann, R. J., & Fagnano, M. (2004). The old and thee, uh, new - Disfluency and reference resolution. Psychological Science,

15 , 9, 578-582.

Bailey, K. G. D., & Ferreira, F. (2003). Disfluencies affect the parsing of garden-path sentences. Journal of Memory and Language, 49, 2, 183-200.

Bortfield, H., Leon, S. D., Bloom, J. E., Schober, M. F., & Brennan, S. E. (2001).

Disfluency rates in spontaneous speech: Effects of age, relationship, topic, role, and gender. Language and Speech, 44 , 123-147.

Brennan, S. E., & Schober, M. F. (2001). How listeners compensate for disfluencies in spontaneous speech. Journal of Memory and Language, 44 , 274-296.

Brennan, S. E., & Williams, M. (1995). The feeling of another's knowing: Prosody and filled pauses as cues to listeners about the metacognitive states of speakers. Journal of Memory and Language , 34 , 383.398.

CELEX English database (Release E25) [On-line]. (1993). Available: http://www.mpi.nl/world/celex

Christianson, K., Hollingworth, A., Halliwell, J. F., & Ferreira, F. (2001). Thematic roles assigned along the garden path linger.

407.

Cognitive Psychology , 42 , 4, 368-

Clark, H. H., & Fox Tree, J. E. (2002). Using uh and um in spontaneous speaking.

Cognition, 84 , 73-111.

Collard, P., Corley, M., & MacGregor, L. J. (2007). ERP evidence for attention orienting effects of hesitations in speech. Manuscript submitted for publication.

Corley, M., Akker, E., & Hartsuiker, R. J. (2007). Why um helps auditory word recognition: The temporal delay hypothesis. Manuscript submitted for publication.

36/37

Corley, M., MacGregor, L. J., & Donaldson, D. I. (in press). It's the way that you, er, say it: Hesitations in speech affect language comprehension. Cognition .

Dell, G. S. (1986). A Spreading-Activation Theory Of Retrieval In Sentence

Production. Psychological Review, 93 , 3, 283-321

Ferreira, F., & Bailey, K.G.D. (2004). Disfluencies and human language comprehension. Trends in Cognitive Science, 8 , 231-237.

Ferreira, F., Lau, E. F., & Bailey, K. G. D. (2004). Disfluencies, language comprehension, and tree adjoining grammars. Cognitive Science, 28 , 721-749.

Fox Tree, J. E. (1995). The effects of false starts and repetitions on the processing of subsequent words in spontaneous speech.

34, 709-738.

Journal of Memory and Language,

Fox Tree, J. E. (2001). Listeners' uses of um and uh in speech comprehension.

Memory and Cognition, 29 , 320-326.

Fox Tree, J. E., & Clark, H. H. (1997). Pronouncing “the” as “thee” to signal problems in speaking. Cognition , 62 , 151.167.

Lau, E. F., & Ferreira, F. (2005). Lingering effects of disfluent material on comprehension of garden path sentences. Language and Cognitive Processes,

20 , 5, 633-666.

Levelt, W. J. M. (1983). Monitoring and self-repair in speech. Cognition , 14 , 41.104.

MacKay, D. G. (1982). The problems of flexibility, fluency, and speed–accuracy trade-off in skilled behavior. Psychological Review, 89 , 483–506.

O’Connell, D. C., & Kowal, S. (2005).

Uh and Um Revisited: Are They Interjections for Signaling Delay? Journal of Psycholinguistic Research, 31, 6, 555-576.

O’Connell, D. C., Kowal, S., & Ageneau, C. (2005). Interjections in interviews.

Journal of Psycholinguistic Research, 34, 153-171.

Oomen, C. C. E., & Postma, A. (2001). Effects of time pressure on mechanisms of speech production and self-monitoring.

30, 2, 163-184.

Journal of Psycholinguistic Research,

Oviatt, S. (1995). Predicting spoken disfluencies during human-computer interaction.

Computer Speech and Language , 9 , 19–35.

Schnadt, M.J., & Corley, M. (2006). The influence of lexical, conceptual and planning based factors on disfluency production. In Proceedings of the twenty-eighth meeting of the Cognitive Science Society .

37/37

Schober, M. F., & Carstensen, R. (2001). Do age and long-term relationship matter in conversations about unfamiliar things? Unpublished Corpus.

Shriberg, E. E. (1996). Disfluencies in SWITCHBOARD. In Proceedings of the

International Conference on Spoken Language Processing,

Philadelphia, PA.

Addendum 11–14.

Snodgrass, J. G., & Vanderwart, M. (1980). A standardized set of 260 pictures:

Norms for name agreement, image agreement, familiarity, and visual complexity. Journal of Experimental Psychology: Human Learning and

Memory, 6 , 174-215.

Stolcke, A., & Shriberg, E. (1996). Statistical language modelling for speech disfluencies. In Proceedings of the International Conference on Acoustics.

Speech and Signal Processing, 405-409. Atlanta, GA.

Swerts, M. (1998). Filled pauses as markers of discourse structure. Journal of

Pragmatics, 30, 4, 485-496.

Szekely, A., Jacobsen, T., D'Amico, S., Devescovi, A., Andonova, E., Herron, D., et al. (2004).

A new on-line resource for psycholinguistic studies. Journal of

Memory and Language , 51, 2 , 247-250.

International Picture Naming Project. [On-Line]. Available: http://crl.ucsd.edu/~aszekely/ipnp/query.html

Download