Document 17954765

advertisement
>> Will Lewis: We’ll go ahead and get started. I want to welcome Mari Ostendorf to give a talk here
today. I won’t go through the whole litany of her background. I do want to say that Mari is very close to
us here because she is close geographically just across the lake. It’s really nice to have her here to give a
talk today. She has a Ph.D. in Electrical Engineering, has been at UW actually since 1999, and does a lot
of work on Speech Processing. I know a number of projects that you’ve worked on with a number of
folks over at UW. What she says here is interests are in dynamic and linguistically informed speech and
language processing, which of course is of interest to a number of folks here.
Mari and I, I actually remember, I’ve known Mari for about ten years. I remember distinctly the exact
moment I first met her. It happened to be in an interview across the table at the Facility Club. It was a
grueling day. Fortunately we, I had the opportunity to work with Mari intermittently over the next
couple of years in the development of the Comp Ling Program at UW, which was a distinct pleasure.
Without belaboring it any further Mari Ostendorf, Finding Information in disfluencies.
>> Mari Ostendorf: Okay, thank you. Alright, so I’m going to talk about how we really talk.
[laughter]
I, so I’m assuming, I know some of you actually know a fair amount about disfluencies. But I’m assuming
that some of you don’t. I’ll have a little bit of an intro here. This is a transcript from the Switchboard
Corpus which is conversational speech. What I did here is everything that is a disfluency I have put in
bold and crossed out. Classic, and then filled pauses, um and uh that’s filled pauses. Those are in purple
to the extent they look like purple here. This is just illustrating different types of disfluencies and the
fact that they are fairly frequent.
The other thing that I want to, so I’m an Engineer, so I actually have practical reasons for thinking about
disfluencies even though I’ll give you a lot of studies that aren’t so engineering oriented in terms of what
we’re looking at. If you think about what computers would hear if they were perfect it’s everything that
comes out of somebody’s mouth that’s what the recognize. There’s an acoustic signal there so the
recognizer will deal with it. The dot, dot, dots are there to indicate that we’ve got some pauses in there.
Then if you look at what people hear. People actually filter these things out. Many of those things, not
all but many of those things depending on how they’re said the listener won’t even notice. If you ask
them to transcribe the speech they will miss quite a lot of it. That is the challenge if we want our
computers to be like humans we have to figure out what to not pay attention to, okay.
But they, now, I’ll come back to that. The other thing I want to point out, a lot of people thought you
know had this idea that Switchboard is just; it’s just really messy data, it’s not real life because it’s not
really practical because these people have no stakes in the conversation. They’re getting paid some
minimal amount to talk to a stranger for five minutes. Well, okay here’s a nice example. This is
Supreme Court oral argument. They are in fact more disfluent than the Switchboard Corpus.
You know disfluencies are real life. If you think about when are we disfluent? Well one of the reasons
we’re disfluent is when we have high cognitive load or emotion situations. The Supreme Court is
definitely a high cognitive load situation, okay.
Alright and then I wanted to point out here. This will be relevant to the particular approach that we’re
taking. This is one sentence, okay. From the point of view of how it’s transcribed and one of the things
you see in the Supreme Court oral arguments, people, particular lawyers don’t want to give up the floor.
They make their sentences; they make it hard to be interrupted. This is going to pose problems for
language processing.
Alright, so disfluencies are common. Multiple studies have found disfluencies rates of six percent or
more in human-human speech. That’s you know relatively high. You can’t avoid having to deal with it.
People have some control over their disfluency rate. But pretty much everyone is disfluent. Some
people are more disfluent than others. But pretty much everyone is disfluent. People aren’t usually
conscious of the disfluencies. But they do appear to use them both as speakers and listeners, and as you
are using with your robot.
Okay, so I like to argue you know traditionally in language processing people have been thinking about
disfluencies as noise. I would like to point out that they are also information. They’re noise. They’re
both, okay they’re noise because if you actually transcribed exactly what somebody said it’s hard to
read, okay. It degrades readability. The word fragments are difficult to handle in recognition. They
mess up the recognizer. The grammatical interruptions mess up the parser. If you translate them
everybody gets confused because the prosodic cues that tell the listener to ignore this aren’t there. For
all those reasons disfluencies are noise.
On the other hand listeners use those disfluencies to understand the corrections, the attention
grabbers, the strategy changers. They use, how they use particular filled pauses is sending a message
about turn taking. They indicate speaker confidence. It reflects, the disfluency rate reflects cognitive
load and anxiety.
It’s interesting from a human-computer interaction perspective to look at them, so detecting
disfluencies impacts from a practical perspective getting the intended word sequence, interpreting the
speaker cognitive state, and understanding the social context. As I will actually show some, one little
study that we hope to build on is it actually is going to tell us about relations between people, power
relations.
Okay, so this has implications for spoken language technology. Human-computer interaction
particularly multi-party because you’re going to have more disfluencies that way, spoken document
processing, as well as medical diagnostics.
Okay, so that’s the introduction. Here’s what I’m going to try to talk about. What can we learn from
disfluencies? What about the speaker mental state or the social context? How do we detect them?
There’s not very much annotated data for disfluencies so how can I detect them for other types of
corpora?
I’m going to look at detecting, look at work on several corpora. I will call it Speech-in-the-wild taking
from, borrowing from Louis Shriberg. In this data we’re going to look at different communicative
contexts, so high stakes, low stakes, stuff like that, looking at automatic detection algorithms. A side
effect of this I think we’re learning about improving spoken language processing for different genres.
Okay, so I’m going to start out just going through some basics of disfluencies, telling you a little bit about
my data. How we do automatic detection and some studies we did with the different corpora analyzing
this data, and finally conclude.
Here’s some basics. People have looked at disfluencies since the eighty’s at least. There’s been
psycholinguistic studies that basically describe a lot of the things we are seeing now, even in a very small
amount of data. In this particular study by Levelt early on they identify a bunch of different types of
disfluencies, including appropriateness disfluencies. You want to change what you say that’s more
strategic, error repairs you make a mistake. A thing that’s called covert repairs that’s a particular level,
well I’ll get into the cognitive levels. Covert repairs are when you say the, the, the, so when you do
repetitions. Then there’s a bunch of other repairs.
This is in a very small study but you look at this, I think it’s pretty cool that this sort of stuff carries over
to a lot of the data that we’re looking at now. They propose that well-formed repairs have syntactic
parallelism between the reparandum and the repair.
The particular model that lots of people use that builds on this, described by Louis Shriberg is this notion
that you have three parts. The reparandum that’s the stuff I was crossing out. The interregnum that’s
optional, that’s a, things like ums and uhs, and I means, and stuff like that. The repair, that’s replacing
the reparandum. In a restart the repair is not there. The interruption point is not explicitly there.
There’s no, not necessarily a clear signal of the interruption point except for the fact that you’ve got a
prosodic discontinuity.
Here’s an example of the very first in the form of the first step I showed you. This is that annotation of
this crossed out version where I have purple for the reparandum. Here’s my interregnum, so the person
is saying or to indicate their rephrasing here. Of course you can have disfluencies inside disfluencies, so
that’s what’s happening here. This is a second; this is a repetition so that’s basically illustrated, yes?
>>: Why do you necessarily think that the stand they have equals not intended?
>> Mari Ostendorf: Huh?
>> Why do you think that the phrase the stand they have was unintended?
>> Mari Ostendorf: They, this is something that would probably be categorized as an appropriateness
disfluency. The person wants the way they command respect to replace the stand they have. In reading
the broader context of this…
>>: You get it from the broader context, okay.
>> Mari Ostendorf: That, well actually you can get it from listening to it too.
>>: Yeah, yeah, processing, sure. But, right in terms of text there I don’t see any reason that that’s
considered a disfluency.
>> Mari Ostendorf: All of these transcriptions were all, this is from the Switchboard. This was all based
on audio. The transcriptions are based on audio.
>>: Yeah [indiscernible]?
>> Mari Ostendorf: The way that I would get this from text alone, so ideally you want to do the
automatic detection with audio, which I’m not doing right now, okay, which I’ll explain. The way that I
would get it is the fact that there’s a repetition here. Not just the, the, the, yeah, okay. Anyway this is
the cleaned up version, okay.
Categorizing, so there’s three categories that have been used in a lot of the recent work based on this
simple surface-level labeling, a repetition is when the reparandum equals the repair. A restart is there’s
no repair. The correction is when they’re not equal, alright. Earlier work by Levelt and Cutler had a
finer-grain intention categories. Eventually, I actually think this is the really interesting stuff in terms of
analyzing interactions. But at the moment most of the work is really aimed in this. That’s where we’re
going to start. In fact most of the work is aimed at only at finding the reparandum.
This is just some examples, repetition the, the, the, it’s, it’s, those are very frequent. Repairs, there’s a
lot of different types of repairs, I just I, so the person is getting rid of the just; we you’d, so there’s
strategic things. There’s, so we want, in our area we want, so there are elaboration. There’s also
mistakes, I don’t have an example here, well we you’d would be a mistake. You can also get words
being wrong, lexical access mistakes, so there’s a lot of different types. Restarts this is an example and
you can have them nested.
Okay, just to show here the interregnum it could be filled pauses, it could be things like I mean,
discourse markers like well. One of the reasons why you want to, one second, why you want to not just
throw it away and why you want to, so mostly people have been detecting these things and throwing
them away. But in fact sometimes not hugely often the word that you want, my insurance is in the
reparandum but not in the repair. You actually need to not throw that away.
Question?
>>: I was wondering on the [indiscernible] most people have been working on the surface result
because, I mean you said is that because the goal of the work is usually just to clean up transcripts.
That’s all that people care about.
>> Mari Ostendorf: Yes, yep, absolutely. I’m, and well also and it’s easier, right. That, so one of the
things that Louis Shriberg pointed out in her thesis is people you know there’s a lot of different
viewpoints on how you categorize below the surface. The surface stuff it’s pretty non-controversial.
Okay, alright, so what makes people disfluent? I’m only going to talk about, so there are the strategic
turn taking disfluencies. But what I’m going to focus on here is cognitive load and situational stress.
Cognitive load if you have many things on it, so people talk about disfluencies on different levels.
There’s the articulation disfluency when, basically when your production and your planning get out of
sync. If your production is ahead of your planning you’re out of sync.
At the articulation level that’s when you say the, the, the, when your production is ahead of your
planning. At the lexical access level sometimes you make things like review, renew or something like
that. Lexical access things that are similar sounding or different wrong tense, or uh versus and. Then
there’s the more planning, you know overall planning level. That’s where you make some of these
appropriateness strategic changes.
Cognitive load you have more disfluencies because you have multiple things on your mind. Situational
stress, high stress will tend to, I’d say cognitive load seems to get everything. High stress from the data
I’ve looked at seems to have more of the lower level disfluencies. The interesting thing is you kind of
have low stress and high stress. If you have really low stress, very casual conversations with people you
know they’re more disfluent. If you have very high stress situations, high stakes it’s more disfluent and
it’s kind of lower in the middle.
Based on the data I’ve looked at the problem, all this with a caveat, there’s such vast variation in
speaker disfluency rates that that’s the you know speaker, individual speaker variation is a huge effect
and you need to look at a lot of data to say anything.
>>: Why would the low stress situation increase disfluency? Is it that you just don’t care or you’re
thinking about other things when you’re engaged?
>>: Yeah you’re not using most of your brain.
>>: Okay.
>>: You don’t care.
>> Mari Ostendorf: No, you know that, okay so some the very disfluent stuff that we’ve looked at. We
had to totally change how we were annotating was CallHome, the CallHome Corpus, or CallHome
CallFriend. Because you know the person you’re talking to understands what you’re going to say and
you don’t finish your sentence.
>>: Oh.
>> Mari Ostendorf: You can be very sloppy because the person knows you very well. It’s a very
grounded conversation.
>>: I see, okay, so do you get then disfluencies that don’t in fact have the correction?
>> Mari Ostendorf: Right, right.
>>: Okay.
>> Mari Ostendorf: You actually get the word that finishes the sentence in the next sentence by the
other person.
>>: Yeah, yeah.
>> Mari Ostendorf: That’s very tricky data. Actually we see this in some data that was collected at UW
recently by Gina Lebow and Richard Wright in terms of the negotiation, so when negotiating about
something. Also if you have a shared visual world they don’t, they may say something that’s in the
picture of, in their shared world that they both are looking at, but they don’t actually say it. All of those
things cause disfluencies as well.
>>: Is there any data that you looked at that shows sort of some consistent pattern across languages?
>> Mari Ostendorf: I’ve only looked at English. That’s pretty hard as it is. Okay…
>>: We got to hear about, hear only about native English speakers?
>> Mari Ostendorf: No, I am talking about the English speakers in corpora we have. Some, most of
them are native but not all.
>>: Not native could cause some additional factors. [Indiscernible] your legs may have different kind of
structure.
>> Mari Ostendorf: Right, right, most of them are native but there’s no guarantee. Okay, so the
interesting question is are these factors reflected in different disfluency types? I argue yes by the way,
by the data that I’ve had. But clearly it needs more analysis.
Okay, so I talked to you about the cognitive models already. Basically there’s the content planning, the
lexical access, and articulation. What’s happening is generation or production gets out of sync of
planning. You and you also have these strategy changes which I think is, which is really important.
Something you see a lot in the Supreme Court data.
There’s a question of whether the different types of disfluencies reflect different types of problems,
content, lexical access, articulation, or different solutions to problems. It has been argued by, I think it
was Herb Clark that it’s different solutions to problems. I would say looking at this data it’s probably a
bit of both.
Here’s again articulation disfluencies. You see these repetitions. Lexical Access, okay here is the actual
example, to remove, to review. A, an that’s lexical access because you will tend to use a as your
planning because that’s the most common thing. You see this a lot in languages that have gender with
their articles. Here you can tell this is a Supreme Court one by a candidate, by a contributor. They’re
talking about two different parties and missing up those parties, that’s a lexical access.
Then the content planning it’s a mix of, things where you’re clarifying, your, yes?
>>: How do you know that that’s not a, by a candidate, by a contributor. How do you know that the
speaker is not purposefully saying something more about what he wants to say? Augmenting, it’s
parallelism. How do you know it’s not parallelism?
>> Mari Ostendorf: Well in this case they’re talking about campaign contributions. Generally the
candidate, they’re talking about, it’s a case about who can contribute and limits on contributions. In this
case, particular context it wouldn’t make sense for this to be the same thing, a clarification. But you
raise an excellent point that this is a very hard problem. It’s not an easy natural language problem. The
other thing though is if you, again if you listen to it the particular way somebody says is different for
those two things. That’s a reason why we really need to get to the point of being able to include the
audio.
Okay, so anyway I just wanted to point these out. Because for me this is a reason why, these give
reasons why we want to look at the endpoint and look at these more complex disfluencies, that the
more complex disfluencies are telling us about the situation. In this case the lawyer says where we are
arguing, where it is our position, so there’s something strategic going on here. I don’t know what it is
because I don’t know legal stuff. Here they’re expanding. They’re emphasizing right to full right. But
one of the things that you also see quite a lot but only for lawyers and not for justices is hedging.
[laughter]
Okay, distributional observations. These are some things that have been, a lot of people, this is, a lot of
work has gone on to look at different disfluencies. One of the things that’s very interesting and humancomputer interactions in the old days there’s very few disfluencies. I think that’s because people are
concentrating. You know it’s, you don’t have the less formal, you don’t know you don’t trust your
computer to understand you. Now that recognizers work better you can look at data that’s collected in
the ATAS days and data that’s collected now. It’s much more disfluent now. The more human like our
systems become the more disfluencies they’re going to have to encounter.
Men are more disfluent that women. Disfluencies are more frequent in initial positions. The argument
there is that’s where your cognitive load is highest from a planning perspective. Speakers are less
disfluent before a general audience than before people they are familiar, then people are familiar with
the topic. I’m being pretty disfluent with you because I think that you know sort of what I’m talking
about. If I was to give this in front of a, you know general audience, non-tech people I would be less
disfluent.
>>: What data is at force that men are more disfluent than women?
>> Mari Ostendorf: Tons of studies. Tons of studies and its also consistent with men stutter more than
women.
>>: With relation to the first point, do you know if we’re less disfluent when we speak to like children or
babies, or some other thing that we think will have a harder time understanding us?
>> Mari Ostendorf: Based on the first point I would say yes but I’m not aware of the studies.
>>: Yes, so some, all the things you are talking about in terms of the familiarity of the audience.
Actually there was a big theory, right, in ninety’s about hyper, hypo theory. How does that fit into that
pre-planning production that kind of sync we need, or the same all cognitive?
>> Mari Ostendorf: Right, I’m sure it’s related I’d have to think about it. Okay, there is study that says
filled pause rates are less sensitive to task complexity and anxiety than other types of disfluencies.
That’s not consistent with our data. Again, you’ll see that one of the things that we have here is huge
variations among speakers. If you’re looking at a small number of speakers you could be concluding
something that’s not necessary, doesn’t generalize.
Okay, alright, so disfluencies are prosodic interruptions. They’re fundamentally prosodic. I’m going to
do a horrible thing, as somebody who has worked on prosody for many years of my life and not use
prosody in this work. There’s, so several studies show a higher F zero. One of the ways you can tell
there’s a disfluency is you reset your pitch afterwards because you’re resetting the prosody. There’s
sort of matching tonal structure because you’re restarting the same thing. There’s also the repair
structure reflects the speaker intention to produce a predictable prosody. There’s lots of reasons why
people are interested in prosody. In the data that I have it’s a little bit hard to get reliable prosodic
features. I’m not using it. You can do a lot with text but long term that’s where it has to go.
Okay, so the most prior studies have used either controlled speech. In psycholinguistic studies they
want to figure out what’s causing disfluencies. Some move the purple, move the yellow square types of
things, or low stakes conversation such as Switchboard. Much of the work on disfluencies has been on
Switchboard, or human-computer interaction. In the old days that has fewer disfluencies. Now a days
that could be more interesting. But there’s not a lot of annotated data unless you guys have it.
[laughter]
We are trying to use multiple corpora with varied stakes to get an idea of what’s going on here. We’ve
got, well let me tell you about it, and to develop algorithms which generalize. I have two telephone
conversation corpora. We have the Switchboard data which LDC has annotated. We augmented the
annotation for a small amount because we’re interested in different types. I have CallHome which is
family members.
Then I have two high stakes, goal-oriented sets of data. The US Supreme Court oral arguments, so
there’s more than fifty years worth of that. We’ve got one year of LDC inundated careful transcripts and
a subset of that. Like a smaller number of careful transcripts done by students at UW with careful
disfluency, more detail disfluency markings.
Then we have financial crisis hearings. These are pretty interesting. This is the, we have two hearings
with Blankfein. One when he was doing well and kind of a golden boy, and then later when he’s on the
hot seat and being blamed for doing bad things. They are very different. If I played you one audio clip
that would be the one to play it’s very interesting.
Okay and then we have this ATAROS data that was collected at UW, its two-person negotiations with
controlled variation of stakes. Yep?
>>: On the telephone conversations do you also have annotated when two people are talking at the
same time?
>> Mari Ostendorf: Yep.
>>: Is that also a source like when somebody else butts in does that start disfluencies on the initial
speaker? I mean do you have any data on that? [inaudible]
>> Mari Ostendorf: I have not actually; I could get data on that because we do have the time for
Switchboard. We don’t have it for CallHome. We could get it for; I think we have it for the ATAROS
data. We don’t have the timing for these guys because, so the reason I’m not using prosody is these
things are so noisy the forced alignments are garbage. I need better recognition to do this. That’s why
I’m not doing prosody right now.
>>: I would have thought that in the SCOTUS Corpus that there wouldn’t be any interruption.
>> Mari Ostendorf: Oh, there is and it causes disfluencies.
>>: Okay, that’s positive.
>> Mari Ostendorf: Yeah, so you, but it’s causing disfluencies in the sense that the lawyers interrupt, or
the justices interrupt the lawyers, and the lawyers start getting stressed out. That’s what anecdotally I
perceive from; I actually did a bit of the annotation myself because I wanted to understand it.
Okay, so we’re trying to vary stakes and yes?
>>: I’m sorry, the Supreme Court arguments these are recorded and transcribed or are they transcribed
“records” transcriptions in total towards this?
>> Mari Ostendorf: That’s a great question and I’m going to talk specifically about that. They are
recorded. They are not high quality recordings hence the challenges. They have been done differently
over the years. Another challenge to speech recognition now a days it’s in P3, old days it’s these big
audio tapes. The transcriptions are not careful transcriptions. They are transcripts that some legal
scholar would be interested in. That’s going to complicate the study.
Okay, so I’m going too slow. I’m going, so the Switchboard we’ve got just a couple things about the
annotation that’s worth noting. The typical LDC, if you’ve ever worked with the LDC data they have a
nested annotation structure. We flattened that because it provides a better representation of these
repetitions and we want to explicitly model those.
Okay we’re going to, let’s see to get this flattened representation we hand annotated a small amount.
Then we automatically predicted the flattened representation. Because you have this structure it’s just
a matter of taking these things out, the F score is really high. You can do that automatically really easily.
We have for Switchboard we particularly when we do the speaker variability stuff. We have twenty
speakers who called a lot. The only way we can look at speaker variability if we have a lot from one
speaker. We can look in the Supreme Court. We can look at it in this data.
Okay, other stuff, we’ve hand-annotated a small amount of data for everything. That data, some of this
data was used for SCOTUS training. Most of the data is only used for testing so it’s all cross domain stuff
that we’re doing.
Okay, so some similarities across the different corpora. Word fragments we find are most often
associated with repetitions. This is counter to what has previously been reported in the literature. It’s
not so often associated with repairs. In the very informal stuff it’s often associated with restarts or
abandoned things. Most of the disfluencies involve function words and simple one-two word
disfluencies. Observations from prior work mostly hold so high sentence-initial stuff like that. But the
thing that just blows you away is the large amount of inter- and intra-speaker variability that’s really
over a continuum.
Some differences that more high stakes SCOTUS and FCIC have more of these very long strategic types
of disfluencies. The relative rate of repetition is highest for SCOTUS and lowest for CallHome. There are
differences in the types of things we get as functions of these contexts need to tease things apart a little
more to understand it better.
The thing that really surprised me is the statistics for CallHome and Switchboard are very different. I
think of, their conversational, informal, they should be similar, no. Friend, family versus strangers seems
to make a big difference. A lot of this is the shared world effect. The other thing is I discovered that the
interregnum has, can play an interesting role. When lawyers are talking to justices they can use “Your
Honor” as a discourse marker to say I’m making a, there’s a politeness thing. All of this is based on hand
annotated data.
Okay, now I’m going to talk about automatic detection. The computational model we’re using is really
quite straightforward. It’s been used by other people. We’re going to use a sequence model. We’re
going in particular conditional random field. It’s like; it’s basic like tagging so we’re going to label each
word in terms of what part of the disfluency it is.
Now previously there’s been a lot of work including stuff I’ve been involved in looking at parsing models.
I think this is really a way to go. Unfortunately as I showed you that SCOTUS sentence our parser can’t
handle it yet. Until we start doing some adaptation or find a way to do a better job parsing then that
stuff is a little bit on hold. The features that have been used are word-based features or words and
prosody. As I say prosody is actually important for detecting interruption points. But we’re, in this
particular work we’re not going to use it.
That’s what we’re doing. Here’s the disfluency detection model. We do a very simple begin, insideoutside sort of model. But we add the interruption point. We know the end of the disfluency because
the interruption point actually really matters when you’ve got these multiple ones in a row, and we have
an other.
Then the next thing we do, so this is the starting point. Then the next thing we do is repeat this for
different disfluency types. Then the next thing we do is add some states for the correction, okay.
Baseline features are part-of-speech. These are totally standard stuff part of speech tags, and other
type indicators, filled pause, stuff like that. Discourse markers, so you look at the word and words
around it. Word fragment indicator you wouldn’t necessarily have this in speech recognition. But we’re
using it because we’re trying to understand what’s going on. Pattern match features, so you’re trying to
say did I have the same word, same phrase, or same part of speech sequence right in a row, okay.
>>: No acoustic features, no acoustic features.
>> Mari Ostendorf: No acoustic features because the time alignments are so bad. Okay, eventually I will
get there. But right now this study is limited in that way. But it makes the experiments run faster.
Okay, so the types of things we’re looking at, multi-domain learning. We’ve got leveraging, we have
separate models of repetitions and repairs. Because the repetition model actually cuts across domain
really well. The repairs is different, so that tends to help us in certain ways, hurts us in some ways and
helps us in others. We have explicit, one thing that other people have not done we’re explicitly
detecting the correction. Because we don’t want to, we want to know about the relation between the
reparandum and the repair, not just throwing it, so it’s not just about text clean up. Punctuation is a
surrogate for prosody is something we’re using on the Supreme Court data and looking at using semisupervised learning because there’s so little data that’s transcribed.
One of the things, here’s a couple of the things that we did to, I’m just going to give you the final story,
but just to tell you how we got better results. One, we tried adding the SCOTUS data, so the SCOTUS
data, to the Switchboard. Switchboard is used for everything, so we tried adding the SCOTUS data.
SCOTUS data helps SCOTUS but surprisingly it did not help with the FCIC. I didn’t expect it to help with
other things but it didn’t help with the FCIC.
Better leveraging similarities, we find the common, what it did help with is coming up with words that
are common words for disfluencies. We looked at what’s in Switchboard, what’s in SCOTUS. We made
this list of common words and disfluency. We train a language model based on that to say okay if I see
this sort of common pattern it’s likely to be disfluency. There’s some things that you see a lot that are,
so, so the, the, you can get that with pattern match really easily. But certain other types of things
aren’t, are common but you wouldn’t’ necessarily get them as a pattern match.
Okay and then the last thing we added was distance-based pattern match features. How far away is the
pattern match because that, so certain corpora you have longer disfluencies than others. The pattern
match if you don’t do distance it becomes, so the farther away pattern matches are less reliable. That’s
the point of that.
Okay, so putting it all together here’s where we stand.
>>: Can I talk just for a second? That the longer distance ones are occurring in what?
>> Mari Ostendorf: If you have, so you can get in the SCOTUS Corpus we can get five and six word
complete repetitions.
>>: I see.
>> Mari Ostendorf: In Switchboard that’s rare, okay.
>>: Okay.
>> Mari Ostendorf: They’d be one or two words.
>>: [inaudible], and the more formal SCOTUS you’re seeing this kind of thing than in the less formal
Switchboard. Am I correct in that?
>> Mari Ostendorf: Basically the idea is that if you allow the longer copies then as they get, if you allow,
if you say okay pattern match if I have a pattern match anywhere fire this feature. The problem with
that is that’s going to over generate, so by having it distance matched you can put less weight on it.
>>: Right, no I’m just trying to get the sense of where the longer distance…
>> Mari Ostendorf: Things happen, so they happen in the higher stake stuff.
>>: Okay, okay, that was your answer for that.
>> Mari Ostendorf: Yeah. Okay, so these particular results are only using Switchboard to recognize
other things. These are our best results. It’s with a bunch of improvements. What you can see is
obviously you do best on Switchboard. But the cool thing is Switchboard works not too horribly for
everything. The other thing that’s interesting is mostly the reason why things work well, so this is the
number that everybody reports. Did I detect the reparandum no matter what it is? Okay, if you
decompose that into repetitions and everything else, repetitions are very easy to detect, the, the, okay.
SCOTUS has a lot of repetitions. These guys have less repetitions, so hence Switchboard works really
well for SCOTUS. This number is kind of hiding the fact, is going to be biased by how much you, the data
has repetitions.
Okay, the thing that’s hard if you try to detect other things that’s harder; not surprising. The thing that
has not been done before is actually detecting the correction because if you just want to clean up
people don’t care about the correction. But if you want to understand what’s going on, you care about
the correction that’s really hard, knowing the endpoint is hard. We don’t do as well on that as we do at
just detecting the reparandum.
Okay that gets that across. Okay, so here’s, yes?
>>: Well I don’t understand why [indiscernible] can be lowest eighty percent? Is it, what is it that’s
getting false positives?
>> Mari Ostendorf: Yeah, so for example that, that sometimes is a disfluency in sometimes is not a
disfluency, right.
>>: Is it really that frequent?
>>: Oh, yeah, yes.
>>: This is all on transcripts, right, not on ASR?
>> Mari Ostendorf: This is all on transcripts. All of those numbers would go down on ASR.
>>: Still surprised that it’s eighty percent, though.
>> Mari Ostendorf: You can see it’s even lower in CallHome. The more informal stuff is but you can
hear the difference if you use the audio. If this had, if I had the audio I think this would be much higher.
If I had the audio and did a good job with extracting the prosodic cues which is another thing.
Alright, okay, so this is new stuff that we’ve been doing. We’re unediting the SCOTUS transcripts. The
question was asked earlier about transcriptions. What people transcribe by LDC when they’re doing a
careful transcript they would get everything the person said. When you look at what they transcribed in
SCOTUS sometimes they would get the repetitions. They may leave out a. Sometimes they’ll put in
commas to indicate there’s something going on. They will put in dot, dot, dot when, they’ll do that for
repetitions. They do that for um, um, um, um, those sorts of things. There are no ums and uhs
anywhere in these fifty years, or in at least the twenty years that I’ve looked at of SCOTUS transcripts.
The question is, so the thing that’s frustrating is we’ve got all of this data. We have fifty years worth of
data. I can look at you know disfluencies across time, speakers across time, Roberts was a lawyer, and
then he was justice. We can look at the difference, actually we have. I could do all of these things if I
could use all this data.
How can we get it when it’s not there? Well what we’re going to try to do is, what, so when I gave this
talk at Penn. Mark Liberman told me I was doing unediting. That’s what I’ve now called it. We, this is
only based on text. I’ve been talking to people about a couple of things we can do based on acoustics,
which are very exciting. Hopefully I’ll you know can come back in a couple of months and tell you about
them.
We’re looking at orthographic cues particularly punctuation. But what we did was automatically
matched the training data to this format. We’re adjusting the training data. We’re learning, so we have
the careful, we have parallel versions of some of the SCOTUS data. We learn where you would insert
things and take things away. We adjust the Switchboard data. We map to the SCOTUS data. We then
can apply said, now we have all this data that’s not transcribed, so we can do semi-supervised learning.
The other things that made a difference is explicitly modeling repetition separately from repairs because
the corpus, the mismatch between corpuses is different for those two.
Okay, so here’s what we got. We started out if you just use Switchboard. This is just doing the
reparandum. This is not doing the correction because that was really harder to deal with in terms of the
mapping. If you just do Switchboard finding the reparandum this is where we start with, the original
Switchboard, okay. If I add the SCOTUS data, the careful SCOTUS data, so it’s mismatched and then
doing the dot, dot, dot, blah, blah, blah, I get forty-one. I go from, basically forty to forty-one.
If I transform the data to transform Switchboard using what I learned about the difference it gets up to
fifty-eight, pretty nice. If I add self training that gives me a little bit more. If I use this different
representation of the two different disfluencies I get up to sixty-two point two. This is a really nice big
jump.
The thing that kind of blows me away is that we’re doing better now than when we were with the
SCOTUS data that was all careful. In terms of we’re now at sixty-two point two versus forty-one point
six. Part of what’s going on, there’s multiple reasons for that. One is I’ve got more SCOTUS data
because, but you can see the self training isn’t giving me that much. I need to kind of figure out what’s
going on to make this big difference.
But there’s also the other thing that you can see here is the thing that is the hardest in the original is the
recall. You know if we detect something with, you know if I, so if I get a the, if the, the is there I can be
pretty sure that I trust it, alright. But the recall is what we mostly improve.
>>: I’m confused about the annotation on it. There’s the un-annotated, or there’s the transcribed data
that’s not and then there’s the LDC careful, then there’s the unedited, right. This is being evaluated on
unedited?
>> Mari Ostendorf: What we did was we took the careful transcript, so the test data has the disfluency
locations in the careful transcript aligned with the non-careful transcript. I know where they are even if
they don’t exist in the non-careful transcript.
>>: Right, but is the careful one, is that what LDC provided for one year or is that what you guys
provided?
>> Mari Ostendorf: It’s both.
>>: Okay, but the SCOTUS one that’s the fifty years of original stuff?
>> Mari Ostendorf: Right.
>>: Okay.
>> Mari Ostendorf: Right, so we took the careful transcript that’s been hand annotated for the
disfluencies, aligned the thing because they have different words, obviously. Do the alignment, transfer
the annotations so we know where the reparandum is, and that’s what we used for the target.
>>: Okay.
>>: You’re doing this only on the test data, or on the training data?
>> Mari Ostendorf: On the test, you do the transfer on the test data. We also do the transfer on a
separate small set of, because we don’t have a lot of this annotated, okay. Small set of training data
which is where we learn what the mapping between them is. Then apply that mapping to all of
Switchboard.
>>: All of Switchboard?
>> Mari Ostendorf: Right, so the thing we’re transforming is Switchboard. We’re adding punctuation
and taking away words from Switchboard.
>>: Oh, you’re saying making Switchboard look like SCOTUS, not the other way around?
>> Mari Ostendorf: Right, right.
>>: Okay.
>>: But is this game then does that mean that its, is the reason for the low original score just because of
mismatched annotation standards rather than [inaudible]?
>> Mari Ostendorf: No the original is based on the matching careful transcription. It’s not a mismatch in
annotation standards.
>>: The 41 that does are, the training and the test are completely comparable in terms of, they’re, how
they annotate both the transcribed and annotate the…
>> Mari Ostendorf: Right, so the transcription, well I mean different people who did it at LDC did this
you know decades ago. It’s not the same people but to the extent that you’re following the guidelines
it’s the same.
>>: Right.
>>: I still don’t quite understand still, the event that you’re measuring that those precision recalls are
on. The task is that you have to fill and you have to put in the original disfluency.
>> Mari Ostendorf: That, find the words, you want to find the words that are in the reparandum.
>>: But you’re not…
>> Mari Ostendorf: Each word counts.
>>: You’re not allowed to insert a new, a disfluency at any point in text, right?
>> Mari Ostendorf: If you miss a word that’s in a reparandum that’s hurting your recall. If you get a
word, if you say some things a disfluency when it’s not that’s hurting your precision. It’s a word level
measure. If your disfluency has three words in it then that counts for three. It’s not a disfluency level
number it’s a word level number.
>>: If a person, regular person to do this annotation would he get [indiscernible]?
>> Mari Ostendorf: Probably not.
>>: Probably not.
>> Mari Ostendorf: No because we’ve got, there’s disagreements, there’s disagreements for sure.
>>: [indiscernible], what’s [indiscernible], what’s at the weight level [indiscernible]?
>> Mari Ostendorf: You know I don’t know for the LDC data, so I don’t know. I mean it was done a long
time ago. I’m not sure there’s a paper on it, so I don’t know. I know that they went through a bunch of
iterations on it. I know that there was some discussions on simplifying it. For the later, so there’s two
phases of LDC annotation. For the later phases they dropped the recursive structure.
Okay, alright so that’s that. I want to, I’ve gone too long. I just want to give you a few data analysis
things. This is looking, so I’m going, this is with a caveat, these results so the unediting just finished.
You know it’s very recent stuff. All of the SCOTUS stuff here; this is on twenty years worth of SCOTUS.
This based on; I should say this one, okay. What we’re doing, so we have much lower recall. Actually I
think at the time the way we were, it was a slightly different version and we had higher precision. We
had precision at about point eight, but much lower recall.
These numbers all need to be updated. But you can still see pretty interesting trends. This is just
showing disfluency rates, each X is a, let’s see I think each X is a speaker? No is a case or a conversation.
No, no it’s a speaker, sorry. Basically what this shows, so these are filled pauses, the bottom is filled
pauses, repetitions, and other disfluencies. You can see speakers varied quite a lot in both corpora in
terms of these, how disfluent they are. Okay, that’s the main thing.
The other thing, if you look at it from this perspective the intra-speaker variability compared to the, is
huge. This is going across speakers and this is within speakers. The variability, so there’s a huge amount
of variability. To try to understand what’s going on if you were working with Scalia versus O’Connor you
would get very different conclusions. It’s important that we have an understanding of variability, speak
variability in order to make any comments about the effect.
Things that I talked about with, and that’s why we have this for the Blankfein stuff we have the before
and after, and stuff like that. Main point is it’s tricky because people, it varies hugely.
>>: Also the three types don’t correlate at all [indiscernible].
>> Mari Ostendorf: Oh, yes thank you, thank you. That was the other very important thing. I cannot
predict how disfluent somebody is going to be by just looking at their filled pauses. Filled pauses are
easy to detect. Repetitions are fairly easy to detect. I would like to be able to predict how disfluent
they’re going to be, from one of these I can’t. That’s a, thank you.
There’s, so it’s roughly log-normal for a speaker, cross-type prediction is difficult, and controlling first
speaker is important. Just a couple fun things, so if you look at the stress effect. Blankfein has more
repetitions and more filled pauses at the, in the higher stress hearing. In the weak versus strong task for
the ATAROS Corpus that we are looking at, the negotiation problem solving task, there’s again higher
repetition rates in the strong stance case, okay, so more the higher stakes case. There’s also here, again
also higher filled pause rates particularly for men. The stakes seem to cause people to be more disfluent
as you might expect.
Now we tried looking at judges in cases with unanimous versus close votes. For the closes cases we
were thinking okay what’s happening if they’re close and you’re on the losing side? The highly disfluent
speakers are more disfluent. The less disfluent speakers are less disfluent. Don’t know what to do
about that, okay. But again I have to revisit this with our new improved detection. Some highly
disfluent cases have nine zero votes. They’re just being more informal. I don’t know what it is.
>>: Do you have the freedom of promise? I though he literally said only a few tenths of…
[laughter]
>> Mari Ostendorf: You notice he wasn’t on the list.
>>: Yeah, yeah, probably.
>> Mari Ostendorf: Yes, no, no, actually.
>>: But this is a highly emotionally charged one too.
>> Mari Ostendorf: Yeah, so there is no thing, but in fact, did I have a thing about?
>>: Yeah…
>>: [indiscernible] at the bottom.
>> Mari Ostendorf: Yeah.
>>: I guess that’s the only case he actually said in the thing though.
[laughter]
>> Mari Ostendorf: Yes, for this particular one it’s a great thing to analyze. But I don’t have him in other
studies because there’s just not enough to normalize. But it’s very interesting the types of disfluencies
that are in that case because it tells you exactly what the hot buttons are.
Okay, so this is the Blankfein stuff. You can have the early hearing and the late hearings. This is just
giving you details on the differences. Where the interesting thing here if you look at the, where the
disfluencies are happening they’re mostly happening sentence initially at the content planning stage.
He’s thinking more about what he’s going to say for obvious reasons.
Okay, this one I love. The repairs reflect the power dynamics. The hedging repairs are mostly used by
lawyers, as I said. Here’s some examples, changing I think that to I don’t disagree with that accept to
the extent that I think that.
[laughter]
This is a classic.
>>: Double standard.
>> Mari Ostendorf: Let’s see, this is another one I love, so many to, or not maybe so many but many.
[laughter]
Politeness, the lawyers tend to be polite. The justices aren’t, so I’m sorry is the interregnum for lawyers.
The other thing we looked at which is fascinating is entrainment. You do not see this is Scalia who’s
interesting because he’s so disfluent. If you look at cases, these are different cases. If you look at the
repetition rate of the case here versus the repetition rate of Scalia. It’s, you know there’s no trend here.
If you look at the lawyers versus the, so here’s the repetition rate of the case and the repetition of the
lawyers. As the case gets more disfluent the lawyers get more disfluent. If you look specifically at the
case the lawyers who come in, in the second half, so you’ve got your first lawyer and your second
lawyer. In the second half, you know they’ll, normally they start out very not disfluent because they’ve
got, they’re totally prepared. The second lawyer starts out disfluent almost at the rate of the case.
>>: Is this excluding the lawyer in question itself?
>> Mari Ostendorf: What?
>>: Is the, in the left hand side graph because if the lawyer talks a lot [indiscernible]…
>> Mari Ostendorf: Right, so at the, and so Scalia can be, so what’s happening to some extent is if he’s
talking a lot he’s driving, because they’re patterning, the lawyers are patterning after the judges.
>>: It would also be interesting to see the trend, [indiscernible], if the lawyer would have gone to in
training with the judges because of this power dynamics.
>> Mari Ostendorf: That’s what this is showing. That the lawyer is in training with, so they’re not, it’s
not necessarily; I haven’t done anything with specific justices.
>>: [indiscernible]…
>> Mari Ostendorf: But I had, but, and remember this has a problem that I don’t have enough data yet
to normalize for; I can normalize for the justices but not for the lawyers, right. Really it’s just broad
statistics that we can look at, at this point. But what you can see because lots of people are arguing and
there’s not a ton of time to entrain to one person. The trend is overall to the case overall. That’s all I’ve
been able to do so far.
>>: But it’s also interesting to see locally. Who starts at this [indiscernible] when it happens?
>> Mari Ostendorf: I don’t think that we can do that without more speaker normalization.
>>: Yeah.
>> Mari Ostendorf: Yes it would be really interesting. But I don’t think I don’t trust my data enough yet.
Okay, so just some things, high engagement more disfluencies. We talked about some other stuff.
Syntactic parallelism, one of the things, so there is this comment, this theory about the well formed
disfluencies having syntactic parallelism. The Switchboard annotation that was done for parsing
incorporates all of that. You know and represents these unfinished constituents.
One question is what’s going on in the data? We looked at, so this is just a very small analysis. It’s very
anecdotal. But what you can see, so most of the work with parsing and disfluencies has been based on
same phrase, okay. The same phrase happens a lot in the Switchboard. Almost half the syntactic
parallelism, clear, simple syntactic parallelism happens a lot in Switchboard, in SCOTUS not nearly so
much.
>>: [indiscernible] identical?
>> Mari Ostendorf: No, no, no, no, same syntactic construct.
>>: Oh, construct, okay.
>> Mari Ostendorf: Okay, so what’s different here is in SCOTUS the higher stakes stuff where you have
more strategy things. You have expansions because they’re adding hedges or clarifying, appropriateness
stuff. You have a lot more function word differences. One of the things that this doesn’t account for,
the syntactic parallelism, it, the, that’s not parallel, right. That sort of stuff happens a lot, right, it’s,
because it’s a lexical access thing. Those things happen, cognitive, high cognitive load, lexical access, all
this sort of stuff is happening more. My point is just that the syntactic parallelism needs to be taken
with a grain of salt. It’s fundamentally there but not always beautiful.
The length of repairs SCOTUS is longer. One of the things that’s very interesting is that sometimes you
have this “repair” that happens. Let’s see if I can, so you say something and the repair is like a whole
sentence, okay. Is that the practice in, and then they insert something, and then is the general practice.
If you think about it as a repair this whole entire thing is in fact a repair. That is a nightmare for
automatic detection. In fact this is longer than it appears because I’ve collapsed all of the other
disfluencies inside it.
>>: How many instances like this out there in the [indiscernible] end of the data?
>> Mari Ostendorf: In SCOTUS?
>>: Yeah.
>> Mari Ostendorf: A non-trivial amount, in Switchboard not much.
>>: Obviously your model isn’t doing enough to capture the [indiscernible].
>> Mari Ostendorf: We can’t capture that. We can’t capture it. In fact the annotators don’t know what
to do with it because sometimes there’s like two sentences.
>>: [indiscernible]. Okay.
>>: [indiscernible] impact [indiscernible] think of the application of these kinds of things either for
understanding or for translation. For translation you probably just tested this as is. My understanding
probably you can talk about the content of it [indiscernible] as well, including both...
>> Mari Ostendorf: The reason why this is, so for translation it’s probably a who cares what you would
want to do. What we were talking about earlier this morning is you might want to put in a dot, dot, dot
where that plus is to let you know the person is you know was thinking and some things going on. But
for translation I would say it’s not a big deal. I think it’s…
>>: But you don’t want to break it into three separate sentences at this point.
>> Mari Ostendorf: Yeah, the thing where this becomes interesting is if you’re looking at any sort of
social analysis. Understanding what’s going on, strategy analysis then you want to know that there is
something here. It’s more of a discourse level thing.
>>: Like in Cortana we just delete the whole thing [indiscernible].
[laughter]
>> Mari Ostendorf: Yeah.
>>: In this case [indiscernible].
>> Mari Ostendorf: Okay, so just to finish up. The implications here is, so you know looking at a noisy
channel model it works really well for handling the easy types of disfluencies. Repetitions are frequent,
so one of the things we’ve started doing is detect them first and deal with them. That makes, that’s kind
of useful for thinking about incremental processing. It works really well with incremental processing.
Being influenced by Mark Johnson, PCFGs are not good for disfluencies. The non-trivial number of long
repairs suggests you need to have more sophisticated delay decision making.
Other implications, I think there’s, one of the reasons this stuff is interesting is if looking at analyzing
social interactions and cognitive load. That’s one of the reasons I’m interested in detecting actually the
corrections.
In conclusions I’m, my main point is disfluencies are not just noise. They’re interesting, they carry
information. My second big point is we really need to be looking at the “speech in the wild” to see what
people really do. It is important for understanding the variability both within speaker and across
speaker in order to handle disfluencies correctly, and to improve our automatic detection. Speaker
variability is huge, so that potentially impacts earlier findings. It also makes it hard to figure out social
information and cognitive load information if you can’t norm the speaker. Then with some context
control if you can do norming of the speaker we can see pretty interesting effects of stress and power.
Some of the things I’m doing for, looking for, or would like to do in the future is more on the semisupervised cross-domain learning. I would love to leverage parsing if I could get a parser to work on the
SCOTUS data. I think it would be very interesting; I’m interested in ways to learn the finer grain
disfluency categories, applying semantic analysis to them. Expanding the disfluency types for, once, if
we have these expanded disfluency types can we use them for speaker modeling or analysis of cognitive
and social factors? Lastly, we can apply this new way of looking at disfluencies in terms of multiple
types as a way to improve parsing, to do a better job of integrating disfluencies and parsing.
That’s it, so sorry for going so long.
[applause]
>>: When they repair like when you replace one word with another word. It seems like sometimes it’s
the type of thing where the word sounded similar and, so they had sort of said the wrong thing and they
were fixing what they said.
>> Mari Ostendorf: Right.
>>: But I’m wondering how often is it more that like the concepts are close but they’re totally different
words? You kind of, like you’ve changed your mind about how you wanted to express. Is that, have you
looked at the difference between those two things?
>> Mari Ostendorf: I’ve tried to do some looking at that. These are less frequent, so of the…
>>: Word repairs are?
>> Mari Ostendorf: The, yeah the word repairs. Of the lexical access repairs most of them involve
function words, okay.
>>: Alright.
>> Mari Ostendorf: I do have and you can, one of the things I’ve been trying to do is automatically mine
the data to find those specific things. We automatically mine them to, based on phonetic similarity.
Also we tried to do some you know stuff where we looked at stemming and trying to find syntactic
similarity. We’ve done a little bit of mining the data for that. What we come up with hasn’t been
accurate enough to do fully automatically.
>>: Because it’s mostly function words so you don’t have a lot of these types of, yeah.
>> Mari Ostendorf: Because, well, yeah, so a lot of it is function words. The interesting, and you know
I’m not sure so I’ve talked with, to, about this with Ann Cutler and she’s the one that you know told me
that she thinks the functional words are lexical access, a lot of the function word things. The, you know
like it, the, you know there’s a lexical access thing. I don’t know if that’s a good example. But anyway
there’s a bunch of function word things that are presumably lexical access issues. You’d have to do a
much more controlled study than, I mean I can’t tell, I’ve got the transcripts I don’t know what’s in
people’s head.
>>: Sure.
>> Mari Ostendorf: I can’t tell that stuff. I can just look at gross statistics.
>>: Is the implication of this rather than the hyper [indiscernible] test that we usually think about, like
translations of event. Is it to actually get the disfluencies themselves so you can make analysis like this
person probably is subordinate, if you know what the conversation is, is that the kind of thing that?
>> Mari Ostendorf: That is one thing that I’m interested in. I’m actually interested in both types of
things. One of the, so there’s a couple reasons why it’s relevant for translation, if you’re going to throw
something away you may be throwing away something useful. Knowing the correction can tell you
whether you’re throwing something useful.
That’s one issue. The other issue is by working on all these different data sources I’ve ended up
improving my performance on Switchboard. It just gives us a better disfluency detector the more we
understand the types of things people do. It’s relevant for that as well. But I’m just arguing that on top
of that it’s relevant for understanding these social things.
>>: I’m just wondering more about what the application from that space. Like is it that type of thing
that just the example of the…
>> Mari Ostendorf: Understanding power relationships is the [indiscernible] we wanted to; it says is this
discussion going well. One of the things people are interested in is you know analyzing how is this going
to go if you have recordings of a negotiation, how is it going, those sorts of things.
>>: But in translation you could see like if there’s power dynamic it might play out differently in another
language that you might want to capture that. That would be a difficult thing to do but. You know like
opt a certain [indiscernible] that might be expressed in a particular power situation that is clear from the
disfluencies. I’m not saying that’s the thing we’ll do but it’s something you might want to do I’m saying,
okay.
>> Mari Ostendorf: Yeah, yeah.
>>: As you mentioned you do more disfluencies in a more informal environment. You said like you
know in this talk here this was a general audience you’d be speaking kind of more carefully. Do we, you
think we sometimes use disfluencies as a way to signal to the audience like how we feel with them? Not
just the audience but the other person we’re talking to more than just that it’s subconscious but it’s…
>> Mari Ostendorf: I think it’s mostly subconscious. I think some of the things that we do are going to
be conscious, the really long ones.
>>: Yeah, so I misspoke. I didn’t mean that it’s subconscious. I meant that it’s not just like an accident
that we’re making disfluencies but that our, we might be kind of, yeah trying to signal by not knowing
that we’re doing it. But that the signaling is there I guess.
>> Mari Ostendorf: There, it has been argued that we do, purposely, that it’s purposeful. You know I’m,
this is, I’m an Engineer so I’m not going to go out, I’m not a cognitive scientist. But I will say you know
you can, it is certainly the case that people use them. Whether they use them because we’re intentional
or whether they use them because they’ve just gotten use to this sort of disfluencies mean X, I don’t
know. But definitely people use them and, so speakers and listeners both use them.
>>: Do you know if there’s been studies that have shown the effect on a listener if a person has
different disfluencies? For instance if we had like when the computer is talking to a person if it injected
disfluencies every once and awhile would that like have an effect of making a person feel more
comfortable? Like do you know if that kind has been [indiscernible]?
[laughter]
>> Mari Ostendorf: You know I don’t know if that is, you guys…
>>: I think that is the case for example [indiscernible].
[laughter]
>>: [indiscernible]
>> Mari Ostendorf: Yeah.
>>: Like I would be [indiscernible].
[laughter]
>> Mari Ostendorf: I’m, this is an advertisement for a talk.
[laughter]
>>: She wanted me to ask that.
[laughter]
>>: Do you think other [indiscernible] techniques like [indiscernible] and other [indiscernible] would be
better [indiscernible] long range repair which can get partial parsing out of it?
>> Mari Ostendorf: I’ve talked to Mark Steedman a bit about this because we were trying to, when I did
my sabbatical in Australia and Mark was there with, I was with Mark Johnson and Mark Steedman was
visiting. We spent a bit of time talking about CCGs. His feeling is that they, that CCGs, and we actually
did a little bit of analysis that CCGs do a better job at meeting this syntactic parallelism well for this
theory. That there are, the constituents that you see people use are more compatible with CCG. That
was the hypothesis.
It mostly seemed to be barring out but it didn’t, you know we didn’t actually really write anything up. It
was just kind of anecdotal trying to figure out what to do. We were looking at it because of my
hypothesis about the repetition, handling the repetitions. You know I think it’s an interesting direction.
We looked at it a little bit but you know definitely no complete study.
>>: On the question about [indiscernible]. What you are missing to get something out [indiscernible] is
it one [indiscernible] alignment, was it pitch, or the lexical pitch or getting us enough?
>> Mari Ostendorf: If you look at work that’s been done on Switchboard you get most of the bang for
the buck from lexical features. A reason for that is because a vast number of these disfluencies are
repetitions in Switchboard, right. That’s going to dominate your detection measure. If we look at more
complex disfluencies I think it’s potentially the case that prosody could give us a bigger win.
The problem is that with these corpora, these more interesting corpora most of them with the exception
of ATAROS which is recent this SCOTUS and the FCIC data quality is not good. I’ve tried doing forced
alignments on them with multiple aligners and the, you know it’s just not good enough. I mean its
stupid things like you know it’s a robust recognition where you put, do the deep learning on it. It’s just
stupid things like page turning totally messes up the forced alignment.
>>: The early model you talked about the [indiscernible] model for planning and [indiscernible]. That
looks like a very reasonable model and hierarchy model. Do any people use, make any additional
models based upon that kind of hierarchy the [indiscernible] between the two [indiscernible] to explain?
You know the kind of difference you have with [indiscernible].
>> Mari Ostendorf: There’s not, so there’s not computational work doing that that I am aware of. This
was my sabbatical in talking to cognitive scientist. There’s not a lot of computational work.
>>: Your transcripts you have several things that aren’t actually going to come out of an ASR system.
You have Ellipses, you have commas, capitalization. To what degree did you use those things in the
systems that you built? Did they help you?
>> Mari Ostendorf: I use those for the SCOTUS transcripts for only doing lexical things. What I would do
in hindsight now is I would use those in forced alignment to look for filled pauses. If I did forced
alignment with an optional insertion of things and optional, so one of the things I’ve done some work
actually that Lucy knows about. On taking oral readings and looking at where people make mistakes to
try to understand reading level and difficulties. That, the forced alignments were done by somebody
who automatically inserts, has a way of inserting words, allows it to insert words so you could allow it to
learn repetitions and things like that, to find repetitions. If I did the forced alignment that way I could
use the dot, dot, dot, and punctuation to up the probability of those sorts of things. That’s one thing
that I’d like to do next.
>>: When you say repetition like typically you find the, the. Is repetition an exact lexical match or is it a
functional match? You’re talking about the, the, another language you have [indiscernible], okay.
>> Mari Ostendorf: I’m using it to mean exact match.
>>: Alright lexical match [indiscernible].
>> Mari Ostendorf: Exact lexical match with the caveat that I include, so if I, if the, I had the example a
wreck, a, a, requirement. All of that is trying to get out that final a requirement. A wreck I’m counting
even though it’s just a fragment.
>>: Yeah.
>> Mari Ostendorf: But you can call that a match by doing a substring.
>>: That’s considered an adaptation?
>> Mari Ostendorf: I am considering it and that’s by definition that is a repetition.
>>: Where it’s this [indiscernible] case would be a case of lexical access because as you’re planning and
you are coming up with, I’m going to say a different noun. Now I have to modify. That would be a
lexical access.
>> Mari Ostendorf: That would be a lexical access one. That would be, count, so there’s the surface
form and then there’s these more interesting categories. The surface form is just are they the exact
same string or not? That’s, you know that’s repetition, repair, restart.
These other categories are things that I’m starting to work on where trying to use sources that will say is
this syntactically the same? Is this semantically the same? The part of speech match you know
[indiscernible] those things you can get with a part of speech match. They are thought to be associated
with a different level than the repetition.
>> Will Lewis: Maybe we should break here. Thank you.
>> Mari Ostendorf: Okay, thanks.
[applause]
Download