Document 17954765

>> Will Lewis: We’ll go ahead and get started. I want to welcome Mari Ostendorf to give a talk here today. I won’t go through the whole litany of her background. I do want to say that Mari is very close to us here because she is close geographically just across the lake. It’s really nice to have her here to give a talk today. She has a Ph.D. in Electrical Engineering, has been at UW actually since 1999, and does a lot of work on Speech Processing. I know a number of projects that you’ve worked on with a number of folks over at UW. What she says here is interests are in dynamic and linguistically informed speech and language processing, which of course is of interest to a number of folks here. Mari and I, I actually remember, I’ve known Mari for about ten years. I remember distinctly the exact moment I first met her. It happened to be in an interview across the table at the Facility Club. It was a grueling day. Fortunately we, I had the opportunity to work with Mari intermittently over the next couple of years in the development of the Comp Ling Program at UW, which was a distinct pleasure. Without belaboring it any further Mari Ostendorf, Finding Information in disfluencies. >> Mari Ostendorf: Okay, thank you. Alright, so I’m going to talk about how we really talk. [laughter] I, so I’m assuming, I know some of you actually know a fair amount about disfluencies. But I’m assuming that some of you don’t. I’ll have a little bit of an intro here. This is a transcript from the Switchboard Corpus which is conversational speech. What I did here is everything that is a disfluency I have put in bold and crossed out. Classic, and then filled pauses, um and uh that’s filled pauses. Those are in purple to the extent they look like purple here. This is just illustrating different types of disfluencies and the fact that they are fairly frequent. The other thing that I want to, so I’m an Engineer, so I actually have practical reasons for thinking about disfluencies even though I’ll give you a lot of studies that aren’t so engineering oriented in terms of what we’re looking at. If you think about what computers would hear if they were perfect it’s everything that comes out of somebody’s mouth that’s what the recognize. There’s an acoustic signal there so the recognizer will deal with it. The dot, dot, dots are there to indicate that we’ve got some pauses in there. Then if you look at what people hear. People actually filter these things out. Many of those things, not all but many of those things depending on how they’re said the listener won’t even notice. If you ask them to transcribe the speech they will miss quite a lot of it. That is the challenge if we want our computers to be like humans we have to figure out what to not pay attention to, okay. But they, now, I’ll come back to that. The other thing I want to point out, a lot of people thought you know had this idea that Switchboard is just; it’s just really messy data, it’s not real life because it’s not really practical because these people have no stakes in the conversation. They’re getting paid some minimal amount to talk to a stranger for five minutes. Well, okay here’s a nice example. This is Supreme Court oral argument. They are in fact more disfluent than the Switchboard Corpus. You know disfluencies are real life. If you think about when are we disfluent? Well one of the reasons we’re disfluent is when we have high cognitive load or emotion situations. The Supreme Court is definitely a high cognitive load situation, okay. Alright and then I wanted to point out here. This will be relevant to the particular approach that we’re taking. This is one sentence, okay. From the point of view of how it’s transcribed and one of the things you see in the Supreme Court oral arguments, people, particular lawyers don’t want to give up the floor. They make their sentences; they make it hard to be interrupted. This is going to pose problems for language processing. Alright, so disfluencies are common. Multiple studies have found disfluencies rates of six percent or more in human-human speech. That’s you know relatively high. You can’t avoid having to deal with it. People have some control over their disfluency rate. But pretty much everyone is disfluent. Some people are more disfluent than others. But pretty much everyone is disfluent. People aren’t usually conscious of the disfluencies. But they do appear to use them both as speakers and listeners, and as you are using with your robot. Okay, so I like to argue you know traditionally in language processing people have been thinking about disfluencies as noise. I would like to point out that they are also information. They’re noise. They’re both, okay they’re noise because if you actually transcribed exactly what somebody said it’s hard to read, okay. It degrades readability. The word fragments are difficult to handle in recognition. They mess up the recognizer. The grammatical interruptions mess up the parser. If you translate them everybody gets confused because the prosodic cues that tell the listener to ignore this aren’t there. For all those reasons disfluencies are noise. On the other hand listeners use those disfluencies to understand the corrections, the attention grabbers, the strategy changers. They use, how they use particular filled pauses is sending a message about turn taking. They indicate speaker confidence. It reflects, the disfluency rate reflects cognitive load and anxiety. It’s interesting from a human-computer interaction perspective to look at them, so detecting disfluencies impacts from a practical perspective getting the intended word sequence, interpreting the speaker cognitive state, and understanding the social context. As I will actually show some, one little study that we hope to build on is it actually is going to tell us about relations between people, power relations. Okay, so this has implications for spoken language technology. Human-computer interaction particularly multi-party because you’re going to have more disfluencies that way, spoken document processing, as well as medical diagnostics. Okay, so that’s the introduction. Here’s what I’m going to try to talk about. What can we learn from disfluencies? What about the speaker mental state or the social context? How do we detect them? There’s not very much annotated data for disfluencies so how can I detect them for other types of corpora? I’m going to look at detecting, look at work on several corpora. I will call it Speech-in-the-wild taking from, borrowing from Louis Shriberg. In this data we’re going to look at different communicative contexts, so high stakes, low stakes, stuff like that, looking at automatic detection algorithms. A side effect of this I think we’re learning about improving spoken language processing for different genres. Okay, so I’m going to start out just going through some basics of disfluencies, telling you a little bit about my data. How we do automatic detection and some studies we did with the different corpora analyzing this data, and finally conclude. Here’s some basics. People have looked at disfluencies since the eighty’s at least. There’s been psycholinguistic studies that basically describe a lot of the things we are seeing now, even in a very small amount of data. In this particular study by Levelt early on they identify a bunch of different types of disfluencies, including appropriateness disfluencies. You want to change what you say that’s more strategic, error repairs you make a mistake. A thing that’s called covert repairs that’s a particular level, well I’ll get into the cognitive levels. Covert repairs are when you say the, the, the, so when you do repetitions. Then there’s a bunch of other repairs. This is in a very small study but you look at this, I think it’s pretty cool that this sort of stuff carries over to a lot of the data that we’re looking at now. They propose that well-formed repairs have syntactic parallelism between the reparandum and the repair. The particular model that lots of people use that builds on this, described by Louis Shriberg is this notion that you have three parts. The reparandum that’s the stuff I was crossing out. The interregnum that’s optional, that’s a, things like ums and uhs, and I means, and stuff like that. The repair, that’s replacing the reparandum. In a restart the repair is not there. The interruption point is not explicitly there. There’s no, not necessarily a clear signal of the interruption point except for the fact that you’ve got a prosodic discontinuity. Here’s an example of the very first in the form of the first step I showed you. This is that annotation of this crossed out version where I have purple for the reparandum. Here’s my interregnum, so the person is saying or to indicate their rephrasing here. Of course you can have disfluencies inside disfluencies, so that’s what’s happening here. This is a second; this is a repetition so that’s basically illustrated, yes? >>: Why do you necessarily think that the stand they have equals not intended? >> Mari Ostendorf: Huh? >> Why do you think that the phrase the stand they have was unintended? >> Mari Ostendorf: They, this is something that would probably be categorized as an appropriateness disfluency. The person wants the way they command respect to replace the stand they have. In reading the broader context of this… >>: You get it from the broader context, okay. >> Mari Ostendorf: That, well actually you can get it from listening to it too. >>: Yeah, yeah, processing, sure. But, right in terms of text there I don’t see any reason that that’s considered a disfluency. >> Mari Ostendorf: All of these transcriptions were all, this is from the Switchboard. This was all based on audio. The transcriptions are based on audio. >>: Yeah [indiscernible]? >> Mari Ostendorf: The way that I would get this from text alone, so ideally you want to do the automatic detection with audio, which I’m not doing right now, okay, which I’ll explain. The way that I would get it is the fact that there’s a repetition here. Not just the, the, the, yeah, okay. Anyway this is the cleaned up version, okay. Categorizing, so there’s three categories that have been used in a lot of the recent work based on this simple surface-level labeling, a repetition is when the reparandum equals the repair. A restart is there’s no repair. The correction is when they’re not equal, alright. Earlier work by Levelt and Cutler had a finer-grain intention categories. Eventually, I actually think this is the really interesting stuff in terms of analyzing interactions. But at the moment most of the work is really aimed in this. That’s where we’re going to start. In fact most of the work is aimed at only at finding the reparandum. This is just some examples, repetition the, the, the, it’s, it’s, those are very frequent. Repairs, there’s a lot of different types of repairs, I just I, so the person is getting rid of the just; we you’d, so there’s strategic things. There’s, so we want, in our area we want, so there are elaboration. There’s also mistakes, I don’t have an example here, well we you’d would be a mistake. You can also get words being wrong, lexical access mistakes, so there’s a lot of different types. Restarts this is an example and you can have them nested. Okay, just to show here the interregnum it could be filled pauses, it could be things like I mean, discourse markers like well. One of the reasons why you want to, one second, why you want to not just throw it away and why you want to, so mostly people have been detecting these things and throwing them away. But in fact sometimes not hugely often the word that you want, my insurance is in the reparandum but not in the repair. You actually need to not throw that away. Question? >>: I was wondering on the [indiscernible] most people have been working on the surface result because, I mean you said is that because the goal of the work is usually just to clean up transcripts. That’s all that people care about. >> Mari Ostendorf: Yes, yep, absolutely. I’m, and well also and it’s easier, right. That, so one of the things that Louis Shriberg pointed out in her thesis is people you know there’s a lot of different viewpoints on how you categorize below the surface. The surface stuff it’s pretty non-controversial. Okay, alright, so what makes people disfluent? I’m only going to talk about, so there are the strategic turn taking disfluencies. But what I’m going to focus on here is cognitive load and situational stress. Cognitive load if you have many things on it, so people talk about disfluencies on different levels. There’s the articulation disfluency when, basically when your production and your planning get out of sync. If your production is ahead of your planning you’re out of sync. At the articulation level that’s when you say the, the, the, when your production is ahead of your planning. At the lexical access level sometimes you make things like review, renew or something like that. Lexical access things that are similar sounding or different wrong tense, or uh versus and. Then there’s the more planning, you know overall planning level. That’s where you make some of these appropriateness strategic changes. Cognitive load you have more disfluencies because you have multiple things on your mind. Situational stress, high stress will tend to, I’d say cognitive load seems to get everything. High stress from the data I’ve looked at seems to have more of the lower level disfluencies. The interesting thing is you kind of have low stress and high stress. If you have really low stress, very casual conversations with people you know they’re more disfluent. If you have very high stress situations, high stakes it’s more disfluent and it’s kind of lower in the middle. Based on the data I’ve looked at the problem, all this with a caveat, there’s such vast variation in speaker disfluency rates that that’s the you know speaker, individual speaker variation is a huge effect and you need to look at a lot of data to say anything. >>: Why would the low stress situation increase disfluency? Is it that you just don’t care or you’re thinking about other things when you’re engaged? >>: Yeah you’re not using most of your brain. >>: Okay. >>: You don’t care. >> Mari Ostendorf: No, you know that, okay so some the very disfluent stuff that we’ve looked at. We had to totally change how we were annotating was CallHome, the CallHome Corpus, or CallHome CallFriend. Because you know the person you’re talking to understands what you’re going to say and you don’t finish your sentence. >>: Oh. >> Mari Ostendorf: You can be very sloppy because the person knows you very well. It’s a very grounded conversation. >>: I see, okay, so do you get then disfluencies that don’t in fact have the correction? >> Mari Ostendorf: Right, right. >>: Okay. >> Mari Ostendorf: You actually get the word that finishes the sentence in the next sentence by the other person. >>: Yeah, yeah. >> Mari Ostendorf: That’s very tricky data. Actually we see this in some data that was collected at UW recently by Gina Lebow and Richard Wright in terms of the negotiation, so when negotiating about something. Also if you have a shared visual world they don’t, they may say something that’s in the picture of, in their shared world that they both are looking at, but they don’t actually say it. All of those things cause disfluencies as well. >>: Is there any data that you looked at that shows sort of some consistent pattern across languages? >> Mari Ostendorf: I’ve only looked at English. That’s pretty hard as it is. Okay… >>: We got to hear about, hear only about native English speakers? >> Mari Ostendorf: No, I am talking about the English speakers in corpora we have. Some, most of them are native but not all. >>: Not native could cause some additional factors. [Indiscernible] your legs may have different kind of structure. >> Mari Ostendorf: Right, right, most of them are native but there’s no guarantee. Okay, so the interesting question is are these factors reflected in different disfluency types? I argue yes by the way, by the data that I’ve had. But clearly it needs more analysis. Okay, so I talked to you about the cognitive models already. Basically there’s the content planning, the lexical access, and articulation. What’s happening is generation or production gets out of sync of planning. You and you also have these strategy changes which I think is, which is really important. Something you see a lot in the Supreme Court data. There’s a question of whether the different types of disfluencies reflect different types of problems, content, lexical access, articulation, or different solutions to problems. It has been argued by, I think it was Herb Clark that it’s different solutions to problems. I would say looking at this data it’s probably a bit of both. Here’s again articulation disfluencies. You see these repetitions. Lexical Access, okay here is the actual example, to remove, to review. A, an that’s lexical access because you will tend to use a as your planning because that’s the most common thing. You see this a lot in languages that have gender with their articles. Here you can tell this is a Supreme Court one by a candidate, by a contributor. They’re talking about two different parties and missing up those parties, that’s a lexical access. Then the content planning it’s a mix of, things where you’re clarifying, your, yes? >>: How do you know that that’s not a, by a candidate, by a contributor. How do you know that the speaker is not purposefully saying something more about what he wants to say? Augmenting, it’s parallelism. How do you know it’s not parallelism? >> Mari Ostendorf: Well in this case they’re talking about campaign contributions. Generally the candidate, they’re talking about, it’s a case about who can contribute and limits on contributions. In this case, particular context it wouldn’t make sense for this to be the same thing, a clarification. But you raise an excellent point that this is a very hard problem. It’s not an easy natural language problem. The other thing though is if you, again if you listen to it the particular way somebody says is different for those two things. That’s a reason why we really need to get to the point of being able to include the audio. Okay, so anyway I just wanted to point these out. Because for me this is a reason why, these give reasons why we want to look at the endpoint and look at these more complex disfluencies, that the more complex disfluencies are telling us about the situation. In this case the lawyer says where we are arguing, where it is our position, so there’s something strategic going on here. I don’t know what it is because I don’t know legal stuff. Here they’re expanding. They’re emphasizing right to full right. But one of the things that you also see quite a lot but only for lawyers and not for justices is hedging. [laughter] Okay, distributional observations. These are some things that have been, a lot of people, this is, a lot of work has gone on to look at different disfluencies. One of the things that’s very interesting and humancomputer interactions in the old days there’s very few disfluencies. I think that’s because people are concentrating. You know it’s, you don’t have the less formal, you don’t know you don’t trust your computer to understand you. Now that recognizers work better you can look at data that’s collected in the ATAS days and data that’s collected now. It’s much more disfluent now. The more human like our systems become the more disfluencies they’re going to have to encounter. Men are more disfluent that women. Disfluencies are more frequent in initial positions. The argument there is that’s where your cognitive load is highest from a planning perspective. Speakers are less disfluent before a general audience than before people they are familiar, then people are familiar with the topic. I’m being pretty disfluent with you because I think that you know sort of what I’m talking about. If I was to give this in front of a, you know general audience, non-tech people I would be less disfluent. >>: What data is at force that men are more disfluent than women? >> Mari Ostendorf: Tons of studies. Tons of studies and its also consistent with men stutter more than women. >>: With relation to the first point, do you know if we’re less disfluent when we speak to like children or babies, or some other thing that we think will have a harder time understanding us? >> Mari Ostendorf: Based on the first point I would say yes but I’m not aware of the studies. >>: Yes, so some, all the things you are talking about in terms of the familiarity of the audience. Actually there was a big theory, right, in ninety’s about hyper, hypo theory. How does that fit into that pre-planning production that kind of sync we need, or the same all cognitive? >> Mari Ostendorf: Right, I’m sure it’s related I’d have to think about it. Okay, there is study that says filled pause rates are less sensitive to task complexity and anxiety than other types of disfluencies. That’s not consistent with our data. Again, you’ll see that one of the things that we have here is huge variations among speakers. If you’re looking at a small number of speakers you could be concluding something that’s not necessary, doesn’t generalize. Okay, alright, so disfluencies are prosodic interruptions. They’re fundamentally prosodic. I’m going to do a horrible thing, as somebody who has worked on prosody for many years of my life and not use prosody in this work. There’s, so several studies show a higher F zero. One of the ways you can tell there’s a disfluency is you reset your pitch afterwards because you’re resetting the prosody. There’s sort of matching tonal structure because you’re restarting the same thing. There’s also the repair structure reflects the speaker intention to produce a predictable prosody. There’s lots of reasons why people are interested in prosody. In the data that I have it’s a little bit hard to get reliable prosodic features. I’m not using it. You can do a lot with text but long term that’s where it has to go. Okay, so the most prior studies have used either controlled speech. In psycholinguistic studies they want to figure out what’s causing disfluencies. Some move the purple, move the yellow square types of things, or low stakes conversation such as Switchboard. Much of the work on disfluencies has been on Switchboard, or human-computer interaction. In the old days that has fewer disfluencies. Now a days that could be more interesting. But there’s not a lot of annotated data unless you guys have it. [laughter] We are trying to use multiple corpora with varied stakes to get an idea of what’s going on here. We’ve got, well let me tell you about it, and to develop algorithms which generalize. I have two telephone conversation corpora. We have the Switchboard data which LDC has annotated. We augmented the annotation for a small amount because we’re interested in different types. I have CallHome which is family members. Then I have two high stakes, goal-oriented sets of data. The US Supreme Court oral arguments, so there’s more than fifty years worth of that. We’ve got one year of LDC inundated careful transcripts and a subset of that. Like a smaller number of careful transcripts done by students at UW with careful disfluency, more detail disfluency markings. Then we have financial crisis hearings. These are pretty interesting. This is the, we have two hearings with Blankfein. One when he was doing well and kind of a golden boy, and then later when he’s on the hot seat and being blamed for doing bad things. They are very different. If I played you one audio clip that would be the one to play it’s very interesting. Okay and then we have this ATAROS data that was collected at UW, its two-person negotiations with controlled variation of stakes. Yep? >>: On the telephone conversations do you also have annotated when two people are talking at the same time? >> Mari Ostendorf: Yep. >>: Is that also a source like when somebody else butts in does that start disfluencies on the initial speaker? I mean do you have any data on that? [inaudible] >> Mari Ostendorf: I have not actually; I could get data on that because we do have the time for Switchboard. We don’t have it for CallHome. We could get it for; I think we have it for the ATAROS data. We don’t have the timing for these guys because, so the reason I’m not using prosody is these things are so noisy the forced alignments are garbage. I need better recognition to do this. That’s why I’m not doing prosody right now. >>: I would have thought that in the SCOTUS Corpus that there wouldn’t be any interruption. >> Mari Ostendorf: Oh, there is and it causes disfluencies. >>: Okay, that’s positive. >> Mari Ostendorf: Yeah, so you, but it’s causing disfluencies in the sense that the lawyers interrupt, or the justices interrupt the lawyers, and the lawyers start getting stressed out. That’s what anecdotally I perceive from; I actually did a bit of the annotation myself because I wanted to understand it. Okay, so we’re trying to vary stakes and yes? >>: I’m sorry, the Supreme Court arguments these are recorded and transcribed or are they transcribed “records” transcriptions in total towards this? >> Mari Ostendorf: That’s a great question and I’m going to talk specifically about that. They are recorded. They are not high quality recordings hence the challenges. They have been done differently over the years. Another challenge to speech recognition now a days it’s in P3, old days it’s these big audio tapes. The transcriptions are not careful transcriptions. They are transcripts that some legal scholar would be interested in. That’s going to complicate the study. Okay, so I’m going too slow. I’m going, so the Switchboard we’ve got just a couple things about the annotation that’s worth noting. The typical LDC, if you’ve ever worked with the LDC data they have a nested annotation structure. We flattened that because it provides a better representation of these repetitions and we want to explicitly model those. Okay we’re going to, let’s see to get this flattened representation we hand annotated a small amount. Then we automatically predicted the flattened representation. Because you have this structure it’s just a matter of taking these things out, the F score is really high. You can do that automatically really easily. We have for Switchboard we particularly when we do the speaker variability stuff. We have twenty speakers who called a lot. The only way we can look at speaker variability if we have a lot from one speaker. We can look in the Supreme Court. We can look at it in this data. Okay, other stuff, we’ve hand-annotated a small amount of data for everything. That data, some of this data was used for SCOTUS training. Most of the data is only used for testing so it’s all cross domain stuff that we’re doing. Okay, so some similarities across the different corpora. Word fragments we find are most often associated with repetitions. This is counter to what has previously been reported in the literature. It’s not so often associated with repairs. In the very informal stuff it’s often associated with restarts or abandoned things. Most of the disfluencies involve function words and simple one-two word disfluencies. Observations from prior work mostly hold so high sentence-initial stuff like that. But the thing that just blows you away is the large amount of inter- and intra-speaker variability that’s really over a continuum. Some differences that more high stakes SCOTUS and FCIC have more of these very long strategic types of disfluencies. The relative rate of repetition is highest for SCOTUS and lowest for CallHome. There are differences in the types of things we get as functions of these contexts need to tease things apart a little more to understand it better. The thing that really surprised me is the statistics for CallHome and Switchboard are very different. I think of, their conversational, informal, they should be similar, no. Friend, family versus strangers seems to make a big difference. A lot of this is the shared world effect. The other thing is I discovered that the interregnum has, can play an interesting role. When lawyers are talking to justices they can use “Your Honor” as a discourse marker to say I’m making a, there’s a politeness thing. All of this is based on hand annotated data. Okay, now I’m going to talk about automatic detection. The computational model we’re using is really quite straightforward. It’s been used by other people. We’re going to use a sequence model. We’re going in particular conditional random field. It’s like; it’s basic like tagging so we’re going to label each word in terms of what part of the disfluency it is. Now previously there’s been a lot of work including stuff I’ve been involved in looking at parsing models. I think this is really a way to go. Unfortunately as I showed you that SCOTUS sentence our parser can’t handle it yet. Until we start doing some adaptation or find a way to do a better job parsing then that stuff is a little bit on hold. The features that have been used are word-based features or words and prosody. As I say prosody is actually important for detecting interruption points. But we’re, in this particular work we’re not going to use it. That’s what we’re doing. Here’s the disfluency detection model. We do a very simple begin, insideoutside sort of model. But we add the interruption point. We know the end of the disfluency because the interruption point actually really matters when you’ve got these multiple ones in a row, and we have an other. Then the next thing we do, so this is the starting point. Then the next thing we do is repeat this for different disfluency types. Then the next thing we do is add some states for the correction, okay. Baseline features are part-of-speech. These are totally standard stuff part of speech tags, and other type indicators, filled pause, stuff like that. Discourse markers, so you look at the word and words around it. Word fragment indicator you wouldn’t necessarily have this in speech recognition. But we’re using it because we’re trying to understand what’s going on. Pattern match features, so you’re trying to say did I have the same word, same phrase, or same part of speech sequence right in a row, okay. >>: No acoustic features, no acoustic features. >> Mari Ostendorf: No acoustic features because the time alignments are so bad. Okay, eventually I will get there. But right now this study is limited in that way. But it makes the experiments run faster. Okay, so the types of things we’re looking at, multi-domain learning. We’ve got leveraging, we have separate models of repetitions and repairs. Because the repetition model actually cuts across domain really well. The repairs is different, so that tends to help us in certain ways, hurts us in some ways and helps us in others. We have explicit, one thing that other people have not done we’re explicitly detecting the correction. Because we don’t want to, we want to know about the relation between the reparandum and the repair, not just throwing it, so it’s not just about text clean up. Punctuation is a surrogate for prosody is something we’re using on the Supreme Court data and looking at using semisupervised learning because there’s so little data that’s transcribed. One of the things, here’s a couple of the things that we did to, I’m just going to give you the final story, but just to tell you how we got better results. One, we tried adding the SCOTUS data, so the SCOTUS data, to the Switchboard. Switchboard is used for everything, so we tried adding the SCOTUS data. SCOTUS data helps SCOTUS but surprisingly it did not help with the FCIC. I didn’t expect it to help with other things but it didn’t help with the FCIC. Better leveraging similarities, we find the common, what it did help with is coming up with words that are common words for disfluencies. We looked at what’s in Switchboard, what’s in SCOTUS. We made this list of common words and disfluency. We train a language model based on that to say okay if I see this sort of common pattern it’s likely to be disfluency. There’s some things that you see a lot that are, so, so the, the, you can get that with pattern match really easily. But certain other types of things aren’t, are common but you wouldn’t’ necessarily get them as a pattern match. Okay and then the last thing we added was distance-based pattern match features. How far away is the pattern match because that, so certain corpora you have longer disfluencies than others. The pattern match if you don’t do distance it becomes, so the farther away pattern matches are less reliable. That’s the point of that. Okay, so putting it all together here’s where we stand. >>: Can I talk just for a second? That the longer distance ones are occurring in what? >> Mari Ostendorf: If you have, so you can get in the SCOTUS Corpus we can get five and six word complete repetitions. >>: I see. >> Mari Ostendorf: In Switchboard that’s rare, okay. >>: Okay. >> Mari Ostendorf: They’d be one or two words. >>: [inaudible], and the more formal SCOTUS you’re seeing this kind of thing than in the less formal Switchboard. Am I correct in that? >> Mari Ostendorf: Basically the idea is that if you allow the longer copies then as they get, if you allow, if you say okay pattern match if I have a pattern match anywhere fire this feature. The problem with that is that’s going to over generate, so by having it distance matched you can put less weight on it. >>: Right, no I’m just trying to get the sense of where the longer distance… >> Mari Ostendorf: Things happen, so they happen in the higher stake stuff. >>: Okay, okay, that was your answer for that. >> Mari Ostendorf: Yeah. Okay, so these particular results are only using Switchboard to recognize other things. These are our best results. It’s with a bunch of improvements. What you can see is obviously you do best on Switchboard. But the cool thing is Switchboard works not too horribly for everything. The other thing that’s interesting is mostly the reason why things work well, so this is the number that everybody reports. Did I detect the reparandum no matter what it is? Okay, if you decompose that into repetitions and everything else, repetitions are very easy to detect, the, the, okay. SCOTUS has a lot of repetitions. These guys have less repetitions, so hence Switchboard works really well for SCOTUS. This number is kind of hiding the fact, is going to be biased by how much you, the data has repetitions. Okay, the thing that’s hard if you try to detect other things that’s harder; not surprising. The thing that has not been done before is actually detecting the correction because if you just want to clean up people don’t care about the correction. But if you want to understand what’s going on, you care about the correction that’s really hard, knowing the endpoint is hard. We don’t do as well on that as we do at just detecting the reparandum. Okay that gets that across. Okay, so here’s, yes? >>: Well I don’t understand why [indiscernible] can be lowest eighty percent? Is it, what is it that’s getting false positives? >> Mari Ostendorf: Yeah, so for example that, that sometimes is a disfluency in sometimes is not a disfluency, right. >>: Is it really that frequent? >>: Oh, yeah, yes. >>: This is all on transcripts, right, not on ASR? >> Mari Ostendorf: This is all on transcripts. All of those numbers would go down on ASR. >>: Still surprised that it’s eighty percent, though. >> Mari Ostendorf: You can see it’s even lower in CallHome. The more informal stuff is but you can hear the difference if you use the audio. If this had, if I had the audio I think this would be much higher. If I had the audio and did a good job with extracting the prosodic cues which is another thing. Alright, okay, so this is new stuff that we’ve been doing. We’re unediting the SCOTUS transcripts. The question was asked earlier about transcriptions. What people transcribe by LDC when they’re doing a careful transcript they would get everything the person said. When you look at what they transcribed in SCOTUS sometimes they would get the repetitions. They may leave out a. Sometimes they’ll put in commas to indicate there’s something going on. They will put in dot, dot, dot when, they’ll do that for repetitions. They do that for um, um, um, um, those sorts of things. There are no ums and uhs anywhere in these fifty years, or in at least the twenty years that I’ve looked at of SCOTUS transcripts. The question is, so the thing that’s frustrating is we’ve got all of this data. We have fifty years worth of data. I can look at you know disfluencies across time, speakers across time, Roberts was a lawyer, and then he was justice. We can look at the difference, actually we have. I could do all of these things if I could use all this data. How can we get it when it’s not there? Well what we’re going to try to do is, what, so when I gave this talk at Penn. Mark Liberman told me I was doing unediting. That’s what I’ve now called it. We, this is only based on text. I’ve been talking to people about a couple of things we can do based on acoustics, which are very exciting. Hopefully I’ll you know can come back in a couple of months and tell you about them. We’re looking at orthographic cues particularly punctuation. But what we did was automatically matched the training data to this format. We’re adjusting the training data. We’re learning, so we have the careful, we have parallel versions of some of the SCOTUS data. We learn where you would insert things and take things away. We adjust the Switchboard data. We map to the SCOTUS data. We then can apply said, now we have all this data that’s not transcribed, so we can do semi-supervised learning. The other things that made a difference is explicitly modeling repetition separately from repairs because the corpus, the mismatch between corpuses is different for those two. Okay, so here’s what we got. We started out if you just use Switchboard. This is just doing the reparandum. This is not doing the correction because that was really harder to deal with in terms of the mapping. If you just do Switchboard finding the reparandum this is where we start with, the original Switchboard, okay. If I add the SCOTUS data, the careful SCOTUS data, so it’s mismatched and then doing the dot, dot, dot, blah, blah, blah, I get forty-one. I go from, basically forty to forty-one. If I transform the data to transform Switchboard using what I learned about the difference it gets up to fifty-eight, pretty nice. If I add self training that gives me a little bit more. If I use this different representation of the two different disfluencies I get up to sixty-two point two. This is a really nice big jump. The thing that kind of blows me away is that we’re doing better now than when we were with the SCOTUS data that was all careful. In terms of we’re now at sixty-two point two versus forty-one point six. Part of what’s going on, there’s multiple reasons for that. One is I’ve got more SCOTUS data because, but you can see the self training isn’t giving me that much. I need to kind of figure out what’s going on to make this big difference. But there’s also the other thing that you can see here is the thing that is the hardest in the original is the recall. You know if we detect something with, you know if I, so if I get a the, if the, the is there I can be pretty sure that I trust it, alright. But the recall is what we mostly improve. >>: I’m confused about the annotation on it. There’s the un-annotated, or there’s the transcribed data that’s not and then there’s the LDC careful, then there’s the unedited, right. This is being evaluated on unedited? >> Mari Ostendorf: What we did was we took the careful transcript, so the test data has the disfluency locations in the careful transcript aligned with the non-careful transcript. I know where they are even if they don’t exist in the non-careful transcript. >>: Right, but is the careful one, is that what LDC provided for one year or is that what you guys provided? >> Mari Ostendorf: It’s both. >>: Okay, but the SCOTUS one that’s the fifty years of original stuff? >> Mari Ostendorf: Right. >>: Okay. >> Mari Ostendorf: Right, so we took the careful transcript that’s been hand annotated for the disfluencies, aligned the thing because they have different words, obviously. Do the alignment, transfer the annotations so we know where the reparandum is, and that’s what we used for the target. >>: Okay. >>: You’re doing this only on the test data, or on the training data? >> Mari Ostendorf: On the test, you do the transfer on the test data. We also do the transfer on a separate small set of, because we don’t have a lot of this annotated, okay. Small set of training data which is where we learn what the mapping between them is. Then apply that mapping to all of Switchboard. >>: All of Switchboard? >> Mari Ostendorf: Right, so the thing we’re transforming is Switchboard. We’re adding punctuation and taking away words from Switchboard. >>: Oh, you’re saying making Switchboard look like SCOTUS, not the other way around? >> Mari Ostendorf: Right, right. >>: Okay. >>: But is this game then does that mean that its, is the reason for the low original score just because of mismatched annotation standards rather than [inaudible]? >> Mari Ostendorf: No the original is based on the matching careful transcription. It’s not a mismatch in annotation standards. >>: The 41 that does are, the training and the test are completely comparable in terms of, they’re, how they annotate both the transcribed and annotate the… >> Mari Ostendorf: Right, so the transcription, well I mean different people who did it at LDC did this you know decades ago. It’s not the same people but to the extent that you’re following the guidelines it’s the same. >>: Right. >>: I still don’t quite understand still, the event that you’re measuring that those precision recalls are on. The task is that you have to fill and you have to put in the original disfluency. >> Mari Ostendorf: That, find the words, you want to find the words that are in the reparandum. >>: But you’re not… >> Mari Ostendorf: Each word counts. >>: You’re not allowed to insert a new, a disfluency at any point in text, right? >> Mari Ostendorf: If you miss a word that’s in a reparandum that’s hurting your recall. If you get a word, if you say some things a disfluency when it’s not that’s hurting your precision. It’s a word level measure. If your disfluency has three words in it then that counts for three. It’s not a disfluency level number it’s a word level number. >>: If a person, regular person to do this annotation would he get [indiscernible]? >> Mari Ostendorf: Probably not. >>: Probably not. >> Mari Ostendorf: No because we’ve got, there’s disagreements, there’s disagreements for sure. >>: [indiscernible], what’s [indiscernible], what’s at the weight level [indiscernible]? >> Mari Ostendorf: You know I don’t know for the LDC data, so I don’t know. I mean it was done a long time ago. I’m not sure there’s a paper on it, so I don’t know. I know that they went through a bunch of iterations on it. I know that there was some discussions on simplifying it. For the later, so there’s two phases of LDC annotation. For the later phases they dropped the recursive structure. Okay, alright so that’s that. I want to, I’ve gone too long. I just want to give you a few data analysis things. This is looking, so I’m going, this is with a caveat, these results so the unediting just finished. You know it’s very recent stuff. All of the SCOTUS stuff here; this is on twenty years worth of SCOTUS. This based on; I should say this one, okay. What we’re doing, so we have much lower recall. Actually I think at the time the way we were, it was a slightly different version and we had higher precision. We had precision at about point eight, but much lower recall. These numbers all need to be updated. But you can still see pretty interesting trends. This is just showing disfluency rates, each X is a, let’s see I think each X is a speaker? No is a case or a conversation. No, no it’s a speaker, sorry. Basically what this shows, so these are filled pauses, the bottom is filled pauses, repetitions, and other disfluencies. You can see speakers varied quite a lot in both corpora in terms of these, how disfluent they are. Okay, that’s the main thing. The other thing, if you look at it from this perspective the intra-speaker variability compared to the, is huge. This is going across speakers and this is within speakers. The variability, so there’s a huge amount of variability. To try to understand what’s going on if you were working with Scalia versus O’Connor you would get very different conclusions. It’s important that we have an understanding of variability, speak variability in order to make any comments about the effect. Things that I talked about with, and that’s why we have this for the Blankfein stuff we have the before and after, and stuff like that. Main point is it’s tricky because people, it varies hugely. >>: Also the three types don’t correlate at all [indiscernible]. >> Mari Ostendorf: Oh, yes thank you, thank you. That was the other very important thing. I cannot predict how disfluent somebody is going to be by just looking at their filled pauses. Filled pauses are easy to detect. Repetitions are fairly easy to detect. I would like to be able to predict how disfluent they’re going to be, from one of these I can’t. That’s a, thank you. There’s, so it’s roughly log-normal for a speaker, cross-type prediction is difficult, and controlling first speaker is important. Just a couple fun things, so if you look at the stress effect. Blankfein has more repetitions and more filled pauses at the, in the higher stress hearing. In the weak versus strong task for the ATAROS Corpus that we are looking at, the negotiation problem solving task, there’s again higher repetition rates in the strong stance case, okay, so more the higher stakes case. There’s also here, again also higher filled pause rates particularly for men. The stakes seem to cause people to be more disfluent as you might expect. Now we tried looking at judges in cases with unanimous versus close votes. For the closes cases we were thinking okay what’s happening if they’re close and you’re on the losing side? The highly disfluent speakers are more disfluent. The less disfluent speakers are less disfluent. Don’t know what to do about that, okay. But again I have to revisit this with our new improved detection. Some highly disfluent cases have nine zero votes. They’re just being more informal. I don’t know what it is. >>: Do you have the freedom of promise? I though he literally said only a few tenths of… [laughter] >> Mari Ostendorf: You notice he wasn’t on the list. >>: Yeah, yeah, probably. >> Mari Ostendorf: Yes, no, no, actually. >>: But this is a highly emotionally charged one too. >> Mari Ostendorf: Yeah, so there is no thing, but in fact, did I have a thing about? >>: Yeah… >>: [indiscernible] at the bottom. >> Mari Ostendorf: Yeah. >>: I guess that’s the only case he actually said in the thing though. [laughter] >> Mari Ostendorf: Yes, for this particular one it’s a great thing to analyze. But I don’t have him in other studies because there’s just not enough to normalize. But it’s very interesting the types of disfluencies that are in that case because it tells you exactly what the hot buttons are. Okay, so this is the Blankfein stuff. You can have the early hearing and the late hearings. This is just giving you details on the differences. Where the interesting thing here if you look at the, where the disfluencies are happening they’re mostly happening sentence initially at the content planning stage. He’s thinking more about what he’s going to say for obvious reasons. Okay, this one I love. The repairs reflect the power dynamics. The hedging repairs are mostly used by lawyers, as I said. Here’s some examples, changing I think that to I don’t disagree with that accept to the extent that I think that. [laughter] This is a classic. >>: Double standard. >> Mari Ostendorf: Let’s see, this is another one I love, so many to, or not maybe so many but many. [laughter] Politeness, the lawyers tend to be polite. The justices aren’t, so I’m sorry is the interregnum for lawyers. The other thing we looked at which is fascinating is entrainment. You do not see this is Scalia who’s interesting because he’s so disfluent. If you look at cases, these are different cases. If you look at the repetition rate of the case here versus the repetition rate of Scalia. It’s, you know there’s no trend here. If you look at the lawyers versus the, so here’s the repetition rate of the case and the repetition of the lawyers. As the case gets more disfluent the lawyers get more disfluent. If you look specifically at the case the lawyers who come in, in the second half, so you’ve got your first lawyer and your second lawyer. In the second half, you know they’ll, normally they start out very not disfluent because they’ve got, they’re totally prepared. The second lawyer starts out disfluent almost at the rate of the case. >>: Is this excluding the lawyer in question itself? >> Mari Ostendorf: What? >>: Is the, in the left hand side graph because if the lawyer talks a lot [indiscernible]… >> Mari Ostendorf: Right, so at the, and so Scalia can be, so what’s happening to some extent is if he’s talking a lot he’s driving, because they’re patterning, the lawyers are patterning after the judges. >>: It would also be interesting to see the trend, [indiscernible], if the lawyer would have gone to in training with the judges because of this power dynamics. >> Mari Ostendorf: That’s what this is showing. That the lawyer is in training with, so they’re not, it’s not necessarily; I haven’t done anything with specific justices. >>: [indiscernible]… >> Mari Ostendorf: But I had, but, and remember this has a problem that I don’t have enough data yet to normalize for; I can normalize for the justices but not for the lawyers, right. Really it’s just broad statistics that we can look at, at this point. But what you can see because lots of people are arguing and there’s not a ton of time to entrain to one person. The trend is overall to the case overall. That’s all I’ve been able to do so far. >>: But it’s also interesting to see locally. Who starts at this [indiscernible] when it happens? >> Mari Ostendorf: I don’t think that we can do that without more speaker normalization. >>: Yeah. >> Mari Ostendorf: Yes it would be really interesting. But I don’t think I don’t trust my data enough yet. Okay, so just some things, high engagement more disfluencies. We talked about some other stuff. Syntactic parallelism, one of the things, so there is this comment, this theory about the well formed disfluencies having syntactic parallelism. The Switchboard annotation that was done for parsing incorporates all of that. You know and represents these unfinished constituents. One question is what’s going on in the data? We looked at, so this is just a very small analysis. It’s very anecdotal. But what you can see, so most of the work with parsing and disfluencies has been based on same phrase, okay. The same phrase happens a lot in the Switchboard. Almost half the syntactic parallelism, clear, simple syntactic parallelism happens a lot in Switchboard, in SCOTUS not nearly so much. >>: [indiscernible] identical? >> Mari Ostendorf: No, no, no, no, same syntactic construct. >>: Oh, construct, okay. >> Mari Ostendorf: Okay, so what’s different here is in SCOTUS the higher stakes stuff where you have more strategy things. You have expansions because they’re adding hedges or clarifying, appropriateness stuff. You have a lot more function word differences. One of the things that this doesn’t account for, the syntactic parallelism, it, the, that’s not parallel, right. That sort of stuff happens a lot, right, it’s, because it’s a lexical access thing. Those things happen, cognitive, high cognitive load, lexical access, all this sort of stuff is happening more. My point is just that the syntactic parallelism needs to be taken with a grain of salt. It’s fundamentally there but not always beautiful. The length of repairs SCOTUS is longer. One of the things that’s very interesting is that sometimes you have this “repair” that happens. Let’s see if I can, so you say something and the repair is like a whole sentence, okay. Is that the practice in, and then they insert something, and then is the general practice. If you think about it as a repair this whole entire thing is in fact a repair. That is a nightmare for automatic detection. In fact this is longer than it appears because I’ve collapsed all of the other disfluencies inside it. >>: How many instances like this out there in the [indiscernible] end of the data? >> Mari Ostendorf: In SCOTUS? >>: Yeah. >> Mari Ostendorf: A non-trivial amount, in Switchboard not much. >>: Obviously your model isn’t doing enough to capture the [indiscernible]. >> Mari Ostendorf: We can’t capture that. We can’t capture it. In fact the annotators don’t know what to do with it because sometimes there’s like two sentences. >>: [indiscernible]. Okay. >>: [indiscernible] impact [indiscernible] think of the application of these kinds of things either for understanding or for translation. For translation you probably just tested this as is. My understanding probably you can talk about the content of it [indiscernible] as well, including both... >> Mari Ostendorf: The reason why this is, so for translation it’s probably a who cares what you would want to do. What we were talking about earlier this morning is you might want to put in a dot, dot, dot where that plus is to let you know the person is you know was thinking and some things going on. But for translation I would say it’s not a big deal. I think it’s… >>: But you don’t want to break it into three separate sentences at this point. >> Mari Ostendorf: Yeah, the thing where this becomes interesting is if you’re looking at any sort of social analysis. Understanding what’s going on, strategy analysis then you want to know that there is something here. It’s more of a discourse level thing. >>: Like in Cortana we just delete the whole thing [indiscernible]. [laughter] >> Mari Ostendorf: Yeah. >>: In this case [indiscernible]. >> Mari Ostendorf: Okay, so just to finish up. The implications here is, so you know looking at a noisy channel model it works really well for handling the easy types of disfluencies. Repetitions are frequent, so one of the things we’ve started doing is detect them first and deal with them. That makes, that’s kind of useful for thinking about incremental processing. It works really well with incremental processing. Being influenced by Mark Johnson, PCFGs are not good for disfluencies. The non-trivial number of long repairs suggests you need to have more sophisticated delay decision making. Other implications, I think there’s, one of the reasons this stuff is interesting is if looking at analyzing social interactions and cognitive load. That’s one of the reasons I’m interested in detecting actually the corrections. In conclusions I’m, my main point is disfluencies are not just noise. They’re interesting, they carry information. My second big point is we really need to be looking at the “speech in the wild” to see what people really do. It is important for understanding the variability both within speaker and across speaker in order to handle disfluencies correctly, and to improve our automatic detection. Speaker variability is huge, so that potentially impacts earlier findings. It also makes it hard to figure out social information and cognitive load information if you can’t norm the speaker. Then with some context control if you can do norming of the speaker we can see pretty interesting effects of stress and power. Some of the things I’m doing for, looking for, or would like to do in the future is more on the semisupervised cross-domain learning. I would love to leverage parsing if I could get a parser to work on the SCOTUS data. I think it would be very interesting; I’m interested in ways to learn the finer grain disfluency categories, applying semantic analysis to them. Expanding the disfluency types for, once, if we have these expanded disfluency types can we use them for speaker modeling or analysis of cognitive and social factors? Lastly, we can apply this new way of looking at disfluencies in terms of multiple types as a way to improve parsing, to do a better job of integrating disfluencies and parsing. That’s it, so sorry for going so long. [applause] >>: When they repair like when you replace one word with another word. It seems like sometimes it’s the type of thing where the word sounded similar and, so they had sort of said the wrong thing and they were fixing what they said. >> Mari Ostendorf: Right. >>: But I’m wondering how often is it more that like the concepts are close but they’re totally different words? You kind of, like you’ve changed your mind about how you wanted to express. Is that, have you looked at the difference between those two things? >> Mari Ostendorf: I’ve tried to do some looking at that. These are less frequent, so of the… >>: Word repairs are? >> Mari Ostendorf: The, yeah the word repairs. Of the lexical access repairs most of them involve function words, okay. >>: Alright. >> Mari Ostendorf: I do have and you can, one of the things I’ve been trying to do is automatically mine the data to find those specific things. We automatically mine them to, based on phonetic similarity. Also we tried to do some you know stuff where we looked at stemming and trying to find syntactic similarity. We’ve done a little bit of mining the data for that. What we come up with hasn’t been accurate enough to do fully automatically. >>: Because it’s mostly function words so you don’t have a lot of these types of, yeah. >> Mari Ostendorf: Because, well, yeah, so a lot of it is function words. The interesting, and you know I’m not sure so I’ve talked with, to, about this with Ann Cutler and she’s the one that you know told me that she thinks the functional words are lexical access, a lot of the function word things. The, you know like it, the, you know there’s a lexical access thing. I don’t know if that’s a good example. But anyway there’s a bunch of function word things that are presumably lexical access issues. You’d have to do a much more controlled study than, I mean I can’t tell, I’ve got the transcripts I don’t know what’s in people’s head. >>: Sure. >> Mari Ostendorf: I can’t tell that stuff. I can just look at gross statistics. >>: Is the implication of this rather than the hyper [indiscernible] test that we usually think about, like translations of event. Is it to actually get the disfluencies themselves so you can make analysis like this person probably is subordinate, if you know what the conversation is, is that the kind of thing that? >> Mari Ostendorf: That is one thing that I’m interested in. I’m actually interested in both types of things. One of the, so there’s a couple reasons why it’s relevant for translation, if you’re going to throw something away you may be throwing away something useful. Knowing the correction can tell you whether you’re throwing something useful. That’s one issue. The other issue is by working on all these different data sources I’ve ended up improving my performance on Switchboard. It just gives us a better disfluency detector the more we understand the types of things people do. It’s relevant for that as well. But I’m just arguing that on top of that it’s relevant for understanding these social things. >>: I’m just wondering more about what the application from that space. Like is it that type of thing that just the example of the… >> Mari Ostendorf: Understanding power relationships is the [indiscernible] we wanted to; it says is this discussion going well. One of the things people are interested in is you know analyzing how is this going to go if you have recordings of a negotiation, how is it going, those sorts of things. >>: But in translation you could see like if there’s power dynamic it might play out differently in another language that you might want to capture that. That would be a difficult thing to do but. You know like opt a certain [indiscernible] that might be expressed in a particular power situation that is clear from the disfluencies. I’m not saying that’s the thing we’ll do but it’s something you might want to do I’m saying, okay. >> Mari Ostendorf: Yeah, yeah. >>: As you mentioned you do more disfluencies in a more informal environment. You said like you know in this talk here this was a general audience you’d be speaking kind of more carefully. Do we, you think we sometimes use disfluencies as a way to signal to the audience like how we feel with them? Not just the audience but the other person we’re talking to more than just that it’s subconscious but it’s… >> Mari Ostendorf: I think it’s mostly subconscious. I think some of the things that we do are going to be conscious, the really long ones. >>: Yeah, so I misspoke. I didn’t mean that it’s subconscious. I meant that it’s not just like an accident that we’re making disfluencies but that our, we might be kind of, yeah trying to signal by not knowing that we’re doing it. But that the signaling is there I guess. >> Mari Ostendorf: There, it has been argued that we do, purposely, that it’s purposeful. You know I’m, this is, I’m an Engineer so I’m not going to go out, I’m not a cognitive scientist. But I will say you know you can, it is certainly the case that people use them. Whether they use them because we’re intentional or whether they use them because they’ve just gotten use to this sort of disfluencies mean X, I don’t know. But definitely people use them and, so speakers and listeners both use them. >>: Do you know if there’s been studies that have shown the effect on a listener if a person has different disfluencies? For instance if we had like when the computer is talking to a person if it injected disfluencies every once and awhile would that like have an effect of making a person feel more comfortable? Like do you know if that kind has been [indiscernible]? [laughter] >> Mari Ostendorf: You know I don’t know if that is, you guys… >>: I think that is the case for example [indiscernible]. [laughter] >>: [indiscernible] >> Mari Ostendorf: Yeah. >>: Like I would be [indiscernible]. [laughter] >> Mari Ostendorf: I’m, this is an advertisement for a talk. [laughter] >>: She wanted me to ask that. [laughter] >>: Do you think other [indiscernible] techniques like [indiscernible] and other [indiscernible] would be better [indiscernible] long range repair which can get partial parsing out of it? >> Mari Ostendorf: I’ve talked to Mark Steedman a bit about this because we were trying to, when I did my sabbatical in Australia and Mark was there with, I was with Mark Johnson and Mark Steedman was visiting. We spent a bit of time talking about CCGs. His feeling is that they, that CCGs, and we actually did a little bit of analysis that CCGs do a better job at meeting this syntactic parallelism well for this theory. That there are, the constituents that you see people use are more compatible with CCG. That was the hypothesis. It mostly seemed to be barring out but it didn’t, you know we didn’t actually really write anything up. It was just kind of anecdotal trying to figure out what to do. We were looking at it because of my hypothesis about the repetition, handling the repetitions. You know I think it’s an interesting direction. We looked at it a little bit but you know definitely no complete study. >>: On the question about [indiscernible]. What you are missing to get something out [indiscernible] is it one [indiscernible] alignment, was it pitch, or the lexical pitch or getting us enough? >> Mari Ostendorf: If you look at work that’s been done on Switchboard you get most of the bang for the buck from lexical features. A reason for that is because a vast number of these disfluencies are repetitions in Switchboard, right. That’s going to dominate your detection measure. If we look at more complex disfluencies I think it’s potentially the case that prosody could give us a bigger win. The problem is that with these corpora, these more interesting corpora most of them with the exception of ATAROS which is recent this SCOTUS and the FCIC data quality is not good. I’ve tried doing forced alignments on them with multiple aligners and the, you know it’s just not good enough. I mean its stupid things like you know it’s a robust recognition where you put, do the deep learning on it. It’s just stupid things like page turning totally messes up the forced alignment. >>: The early model you talked about the [indiscernible] model for planning and [indiscernible]. That looks like a very reasonable model and hierarchy model. Do any people use, make any additional models based upon that kind of hierarchy the [indiscernible] between the two [indiscernible] to explain? You know the kind of difference you have with [indiscernible]. >> Mari Ostendorf: There’s not, so there’s not computational work doing that that I am aware of. This was my sabbatical in talking to cognitive scientist. There’s not a lot of computational work. >>: Your transcripts you have several things that aren’t actually going to come out of an ASR system. You have Ellipses, you have commas, capitalization. To what degree did you use those things in the systems that you built? Did they help you? >> Mari Ostendorf: I use those for the SCOTUS transcripts for only doing lexical things. What I would do in hindsight now is I would use those in forced alignment to look for filled pauses. If I did forced alignment with an optional insertion of things and optional, so one of the things I’ve done some work actually that Lucy knows about. On taking oral readings and looking at where people make mistakes to try to understand reading level and difficulties. That, the forced alignments were done by somebody who automatically inserts, has a way of inserting words, allows it to insert words so you could allow it to learn repetitions and things like that, to find repetitions. If I did the forced alignment that way I could use the dot, dot, dot, and punctuation to up the probability of those sorts of things. That’s one thing that I’d like to do next. >>: When you say repetition like typically you find the, the. Is repetition an exact lexical match or is it a functional match? You’re talking about the, the, another language you have [indiscernible], okay. >> Mari Ostendorf: I’m using it to mean exact match. >>: Alright lexical match [indiscernible]. >> Mari Ostendorf: Exact lexical match with the caveat that I include, so if I, if the, I had the example a wreck, a, a, requirement. All of that is trying to get out that final a requirement. A wreck I’m counting even though it’s just a fragment. >>: Yeah. >> Mari Ostendorf: But you can call that a match by doing a substring. >>: That’s considered an adaptation? >> Mari Ostendorf: I am considering it and that’s by definition that is a repetition. >>: Where it’s this [indiscernible] case would be a case of lexical access because as you’re planning and you are coming up with, I’m going to say a different noun. Now I have to modify. That would be a lexical access. >> Mari Ostendorf: That would be a lexical access one. That would be, count, so there’s the surface form and then there’s these more interesting categories. The surface form is just are they the exact same string or not? That’s, you know that’s repetition, repair, restart. These other categories are things that I’m starting to work on where trying to use sources that will say is this syntactically the same? Is this semantically the same? The part of speech match you know [indiscernible] those things you can get with a part of speech match. They are thought to be associated with a different level than the repetition. >> Will Lewis: Maybe we should break here. Thank you. >> Mari Ostendorf: Okay, thanks. [applause]

Document 17954765

Related documents

Products

Support

Document 17954765

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib