>> Will Lewis: Welcome everyone this is the third Pacific NW-NLP. It looks like we have a record breaking crowd this year. I know people are still trickling in. A few things I do want to mention. If you drove and parked be sure to register your car. I’d hate to have one of the highlights of today’s workshop being someone having their car towed, which likely won’t happen but just to be safe. The other thing if you’re talking today for the people in the next session. Be sure that we get your slides loaded on to our server here or that you have your laptop checked. Please do that during the next, during the first break. Any of you talking today overall please same kind of holds. We need to basically get your slides loaded. There’s also a release form that needs to be signed. Basically it more or less says we’re going to be broadcasting your talk. We’ll also be posting the talks to the NW-NLP website hosted at Simon Fraser. If you don’t want your talks posted that is also an option, just you have to let us know today. Check with, there’s some folks, the AV guys are on the other side of this door here you can check with them or check with anyone up here during the break. All of you should have a folder that basically has the proceedings and everything in here. If you have a poster session you’ll notice that there’s a number assigned with your poster. The poster rooms are actually down the hall so just look for your number, go down to those rooms during the break. I’d ask you to go during the break to actually put your posters up, look for your number on the wall, and that’s where your posters should go. There are two people here that are, or actually three people, I think one person isn’t here yet. Three people that can help you with the posters, there’s Chin in the back, look for him he’ll be in one of the rooms, and Meg in the back also she can help you too, if there’s any logistical problems or whatever they’re the people to help you with that. There’s a map also of our facility here. If you’re looking for where you know where the different rooms are and everything. This is a map of our conference area here. I think you’ll have this down within a few minutes. But still we put a map in there just in case. There’s also a campus map and I’ll bring this up later. You all have received lunch cards. These are, who said there’s no free lunch by the way here’s your lunch card. [laughter] Thanks [indiscernible] for saying that, he’s the one that, okay. The, I’ll mention where to go for lunch and all that just before the lunch break. You have to kind of glum on to a Microsoft employee to take you wherever it is that you want to go. We’ll organize that just before lunch. Let’s see what else do we have? There is Wi-Fi access. It’s open just look for MSFTOpen I think it is. MSGuest, oh it’s up on the wall, okay. These guys took care of it already I don’t need to worry about it, okay. If you haven’t grabbed some food there’s food behind me. Let’s see posters, talks, okay we talk about that. Let’s see I think I’ll go ahead and, was there anything else? I think I’ll turn it over to [indiscernible] to actually. Oh, Yashar’s doing that, okay. Thank you. Alex, oh you have it, okay, good, thanks. >> Alex Marin: Hello, good morning everyone. Can you hear me? I think my mic is working, right, okay perfect. Welcome to NW-NLP two thousand fourteen. We’re very excited to you have here. Well different groups actually put their hands together to organize these workshop people from SFU, [indiscernible], and [indiscernible] who is not here, [indiscernible]. People from UVC, people from Microsoft definitely they really try to organize a very good, [indiscernible] very good program. We try to have different talks, different papers to raise the interest. I’m very happy that there are many people registered, and many people are excited to come here. We are also very excited to have you. Without further delay I go to the introduction. We have twenty participants today from twenty-three institutes. That’s totally changed from the last time NW-NLP was organized. There are certainly a lot of more interest that we can see in such workshop. We are thankful to the people who made it here. What, seems we want to know each other better if, which institute you are from I am going to read the institutions. Please don’t be shy, raise your hand when I call your institution to know how many people from each part of this region, or which part of out of this region coming today to the, through workshops. I know we have a lot of people from UW University of Washington. How many people are here from UW? Wow, okay, shall I continue with the other institutions? [laughter] Great a lot of people from UW, from Microsoft Research who is a part of our organization theme, okay so far three. [laughter] But I, four. >>: They’re trickling in. >> Alex Marin: Yeah, but I know that, well I knew [indiscernible] told me that what you learn, the first thing you learn in teaching is that people don’t raise hands. Then I’m expecting some other people from that saw it that they didn’t raise hand. But anyway, people from Microsoft come and go so they will join us later. We have people from Expedia here so, oh perfect we have a good group people there from Expedia. We have from Oregon Health and Science University I know some people there from Oregon, great to have you here. From SFU Vancouver, perfect a good number of people from SFU. From Nuance, I’m excited to see people from Nuance here, perfect. From Western Washington University, good number also from, so a lot of people from Washington came here today. From Amazon, people from Amazon, okay famous Amazon people are here so if you have any problem with your shopping or whatever you can ask them. From UBC, okay we have a number of people from UBC. From Boeing, I was excited to see today morning actually from Boeing. From Pacific Northwest National Lab, okay people over there. From University of California Santa Cruz, nice, thanks for coming over actually. From Allen Institute of AI, great, Peter Clark and colleagues are here. Okay I finish this page right, twenty-three institutions so a lot of work. From Appen, people from Appen, okay they’re no one person there, great. From Google, okay, from Google in Microsoft. [laughter] We have from Genome, yeah one person from Genome from Vancouver came to make it. From Intel, okay you have a good relation with Microsoft I know. [laughter] NWTA, not yet, Univera, okay one person at the back, from USC, no not yet. Point Inside, okay we have one person from, thank you for making it to here, University of Buffalo, perfect one person here. Okay, so we have different people from different institutions, twenty-three. I’m happy to see you all here. We’re all happy for that. I am going to give you a very brief introduction about how many papers we had, how many we selected. Totally we had thirty-eight on time submissions. Out of these thirtyeight there were twenty long papers, eighteen abstract or extended abstract which was a new work. The organization was in a way that the long papers are the work that have been published elsewhere, recent work. But the works that have been published like in other venues for NLP and new works has, they have been submitted as extended abstract. We have one unusual submission that I’m telling you about that later. [laughter] Okay, now is later. Okay the unusual submission it was interesting we have received an automatically generated paper. Thanks to our program committee and reviewers we rejected the paper. [laughter] It means we check all the papers whoever submits you can see their reviews. Actually that made us kind of excited because NW-NLP is growing so much that people start recognizing us and sending some automatic papers to test us probably if it pass or not. We had unusual sub, kind of, let’s call it automatic paper submission. Yeah, so that’s all for now and if you’re happy we have lots of talk samples first. >>: I just wanted to say one thing before we go on to the actual program. There are three papers actually in the program today with Ben Taskar as a co-author. I just wanted to acknowledge that he you know he recently, not so recent but he passed away at an early age. It’s unfortunate that he you know we could have benefited from his long term input to this community. I just didn’t want to let that pass because it maybe one of the last places where his name appears in the program. There might be other papers in the pipeline. I’m sure they’ll be. There have been lots of, of course he was so famous and so influential that there’s been a lot of other places where his work has been acknowledged. But I just wanted to say something here. I would like you to visit his, there’s a website for donations to his family as well. I would like to, you can search for his name and it’s the second hit on most search engines. I would encourage you to go and visit this. If you haven’t heard of, if you’re a new student you haven’t heard of Ben Taskar you’re going to use his papers in your work, it’s almost guaranteed. I think we have an exciting day of papers and posters. I think that first one is in five minutes, or? >> Will Lewis: Eight minutes, yeah. >>: Yeah, so should we let people… >> Will Lewis: I think let people grab some more snacks and then we’ll start. >>: Then back here in eight minutes, so don’t run away, come back in eight minutes. [laughter] >> Will Lewis: Okay, so Lucy is going to be chairing the first session and she’ll be introducing the speakers. >> Lucy: Good morning everyone. Our first speaker is Kenton Lee. He’ll be presenting Contextdependent Semantic Parsing for Time Expressions. A paper that will be presented at ACL twenty fourteen, let’s start, alright. >> Kenton Lee: Thank you. Good morning I’m Kenton Lee a Grad Student from the University of Washington. Today I’ll present my project on Context-dependent Semantic Parsing for Time Expressions. This is joint work with Yoav Artzi, Jesse Dodge, and Luke Zettlemoyer. To give you some motivation for the task let’s look in the example document. We see here that Biden will attend a lunch with Democrats on Thursday while Obama will serve as the retreats closing act Friday. We see here that there’s a lot of temporal information that we want to extract from the text. In the long term we might want to extract all events from the text into a timeline. For this project we focus on a specific part of this task which is just extracting the time expressions from the documents. In this case we would want to find and understand the time expressions such as next week, Thursday, Tuesday morning, etcetera. We can think of this as two tasks. The first task is detection where we find all mentions of the time expressions. In the second task we take these detected mentions and we resolve each to time value in standardized format. To be concrete about the task we can represent things as dates, duration, and approximate times. Notice that in the last three examples Thursday, two days earlier, and that week. In isolation these are underspecified and we will need to incorporate some kind of context to understand what they mean. By context I mean the text that is beyond the mentioned boundary. This is one of several challenges that we identify when we try to do this task. I’ll go through those challenges in detail. First here’s an outline of my task, of the talk. I just described the task. Then I’ll talk about some challenges that we’ll have to face and our approach for how to address each of these challenges. Lastly I’ll show some results on experiments comparing our approach to existing state of the art systems. As I previously mentioned the first challenge is Context-dependence. We see in this example the time expression Friday within two contexts. We can think of Friday in isolation to mean a sequence of all Fridays. To choose the correct Friday out of all of these Fridays we will have to know two pieces of information from context. First is we need to know when this was said or written. Second we need to rely on contextual cues such as the verb tense to determine whether we want to choose a Friday occurring after the document time or before the document time. For the second challenge we see that time expressions are often compositional. Say in this example of a week after tomorrow we need to first understand the meaning of tomorrow. Then we have to understand how long a week is. Then we have to be able to combine these meaning to form the final meaning of the full phrase. Lastly this challenge is related to the task of detection. We see a lot of phrases can be temporal or nontemporal depending on their context. In this example two thousand time expression in the context of she was hired in two thousand. Whereas it’s not the case in the context of they spent two thousand dollars. Next I’ll go through how we approach the task and how our system addresses each of these challenges. We decompose this task into two steps. First is we take the input documents and we detect where all the time expressions occur. The second step which is resolution we take each of these mentions and resolve them to a time value. A key component of this approach is that we define a temporal grammar that is used in both the detection step and the resolution step. Using this temporal grammar we can do semantic parsing which will give us a formal meaning representation of time to work with. This will allow us to solve the issue of compositionality in time expressions. Let’s look at how we define this temporal grammar. We choose to use a combinatory categorical grammar. We can think of this as a function from a phrase to a set of formal meaning representations in the form of logical expressions. We choose to use CCG because it’s a very well studied formalism. We know of a lot of existing algorithms that we can reuse to solve this task. This is an example of CCG parse. Don’t worry about the details here. But the take away of the slide is that at the very top we have the input phrase and we have lexical entries that pair words with their meanings. We compose these meanings to retrieve the final logical expression at the bottom, which is the output. A nice property of the domain of time expressions is that the vocabulary is relative closed-classed. It’s easy for us to manually design these lexical entries at the top. That’s exactly what we do for our temporal grammar. We engineer a lexicon. We include in this lexicon things like units of time, named times, and function words. That was how we define our temporal grammar. I’ll talk about how we use this grammar in both the detection and resolution steps. Let’s start with detection. To recap we are taking a document and detecting where all time expressions occur in the detection component, and the way we approach this is we parse all possible spans of text using our temporal grammar. For any span that belongs to this temporal grammar we consider it as a candidate for time expression. The way we design a grammar is that it has very high coverage at the cost of over generation. We have to on top of this use a linear classifier to filter out any false positives. To illustrate this process let’s look at four examples. Twenty-four hour we’ll have lunch and two thousand in the respective contexts. The first thing we do is we use our temporal grammar to give us a set of logical expressions. In the case where the set is empty such as in a case of we’ll have lunch we assume that this is not a time expression. Then for the cases where the set is not empty we look at, we use our linear classifier to make a final decision for whether or not to reject this as a time expression, which we do in the case of they spent two thousand dollars. For our linear classifier we use a logistic regression model with L1 regularization. For features we include syntactic features, lexical features, and indicators for tokens that are particularly discriminative such as ten prepositions near the phrase like until, in, during, etcetera. That was how we use our temporal grammar to do detection. Now I’ll talk about how we also use it to do resolution. To recap the task of resolution is to take detect conventions and resolve each of them in their contexts to particular time expressions, or to time values. To resolve an input phrase we define a three step process which we call a derivation. In the first step we use our temporal grammar to give us logical expressions that are independent of context. Then in order to incorporate contextual information we transform these logical expressions using what we call context dependent operations. Lastly we have a deterministic process that resolves each of these to a specific time value. Because for every input phrase there can be multiple derivations we reasoned over all of these steps using a log-linear model that will allow us to score the derivations and choose the most likely one. We can go through an example of how we do this. This is the time expression Friday in the context of John arrived on Friday. This temporal grammar gives us context independent logical expression Friday, representing the sequence of all Fridays. We have to choose from the sequence of one particular Friday. We apply one of three context-dependent operations allowing us to choose the last Friday, this Friday, or next Friday in relative to some reference time. But these logical expressions are so underspecified we still have to decide what this reference time is coming from. We have a second step. We can replace this reference time with either the document time or another time expression that has, that was occurring before in the document that we’ve already resolved. For each of these final logical expressions we can resolve them to produce a time value. With our log-linear model we can choose the most likely interpretation which is the last Friday relative to the document time. As I previously mentioned we us the log-linear model, we do parameter estimation using AdaGrad. For features we include things like the part of speech of the Governor verb which will give us the corresponding tense. We also include the type of the reference time which can either be the document time or previous time. Lastly we also include the type of the time expression such as the day of the week, day of the month, year, etcetera. I just gave an overview of our approach which is to use the head engineer and temporal grammar to detect and resolve time expressions. Now we’ll talk about some related work and some experiments we do to compare our approach with previous state of the art systems. We mainly compare ourselves to HeidelTime which is state of the art in doing the full end to end task of both detection and resolution. Our approach is most similar to SCFG in parsing time which used semantic parsing to resolve given time expressions. We evaluate the systems over two corpora, one is TempEval-3 which consists of Newswire text and the other is WikiWars which consists of history articles. The way we evaluate this is for detection we consider a predicted mention to be a true positive if it overlaps with some gold mention. For the resolution task the predicted time and value has to both be correctly detected and resolved. This gives us two sets of precision we recall and F1 values. Here are the results evaluating over the two corpora. On the left we see F1 scores for detection and on the right we see F1 scores for resolution. We show an improvement across the board. For the end to end tasks we see an improvement in TempEval of twenty-one percent, error reduction of twenty-one percent. In the [indiscernible] WikiWars we see an error reduction of thirteen percent. An advantage of our system is that we can trade off between precision and recall because we produce confidence values. This is very useful for downstream applications who might want to rely on different points of this purple curve. Lastly we wanted to know how important context was for our system. We ablated the ability of our systems ability to refer to information beyond the mentioned boundary. We find that the degradation in performance is much more severe in WikiWars compared to TempEval-3. This is because in news things generally occur near the document time, whereas in history articles we have these long complex narratives where we have to model context in order to properly resolve the time expressions. To sum up we produced an approach that, for extracting time expressions that are state of the art. We are the first to use semantic parsing to do detection. We also are able to jointly learn a semantic parser and how we can use context to understand time expressions. For future work I hope to be able to jointly model both times and events as mentioned during the motivation. This is just a slide for advertisement. We are releasing this tool which we called UWTime. This is implemented using the UW Semantic Parsing Framework. It will be available soon on my websites, so look out for that. Thank you. [applause] Questions? Yes. >>: You had examples like somebody would arrive on Friday, okay. You tried to handle things like yesterday John said that Mary would arrive on Friday. I mean how complex do you go in your analysis, how complex of context will you analyze? >> Kenton Lee: We use a very simple model of when we look backwards towards previously resolved mentions we just look at the previous thing. In your example of… >>: Yesterday John said that Mary would arrive on Friday. >> Kenton Lee: Right, so we would look as far as yesterday and nothing before it. >>: Okay, so to determine when Friday is you do pay attention to yesterday? >> Kenton Lee: Yes we do. We anchor this Friday relative to yesterday as one of the choices during resolution. >>: Okay, so you look for all of your time expressions in the utterance and then try to relate, determine how one is related to the other? >> Kenton Lee: We don’t look at all of them we just look at the last one that was resolved. >>: Last resolved, okay. >> Kenton Lee: Potentially we could extend the model to make it, to have it look at more than just the last thing. But we started with this. >> Lucy: Other questions? >>: I was wondering how good the parser is at finding, do you have detection results that you can also take a statistical parser and then convert it into a CCD derivation? I was wondering how good is it at finding sort of the boundaries? >> Kenton Lee: Not sure this answers your question. But when we look at how higher coverage is we can see that our parser produces a correct match ninety-six percent of the times in a development set. We, it’s very feasible for us to do detection using our parser. Does that answer the question? >>: Yeah. >>: You mentioned that you hand code your initial data. Is there any way to speed up that process? I mean I’m wondering if there are language resources out there that you could make use of to at least assist in that phase. Like for example taking a language data that’s marked up with time ML or other time representation languages as a boost. >> Kenton Lee: Right, this is definitely a logical next step to this project is just to automate this task of developing a lexicon, which would be great for also doing this in other languages where we don’t want to necessarily engineer multiple lexicons. >> Lucy: Additional questions? >>: I have a question for the complexity of these time references so you can kind of use the whole power of language to refer time per se. One year after Picasso painted his [indiscernible], so where do you stop? >> Kenton Lee: We stop at events. We don’t actually, so if in your example a year after some event we can’t really handle this in our system. We hopefully will do some future work. But we are able to handle things that anchor on other time expressions, right. >>: Well, sometimes you have a combination of all [indiscernible] which will harness not only the time reference, revolution but also the [indiscernible] reference. For instance you know John arrived at that Friday. That Friday, yeah, which Friday? Do you handle those kinds of [indiscernible] as well? >> Kenton Lee: Right, our model of context is actually designed to handle these cases where you just talked about some Friday. When you refer to that Friday we’re able to look backwards at those previously mentioned time expressions. >>: I see. >>: You are marking temporal events, right. Let’s say there is a sentence that says the conference is on this day [indiscernible]. In which case, do you resolve, you first resolve Thursday and then you resolve Friday. But the day before the conference would you go back and do a back [indiscernible]? >> Kenton Lee: We actually don’t build this kind of reasoning. Like we don’t think about the conference at all, so we just look at Thursday and Friday, and what they mean independently. Yes? >> Lucy: Question? >>: [inaudible] as use. I’m curious what your example is before next Friday or it’s just for Friday? Is that resolved to the Friday being referenced or [indiscernible]? >> Kenton Lee: In the case of before that is actually part of the time expression itself. >>: Right. >> Kenton Lee: In the annotation we use you can actually mark a time expression with modifiers. That’s how that it’s before a particular time. >>: That’s, and they do this before or is it, does it, how does it resolve for a particular point on a time line? >> Kenton Lee: I think this is a limitation of the annotation we use. It doesn’t have a very good representation for the ambiguous cases unfortunately. Yes? >> Lucy: One last question. >> Kenton Lee: Go ahead. >>: Adding on to what she said when you said this will happen before that, so future from now and prior to the following Friday you get that range through there, or you just get [indiscernible]? >> Kenton Lee: We have a reference template, so it’s before a particular, so sorry you’re saying the difference between… >>: First there’s, I mean enacting reality you’re saying that sometime between now and next Friday you say this will happen before next Friday. Do you get a rate of… >> Kenton Lee: I see. >>: Because stream on both sides are just saying before Friday. >> Kenton Lee: We do not. We only know that it’s before that Friday. >> Lucy: Let’s thank the speaker again. [applause] Our next speaker is Yashar Mehdad. He’ll be speaking about the Query Based Abstractive Summarization of Conversations. The word abstractive is very interesting to me. This paper will be presented at ACL two thousand fourteen. >> Yashar Mehdad: We’re working on the Abstractive Summarization of Conversations. This talk will be about that, specifically about query based application for that abstractive summarization. Basically why we are interested on conversations? There are a lot of conversations every day generated in our life, every day generated on internet. You can find many friends, many blogs, many social media websites that users generate. It’s past the time that users were passive actually, passive users in internet. Now a day’s people go and generate some contents. You can see based on these how we change from two thousand six to two thousand twelve and how much data we are generating every day. If we go to two thousand fifteen which is just over the corner next year we’re going to have big, big data about eight point five billion terabytes that we have to deal with. What we do if we want to go through those data. If we want to understand those data, if we want to search those data, if we want to really have a look at each part and see what they’re talking about. Of course these, all this data I’m going through and dig into each one will give us a big problem of information overload. One of the ways that we can approach this problem or we can deal with such problem is summarization, automatic summarization. Having a conversation we can have a system automatically summarize that conversation. In automatic summarization actually we have different areas of work. We have extractive versus abstractive. We have generic versus query-based. What do we mean by extractive and abstractive? Well extractive it has been in the area of summarization for long so people use to extract some sentences from the original document and say okay these documents, these are significant sentences in these document. We collect them and then we give it to the user and say okay this is a summary of this document. But that’s not how human does, human reads the document, and then understand it, and try to generate a text, right, write a text and a summary. That is what we want to do, why, which calls abstractive and why? Because many research actually showed and proved that users prefer abstractive summarization. Actually that’s what human does, so human prefers abstractive summarization. What are generic and query-based? Okay, generic one you try to summarize a text and then get the meaning of the whole text present to the user. At the same time query-based the summary generated based on a question, based on a query that user actually ask about that. These carte blanche conversation or document can talk about different things. When a user has some questions can ask, and that summary will be generated based on that question. Our work will be mainly on abstractive query-based summarization of conversations. I’m going to talk about which is kind of similar to the previous work of Wang et al., but with some differences that I’m telling you. Well many query-based summarization systems since two thousand four started working on these dataset DUC query-based multidocument news summarization, which was mainly on news. As you know news are kind of you know very well structure, very well polished. It’s kind of clean text, not that much noise of real user generated content or conversation on internet. They have some queries which are kind of complex question. They are looking for specific information needs. These questions are actually written by expert annotators. For example I’ve one question or query they can have “How were the bombings of the US embassies in Kenya and Tanzania conducted?” It’s a very specific information need. But do they always when we want to ask for something are knowing exactly what information we are looking for? When we’re dealing with conversations, do we always generate such complex questions? What kind of queries we are looking at? What kind of queries we are talking about? Let’s say that you are going to a series of reviews of a ParaDoc and you want to know about a certain aspect like a camera and lens. This query can be only one word or can be actually a phrase. Like for example, “the new model of an iphone”, right. This is more of a kind of phase based which is, has not been really looked at much in the past. Then if you want to compare these two we have some specific information needs in complex queries. These needs to have precise kind of formulation of a question or query and it’s kind of very well structured, as we have in our previous datasets. While phrasal queries are kind of less specific in terms of information need. At the same time they are used very much for exploring different text you know or different summaries. They are less topically focused so they could be more general. They are less structured, they have less context. In this work we are focusing for the first time on the phrasal queries. Then we define phrasal query as concatenation of two or more keywords. We believe that this is more relevant when we are talking about conversational data we are focusing. Of course in this work we are facing many challenges, one of the challenges that we’re dealing conversational data. You all know what challenges we have in conversational data. We have noise; we have less structure because those are not edited, just produced. We have many acronyms, many problems. We have to deal with phrasal queries that don’t have that much of context. They are not also less structured so that limits our choices as well. We are trying to produce abstractive ones so we have to generate the language. It’s not as easy as extracting just some sentences from the text or from the conversation. Of course since its new work we don’t have much annotated data. We have a very limited annotated data so that motivates us to go for more of the kind of unsupervised work. Following the challenges we have the following contributions. We in this work propose the first abstractive summarization system that is based on phrasal queries. We have, we are having essentially kind of unsupervised model. We are, our work is actually, our system working a different conversational domain and we are not focusing only on one. If you look at this framework at a glance you can see our framework is actually have our system having three different parts. The first part of the system try to extracts, try to do something extractive system does. Try to find the significant sentences, extract them. Then we go to the next step, we do, we filter those sentences because there are many redundant ones. There are many ones that are not really significant or they’re more info muttered than others. We go to the last step which is the abstract generation, the language generation we’re talking about in the abstractive summarization. We have three parts that I am going to talk in more detail about that. This is our framework. I’ll walk you through each phase one by one. For utterance extraction the aim is to extract the utterances from conversations that are you know important for us. What we have to fulfill? Two things, first of all this sentence has the whole idea of the conversation, have the whole content of the conversation. At the same time it’s related to our query that is imposed by the user. For each one we extract some terms. The first one we call a signature term that shows the important terms or the important topics that is discussed in a whole conversation. We try to exact them using loglikelihood ratio with the associated weight. This actually has been proved to be effective in the previous work. For the query terms we try to extract the content words for each query. We try to expand them using some knowledge. For this case we use WordNet synonym relations to expand our queries. We have a set of words. Now we want to score each utterance based on different terms that we have here. We have a score of query, so how relevant is the sentence based on the query. We have a score of signature terms. Then we can combine them using you know linear kind of combination using some coefficients that can be tuned. Then we go to redundancy removal one. For the first time actually we thought that we are not going to use MMR you know as a kind of very popular method for redundancy removal. We are using semantic relations or entailment relations in our framework. I recap entailment relations, so entailment is when a directional relation we have between two text. We say one text entail R if reading the first text you can infer the second one, or you can say the second one is true based on the first one. I have one example here, for example, the technological term known as GPS while incubated in the mind of Ivan Getting and then Ivan Getting invented the GPS. The true case of entailment, how we use that entailment? We try to train an entailment model using features and the previous dataset using SVM classifier. Then how we can use that? Well let’s say that we have utterances, right. We try to compose the Entailment Graph over the extracted utterances. We try to label the relations between each utterance or each sentence using the unit direction of, Bidirectional entailment and unknown entailment. If two sentence are entailing each other in both direction they are unit, they are bidirectional entailments. They are kind of semantically equivalent. If they are one in one way it means that for example sentence or utterance C is kind more informative than A and B. Then others could be unknown. We use that information and then we filter some of the utterances. How? We say that if we have semantically equivalent sentences one of them can stay in, or if we have some more informative sentences than others then the more informative sentences are more related our summarization framework. We keep them and then prune others. As you see out of seven sentences running the entailment graph we can have four sentences at the end. The filtered sentence it brings us to the next stage which is the abstract generation. The abstract generation actually composed of three parts. The first part is the clustering. We have a list of utterances and we want to cluster utterances in different groups. We want to use lexical clustering, why, because that also helps us in the next stages. We use a simple clustering algorithm K-mean. We use a cosine similarity over tf.idf scores to cluster. We have now set of clusters, each cluster different sentences. For each cluster in the next step what we want to do? We want to merge and fuse the sentences of each cluster and generate one sentence out of them. Once we do that we proposed using a version of Word Graph model. Why we do that? Why Word Graph? Why we’re not going to use any specific natural language generation or sophisticated syntactical approaches? Because we are dealing with noisy conversations and we want to really we cannot go deep into syntactic and structure analysis. Previously Word Graph model was introduced by Filippova. We extended that new version of that with some modification. We use some semantic relations between the words. We try to generate in our graph on each cluster in this way. We have a start to end for each utterance. Then at each time we add a new sentence to the graph. If the nodes are the same we merge them. If they are synonym we merge them. If they are connected by hyper [indiscernible] we kind of merge them using that, using WordNet. Then we can see that we can have a graph from start to end. Then out of this graph now our job when we have the Word Graph on each cluster is to find a best path that’s getting from start to end. How we do that? We go to the path ranking. First we prune the paths with no verb because they are definitely not grammatically correct for us. Then we go to the different criteria, query focus, readability using language model, and the path weight which is you know you can go for details of the formula, or you can ask me later on. It’s written in our paper. We add them up together. We have the scoring for each path. We find the, we pick the first path selected by these as our abstract for each cluster. Each cluster produced one abstract sentence. All abstract sentences together is our summary generated. The experiments on, we have experimented on different datasets, chat logs. We have for chat logs some query-based summaries so we use them for our automatic summarization. We use meeting transcripts and email threads. For those we didn’t have any so we had to go for manual studies. I am going to talk about results very briefly. We can see that our abstractive summarization using ROUGE-1, F-1 score is actually outperforming all other baselines. While ROUGE-2 actually cannot perform that well, main reason is that in the Word Graph generation we try to sometimes change some words so in the bigram kind of matching that score will be lower in that fractive way. We can see that our first phase which is utterance extraction still outperformed other extractive models. Then we can see that previous Biased LexRank which is a query-based summarization system was not performing as well as our system for conversational data. We run a user study for our manual results again. Then we find that users you know manual and the teachers preferred our system seventy percent, sixty percent of the time. Our system outperformed the other baseline. Again in a manual study grammaticality correctness became quite good acceptable grammatical correctness seventy to sixty percent, while in the meetings it was lower because the meeting transcripts coming from automatic transcription was noisy, so the original meeting transcription actually there were fifty percent correct themselves. In conclusion I presented the abstractive summary using phrasal query. You know the first phase was mainly a kind of model for extraction. Then we integrated the semantic for the next phase. Then we have introduced a word graph model over the ranking strategy using minimal syntactics. We got very promising results over various conversational dataset. Then for the future work we are thinking of incorporating more conversational features like speaker information, speech acts, and generate more coherent abstracts, as well as resolving some coreferences. I invite you to come to our posters as well and also look at a demo of our NLP group at UBC. Thank you very much. [applause] >> Lucy: Alex could you come and setup. We’ll have one question. I know I have many questions but you know we have the whole day to ask Yashar and his co-authors. But Pete did you want to ask a question, quickly? >>: Yeah, this is very interesting. >> Yashar Mehdad: Thank you. >>: [indiscernible] can see that your word graph ideal path is grammatical? >> Yashar Mehdad: Is… >>: The ideal path you defined with the word graph does it always produce a grammatical sentence? >> Yashar Mehdad: Not always. >>: Okay. >> Yashar Mehdad: That why actually we check the grammaticality. We concluded that it’s in the results so I can show you later that it depends on the dataset, and it depends on the nature of data. For example from many transcripts there are more errors. But for others it’s less so we could get seventy to eighty percent of the path generated correct in terms of grammatical correctness. That’s very good for an abstractive summarization system actually. >>: Was that a constraint that it’s grammatical by checking it [indiscernible]? >> Yashar Mehdad: Yeah, exactly. >>: Yeah. >> Yashar Mehdad: Exactly, but it’s still we check the grammatic [indiscernible] through a language model. That also is a good filter, right, because in the path ranking you have language model and weights at the same time, so weights also check kind of how bigrams, trigrams are connected. That’s also a good grammaticality metric let’s say. >> Lucy: Let’s thank the speaker again. [applause] >> Yashar Mehdad: Thank you very much. I will answer all other questions in the break. Thanks you. >> Lucy: Yes, okay, so our last speaker of this session is Alex Marin, or am I pronouncing the last name right? >> Alex Marin: Yep. >> Lucy: Okay, talking about Domain Adaptation for Parsing in ASR. This work will be presented at ICAST. >> Alex Marin: Yeah, thank you. This is work with my Advisor Mari Ostendorf at University of Washington on Domain Adaptation for Parsing and Speech Recognition. Speech recognition has been used in many applications over the years from call center applications, more recently voice search, personal assistance such as Cortana, or CV, or Google Now. We’re working on a similar but not exactly the same system. We’re working on a speech to speech translation application with a [indiscernible] system to resolve ASR errors by interacting with a user, asking clarification questions whenever something is unclear to the system. In particular we’re looking at doing correction of errors automatically when we can, so that we only ask about the other errors that we cannot correct automatically. In particular we’re focusing on out-of-vocabulary words in this talk. But we’ve looked at other kinds of errors as well. To get a sense of what kind of errors you might see there’s a couple of examples here. For example here Litanfeeth is an OOV but we see that the error region extends to the neighboring words. What we’d like to do is correct at instead of having the incorrect word it, and then mark the rest of the error region as an OOV, so the system could ask to have it replaced with a different word or have it spelled if it’s a name, etcetera. Similar here reframe is an OOV. It is replaced by a different word so we’d like to mark this as being an out-of vocabulary word and then remove this filled pause because it doesn’t add any meaning to the sentence. The way we do this is by looking at the ASR output and using a classifier with confidence cues. But we also integrate information from parsing. In particular what we want to do is model the syntactic anomalies in error regions. We do this by working with confusion networks. This is a different approach than what other people have done. People have used lattices before and best lists. We’re working with a full confusion network structure and adding an error or OOV in this case arcs on each slight of the confusion network. Thus we have to have a parser which can handle these arcs as well as the ASR insertions or non-arcs which have to be introduced as part of the confusion network generation. What we’d like to do is after we parse entire confusion network structure we get a new one best path through the network. Instead of having the incorrect word fame we’d find that there’s an OOV region which has syntactic header verb which allows us to ask a meaningful question about it. Also instead of having the filled paused we’d be able to remove it. The contributions of this work are twofold. First we look at using the main adaptation for improving the impact of parsing in our error detection correction strategies. But also we add additional features that capture the reliability of the parsing to further improve the detection task. Why do we need domain adaptation? Previous work has looked a lot at doing parsing on the target domain. For example conversational telephone speech or broadcast news, we don’t really have that luxury here because we don’t have Treebank data available for our domain. There are treebanks in other conversational domains like Switchboard. But as you can see from these examples the Transtac data which is what we are working with tends to be rather different from Switchboard. There’s a lot fewer discrepancies in our data whereas in Switchboard you have a lot of discrepancies, you have a lot of filled pauses. The sentences tend to be a bit more rambly and so on. We’re going to look at trying to adapt our parser, train on Switchboard on to the target domain which is the Transtac data. We’re going to do this in two ways. We’re going to use self-training to capture the vocabulary and sentence structure of the target domain. As well as capture the confusion network structural information such as dealing with null arcs. We’re also going to use a task supervised approach for modeling the ASR errors. An overview of the system that we’re using is shown here. We have a three stage process. We start with a [indiscernible] confusion network from the decoder. First do a baseline error classification to annotate the confusion network with error or OOV in this case arcs. This give us essentially a prior on each slot as to whether that slot contains an error or not in the confusion network. The annotated confusion network is then fed into two different parsers. One that doesn’t know anything about errors which is used for rescoring and one which does know something about errors in ASR, and the combinations of those two parsers allows us to extract additional features, which are then used in a final round of error classification to give us the final OOV or error decisions. The work in this talk I’m going to focus on the parsing, so both the parsing adaptation in two way as well as the extra features that we’re going to extract. To talk about parsing confusion networks in a bit more detail we start with a standard factored parser model. We’re using the Stanford parser. We’re using the probabilistic context free grammar to start with trained on a conversational telephone speech Treebank. We’re generating k-best trees from the confusion network. The k-best trees are converted to a dependency model and then rescored using these dependencies. This gives us a one-best tree over an entire confusion network. To the standard approach we have to add two sets of rules. We add the rules for parsing null arcs in the confusion network, essentially for each non-terminal in the grammar. We add a couple of rules and then one to generate the actual null arcs. The error model is only added to the error parser. Here we’re essentially adding an error category for each syntactic category, so each constituent in the grammar. Then we have a couple of rules to grow that error region and to actually do the generation. Looking at the adaptation methods, so first self training. The goal of the self training approach is to adapt the vocabulary and sentence structure to the target domain as well as to deal with the null arcs better. We’re using a fairly standard approach. We’re iterating over the data multiple times starting with the parser train on just the raw Treebank. We then parse all the unlabeled data. We have both speech and text data. We’re adding the most confident trees to the Treebank so that we can retrain with more data in the next iteration. The threshold for what we consider most confident is tuned at each iteration separately on a development set. We stop whenever we then get any significant improvement. In practice it’s the sense of being about three to four iterations or some untold coverage’s. The second approach you use is weak task supervision. This is primarily geared towards adapting the scores for the error arc rules which don’t appear in our treebanks at all. But we optionally can also use it to improve the scoring of the null arc rules. We’re going to have experiments that look at this also. The approach is to augment the context free, probabilistic context free grammar not only with a log-linear model, which is only used to score these error rules and the null arc rules. What we do is we use as features the presence or absence of these rules in a derivation. All the features are local and as objective we’re using the word error rate of the training set. This is the task supervision that we’re using essentially the rescoring is an auxiliary task for training the parser. We could use other tasks like the detection task instead of rescoring. We found out the rescoring tends to work slightly better. The training is done using an average perceptron algorithm. Again this converges fairly quickly in about five to ten iterations at most and we get pretty good results. Finally to talk about the features that we extract from the parser, what we’re trying to do is capture differences in the parse trees from the non-error models, so the model that just parses confusion networks, and the model that parses confusion networks with error arcs. We have two types of features. We have dependency tuples which compare the local structure in the two trees. As well as inside scores which look at the reliability of the or the confidence of the non-error parser in error regions. Here we’re looking at essentially getting two scores one that looks at just the local tree around a particular slot that’s part of the error region on the error side. But not in an error region on the nonerror side of course. As well as a larger tree that captures that slot as well as the boundary of the error region. To talk about the experimental setup we are using the BOLT speech-to-speech translation system from the SRI Team. The data is a mixture of military and civil infrastructure domains. We’re focusing on the English side. We haven’t looked at Arabic. We have about sixteen hundred utterances of the speech data split sixty, twenty, twenty into a speech train-dev, and eval sets. We also have about eighty thousand utterances of language model training data which are used for self-training. The actually Treebank that we’re using is drawn from conversational telephone speech, so Switchboard and Fisher. We have about twenty thousand utterances here. The ASR system we’re using is a hybrid deep neural network and Gaussian mixture model system. We’re using the DNN confusion networks with, and augment them with the one-best GMM system. This was the configuration that our collaborators at SRI found to work the best. The vocabulary size of the ASR system is about thirty thousand words. We also use a confidence estimation process also DNN-based. These are used as features in the error classification tasks. Looking at results, first the rescoring task, we’re looking at two adaptation approaches the self-training and configurations are on the vertical line. The augmenting that with the task supervised adaptation gives us no tasks supervised and tasks supervised in the two columns. The baseline is the word error rate of the ASR one-best. In word error rate lower is better. What we find is that the language model training data alone doesn’t give us an improvement in self training. Adapting the null arcs with this loglinear model or CRF model doesn’t actually give us a win over not doing that. This is likely due to overtraining from our analysis. But what we find is that when I do the self-training with both the text data from the language model training as well as the speech data, we actually get a significant win over not doing any self-training, but using the parser as well as a slight but consistent win over the ASR baseline. Looking at the OOV detection task we have again a similar configuration. The various self training configurations horizontally and we’re using two measures to evaluate our systems. We’re using the more standard F-score measures. Here higher is better, as well as a modified version of the word error rate with all these replaced by a single OOV token. We do this to get a sense of whether we generate too large OOV regions, because if we generate too large OOV regions then we’re going to increase the word error rate by removing words that should be there. Here again in the word error rate starred lower is better. What we find is that in this case all the parser configurations improve over the baseline of not using any parsing just doing the structural confusion network features. The best results are obtained again with the self training using both the unlabeled sets. The final results on our internal evaluation system are shown here. We’re comparing the baseline without any parsing, the baseline parsing with no self training and adaptation, and the best parsing with self training. What we find is that we get an improvement from the parser on the OOV detection task without self training but not on the rescoring. But when we do this adaptation we get a win on the rescoring task as well. Again the win is much higher over the base parser but again consistent over the one best ASR. We also looked at another dataset which comes from the actual evaluation from the BOLT data. Here we have about thirteen hundred sentences. We’re comparing only the baseline system which is no parsing against the best parsing system. We’re not doing any tuning on this system. We’re just looking at how well will we do with the two systems that are used in the evaluation. What we find is that the OOV detection F-score is actually significantly lower. This is because the data that they used in the evaluation has a lot fewer OOV’s and our internal data. The system was slightly over trained on the wrong set. But by using the parsing we get a fairly significant improvement over that where we can actually recover a lot of the performance degradation. Again the performance improvement on rescoring on the rescoring task is larger and consistent with what we saw before. To conclude domain adaptation for parsing gives us quite good gains on both the error detection task as well as doing rescoring where the parser is a language model. The best results are obtained when we adapt the parser to match both the vocabulary and the structure of the target domain data and use the log-linear model only for scoring error rules. The future work is going to look at modeling different error types jointly within the parser. Not just doing OOVs or names but doing all of those different types of errors as a single model. We also want to expand the log-linear model for its scoring rules to not just use local features but also use larger context or global features in a sentence. Thank you. [applause] >> Lucy: Questions? >>: Yeah. >> Alex Marin: Brian. >>: In your early example you had a proper name OOV that was then recognized as multiple in vocabulary tokens. >> Alex Marin: Right. >>: How exactly does this get labeled in your [indiscernible] in your confusion network? Is it sort of OOVOOVOOV and then do you try to eventually consolidate it or is it null, null, OOV or? >> Alex Marin: What it turns up, so I should have had another example for this. But let’s pretend that this is that confusion network. Let’s say that Litanfeeth was this word here. Then, so ignoring the OOV arcs, so you’d have Litanfeeth here as. If you align, sorry, if you align the confusion network that’s generated with the references, right. Let’s say that you have Litanfeeth as a reference arc here. Then you have a null arc and then another null arc on the following slots. This allows us to mark each of the slots as labeled with an OOV. What we’d like to do is capture all three of them, or four, however many there were as OOV. But then when we look at the final one best through the confusion network we’d want to merge all those into a single OOV slot that combines all those three OOV things. >>: You do that… >> Alex Marin: We do that, that’s how we score it. The F-score numbers that we reported are at this region level not the slot level. We have slot level results, the results are essentially similar but we think the region level scoring makes more sense. Jim? >>: In the self training you choose the competent parse to add to the data. Is that using the PCFG tree likelihoods or is in the DNN [indiscernible]? >> Alex Marin: That is using the PCFG tree likelihood. It’s not just the PCFG because we’re using also the log-linear model to score rules sometimes. But it’s the combination of all those things. We’re using the insider score of the entire tree. Other questions? >> Lucy: Any other questions? >>: That the first slide on the results maybe I misunderstood it. But it seems to indicate that the conditional random field isn’t [indiscernible] or? >> Alex Marin: Yes, so what we found is that when we use the log-linear model, the CRF model to score the null arc rules we don’t get a win. But we do get a win from the self training. The self training, but doing this additional adaptation did not help. >>: Do you have an explanation? >> Alex Marin: We think it’s due to over training on those particular rules. What happens is those, the null arc rules if we do the log-linear adaptation as well as the self training. Those rules end up getting a disproportionately high weight in the model. They end up over generating the arcs in the parson so the parsons end up eating up words that should have been there end up with a lot more deletions. There could be various ways to mitigate this. One way would be to not score just the null arc rules with a log –linear model but to score everything. But this is where we want to go with [indiscernible] with global features because then we would actually be able to train a ton more data not just do null arc rules. >> Lucy: Okay, so I think it’s time for a break. We’re reconvening at eleven twenty. Let’s thank the speaker again. [applause]