Document 17836750

advertisement
>> Will Lewis: Welcome everyone this is the third Pacific NW-NLP. It looks like we have a record
breaking crowd this year. I know people are still trickling in. A few things I do want to mention. If you
drove and parked be sure to register your car. I’d hate to have one of the highlights of today’s workshop
being someone having their car towed, which likely won’t happen but just to be safe. The other thing if
you’re talking today for the people in the next session. Be sure that we get your slides loaded on to our
server here or that you have your laptop checked. Please do that during the next, during the first break.
Any of you talking today overall please same kind of holds. We need to basically get your slides loaded.
There’s also a release form that needs to be signed. Basically it more or less says we’re going to be
broadcasting your talk. We’ll also be posting the talks to the NW-NLP website hosted at Simon Fraser. If
you don’t want your talks posted that is also an option, just you have to let us know today. Check with,
there’s some folks, the AV guys are on the other side of this door here you can check with them or check
with anyone up here during the break.
All of you should have a folder that basically has the proceedings and everything in here. If you have a
poster session you’ll notice that there’s a number assigned with your poster. The poster rooms are
actually down the hall so just look for your number, go down to those rooms during the break. I’d ask
you to go during the break to actually put your posters up, look for your number on the wall, and that’s
where your posters should go. There are two people here that are, or actually three people, I think one
person isn’t here yet. Three people that can help you with the posters, there’s Chin in the back, look for
him he’ll be in one of the rooms, and Meg in the back also she can help you too, if there’s any logistical
problems or whatever they’re the people to help you with that.
There’s a map also of our facility here. If you’re looking for where you know where the different rooms
are and everything. This is a map of our conference area here. I think you’ll have this down within a few
minutes. But still we put a map in there just in case. There’s also a campus map and I’ll bring this up
later. You all have received lunch cards. These are, who said there’s no free lunch by the way here’s
your lunch card.
[laughter]
Thanks [indiscernible] for saying that, he’s the one that, okay. The, I’ll mention where to go for lunch
and all that just before the lunch break. You have to kind of glum on to a Microsoft employee to take
you wherever it is that you want to go. We’ll organize that just before lunch.
Let’s see what else do we have? There is Wi-Fi access. It’s open just look for MSFTOpen I think it is.
MSGuest, oh it’s up on the wall, okay. These guys took care of it already I don’t need to worry about it,
okay. If you haven’t grabbed some food there’s food behind me.
Let’s see posters, talks, okay we talk about that. Let’s see I think I’ll go ahead and, was there anything
else? I think I’ll turn it over to [indiscernible] to actually. Oh, Yashar’s doing that, okay. Thank you.
Alex, oh you have it, okay, good, thanks.
>> Alex Marin: Hello, good morning everyone. Can you hear me? I think my mic is working, right, okay
perfect. Welcome to NW-NLP two thousand fourteen. We’re very excited to you have here. Well
different groups actually put their hands together to organize these workshop people from SFU,
[indiscernible], and [indiscernible] who is not here, [indiscernible]. People from UVC, people from
Microsoft definitely they really try to organize a very good, [indiscernible] very good program. We try to
have different talks, different papers to raise the interest. I’m very happy that there are many people
registered, and many people are excited to come here. We are also very excited to have you.
Without further delay I go to the introduction. We have twenty participants today from twenty-three
institutes. That’s totally changed from the last time NW-NLP was organized. There are certainly a lot of
more interest that we can see in such workshop. We are thankful to the people who made it here.
What, seems we want to know each other better if, which institute you are from I am going to read the
institutions. Please don’t be shy, raise your hand when I call your institution to know how many people
from each part of this region, or which part of out of this region coming today to the, through
workshops.
I know we have a lot of people from UW University of Washington. How many people are here from
UW? Wow, okay, shall I continue with the other institutions?
[laughter]
Great a lot of people from UW, from Microsoft Research who is a part of our organization theme, okay
so far three.
[laughter]
But I, four.
>>: They’re trickling in.
>> Alex Marin: Yeah, but I know that, well I knew [indiscernible] told me that what you learn, the first
thing you learn in teaching is that people don’t raise hands. Then I’m expecting some other people from
that saw it that they didn’t raise hand. But anyway, people from Microsoft come and go so they will join
us later. We have people from Expedia here so, oh perfect we have a good group people there from
Expedia. We have from Oregon Health and Science University I know some people there from Oregon,
great to have you here.
From SFU Vancouver, perfect a good number of people from SFU. From Nuance, I’m excited to see
people from Nuance here, perfect. From Western Washington University, good number also from, so a
lot of people from Washington came here today. From Amazon, people from Amazon, okay famous
Amazon people are here so if you have any problem with your shopping or whatever you can ask them.
From UBC, okay we have a number of people from UBC. From Boeing, I was excited to see today
morning actually from Boeing. From Pacific Northwest National Lab, okay people over there.
From University of California Santa Cruz, nice, thanks for coming over actually. From Allen Institute of
AI, great, Peter Clark and colleagues are here. Okay I finish this page right, twenty-three institutions so a
lot of work. From Appen, people from Appen, okay they’re no one person there, great. From Google,
okay, from Google in Microsoft.
[laughter]
We have from Genome, yeah one person from Genome from Vancouver came to make it. From Intel,
okay you have a good relation with Microsoft I know.
[laughter]
NWTA, not yet, Univera, okay one person at the back, from USC, no not yet. Point Inside, okay we have
one person from, thank you for making it to here, University of Buffalo, perfect one person here.
Okay, so we have different people from different institutions, twenty-three. I’m happy to see you all
here. We’re all happy for that. I am going to give you a very brief introduction about how many papers
we had, how many we selected. Totally we had thirty-eight on time submissions. Out of these thirtyeight there were twenty long papers, eighteen abstract or extended abstract which was a new work.
The organization was in a way that the long papers are the work that have been published elsewhere,
recent work. But the works that have been published like in other venues for NLP and new works has,
they have been submitted as extended abstract. We have one unusual submission that I’m telling you
about that later.
[laughter]
Okay, now is later. Okay the unusual submission it was interesting we have received an automatically
generated paper. Thanks to our program committee and reviewers we rejected the paper.
[laughter]
It means we check all the papers whoever submits you can see their reviews. Actually that made us kind
of excited because NW-NLP is growing so much that people start recognizing us and sending some
automatic papers to test us probably if it pass or not. We had unusual sub, kind of, let’s call it automatic
paper submission. Yeah, so that’s all for now and if you’re happy we have lots of talk samples first.
>>: I just wanted to say one thing before we go on to the actual program. There are three papers
actually in the program today with Ben Taskar as a co-author. I just wanted to acknowledge that he you
know he recently, not so recent but he passed away at an early age. It’s unfortunate that he you know
we could have benefited from his long term input to this community. I just didn’t want to let that pass
because it maybe one of the last places where his name appears in the program. There might be other
papers in the pipeline. I’m sure they’ll be. There have been lots of, of course he was so famous and so
influential that there’s been a lot of other places where his work has been acknowledged. But I just
wanted to say something here. I would like you to visit his, there’s a website for donations to his family
as well. I would like to, you can search for his name and it’s the second hit on most search engines. I
would encourage you to go and visit this. If you haven’t heard of, if you’re a new student you haven’t
heard of Ben Taskar you’re going to use his papers in your work, it’s almost guaranteed.
I think we have an exciting day of papers and posters. I think that first one is in five minutes, or?
>> Will Lewis: Eight minutes, yeah.
>>: Yeah, so should we let people…
>> Will Lewis: I think let people grab some more snacks and then we’ll start.
>>: Then back here in eight minutes, so don’t run away, come back in eight minutes.
[laughter]
>> Will Lewis: Okay, so Lucy is going to be chairing the first session and she’ll be introducing the
speakers.
>> Lucy: Good morning everyone. Our first speaker is Kenton Lee. He’ll be presenting Contextdependent Semantic Parsing for Time Expressions. A paper that will be presented at ACL twenty
fourteen, let’s start, alright.
>> Kenton Lee: Thank you. Good morning I’m Kenton Lee a Grad Student from the University of
Washington. Today I’ll present my project on Context-dependent Semantic Parsing for Time
Expressions. This is joint work with Yoav Artzi, Jesse Dodge, and Luke Zettlemoyer.
To give you some motivation for the task let’s look in the example document. We see here that Biden
will attend a lunch with Democrats on Thursday while Obama will serve as the retreats closing act
Friday. We see here that there’s a lot of temporal information that we want to extract from the text. In
the long term we might want to extract all events from the text into a timeline. For this project we focus
on a specific part of this task which is just extracting the time expressions from the documents. In this
case we would want to find and understand the time expressions such as next week, Thursday, Tuesday
morning, etcetera.
We can think of this as two tasks. The first task is detection where we find all mentions of the time
expressions. In the second task we take these detected mentions and we resolve each to time value in
standardized format. To be concrete about the task we can represent things as dates, duration, and
approximate times.
Notice that in the last three examples Thursday, two days earlier, and that week. In isolation these are
underspecified and we will need to incorporate some kind of context to understand what they mean. By
context I mean the text that is beyond the mentioned boundary. This is one of several challenges that
we identify when we try to do this task. I’ll go through those challenges in detail.
First here’s an outline of my task, of the talk. I just described the task. Then I’ll talk about some
challenges that we’ll have to face and our approach for how to address each of these challenges. Lastly
I’ll show some results on experiments comparing our approach to existing state of the art systems.
As I previously mentioned the first challenge is Context-dependence. We see in this example the time
expression Friday within two contexts. We can think of Friday in isolation to mean a sequence of all
Fridays. To choose the correct Friday out of all of these Fridays we will have to know two pieces of
information from context. First is we need to know when this was said or written. Second we need to
rely on contextual cues such as the verb tense to determine whether we want to choose a Friday
occurring after the document time or before the document time.
For the second challenge we see that time expressions are often compositional. Say in this example of a
week after tomorrow we need to first understand the meaning of tomorrow. Then we have to
understand how long a week is. Then we have to be able to combine these meaning to form the final
meaning of the full phrase.
Lastly this challenge is related to the task of detection. We see a lot of phrases can be temporal or nontemporal depending on their context. In this example two thousand time expression in the context of
she was hired in two thousand. Whereas it’s not the case in the context of they spent two thousand
dollars.
Next I’ll go through how we approach the task and how our system addresses each of these challenges.
We decompose this task into two steps. First is we take the input documents and we detect where all
the time expressions occur. The second step which is resolution we take each of these mentions and
resolve them to a time value. A key component of this approach is that we define a temporal grammar
that is used in both the detection step and the resolution step. Using this temporal grammar we can do
semantic parsing which will give us a formal meaning representation of time to work with. This will
allow us to solve the issue of compositionality in time expressions.
Let’s look at how we define this temporal grammar. We choose to use a combinatory categorical
grammar. We can think of this as a function from a phrase to a set of formal meaning representations in
the form of logical expressions. We choose to use CCG because it’s a very well studied formalism. We
know of a lot of existing algorithms that we can reuse to solve this task.
This is an example of CCG parse. Don’t worry about the details here. But the take away of the slide is
that at the very top we have the input phrase and we have lexical entries that pair words with their
meanings. We compose these meanings to retrieve the final logical expression at the bottom, which is
the output.
A nice property of the domain of time expressions is that the vocabulary is relative closed-classed. It’s
easy for us to manually design these lexical entries at the top. That’s exactly what we do for our
temporal grammar. We engineer a lexicon. We include in this lexicon things like units of time, named
times, and function words. That was how we define our temporal grammar. I’ll talk about how we use
this grammar in both the detection and resolution steps.
Let’s start with detection. To recap we are taking a document and detecting where all time expressions
occur in the detection component, and the way we approach this is we parse all possible spans of text
using our temporal grammar. For any span that belongs to this temporal grammar we consider it as a
candidate for time expression. The way we design a grammar is that it has very high coverage at the
cost of over generation. We have to on top of this use a linear classifier to filter out any false positives.
To illustrate this process let’s look at four examples. Twenty-four hour we’ll have lunch and two
thousand in the respective contexts. The first thing we do is we use our temporal grammar to give us a
set of logical expressions. In the case where the set is empty such as in a case of we’ll have lunch we
assume that this is not a time expression. Then for the cases where the set is not empty we look at, we
use our linear classifier to make a final decision for whether or not to reject this as a time expression,
which we do in the case of they spent two thousand dollars.
For our linear classifier we use a logistic regression model with L1 regularization. For features we
include syntactic features, lexical features, and indicators for tokens that are particularly discriminative
such as ten prepositions near the phrase like until, in, during, etcetera. That was how we use our
temporal grammar to do detection.
Now I’ll talk about how we also use it to do resolution. To recap the task of resolution is to take detect
conventions and resolve each of them in their contexts to particular time expressions, or to time values.
To resolve an input phrase we define a three step process which we call a derivation. In the first step we
use our temporal grammar to give us logical expressions that are independent of context. Then in order
to incorporate contextual information we transform these logical expressions using what we call context
dependent operations. Lastly we have a deterministic process that resolves each of these to a specific
time value. Because for every input phrase there can be multiple derivations we reasoned over all of
these steps using a log-linear model that will allow us to score the derivations and choose the most
likely one.
We can go through an example of how we do this. This is the time expression Friday in the context of
John arrived on Friday. This temporal grammar gives us context independent logical expression Friday,
representing the sequence of all Fridays. We have to choose from the sequence of one particular
Friday. We apply one of three context-dependent operations allowing us to choose the last Friday, this
Friday, or next Friday in relative to some reference time. But these logical expressions are so
underspecified we still have to decide what this reference time is coming from.
We have a second step. We can replace this reference time with either the document time or another
time expression that has, that was occurring before in the document that we’ve already resolved. For
each of these final logical expressions we can resolve them to produce a time value. With our log-linear
model we can choose the most likely interpretation which is the last Friday relative to the document
time.
As I previously mentioned we us the log-linear model, we do parameter estimation using AdaGrad. For
features we include things like the part of speech of the Governor verb which will give us the
corresponding tense. We also include the type of the reference time which can either be the document
time or previous time. Lastly we also include the type of the time expression such as the day of the
week, day of the month, year, etcetera. I just gave an overview of our approach which is to use the
head engineer and temporal grammar to detect and resolve time expressions.
Now we’ll talk about some related work and some experiments we do to compare our approach with
previous state of the art systems. We mainly compare ourselves to HeidelTime which is state of the art
in doing the full end to end task of both detection and resolution. Our approach is most similar to SCFG
in parsing time which used semantic parsing to resolve given time expressions. We evaluate the
systems over two corpora, one is TempEval-3 which consists of Newswire text and the other is WikiWars
which consists of history articles.
The way we evaluate this is for detection we consider a predicted mention to be a true positive if it
overlaps with some gold mention. For the resolution task the predicted time and value has to both be
correctly detected and resolved. This gives us two sets of precision we recall and F1 values.
Here are the results evaluating over the two corpora. On the left we see F1 scores for detection and on
the right we see F1 scores for resolution. We show an improvement across the board. For the end to
end tasks we see an improvement in TempEval of twenty-one percent, error reduction of twenty-one
percent. In the [indiscernible] WikiWars we see an error reduction of thirteen percent.
An advantage of our system is that we can trade off between precision and recall because we produce
confidence values. This is very useful for downstream applications who might want to rely on different
points of this purple curve.
Lastly we wanted to know how important context was for our system. We ablated the ability of our
systems ability to refer to information beyond the mentioned boundary. We find that the degradation
in performance is much more severe in WikiWars compared to TempEval-3. This is because in news
things generally occur near the document time, whereas in history articles we have these long complex
narratives where we have to model context in order to properly resolve the time expressions.
To sum up we produced an approach that, for extracting time expressions that are state of the art. We
are the first to use semantic parsing to do detection. We also are able to jointly learn a semantic parser
and how we can use context to understand time expressions. For future work I hope to be able to
jointly model both times and events as mentioned during the motivation.
This is just a slide for advertisement. We are releasing this tool which we called UWTime. This is
implemented using the UW Semantic Parsing Framework. It will be available soon on my websites, so
look out for that.
Thank you.
[applause]
Questions? Yes.
>>: You had examples like somebody would arrive on Friday, okay. You tried to handle things like
yesterday John said that Mary would arrive on Friday. I mean how complex do you go in your analysis,
how complex of context will you analyze?
>> Kenton Lee: We use a very simple model of when we look backwards towards previously resolved
mentions we just look at the previous thing. In your example of…
>>: Yesterday John said that Mary would arrive on Friday.
>> Kenton Lee: Right, so we would look as far as yesterday and nothing before it.
>>: Okay, so to determine when Friday is you do pay attention to yesterday?
>> Kenton Lee: Yes we do. We anchor this Friday relative to yesterday as one of the choices during
resolution.
>>: Okay, so you look for all of your time expressions in the utterance and then try to relate, determine
how one is related to the other?
>> Kenton Lee: We don’t look at all of them we just look at the last one that was resolved.
>>: Last resolved, okay.
>> Kenton Lee: Potentially we could extend the model to make it, to have it look at more than just the
last thing. But we started with this.
>> Lucy: Other questions?
>>: I was wondering how good the parser is at finding, do you have detection results that you can also
take a statistical parser and then convert it into a CCD derivation? I was wondering how good is it at
finding sort of the boundaries?
>> Kenton Lee: Not sure this answers your question. But when we look at how higher coverage is we
can see that our parser produces a correct match ninety-six percent of the times in a development set.
We, it’s very feasible for us to do detection using our parser. Does that answer the question?
>>: Yeah.
>>: You mentioned that you hand code your initial data. Is there any way to speed up that process? I
mean I’m wondering if there are language resources out there that you could make use of to at least
assist in that phase. Like for example taking a language data that’s marked up with time ML or other
time representation languages as a boost.
>> Kenton Lee: Right, this is definitely a logical next step to this project is just to automate this task of
developing a lexicon, which would be great for also doing this in other languages where we don’t want
to necessarily engineer multiple lexicons.
>> Lucy: Additional questions?
>>: I have a question for the complexity of these time references so you can kind of use the whole
power of language to refer time per se. One year after Picasso painted his [indiscernible], so where do
you stop?
>> Kenton Lee: We stop at events. We don’t actually, so if in your example a year after some event we
can’t really handle this in our system. We hopefully will do some future work. But we are able to
handle things that anchor on other time expressions, right.
>>: Well, sometimes you have a combination of all [indiscernible] which will harness not only the time
reference, revolution but also the [indiscernible] reference. For instance you know John arrived at that
Friday. That Friday, yeah, which Friday? Do you handle those kinds of [indiscernible] as well?
>> Kenton Lee: Right, our model of context is actually designed to handle these cases where you just
talked about some Friday. When you refer to that Friday we’re able to look backwards at those
previously mentioned time expressions.
>>: I see.
>>: You are marking temporal events, right. Let’s say there is a sentence that says the conference is on
this day [indiscernible]. In which case, do you resolve, you first resolve Thursday and then you resolve
Friday. But the day before the conference would you go back and do a back [indiscernible]?
>> Kenton Lee: We actually don’t build this kind of reasoning. Like we don’t think about the conference
at all, so we just look at Thursday and Friday, and what they mean independently. Yes?
>> Lucy: Question?
>>: [inaudible] as use. I’m curious what your example is before next Friday or it’s just for Friday? Is that
resolved to the Friday being referenced or [indiscernible]?
>> Kenton Lee: In the case of before that is actually part of the time expression itself.
>>: Right.
>> Kenton Lee: In the annotation we use you can actually mark a time expression with modifiers. That’s
how that it’s before a particular time.
>>: That’s, and they do this before or is it, does it, how does it resolve for a particular point on a time
line?
>> Kenton Lee: I think this is a limitation of the annotation we use. It doesn’t have a very good
representation for the ambiguous cases unfortunately. Yes?
>> Lucy: One last question.
>> Kenton Lee: Go ahead.
>>: Adding on to what she said when you said this will happen before that, so future from now and
prior to the following Friday you get that range through there, or you just get [indiscernible]?
>> Kenton Lee: We have a reference template, so it’s before a particular, so sorry you’re saying the
difference between…
>>: First there’s, I mean enacting reality you’re saying that sometime between now and next Friday you
say this will happen before next Friday. Do you get a rate of…
>> Kenton Lee: I see.
>>: Because stream on both sides are just saying before Friday.
>> Kenton Lee: We do not. We only know that it’s before that Friday.
>> Lucy: Let’s thank the speaker again.
[applause]
Our next speaker is Yashar Mehdad. He’ll be speaking about the Query Based Abstractive
Summarization of Conversations. The word abstractive is very interesting to me. This paper will be
presented at ACL two thousand fourteen.
>> Yashar Mehdad: We’re working on the Abstractive Summarization of Conversations. This talk will be
about that, specifically about query based application for that abstractive summarization. Basically why
we are interested on conversations? There are a lot of conversations every day generated in our life,
every day generated on internet.
You can find many friends, many blogs, many social media websites that users generate. It’s past the
time that users were passive actually, passive users in internet. Now a day’s people go and generate
some contents. You can see based on these how we change from two thousand six to two thousand
twelve and how much data we are generating every day.
If we go to two thousand fifteen which is just over the corner next year we’re going to have big, big data
about eight point five billion terabytes that we have to deal with. What we do if we want to go through
those data. If we want to understand those data, if we want to search those data, if we want to really
have a look at each part and see what they’re talking about. Of course these, all this data I’m going
through and dig into each one will give us a big problem of information overload.
One of the ways that we can approach this problem or we can deal with such problem is summarization,
automatic summarization. Having a conversation we can have a system automatically summarize that
conversation. In automatic summarization actually we have different areas of work. We have extractive
versus abstractive. We have generic versus query-based.
What do we mean by extractive and abstractive? Well extractive it has been in the area of
summarization for long so people use to extract some sentences from the original document and say
okay these documents, these are significant sentences in these document. We collect them and then
we give it to the user and say okay this is a summary of this document. But that’s not how human does,
human reads the document, and then understand it, and try to generate a text, right, write a text and a
summary.
That is what we want to do, why, which calls abstractive and why? Because many research actually
showed and proved that users prefer abstractive summarization. Actually that’s what human does, so
human prefers abstractive summarization. What are generic and query-based? Okay, generic one you
try to summarize a text and then get the meaning of the whole text present to the user. At the same
time query-based the summary generated based on a question, based on a query that user actually ask
about that. These carte blanche conversation or document can talk about different things. When a user
has some questions can ask, and that summary will be generated based on that question.
Our work will be mainly on abstractive query-based summarization of conversations. I’m going to talk
about which is kind of similar to the previous work of Wang et al., but with some differences that I’m
telling you. Well many query-based summarization systems since two thousand four started working on
these dataset DUC query-based multidocument news summarization, which was mainly on news. As
you know news are kind of you know very well structure, very well polished. It’s kind of clean text, not
that much noise of real user generated content or conversation on internet.
They have some queries which are kind of complex question. They are looking for specific information
needs. These questions are actually written by expert annotators. For example I’ve one question or
query they can have “How were the bombings of the US embassies in Kenya and Tanzania conducted?”
It’s a very specific information need. But do they always when we want to ask for something are
knowing exactly what information we are looking for? When we’re dealing with conversations, do we
always generate such complex questions?
What kind of queries we are looking at? What kind of queries we are talking about? Let’s say that you
are going to a series of reviews of a ParaDoc and you want to know about a certain aspect like a camera
and lens. This query can be only one word or can be actually a phrase. Like for example, “the new
model of an iphone”, right. This is more of a kind of phase based which is, has not been really looked at
much in the past.
Then if you want to compare these two we have some specific information needs in complex queries.
These needs to have precise kind of formulation of a question or query and it’s kind of very well
structured, as we have in our previous datasets. While phrasal queries are kind of less specific in terms
of information need. At the same time they are used very much for exploring different text you know or
different summaries. They are less topically focused so they could be more general. They are less
structured, they have less context. In this work we are focusing for the first time on the phrasal queries.
Then we define phrasal query as concatenation of two or more keywords. We believe that this is more
relevant when we are talking about conversational data we are focusing.
Of course in this work we are facing many challenges, one of the challenges that we’re dealing
conversational data. You all know what challenges we have in conversational data. We have noise; we
have less structure because those are not edited, just produced. We have many acronyms, many
problems. We have to deal with phrasal queries that don’t have that much of context. They are not
also less structured so that limits our choices as well. We are trying to produce abstractive ones so we
have to generate the language. It’s not as easy as extracting just some sentences from the text or from
the conversation. Of course since its new work we don’t have much annotated data. We have a very
limited annotated data so that motivates us to go for more of the kind of unsupervised work.
Following the challenges we have the following contributions. We in this work propose the first
abstractive summarization system that is based on phrasal queries. We have, we are having essentially
kind of unsupervised model. We are, our work is actually, our system working a different conversational
domain and we are not focusing only on one.
If you look at this framework at a glance you can see our framework is actually have our system having
three different parts. The first part of the system try to extracts, try to do something extractive system
does. Try to find the significant sentences, extract them. Then we go to the next step, we do, we filter
those sentences because there are many redundant ones. There are many ones that are not really
significant or they’re more info muttered than others. We go to the last step which is the abstract
generation, the language generation we’re talking about in the abstractive summarization. We have
three parts that I am going to talk in more detail about that.
This is our framework. I’ll walk you through each phase one by one. For utterance extraction the aim is
to extract the utterances from conversations that are you know important for us. What we have to
fulfill? Two things, first of all this sentence has the whole idea of the conversation, have the whole
content of the conversation. At the same time it’s related to our query that is imposed by the user. For
each one we extract some terms. The first one we call a signature term that shows the important terms
or the important topics that is discussed in a whole conversation. We try to exact them using loglikelihood ratio with the associated weight. This actually has been proved to be effective in the previous
work. For the query terms we try to extract the content words for each query. We try to expand them
using some knowledge. For this case we use WordNet synonym relations to expand our queries.
We have a set of words. Now we want to score each utterance based on different terms that we have
here. We have a score of query, so how relevant is the sentence based on the query. We have a score
of signature terms. Then we can combine them using you know linear kind of combination using some
coefficients that can be tuned.
Then we go to redundancy removal one. For the first time actually we thought that we are not going to
use MMR you know as a kind of very popular method for redundancy removal. We are using semantic
relations or entailment relations in our framework. I recap entailment relations, so entailment is when a
directional relation we have between two text. We say one text entail R if reading the first text you can
infer the second one, or you can say the second one is true based on the first one. I have one example
here, for example, the technological term known as GPS while incubated in the mind of Ivan Getting and
then Ivan Getting invented the GPS. The true case of entailment, how we use that entailment? We try
to train an entailment model using features and the previous dataset using SVM classifier.
Then how we can use that? Well let’s say that we have utterances, right. We try to compose the
Entailment Graph over the extracted utterances. We try to label the relations between each utterance
or each sentence using the unit direction of, Bidirectional entailment and unknown entailment. If two
sentence are entailing each other in both direction they are unit, they are bidirectional entailments.
They are kind of semantically equivalent. If they are one in one way it means that for example sentence
or utterance C is kind more informative than A and B. Then others could be unknown. We use that
information and then we filter some of the utterances.
How? We say that if we have semantically equivalent sentences one of them can stay in, or if we have
some more informative sentences than others then the more informative sentences are more related
our summarization framework. We keep them and then prune others. As you see out of seven
sentences running the entailment graph we can have four sentences at the end.
The filtered sentence it brings us to the next stage which is the abstract generation. The abstract
generation actually composed of three parts. The first part is the clustering. We have a list of
utterances and we want to cluster utterances in different groups. We want to use lexical clustering,
why, because that also helps us in the next stages. We use a simple clustering algorithm K-mean. We
use a cosine similarity over tf.idf scores to cluster. We have now set of clusters, each cluster different
sentences.
For each cluster in the next step what we want to do? We want to merge and fuse the sentences of
each cluster and generate one sentence out of them. Once we do that we proposed using a version of
Word Graph model. Why we do that? Why Word Graph? Why we’re not going to use any specific
natural language generation or sophisticated syntactical approaches? Because we are dealing with noisy
conversations and we want to really we cannot go deep into syntactic and structure analysis.
Previously Word Graph model was introduced by Filippova. We extended that new version of that with
some modification. We use some semantic relations between the words. We try to generate in our
graph on each cluster in this way. We have a start to end for each utterance. Then at each time we add
a new sentence to the graph. If the nodes are the same we merge them. If they are synonym we merge
them. If they are connected by hyper [indiscernible] we kind of merge them using that, using WordNet.
Then we can see that we can have a graph from start to end.
Then out of this graph now our job when we have the Word Graph on each cluster is to find a best path
that’s getting from start to end. How we do that? We go to the path ranking. First we prune the paths
with no verb because they are definitely not grammatically correct for us. Then we go to the different
criteria, query focus, readability using language model, and the path weight which is you know you can
go for details of the formula, or you can ask me later on. It’s written in our paper. We add them up
together. We have the scoring for each path. We find the, we pick the first path selected by these as
our abstract for each cluster. Each cluster produced one abstract sentence. All abstract sentences
together is our summary generated.
The experiments on, we have experimented on different datasets, chat logs. We have for chat logs
some query-based summaries so we use them for our automatic summarization. We use meeting
transcripts and email threads. For those we didn’t have any so we had to go for manual studies. I am
going to talk about results very briefly. We can see that our abstractive summarization using ROUGE-1,
F-1 score is actually outperforming all other baselines. While ROUGE-2 actually cannot perform that
well, main reason is that in the Word Graph generation we try to sometimes change some words so in
the bigram kind of matching that score will be lower in that fractive way. We can see that our first
phase which is utterance extraction still outperformed other extractive models. Then we can see that
previous Biased LexRank which is a query-based summarization system was not performing as well as
our system for conversational data.
We run a user study for our manual results again. Then we find that users you know manual and the
teachers preferred our system seventy percent, sixty percent of the time. Our system outperformed the
other baseline. Again in a manual study grammaticality correctness became quite good acceptable
grammatical correctness seventy to sixty percent, while in the meetings it was lower because the
meeting transcripts coming from automatic transcription was noisy, so the original meeting
transcription actually there were fifty percent correct themselves.
In conclusion I presented the abstractive summary using phrasal query. You know the first phase was
mainly a kind of model for extraction. Then we integrated the semantic for the next phase. Then we
have introduced a word graph model over the ranking strategy using minimal syntactics. We got very
promising results over various conversational dataset.
Then for the future work we are thinking of incorporating more conversational features like speaker
information, speech acts, and generate more coherent abstracts, as well as resolving some coreferences. I invite you to come to our posters as well and also look at a demo of our NLP group at UBC.
Thank you very much.
[applause]
>> Lucy: Alex could you come and setup. We’ll have one question. I know I have many questions but
you know we have the whole day to ask Yashar and his co-authors. But Pete did you want to ask a
question, quickly?
>>: Yeah, this is very interesting.
>> Yashar Mehdad: Thank you.
>>: [indiscernible] can see that your word graph ideal path is grammatical?
>> Yashar Mehdad: Is…
>>: The ideal path you defined with the word graph does it always produce a grammatical sentence?
>> Yashar Mehdad: Not always.
>>: Okay.
>> Yashar Mehdad: That why actually we check the grammaticality. We concluded that it’s in the
results so I can show you later that it depends on the dataset, and it depends on the nature of data. For
example from many transcripts there are more errors. But for others it’s less so we could get seventy to
eighty percent of the path generated correct in terms of grammatical correctness. That’s very good for
an abstractive summarization system actually.
>>: Was that a constraint that it’s grammatical by checking it [indiscernible]?
>> Yashar Mehdad: Yeah, exactly.
>>: Yeah.
>> Yashar Mehdad: Exactly, but it’s still we check the grammatic [indiscernible] through a language
model. That also is a good filter, right, because in the path ranking you have language model and
weights at the same time, so weights also check kind of how bigrams, trigrams are connected. That’s
also a good grammaticality metric let’s say.
>> Lucy: Let’s thank the speaker again.
[applause]
>> Yashar Mehdad: Thank you very much. I will answer all other questions in the break. Thanks you.
>> Lucy: Yes, okay, so our last speaker of this session is Alex Marin, or am I pronouncing the last name
right?
>> Alex Marin: Yep.
>> Lucy: Okay, talking about Domain Adaptation for Parsing in ASR. This work will be presented at
ICAST.
>> Alex Marin: Yeah, thank you. This is work with my Advisor Mari Ostendorf at University of
Washington on Domain Adaptation for Parsing and Speech Recognition. Speech recognition has been
used in many applications over the years from call center applications, more recently voice search,
personal assistance such as Cortana, or CV, or Google Now.
We’re working on a similar but not exactly the same system. We’re working on a speech to speech
translation application with a [indiscernible] system to resolve ASR errors by interacting with a user,
asking clarification questions whenever something is unclear to the system. In particular we’re looking
at doing correction of errors automatically when we can, so that we only ask about the other errors that
we cannot correct automatically. In particular we’re focusing on out-of-vocabulary words in this talk.
But we’ve looked at other kinds of errors as well. To get a sense of what kind of errors you might see
there’s a couple of examples here. For example here Litanfeeth is an OOV but we see that the error
region extends to the neighboring words. What we’d like to do is correct at instead of having the
incorrect word it, and then mark the rest of the error region as an OOV, so the system could ask to have
it replaced with a different word or have it spelled if it’s a name, etcetera.
Similar here reframe is an OOV. It is replaced by a different word so we’d like to mark this as being an
out-of vocabulary word and then remove this filled pause because it doesn’t add any meaning to the
sentence. The way we do this is by looking at the ASR output and using a classifier with confidence cues.
But we also integrate information from parsing.
In particular what we want to do is model the syntactic anomalies in error regions. We do this by
working with confusion networks. This is a different approach than what other people have done.
People have used lattices before and best lists. We’re working with a full confusion network structure
and adding an error or OOV in this case arcs on each slight of the confusion network. Thus we have to
have a parser which can handle these arcs as well as the ASR insertions or non-arcs which have to be
introduced as part of the confusion network generation.
What we’d like to do is after we parse entire confusion network structure we get a new one best path
through the network. Instead of having the incorrect word fame we’d find that there’s an OOV region
which has syntactic header verb which allows us to ask a meaningful question about it. Also instead of
having the filled paused we’d be able to remove it. The contributions of this work are twofold. First we
look at using the main adaptation for improving the impact of parsing in our error detection correction
strategies. But also we add additional features that capture the reliability of the parsing to further
improve the detection task.
Why do we need domain adaptation? Previous work has looked a lot at doing parsing on the target
domain. For example conversational telephone speech or broadcast news, we don’t really have that
luxury here because we don’t have Treebank data available for our domain. There are treebanks in
other conversational domains like Switchboard. But as you can see from these examples the Transtac
data which is what we are working with tends to be rather different from Switchboard. There’s a lot
fewer discrepancies in our data whereas in Switchboard you have a lot of discrepancies, you have a lot
of filled pauses. The sentences tend to be a bit more rambly and so on.
We’re going to look at trying to adapt our parser, train on Switchboard on to the target domain which is
the Transtac data. We’re going to do this in two ways. We’re going to use self-training to capture the
vocabulary and sentence structure of the target domain. As well as capture the confusion network
structural information such as dealing with null arcs. We’re also going to use a task supervised approach
for modeling the ASR errors.
An overview of the system that we’re using is shown here. We have a three stage process. We start
with a [indiscernible] confusion network from the decoder. First do a baseline error classification to
annotate the confusion network with error or OOV in this case arcs. This give us essentially a prior on
each slot as to whether that slot contains an error or not in the confusion network. The annotated
confusion network is then fed into two different parsers. One that doesn’t know anything about errors
which is used for rescoring and one which does know something about errors in ASR, and the
combinations of those two parsers allows us to extract additional features, which are then used in a final
round of error classification to give us the final OOV or error decisions.
The work in this talk I’m going to focus on the parsing, so both the parsing adaptation in two way as well
as the extra features that we’re going to extract. To talk about parsing confusion networks in a bit more
detail we start with a standard factored parser model. We’re using the Stanford parser. We’re using the
probabilistic context free grammar to start with trained on a conversational telephone speech Treebank.
We’re generating k-best trees from the confusion network. The k-best trees are converted to a
dependency model and then rescored using these dependencies. This gives us a one-best tree over an
entire confusion network.
To the standard approach we have to add two sets of rules. We add the rules for parsing null arcs in the
confusion network, essentially for each non-terminal in the grammar. We add a couple of rules and
then one to generate the actual null arcs. The error model is only added to the error parser. Here we’re
essentially adding an error category for each syntactic category, so each constituent in the grammar.
Then we have a couple of rules to grow that error region and to actually do the generation.
Looking at the adaptation methods, so first self training. The goal of the self training approach is to
adapt the vocabulary and sentence structure to the target domain as well as to deal with the null arcs
better. We’re using a fairly standard approach. We’re iterating over the data multiple times starting
with the parser train on just the raw Treebank. We then parse all the unlabeled data. We have both
speech and text data. We’re adding the most confident trees to the Treebank so that we can retrain
with more data in the next iteration. The threshold for what we consider most confident is tuned at
each iteration separately on a development set. We stop whenever we then get any significant
improvement. In practice it’s the sense of being about three to four iterations or some untold
coverage’s.
The second approach you use is weak task supervision. This is primarily geared towards adapting the
scores for the error arc rules which don’t appear in our treebanks at all. But we optionally can also use it
to improve the scoring of the null arc rules. We’re going to have experiments that look at this also. The
approach is to augment the context free, probabilistic context free grammar not only with a log-linear
model, which is only used to score these error rules and the null arc rules.
What we do is we use as features the presence or absence of these rules in a derivation. All the features
are local and as objective we’re using the word error rate of the training set. This is the task supervision
that we’re using essentially the rescoring is an auxiliary task for training the parser. We could use other
tasks like the detection task instead of rescoring. We found out the rescoring tends to work slightly
better. The training is done using an average perceptron algorithm. Again this converges fairly quickly
in about five to ten iterations at most and we get pretty good results.
Finally to talk about the features that we extract from the parser, what we’re trying to do is capture
differences in the parse trees from the non-error models, so the model that just parses confusion
networks, and the model that parses confusion networks with error arcs. We have two types of
features. We have dependency tuples which compare the local structure in the two trees. As well as
inside scores which look at the reliability of the or the confidence of the non-error parser in error
regions.
Here we’re looking at essentially getting two scores one that looks at just the local tree around a
particular slot that’s part of the error region on the error side. But not in an error region on the nonerror side of course. As well as a larger tree that captures that slot as well as the boundary of the error
region.
To talk about the experimental setup we are using the BOLT speech-to-speech translation system from
the SRI Team. The data is a mixture of military and civil infrastructure domains. We’re focusing on the
English side. We haven’t looked at Arabic. We have about sixteen hundred utterances of the speech
data split sixty, twenty, twenty into a speech train-dev, and eval sets. We also have about eighty
thousand utterances of language model training data which are used for self-training.
The actually Treebank that we’re using is drawn from conversational telephone speech, so Switchboard
and Fisher. We have about twenty thousand utterances here. The ASR system we’re using is a hybrid
deep neural network and Gaussian mixture model system. We’re using the DNN confusion networks
with, and augment them with the one-best GMM system. This was the configuration that our
collaborators at SRI found to work the best. The vocabulary size of the ASR system is about thirty
thousand words. We also use a confidence estimation process also DNN-based. These are used as
features in the error classification tasks.
Looking at results, first the rescoring task, we’re looking at two adaptation approaches the self-training
and configurations are on the vertical line. The augmenting that with the task supervised adaptation
gives us no tasks supervised and tasks supervised in the two columns. The baseline is the word error
rate of the ASR one-best. In word error rate lower is better. What we find is that the language model
training data alone doesn’t give us an improvement in self training. Adapting the null arcs with this loglinear model or CRF model doesn’t actually give us a win over not doing that. This is likely due to
overtraining from our analysis. But what we find is that when I do the self-training with both the text
data from the language model training as well as the speech data, we actually get a significant win over
not doing any self-training, but using the parser as well as a slight but consistent win over the ASR
baseline.
Looking at the OOV detection task we have again a similar configuration. The various self training
configurations horizontally and we’re using two measures to evaluate our systems. We’re using the
more standard F-score measures. Here higher is better, as well as a modified version of the word error
rate with all these replaced by a single OOV token. We do this to get a sense of whether we generate
too large OOV regions, because if we generate too large OOV regions then we’re going to increase the
word error rate by removing words that should be there. Here again in the word error rate starred
lower is better.
What we find is that in this case all the parser configurations improve over the baseline of not using any
parsing just doing the structural confusion network features. The best results are obtained again with
the self training using both the unlabeled sets. The final results on our internal evaluation system are
shown here. We’re comparing the baseline without any parsing, the baseline parsing with no self
training and adaptation, and the best parsing with self training.
What we find is that we get an improvement from the parser on the OOV detection task without self
training but not on the rescoring. But when we do this adaptation we get a win on the rescoring task as
well. Again the win is much higher over the base parser but again consistent over the one best ASR.
We also looked at another dataset which comes from the actual evaluation from the BOLT data. Here
we have about thirteen hundred sentences. We’re comparing only the baseline system which is no
parsing against the best parsing system. We’re not doing any tuning on this system. We’re just looking
at how well will we do with the two systems that are used in the evaluation. What we find is that the
OOV detection F-score is actually significantly lower. This is because the data that they used in the
evaluation has a lot fewer OOV’s and our internal data. The system was slightly over trained on the
wrong set. But by using the parsing we get a fairly significant improvement over that where we can
actually recover a lot of the performance degradation. Again the performance improvement on
rescoring on the rescoring task is larger and consistent with what we saw before.
To conclude domain adaptation for parsing gives us quite good gains on both the error detection task as
well as doing rescoring where the parser is a language model. The best results are obtained when we
adapt the parser to match both the vocabulary and the structure of the target domain data and use the
log-linear model only for scoring error rules.
The future work is going to look at modeling different error types jointly within the parser. Not just
doing OOVs or names but doing all of those different types of errors as a single model. We also want to
expand the log-linear model for its scoring rules to not just use local features but also use larger context
or global features in a sentence.
Thank you.
[applause]
>> Lucy: Questions?
>>: Yeah.
>> Alex Marin: Brian.
>>: In your early example you had a proper name OOV that was then recognized as multiple in
vocabulary tokens.
>> Alex Marin: Right.
>>: How exactly does this get labeled in your [indiscernible] in your confusion network? Is it sort of
OOVOOVOOV and then do you try to eventually consolidate it or is it null, null, OOV or?
>> Alex Marin: What it turns up, so I should have had another example for this. But let’s pretend that
this is that confusion network. Let’s say that Litanfeeth was this word here. Then, so ignoring the OOV
arcs, so you’d have Litanfeeth here as. If you align, sorry, if you align the confusion network that’s
generated with the references, right. Let’s say that you have Litanfeeth as a reference arc here. Then
you have a null arc and then another null arc on the following slots. This allows us to mark each of the
slots as labeled with an OOV. What we’d like to do is capture all three of them, or four, however many
there were as OOV. But then when we look at the final one best through the confusion network we’d
want to merge all those into a single OOV slot that combines all those three OOV things.
>>: You do that…
>> Alex Marin: We do that, that’s how we score it. The F-score numbers that we reported are at this
region level not the slot level. We have slot level results, the results are essentially similar but we think
the region level scoring makes more sense. Jim?
>>: In the self training you choose the competent parse to add to the data. Is that using the PCFG tree
likelihoods or is in the DNN [indiscernible]?
>> Alex Marin: That is using the PCFG tree likelihood. It’s not just the PCFG because we’re using also
the log-linear model to score rules sometimes. But it’s the combination of all those things. We’re using
the insider score of the entire tree. Other questions?
>> Lucy: Any other questions?
>>: That the first slide on the results maybe I misunderstood it. But it seems to indicate that the
conditional random field isn’t [indiscernible] or?
>> Alex Marin: Yes, so what we found is that when we use the log-linear model, the CRF model to score
the null arc rules we don’t get a win. But we do get a win from the self training. The self training, but
doing this additional adaptation did not help.
>>: Do you have an explanation?
>> Alex Marin: We think it’s due to over training on those particular rules. What happens is those, the
null arc rules if we do the log-linear adaptation as well as the self training. Those rules end up getting a
disproportionately high weight in the model. They end up over generating the arcs in the parson so the
parsons end up eating up words that should have been there end up with a lot more deletions.
There could be various ways to mitigate this. One way would be to not score just the null arc rules with
a log –linear model but to score everything. But this is where we want to go with [indiscernible] with
global features because then we would actually be able to train a ton more data not just do null arc
rules.
>> Lucy: Okay, so I think it’s time for a break. We’re reconvening at eleven twenty. Let’s thank the
speaker again.
[applause]
Download