>> Host: Hi, everyone. It's my pleasure this... Even though we've been colleagues at MSR for a number...

advertisement
>> Host: Hi, everyone. It's my pleasure this afternoon to introduce Chris Burges.
Even though we've been colleagues at MSR for a number of years, it wasn't until
I read his bio this morning that I realized he had started in theoretical physics.
Since there's been no mention of supersymmetry or the Higgs boson this
morning, I think it's fair to say he's moved a considerable distance since then.
Chris's work in machine learning has spanned everything from handwriting
recognition to audio fingerprinting, information retrieval, support vector machines,
neural networks, and a number of other techniques. Today he's going to be
talking to us about one of his current interests, which is semantic modeling of
language.
>> Chris Burges: All right. Thank you very much. Fantastic. So I'm wired,
thanks.
So I'd like to start by thanking the organizers -- can everybody hear me? Okay in
the back? Can you hear me? Up? Okay. Okay. I have a quiet voice, even with
amplification.
So I'd like to thank the organizers and the program chairs, in particular Will Lewis
and [inaudible]. I'm honored to be given the chance to talk to you folks today. I
was asked to be somewhat controversial and also technical, and I'm happy to try
to oblige. But technical and controversial are hard to do at the same time so I'm
going to split the talk into two. The first part will be more speculative and
perhaps thought-provoking, big picture, and the second will be some resources
we've been working on that will be publicly available. And that's another reason
I'm very happy to talk today is we've been working on sort of public service
datasets to help forward the field, we hope.
So I'd like to especially thank these folks. I'm using some of their slides and
information in this talk. I think most of them are here.
So let's start with AI. So artificial intelligence. The quest is as old as the hills, as
old as computer science. And unfortunately, it has a long history of high
expectations not being met. So people approach this with trepidation. And when
I tell people, sort of tongue in cheek a little, but not completely, that I want to
work on strong AI, I expect the kind of response like this [laughter] and have got
it. But what was unexpected and interesting was this one. And the why wasn't
so much why on earth would anybody want to solve this problem. It's a very
interesting problem to solve. It's more amidst all this wonderful progress from
deep learning and everything else, why attack such a hard problem? Why not
just, you know, do something equally impactful perhaps and simpler?
So this talk I hope contains my answer. If it doesn't, let me know at the end.
So let's take a look back at artificial intelligence and ask how far we've come. So
to paraphrase George Santiano, those who ignore history are doomed to repeat
it, so it's good to look at the history here. There was actually a page in Wikipedia
devoted to AI Winters, and this is from that page. This list is a list of events that
sparked an AI winter. The name winter denotes a cycle, so it looks like
something like this. It's interesting to me that a system that makes all the world's
information available to billions of people was not on the AI radar back then.
Namely, web search. I mean, it doesn't do all of this, but it's come a long way.
So let's just look at the first three of these winters. The failure of machine
translation. So MSR translator has now over 100,000 companies using its
services, and it's widely used internally as well. And to give you a feel for the
growth, this is a graph from daily requests for translations from April 2009 to
August 2013. The numbers are proprietaries so I can't show you the y axis, but
you can get an idea of the growth. 400 times since the beginning. The gaps are
due to the gaps in data collection and the outlier specs are often due to internal
experimentation or sometimes [inaudible].
Rick Rashid gave a demo in Beijing you might have seen -- it's there in case you
didn't -- in which his speech is recognized, translated, and regenerated in
Mandarin using a model of his own voice. So we've made some serious
progress on MT, but it's still far from perfect. We've got a long way to go, I think
everyone would agree.
The abandonment of connectionism is the second AI winter. Let's look at that.
So ImageNet is an equivalent of WordNet, image equivalent of WordNet. Image
is a place and an ontology like [inaudible], and there are several ImageNet tasks.
This one is classification. So top five correct means that the correct class is in
the top five choices here. This is a hard task. A thousand classes. And deep
convolutional nets are doing pretty well. On the left, the true label is in pink, and
the scores the system gives are shown by this histogram. So it gets things
wrong, but understandably this looks like a -- actually, let's do this. More like a
dalmatian than a cherry perhaps right here. And it's hard to tell the difference
between an ageric and mushrooms. But anyway, we're doing well using all
tricks. Most of the tricks -- most of the technology is quite old. But faster
parallelized hardware and more data have been really crucial here.
The second task is even harder. 22,000 categories in ImageNet, 7 million test
images. It's much harder. The numbers here are accuracies, not error rates. So
it's inverted from the previous slide. And an informal test by Trishul Chilimbi
showed that humans probably get around 20 percent correct on the top choice
on this dataset.
So, for example -- and I should mention that Trishul has some ground-breaking
results. I can't tell you the actual numbers because it hasn't been published,
but -- on this dataset, beating these numbers, significantly.
So these four images were taken from the training set. Can anybody tell which is
the parliamentarian and which is the Pembroke Welsh Corgi? You don't get
extra points for saying these guys are not Corgis.
So it turns out this guy's a parliamentarian and that guy is the Pembroke Welsh
Corgi. So this is tough, a tough problem.
So the third area I went to was the frustration of speech understanding research.
NIST has been doing evaluations on speech since 1988, and this graph shows
those evaluations from that time until 2011. The y axis is word error rate, and in
the early years the domains were narrow, such as air travel planning, and
performance was on a par with the human capability. But open domain problems
are much harder. Things like meeting transcription, television news, captioning.
The switchboard is a benchmark, and in the late '90s there was good progress,
but for the past decade it plateaued, and then from 2010 to 2012 MSR made
significant improvements using detective convolutional networks. Again, fairly
old technology with some new tricks.
So what changed here? I think that there's three things that have changed, and
there's two really big things that have changed. The first is compute power.
From 1971 to 2011 transistor count, the density increased by over a factor of
1000. We also have huge clusters.
So Amazon's elastic compute cloud has -- is estimated to have over a half million
servers. And the web -- we have huge datasets that you can think of text on the
web as weakly labeled data. And it's true that algorithms have progressed a lot,
but I wouldn't say with orders of magnitude improvements in accuracy. I mean,
we've gone from sigmoid transfer functions and units to rectolinear transfer
functions, and there are other lots of cool tricks, but they're not -- I just don't think
you can point at them and say that they're responsible. These ideas are pretty
old.
So should we just continue down this path and forget strong AI for now? And for
many of us, the perfectly reasonable answer is yes. It's perfectly fine.
So let's take a look at AI, where we'd like to be, where we're at, and where is
machine learning at and kind of provide what we think we need.
So this is my blue sky slide. You may say there are several. But here are
problems we'd like to solve automatically. I'll just let you read them. These are
incredibly hard. Right? But it's tasks like this that makes me excited about this
field, and it's what got me into it, and I hope the same is true for some of you.
So that's where we'd like to be in the end. Right? But it's very far away. It may
seem far-fetched.
So how far are we? Well, there are certainly challenges in NLP. I invented this
sentence here: I house table but them eaten never turnip boat, which I don't
think is grammatical, but certainly for the technologies that users use to check
grammar, they're not very helpful in this situation. So it's largely unsolved.
Paraphrase detection is also largely unsolved. This is an example of paraphrase
succeeding using crisp rocket system that takes a sentence, translates it into
many languages, translates it back into English. One of these sentences was
written by [indiscernible] and Doyle and the other was generated automatically.
Anyone want to take a guess which is which?
>>: One is automatic.
>> Chris Burges: That's right. I think probably today you get better marks with
an English teacher with the first sentence because it's shorter and contains the
same information. First is automatic.
But paraphrase is a largely unsolved problem. This is just one example that
work, but there are many that don't.
And I think there's a fundamental reason, which is that we need world models to
inform our meaning representations, but we need our meaning representation to
properly [indiscernible] our models. So there's this cycle we need to break into.
And people are doing it. There's exciting progress.
For example, this little piece of text is from data that I'll be talking about in a
minute, and to solve it there's technology right now that we might be able to
adapt to do this by Luke and others. And it's progressing rapidly. So there's
certainly interesting progress and fascinating stuff as attested by this conference.
And another example, this is a cite from [indiscernible] work. So sequencing
technology provides rapidly growing genomics databases, and it needs systems
biology approach to interpret this data for discovering disease genes and drug
targets, and the bottleneck is genetic knowledge, especially pathways, which are
very complex, incredibly complex things to model. But they're in the text.
They're in these papers. So we'd like to automatically extract this. And this is an
example of a path we'd like to achieve given this input from some paper.
So, again, work is ongoing, and it's very exciting.
So how about new kinds of machine learning? Is the machine learning that we
have today pretty much enough?
Machine learning is largely statistical modeling discipline currently. It's used to
learn of structured data with structured outputs, but I do think it's missing a few
key ingredients.
This is a typical bread and butter machine learning task. This is the first 800
threes in the MNIST, another NIST test, statistical learning set, benchmark set,
that, actually, I had a hand in helping create this set for a machine learning
[inaudible]. And it's just a classification task. And it's learning by example, which
is essentially what the speech and ImageNet results used that I just showed you.
So this is the typical thing that people do.
Another example, this is unsupervised learning. I don't know if you've seen this
result, but this was -- actually [indiscernible] was an intern here. Not when he did
this work.
They used a deep auto encoder to train images gathered from YouTube, and
then they took one of the units in the middle of the auto encoder and said which
inputs light up this unit the most, and they regenerated this picture of a cat. So
they generated a cat neuron automatically, which is kind of amazing. There's this
grandmother neuron idea of how the brain works.
But text just seems way, way harder. It's more recursive, fine structure. It's not
typically addressed by statistical learning models. It's hard to see how statistics
would solve this: The ball fell through the table because it was made of paper.
The ball fell through the table because it was made of lead. So in those two
cases it refers to something different. And you have to know that the table, if it
was made of paper, could have a ball fall through it, but if it's made of lead it
probably couldn't, to solve it. These are called winter grad schema sentences.
I'm coming back to this later.
So progress has certainly been made in machine learning on language. Here's
one intriguing example. This is called the skip gram model. There are certain
semantic properties encoded by vectors. So you learn the vector that predicts
the two previous and the two following words automatically in a ton of unlabeled
data. 30 billion words. And you get vectors. And it turns out that if you take the
vector from Madrid and subtract the Spain-ness from it and add the France-ness
to it and ask what's the closest vector to this one in the space I have, you get
Paris. So Paris is Madrid with Spain-ness removed and France-ness added,
which is amazing, right? I mean, good lord.
And then they extended it to phrases, what they call analogical test samples, and
they get pretty good accuracy. So an example being New York is to New York
Times as San Jose is to San Jose Mercury News. That's not easy to do. Seems
like a step in the right direction.
Now, you've got to be careful, though, thinking that we need to model meaning
when we don't necessarily have to. Statistics either is or are powerful depending
on whether that's plural or singular.
Is marble cake a food or a cake of marble? You can probably solve that just by
searching for verb phrase marble cake and verb phrase, noun phrase, and count
when noun phrase is a known food automatically.
A more interesting example is this one: John needs his friend. John needs his
support. To whom does his refer? This is a paper by Bergsma and Lin. You just
look at a ton of unlabeled data, look for these patterns where this pronoun occurs
and this word and see what fraction of the time the gender agrees, the number
agrees, and it gives you evidence that if the noun phrase needs pronoun support,
the reference is likely not the noun phrase. So there's certainly a lot we can do
with pure statistics. Surprising things we can do.
This looks like a very tough [indiscernible] resolution problem. But it seems to
me that such tricks won't be enough probably.
Here's more examples like the ball fell through the table. It's called the Winograd
Schema. And the first sentence was proposed by Terry Winograd in 1972.
There are 141 examples of these things on this website. And it's hard to see how
statistics alone would answer these. Jane knocked on Susan's door but she did
not get an answer. So is that Jane or Susan that "she" refers to? And beyond
that, these are very specific problems. They're anaphor resolution problems. But
there's lots of other problems, of course, where you need world knowledge to
know the answer. So I don't think statistics alone or typical machine learning
approaches will solve this.
So I claim we need a lot more from machine learning. I'm only going to talk
about one of these things now, but it would be great to have interpretable
systems, systems that you can say if this makes an error, I understand why it
made an error. A giant six-layer neural network, it's very hard to say why it made
this particular error. There's a decision surface in some space.
Scalability. We'd like our systems to be -- to work in different domains and on
arbitrarily large datasets. We'd like them to be modular and composable for
debugging purposes.
People are thinking about this, but it's not front and center. And I think it will
need to be for us to really make progress.
So let's look at correctability. What do I mean by correctability? So what I mean,
there are two things very closely related I mean. One is stability. So decision
surfaces are not stable on adding new training data. So here we've got this
decision surface, the green from the red. I add a green point right here and the
red point is known as classified. Right? Now that I've retrained.
This is in stark contrast to human learning, which is very compartmentalized. So
this would be like when a child learns to ride a bicycle, he forgets how to brush
his teeth. Right? It just isn't how would he learn. We're robust. We can learn
something new without forgetting something we already knew.
So we don't probably learn by moving giant decision surfaces around in some
high-dimensional space. We want to be able to correct errors without introducing
new errors.
Another way of looking at the same problem is separability, and that is the
separation of problems from subproblems. So now we have this same problem
and we want to solve the misclassified green point by adding training data, but
that's likes the child makes a mistake, we don't lock him in the library for a week
and ask him to update his parameters, right? There's very few bits transferred
from student to teacher. If you measure when they actually say, they say very
little. So there must be some rich shared world model that each actually has a
model of in order to have this communication occur. And we're just miles away
from this kind of modeling, I think, today. And machine learning, you know -- we
need to think about this, I think.
All right. So the nature of meaning is slippery.
Actually, I'm going to ask you to give me a 10-minute because my clock is
completely off.
And so I'm not going to try and define meaning because that's just a black hole.
It's been discussed since Aristotle's time.
It's much safer to use operational tests. And so what do we have for operational
tests of meaning? There's Turing's test. But it encourages the wrong thing.
[Indiscernible] humans, as Levesque pointed out. And participants must
necessarily lie. If you ask the computer about his mother or the person about
what operating system he runs, he has to lie [indiscernible]. And, actually, we
wouldn't mind knowing we're talking to a computer if it's much smarter than
anybody we could possibly be talking to, right? That's a good thing.
So how about the Winograd Schema? It's beautiful data, but it's expensive and
hard to create this. It's quite creative generating this data.
This is the original sentence Terry Winograd proposed, but it only tests pronoun
and anaphora resolution too. So question-answering is huge, and I think it's a
useful thing to view comprehension as simply question-answering in the following
sense. If any question you could ask can be answered by the system, then the
system comprehends the text.
And this we can measure. The one thing I don't like about this definition is that it
has humans in the loop. It's very hard to come up with definitions that involve
meaning that don't have humans because then you're defining meaning. So we
have humans in the loop just like Turing did.
All right. So I'm going on to a good clip, so I'm going to go to part two. That was
the speculative. If anybody has questions, by the way, you can jump in during
the talk if you like or hold them until the end. It's up to you.
So now we generated two datasets because there was a need for these datasets
to help, we think, researchers progress in semantic modeling. And, in fact,
progress is often tied to the availability of good benchmark test sets, everybody I
think would agree.
So what are the properties we require of these kinds of datasets? We'd like them
to be public domain. Copyright-free. Anybody can do anything they like with this
data. There has to be a very simple, clear, unambiguous measure of success.
Data collection is scalable. We'd like a big gap. So between humans and
machines, you know, [inaudible]. A good baseline. Open domain yet limited
scope. The problem with limited domain is you can solve the [indiscernible] data
set very, very nicely, but then you've just solved the [indiscernible] data set. We
want something that's more generalizable. And ideally we'd like a hardness
knob. When people get better at it, we'd like to make the problem harder without
a huge new collection of data. I think we've done this.
So the first dataset is called the Holmes Data. And it's joint work with Geoff
Zweig here at MSR. So this was an SAT-style sentence-completion test, and the
task is to select the correct word from five candidates.
We took the target sentences from five Sherlock Holmes novels from Project
Gutenberg and the training set we used for the N-gram models was also taken
from Project Gutenberg. We started with 540 documents. We whittled down to
522 to make sure that everything was copyright-free.
So then we used N-gram models to build decoys. And these are the novels we
drew the sentences from for the test data. There's huge variation in style across
the five novels which is why we settled on Sherlock Holmes. So it's limited scope
in that sense. It's just one writer. But he can write about anything.
So to generate the decoys we select the sentence at random, we select a focus
word with low frequency, we compute the tri-gram probability, we save 150
highest of these for the low-frequency words. If the original word wins, we reject
that because it's too easy. Unfortunately, this introduces a bias in the opposite
direction.
And then we score by the N-gram looking forward. And then we retain the top
30, and the human judges, which was actually my family and Geoff's family, pick
the best decoys, syntactically correct but semantically unlikely. And by
syntactically correct, we just mean agree in number, gender, tense, and that kind
of thing.
So humans had to do some curation here. And here were the rules. They must
be grammatically correct, it should be a significantly better fit than the decoys,
the correct answer, although decoys could fit, but the correct answer has to be -it's like an SAT test. No offensive sentences. These were written a century ago.
Here's an example that requires some thought. They all require some thought.
Was she his client, musings, discomfiture, choice, opportunity, his friend, or his
mistress? Well, friend and mistress are people, so it's probably client. So you
have to think like that.
Here's one that required world knowledge. Men have to be older than seven
years old. And so forth.
Again, this is a fairly labor-intensive thing.
Here are some samples. I'll just let you look at them.
I'll draw your attention to two. So with this question you need to know that fungi
don't run, but that's actually not true. You could have a tree with fungi running up
the side. It's just there. In this context it's true, so it's a great question.
Here you have to know that you don't stare rapidly. If you look at something -- if
you stare at something, you're doing it for an extended period of time. You would
glance rapidly. So this does require knowledge, we think, that is hard to model.
So here is how the various systems do. The human, Geoff's wife, was tested on
100 questions. The generating model didn't do as well as the N-grams because
there's a bias against it, as I just explained.
The smoothed N-gram models were built from the CMU toolkit, and the positional
combination was us being devil's advocate, us knowing exactly how this data
was created, do our worst to build a system to solve a problem by knowing the
tricks of the biases involved, and we got up to 43 percent. So we didn't succeed
in our goal of trying to make this completely impervious to knowing how it was
created, but it's still a far distant cry from human performance. And then latent
semantic analysis does slightly better.
And I'm happy to report that since then, there have been people publishing on
this dataset, getting quite good results. So Mnih uses a vector space model. He
predicts 10 words, five before and five after. And Mikolov uses the skip gram
model I just described and does pretty well on this. But it's still far from 100
percent, which is -- we think a human less harassed might get 100 percent of
this.
And the data is available at this url. So the urls I'm giving today contain the
papers, the background, the data, everything you need to play with this stuff.
All right. On to the stories data. This is the second dataset that I just want to tell
you guys about. This is joint work with Matt Richardson and Erin Renshaw, and,
again, there's a website where we will be collecting results.
So hers was a useful dataset, but it's not really scalable. There was a lot of work
to actually wind up with this 1040 sentences. So we wanted something also
more general, because that's a particular kind of test. We want something that
tests comprehension.
So we went to comprehending stories with multiple choice questions. This is an
old problem. [Indiscernible] in 1972 worked on this kind of thing. The difference
really with this is the scalability of the data collection.
So reading comprehension seems like a good way to do this. It's how we test
people's understanding of text. Again, I'll just let you read this.
Progress is easy to measure with multiple choice questions.
One of the pushbacks we got when we sort of did this was maybe first grade is
too easy, guys. Come on. Is it? So this is a parents' guide to the first grade
curriculum. I don't know if you can read this, but -- let me just read it to you.
Describe, retail, and answer questions about key ideas and details. Explain the
differences between fact and fictional stories. That's a really hard problem to
automate. Beginning reading complex text such as prose and poetry. These are
not machine-solved problems at all. So we do not think that first grade is too
easy. In fact, we think it's the opposite. If you've solved first grade, you've
probably gone a long way to solving everything.
So the MC Test data looks like this. Fictional stories, meaning you can't go to
Wikipedia to find the answer. The answer is in the story and only in the story.
Sam's birthday is tomorrow. When is Sam's birthday?
Multiple choice. We limited the vocabulary to what a typical seven-year-old
would know to try and limit the scope. We keep it open domain.
There are no copyright issues. This is generated using Mechanical Turk. So
you're free to use this data any way you like. We looked into using SAT and the
rest, but we couldn't do it.
And we wound up with 660 stories, and we've done this in such a way that we
think we could collect 10 times as much. If there's interest in this community on
this dataset, we will collect more and support it.
So we paid $2.50 a story set because they had to write a story and four
questions, and for each question, four answers, one of which is correct and five
are incorrect. And we got a wide variety of questions.
So M Turk has over a half million workers. They're typically more educated than
the U.S. population in general, and the majority of workers are female. Just
curious statistics.
This is an example story set from the first 160. And to solve this -- I put the
correct answer on top. That's not true in the data that's on the web. You need to
know that turning two years old happens on a birthday, for example. And there
are some extremely challenging NLP problems buried in this data, an aphorism
solution in particular where it refers to something several sentences back.
So what we did was we did a little test run of 160 of these stories. We gathered
160 and then we curated them manually, and that led to ideas on how to do a
second run at a data collection that was more automated, so we know how to
scale.
And we also worried that people would write the ball was blue and then the
question was what color was the ball. It's too easy, right? So we asked that they
have at least two of the four questions require multiple sentences to answer. So
you have to go into the text in two places. This was actually very confusing to
the M Turkers, what we meant by this. We struggled with trying to explain what
we meant by this.
We did fix errors on the 160, and we found certain workers to avoid. So we
touched about two-thirds of the stories.
And this enabled us to improve the process to generate much more data. So
now we collected 500 more stories, we automatically verified that the distracters
appear in the story. Otherwise it's too easy if only the correct answer appears in
the story, unless the correct answer does not appear in the story because it
might be like how many candies did Jimmy eat, and the answer is not in the
story, but you should be able to get it.
Creativity terms. We wanted to increase diversity, so we took 15 random nouns
from the vocabulary and presented them to the users and said use these nouns
and you like, but you don't have to.
We added a grammar test, which actually had a significant impact on the quality
of the stories.
Thank you very much.
And then we added a second Turk task to grade stories and questions.
So here's the grammar test. I came up with this, mostly. Half of these are
grammatically incorrect, and we required that the workers get at least 80 percent
correct to do the task. We found that requiring 90 percent correct would have
extended the data collection time from days to weeks, so 80 percent was what
we went for. I won't test you on the grammar test.
So here's the effect of the grammar test. We took ten stories randomly chosen
with and without the grammar test in place that were generated and then we
blindly graded each for quality. And you can see it made quite a difference. So
the quality measures do not attempt to fix the story. It's bad, but it's rescuable
has no minor problems, has minor problems, has no problems.
And interestingly, somewhat randomly, it also reduced the number of stories
about animals [laughter] significantly. And the quality was just better.
>>: Does that include humans?
>> Chris Burges: Does not include humans.
So then we had each story set evaluated by 10 Mechanical Turk workers with the
following measures, and we also had them answer the questions. And what that
enabled us to do is to say if the M Turker couldn't answer the questions
themselves, then they probably shouldn't be trusted and -- I mean, perhaps the
story is terrible, but threw out the guys that just can't seem to answer the
questions themselves. So we removed the guys that were less than 80 percent
accuracy, about 4 percent of people, and then we just did a manual inspection.
Now, this is manual. So it's less scalable. But there wasn't much time needed.
One person, one day. And we actually corrected -- polished it up to make the
stories make sense or the questions and answers make sense for this small
subset of the data.
So we think we can get actually a factor of 10 more data pretty easily if we need
to.
So this is a note on the quality of the automated data collection. The grammar
test improves MC500 significantly. So here, this is without any grammar test,
and this is with. The editing process improves MC500 clarity and also the
number correct. And MC160 still has a better gram. So, still, we didn't get as
good as manually correcting the stories ourselves by doing this automated
Mechanical Turk, but it's a lot better than it was.
Then we wrote a simple baseline system. I won't go into details because of lack
of time, but it's basically a sliding window. You take the question, the answer,
build a set of words, slide it over the story, look for hits, do a TFIDF count, the
highest answer wins, which worked pretty well, and then we added that -- that
sort of ignores things outside of this moving window, so then we added a
distance algorithm of the distance of the question -- end of story from the
question words to the answer words. This is the whole thing. It's very simple.
And it returns this difference of scores.
And this is how it does. It gets about 60 percent correct when random would be
25 percent correct. The MC160 is easier, as we said. Having a grammar test
and the other tests actually made the data harder.
So here single refers to questions that require only one sentence in the story to
answer. Multi requires multiple sentences. And these are the two datasets.
And, in fact, really for us everything's a test because only this was used to
actually build a model. So all of this is test and all of this is test.
And here W is the window and DS is the distance algorithm.
So one of the tests we did is we were worried that somebody would come
along -- we're still worried -- and solve the whole thing with some trick. So what
about viewing then as a textual entailment problem? Let's use an off-the-shelf
textual entailment system that says given text T and hypothesis H, does T entail
H?
So we can turn the questions and answers into statements like this that
[indiscernible] had for a different project, question and answer on the web, that
we could use to do this. And then we select the answer with the highest
likelihood of entailment and an off-the-shelf system which actually turned out not
to do as well as the baseline by itself, although combined with the baseline it did
slightly better.
This may not be fair to the system BIUTTE. There are reasons it may have
failed. They may not be tuned for this task, question to statement conversion is
not perfect, and so forth.
There is an interesting task that we can get close, but the baseline is pretty solid
here. At least the first thing we tried, first sophisticated thing we tried, was not as
good as the baseline.
So MT is a good resource, and I hope you use it. We will maintain this on the
web, not only the data itself, but also people's results on it. So if you want to
compare with somebody else's system, you're going to be able to get scores per
question, per answer to see how your system compares with theirs, and you can
do paired T tests and things like that on the data. So we want to keep pretty
much everything that people might want to use for that reason.
Okay. I still have five minutes, so I can quickly, then, show you a bit more work
we did. I will stop on time.
This is work I did with Erin and Andrzej labeling this dataset. We only labeled
MC160, but we've labeled it pretty exhaustively. And these labels will also be
made available on the web for anybody to use, of course.
So obviously labels are useful for machine learning, but they're also useful to
isolate parts of the system you want to work on and have everything else work
perfectly to see how that part is doing. And, similarly, it helps, for the same
reason, the test bed for experiment with feedback. So if the coref module and
the downstream module are switched on, everything else is just using labels, we
can see how the downstream module feeds back and how that works and all the
rest.
These are the labels we have now. We seeded the sets with SPLAT which is
MSR's toolkit and another one, another toolkit we have. We've even run coref
chains. And, now, I did a lot of this labeling. I learned a lot about labeling this
stuff. It was interesting to me.
We also have been playing with Wall Street Journal data. So in the Wall Street
Journal data, chief executive officer, even [indiscernible] nouns is tough. The
word executive is labeled as a noun about half the time and an adjective about
half the time in that -- in WSJ. He went downstairs. Downstairs is an adverb
according to OED. The outside can be a noun, adverb, preposition or adjective.
And coref gets even more complicated. But we're going to try not to worry about
that since if we made a perfect coref system we'd be spending all the time just
doing the labeling.
But here are some of the labeling issues. You have a whole bunch of mentions
of Sam and then Peter and then the word "they." Well, that's a coref chain with
Peter and Sam, but which Peter and Sam, which tokens? So we just put the two
previous mentions in the chain.
And then you can have something like Sarah and Tim, which is referred to by
"they," but then Sarah by "she" and Tim by "he" so the mention and submentions.
Same with John's shirt. And then this is a really tough one. I don't know if
anybody can really get this. So Bob was Bob the Clown during the day and at
night he became Bob the axe murderer [laughter]. Bob the Clown was a nice
guy. Bob didn't know which Bob he preferred? Where do we put that Bob?
Tough.
And here's my favorite. This is just because I'd been looking at some number for
coref. My family is large, and tonight they are all coming to dinner. The same
token is singular here and plural here. So you have to deal with that kind of stuff.
So I'm going to wrap up because I'm fairly -- let's just do two minutes. I can do
this in two minutes. Okay.
So then we also labeled animacy. Again, animacy is tricky. Are dead opera
singers animate? Caruso is considered one of the world's best opera singers.
He. In that sentence it would be useful to treat him as animate even though he's
not alive, right? So we treated animate as at one time animate.
Collectives can be animate. The team went out for lunch. Cats like naps.
People like sunshine. And some things are just not clear. One boy was named
James. Well, James refers to a name which is not animate but also refers to
James himself, who is. So that's tricky.
And then we did a little bit of testing on the data I just showed you. How
ambiguous are the nouns? It's 8000 words. How hard could it be? Well, the
number of noun sentences in WordNet of those nouns is about four for each
noun on average, and the word sentences are here, and these don't include
these kinds of ambiguities, part of speech ambiguities. So there's definitely a
problem. This is a tough problem.
Similarly, for verb meanings, how many verb meanings are there? So we
identified the verbs using the SPLAT part of speech tagger and then we went to
simple Wiktionary and asked for how many simple Wiktionary and asked for how
many verb -- this is a lot more if you use the full Wiktionary. So mostly they had
one, but there's things that have 35. So blow up, explode, get bigger, whistle
blows, he blew the whistle. These are all different meanings of blow. We
couldn't useful dictionary because it just goes over the [indiscernible]. Alligator is
a verb. Ash is a verb in full Wiktionary and it's just going too far for us.
All right. Thank you very much.
[applause].
>> Host: We have some time for questions.
>> Chris Burges: I'll just point, so --
>>: You said humans don't forget things when they learn new things. My kids
did overgeneralization for a while, learning to speak. They would say went and
then later go for a little while before they went back to went.
>> Chris Burges: Yes. Fair point. But I think in general when you take an adult
human and they learn something new, it's hard to find examples like this. It's true
maybe for kids, but people are learning every day all the time while they live,
right? And we don't see this phenomenon very much. Certainly it's true in
children to a certain extent. That's a good point, yeah. So maybe -- I don't know
what that says about the brain. Maybe something interesting.
>>: I was just wondering about the fictional content in the stories, whether that's
kind of distracting from the task of building accurate models of the real world
when you have talking animals and --
>> Chris Burges: Yes. That's a good point. The ball was blue and the ball was
happy and all that -- we get stories like this. That's true. It makes it harder
because you can't say because of a ball in the real world because it's inanimate,
yeah.
We really wanted this set to be self-contained in the sense you couldn't go to
Wikipedia and get the answer or the web and get the answer. So fiction kind of
solves that problem. And it does raise other problems. In fact, we would like to
add non-fiction. So we don't want to stop here. We just wanted to start with
something to work with. But we felt that the benefits outweighed the -- there are
problems. It makes it a harder problem for sure.
>>: So at the beginning of your talk you were kind of asking the question do we
have all the bag of tricks we're going to need for machine learning. And so my
question for your dataset is, is it the kind of problem that we haven't defined what
we need to know from the problem or about the problem to answer these so we
don't have machine learning techniques for an unknown problem or --
>> Chris Burges: Well, so that first part of the talk is just opinion. So I don't
know the answer to this question. It's a subtle question. So you're saying we
don't even know if what we have is sufficient or not. But we can look at
limitations of what we have. And there are some pretty strong limitations of
current approaches. So I think it's -- like, for example, interpretability, if I'm
talking to you and I clearly misunderstand something you say, you can detect
that and correct it with just a few bits of information. So you have some model,
you have some interpretation of my error and of what I understand. Teachers do
this all the time. But that kind of learning we tend not to have in machine learning
models. So it may be that we don't need any of this stuff. I'm told this, right?
There are sceptics that say you will not need any of this stuff, just use deep
learning and lots of data and see where it goes. And that's actually a perfectly
valid approach, I think.
But I think it's worth considering this question, because it might be true. So it
might be that adding -- coming up with a model which is interrogable, you can
ask it why it believes what it believes and understand it. It sounds rule-based,
but -- and do this in a scalable way, which rule-based systems tend not to be. If
we can solve these problems, I think -- interpretability I think is very powerful. If
you understand why the system is making the mistake it makes and you can
correct only that mistake, then that's progress. Whereas, if I have to just retrain, I
don't know what else is going to break. I don't know if that answers your
question. I sort of rambled on a bit. Does that help?
>>: Well, so I guess my particular question is, in this new data that you've
defined, in some sense we don't know what it's going to take to solve that
problem --
>> Chris Burges: Okay. So we have done in full tests of the new data of roughly
how many questions need world knowledge like the kind of example I was giving,
the ball fell through the table because it was made of paper, that kind of thing.
We get about 30 percent. So we need something with world knowledge for sure
to really know that we're solving this. So that's not been done yet, right? We
don't have a reliable, scalable model.
>>: So one way in which this has a flavor of a task that might be gamed slightly
is that, as far as I understand it, in every case you actually do have an answer
available in the story. Have you thought about augmenting it with distracters
where a human would be able to recognize fairly straightforwardly that the
answer is not present?
>> Chris Burges: Great question. And in fact that's one of the ways we plan on
making the dataset harder if it turns out to be too easy. We just randomly
remove some of the correct answers and put in imposters in all four and you
don't know which is which. So we are thinking about how to make this data
harder as we collect it.
There's another way which is kind of interesting, and that is take this baseline,
run it in realtime as the Mechanical Turkers do their thing. If it's too easy, if the
baseline gets it, then have them do another one. Then you're going to make it
really tough because you know the baseline can't do it. So I think there are ways
to make the data arbitrarily hard. Certainly making -- a step you suggest is a kind
of intermediate step. But then there's just write down answer, right? And that's
even harder. But that's hard to measure too, because then you've got
paraphrase to solve.
>>: So related to that, since 70 percent of the stories were initially around
animals, it shows that people aren't very creative in coming up with the stories.
So how well do you think you could do without reading the story, just --
>> Chris Burges: It's interesting you ask that question. So we did a crazy
experiment. Let me finish with this crazy experiment. We took -- we were
intrigued buy these vector results, right? People were getting amazing results
with vectors. So we threw away the story, we threw away the question -- this is
just the first crazy baseline you do -- we took the answers and we assigned a
random vector to each word. And -- what did we do? We found that we get 31
percent correct, so not the 25 percent you'd expect, by comparing -- yeah, just by
looking at the answers themselves.
So the answers themselves are biased. Right? They tend to be longer, the
correct answers, they tend to be more about baseball than cricket because
baseball is more popular. So there are definite biases you can pick up
automatically just by training on just the answers. And you have to be careful
about that kind of stuff.
>>: Question here. So it seems like there might be a slippery slope about being
disjunctions in the sentence or sentences requiring math computations. So how
did you curate to make sure that there wasn't something that said John had five
balls and Suzie took two, so how many balls were left --
>> Chris Burges: There are rather few of those. We did look at all the stories,
and the 160 we looked at very carefully, actually. But it's funny you should
mention that, because there's one story that was completely insane. It was about
people going to the fair, and there was a jar full of jellybeans, and you had to
guess how many there were, and somebody took 10, another person took five.
But it never said how many were in the jar to start with, and the question was
how many were left. Right? So it's an impossible question. So we just had to
ditch the story really because -- I mean, it was really -- so there are biases we
should worry about, but I think the fact that there are so many people -- and
animals are in the stories, but they're actually surprisingly varied. If you look at
the main topic, the main topic wasn't animals in these stories, it was all over the
map. There are birthdays and parties and parks and pets quite a bit, but there's
a surprising variety of topics there. Does that help answer your question?
>>: Yeah. And also about disjunctions, you could have set operations across
the whole story which might be difficult to compute and might need to invoke a
different sort of module that actually does the set computation.
>> Chris Burges: I think that's true. In fact, we're also attacking this data
ourselves. I haven't talked about that at all. But we would like to attack it from
that point of view of taking these little tiny things like find the events in the story,
find the entities, find the animate entities, do the coref and see how far we get
with just breaking it off in little pieces. So that's what we're attacking right now.
We're building a platform to do this.
>> Host: We have time for one more just so we have time to transition.
>>: So at the beginning you talked about the Turing test and how maybe the
computer should know that it's a computer instead of trying to --
>> Chris Burges: Right.
>>: And so maybe the world knowledge of computers is then in the computer
domain. So instead of using stories that talk about animals and gravity and
paper, what if the stories were about things and the computer like bits and bytes
and processes and operating systems.
>> Chris Burges: But how many thoughts do you have in this directory? Stuff
like that. Yeah.
>>: I mean, would that change it? So how does that really connect --
>> Chris Burges: So you're saying -- that's an interesting question. So still
natural language, so still interesting, hard to task, but in a very limited domain
about what the computer knows about itself. What is your memory?
>>: And also following instructions. My personal project is about read me files
and following the instructions that are in read me files.
>> Chris Burges: Having the computer following instructions?
>>: Yes.
>> Chris Burges: Yeah, well, that's a wonderful thing to work on. And I think it's
pretty hard, too. Yes, that's an interesting point. Thank you.
>> Host: Thank you very much.
[applause]
>> Host: Okay. Let's start. We have two talks in the session. Each one will be
15 minutes followed by five minutes of questions. And the first talk, the title is
here, and it's going to be presented by Enamul Hoque.
>> Enamul Hoque: Thanks a lot. So I'm going to present a visual text analytic
system for exploring blog conversation. This work has been done in our NLP
group at UBC with my supervisor Giuseppe Carenini, and this work has been
accepted in the [indiscernible] conference to be held this year at [indiscernible].
So as you all know, in the last few years there has been this tremendous growth
of online conversations in social media. People talk to each other in different
domains such as blogs, forums and Twitter to express their thoughts and
feelings.
And so if you look at some statistics of a particular domain such as blogs, we
could see some very fascinating figures. So as you could see, there is more than
100 million blogs already there in the internet, and these figures keep rising
exponentially.
So now that we have so many blog conversations and other threaded
conversations on the internet, we might start wondering whether the current
interfaces are sufficient enough to support the user task and requirements. So
let's have a look at an example of a blog conversation from Daily Kos, which is a
political blog site. And this particular blog conversation started with talking about
Obamacare health policy, and then it also started talking about some other
related topics. And then other people started to make comments about this
particular article, and very quickly it turns into a very complex, long thread of
conversation.
So imagine a new user coming to read this conversation, having seen all these
comments in a large thread, makes it almost next to impossible to go over each
of the comments and trying to understand and get insight from what this
conversation is all about and what are the different opinions and [indiscernible]
are being expressed within the conversation.
So that leads to the information overload problem. And, as a result, what
happens is that the reader or the participant of the conversation, they start to skip
the comments, they often generate shorter response, and eventually they leave
the discussion prematurely without fulfilling their information list.
So the question is how can we address this problem better so that we can
support the user task in a more effective way. So there has been some
approach in the information visualization community, and what they're trying to
do is that they're trying to visualize, solve the metadata of the conversation,
especially these are the earlier words where they're trying to visualize the
threaded structure. The idea is that by visualizing the threaded structure, one
could possibly navigate through the conversation in some better way.
But the question is does it really tell anything about the actual content of the
conversation? That's not the case. So by seeing this visualization, a reader can
hardly ever understand what's the actual content of the conversation is all about.
And so very simply, they move on to some simple NLP using in their visualization
research. So, for example, this work in early 2006 tries to use some very simple
TFIDF-based statistics using over the timeline, trying to visualize the topic
overflow, the terms overflow, over time. And, similarly, recently Tiara system
from IBM, they're also trying to visualize like [indiscernible] topic model over time,
trying to show the topical evolution so that the user can kind of get some insight
about how the conversations evolved over time.
But, still, the underlying NLP methods are either too simple so that they're more
error prone or they're not designed specifically for the conversation. So that
makes it difficult to get insight.
On the other hand, NLP approaches are trying to figure out different content
information, extracting content information from the conversation. So things like
topic portal is trying to assign topic label into the different topical clusters of the
conversation. Sentiment analysis is trying to mine the sentiment information for
different sentences in the conversation.
Similarly, we have summarization that are trying to pick the most important
sentences or generating abstract sentences.
We could have some other relations being extracted such as this [indiscernible]
or [indiscernible] relations, but the question is sometimes this output of NLP can
be too complex to be consumed by the user or the question is whether the user
is really wanting to see all those different complex output of the system. Do they
really want to see the speak set or do they really want to see the retro relations?
So that's why we argue that we need to combine both NLP methods and InfoVis
together in a synergetic way so that both techniques can complement each other.
So the goal of this work is to combine NLP and InfoVis in a synergistic way, and
in particular to make this happen, we asked the following questions. For a
specific domain of conversation such as blogs, what other NLP methods we
should apply. Then what metadata of the conversations are actually useful to the
user for performing their task? And how this information should be eventually
visualized so that it can actually be helpful to the user.
So answer all these questions, we supplied a human centered approach for
design, and this was proposed by Munzner for InfoVis application. It's kind of
commonly adopted in InfoVis, but there has not been much work in the area of
visual text analytics where the idea is to combine both NLP and InfoVis together.
So this kind of formed our earlier -- so our work is kind of one of the earlier work
that is trying to use this nested model to combine NLP and InfoVis together.
So in particular, this nested model -- using this nested model, what we are trying
to do is that we're trying to use a systematic approach to go through a set of
steps to eventually design a visual text analytics system. So first, taking a
domain of conversations such as blogs, we characterize what are the tasks that
the user performed, and then based on that, we pick the data and task that are
useful to the user and then we mine those data from the conversation by using
NLP methods. And, finally, we develop the interactive visualization system for
the conversation.
So in the next few minutes I'll go over each of these steps. So first to
characterize the domain of blogs we rely on the literature on different areas such
as computer mediated conversations, social media, human computer interactions
and information retrieval. The question that we ask when we review this
literature is trying to understand why and how people actually read blogs.
And by asking this question, we get results from the literature. For instance,
people read blogs because they want to seek more information, they want to
check some fact with traditional media, or they might want to keep track of
arguments and evidences, and maybe they just want to get fun out of it.
Similarly, we also analyze how they want to read blogs, and then using all this
information, we decide what tasks are usually performed by the blog readers and
what kind of data that we should use to support those tasks.
And so in the next step we came up with a set of tasks based on the domain
analysis that we have done in the previous [indiscernible]. So these are some of
the tasks that typically a user might want to perform when they read blogs. For
example, they might want to ask questions about what this conversation is about,
which topic is generating more discussion, how controversial was the
conversation overall, was there a lot of differences in opinion.
So once we have this list of questions and each question is representing a task,
the next step is we try to identify what data variables are actually involved in this
task. So to do so, we identify a set of data variables that are actually involved
with this task. Some of them could come from the NLP methods. For instance,
topic modeling and opinion mining could help in some of those tasks, while also
we could use some other metadata of the conversation such as author, thread,
and comment.
So once we identify those data variables, the next step is to mine the data from
the conversation using the NLP method. So specifically we use topic modeling
and the sentiment analysis. So in case of topic modeling, we use a method that
has been developed specifically for dealing with conversational data. So in this
case to take advantage of the conversational structure, we take the information
of reply relationships between comments and then turn it into a graph called
fragment quotation graph, so where we're representing each node as a fragment
and the edge is representing the replying relationships between fragments.
The intuition behind representing this fragmentation graph is that if one comment
or a fragment is replying to another fragment, then there is a high probability that
the fragment that is commenting to its parent comment or fragment, talking about
the same talking. So basically from this we can actually inform the topic
modeling process.
So later on, using this fragment quotation graph, we first do the topic
segmentation where we basically cluster the whole conversation into different
topical cluster. So in this step first we apply lexical cohesion-based
segmentation on each of the paths of the conversation. So we basically treat
each path of the conversation as a separate conversation. And then we run this
lexical cohesion-based segmentation which is commonly used in other like
meeting and other corpora to setting to the topical segment.
And then once we have those segmentation decisions, we need to consolidate it
for the whole conversation, and for that purpose we use a graph-based
technique. So first we represent segmentation decisions into a graph. So we
represent the graph where each node is representing a sentence and each edge
is representing how many times two sentences actually co-occur in the same
segment while running this lexical equation.
So then this normalized cut clustering technique clusters into a set of topical
clusters, and then the next step is just to assign representative key phrases for
each topical cluster. So we apply a syntactic filter that only picks nouns and
adjectives for each topical cluster, and then we use a co-rank method that, again,
take advantage of the [indiscernible] structure by providing more boosting to
those key words that are coming from the leading sentence from each topic. So
the initial sentences from each topic are more important than the rest of the
sentence. That's kind of the intuition behind ranking method.
So at the end of this step we have a set of topical clusters. Each topical cluster
has been assigned with a set of key phrases. And then we also did the
sentiment analysis using SO-CAL system. So it's a lexicon-based approach
where we use this program to find a sentiment and orientation and polarity for
each sentence, and then for each comment of the conversation we compute the
polarity distribution, that is, how many sentences are actually falling in any of
these polarity intervals.
So once we have done that, we design the visualization, and in this process we
actually first start with the paper prototyping. So we do a lot of prototyping trying
to impose the principles that we want to promote in the [indiscernible]. For
instance, we want to promote multi-facetted exploration based on different facets
such as topic, opinions, and authors, and then we also want to use some
lightweight interactive features that can help the user to promote these
exploration activities.
So this is the final interface that we came up with. I'm going to show a demo
right now. So in this interface, in the middle we have the thread overview. That's
kind of showing the parent-child relationships between comments, and it's also
showing the sentiment analysis result.
Circularly around this thread overview we have topics and authors that are
showing the exploration for using facets. So now I'm going to show a quick demo
of this interface.
So this is how the interface looked like. So the user can quickly see what is the
structure of the thread and they can quickly look at the topics and kind of
understand the evolution of topic over the conversation, and if they're interested
about a particular topic they can highlight -- they can hover the mouse over that
and that highlights the corresponding comments as well as the author that are
related to a particular topic. And then if they're interested about a particular topic,
they can click on it, and that kind of put some bars on each of these comments.
So then the next step would be they can go over each comment, and as a result,
the system controls down to the actual comment in the right side and they can
then quickly browse through each comment from the overview interface.
Similarly, they can also go through the detailed view, and that kind of also
mirrored the corresponding operations in the overview. So as you can see, as
the user scrolled down, that also highlighted the corresponding comment in the
overview. That's kind of providing the idea where they're currently right now and
what topic they are currently in.
So that's kind of how the interface looks like. So I'm going to wrap up with -- so
we've recently did informal evolution, and using the informal evolution we are
trying to understand how the people are actually using this interface and how
they're getting helped by this interface.
In the future we're trying to incorporate the interactive feedback loop from user so
that they can perform their -- some interactive feedback to the actual topic model
so they can say that this topic is actually more broader, so I want to make this
topic to be more [indiscernible] versus these topics are too specific so I have to
merge them together.
So our idea is to make more control back to the user so that they can have more
control over the topic model and text analysis system. That's all.
[applause].
>> Host: Questions?
>>: Sorry. When you did the human evaluation, perhaps if you could go back
one slide, so what were you asking the bloggers to evaluate? Did they have a
particular task that they needed to accomplish using your --
>> Enamul Hoque: Yes. The idea was instead of giving some very specific task
which might be more synthetic because these tasks are more exploratory in
nature and there's no specific aim the user has before they read the blogs. And
so we just asked them to read the blog according to their own needs by reading
the comments that they like, and then after reading the conversation, they would
try to summary, and then from there on we also collect the interaction log data
that perform what kind of interface actions that they made, and then based on
that we analyzed the output of the evolution.
>>: Did the bloggers use different ways of getting through this information or did
you see all of them using -- going from the topics --
>> Enamul Hoque: We identified two different strategies for exploration. So
some of the bloggers, they would use the visual overview. More specifically, they
would click on topics and then they would either go to this fade-overview and
then click on specific comments and read that particular comment. On the other
hand, other group of users, they almost used this topic less frequently and they
would mostly go over the detailed view, very quickly skimming through the
comments, but then at the same time they would have a situational awareness of
what's going on in this overview. So they would kind of coordinate between
these two views in a way that while they're reading in the detailed view, they
would kind of understand what topic they are currently reading and what position
they're currently in the thread overview.
So that kind of allowed them to read some of the comments that are even buried
down near the end of the conversation. So in this way what we've identified
compared to the traditional system, usually people start reading some of the top
few comments and then they totally get lost at some point and quit at some point.
Whereas, using this interface, they can find and reach some of the interesting
comments that were really buried down near the end. So I guess that's kind of
making a difference in terms of where they explored the comments.
>> Host: We have time for one quick question.
>>: What is that bar underneath the user?
>> Enamul Hoque: These bars?
>>: Yes.
>> Enamul Hoque: This is just mirroring the same from here. So this is
representing the same information for each comment. So we define five different
polarity intervals, and it's just showing as a stack bar. And this stack bar is the
same as here. We just mirrored the corresponding stack bar.
>>: Just for that comment?
>> Enamul Hoque: Yeah, just for that comment.
>>: Not whether the user is generally a negative person?
>> Enamul Hoque: Just for that comment. So this is for each comment.
>> Host: Let's thank the speaker again.
[applause]
>> Ramtin Mehdizadeh: Good afternoon, everyone. My name is Ramtin
Mehdizadeh, and I am a graduate student at [indiscernible] Lab. Today I'm going
to talk about graph propagation for paraphrasing out-of-vocabulary words in
SMT. I'm extending this work for my research which is done by my colleagues
and presented in ACL 2013.
One of the big challenges in statistical machine translation is out-of-vocabulary
words which are missing in the training set. And this problem becomes severe
when we have a small amount of training data or we have a test set in a different
domain than the training set even noisy translation of OOVs helps the reordering.
For example, consider this Spanish sentence. If we pass this sentence to an
SMT trained on [indiscernible] corpus, we will get the sentence similar to this
output. We showed that there are some words here, like this word, that is not a
name entity or days or number, but still OOV.
We want to use monolingual corpus for the source site to create a graph to find
paraphrases to OOVs for which we have translation. Let's look at our framework.
First we'll use our training set to create a translation model, and by using this
translation model we will extract OOVs from our test set. By using our source
site graph that I will explain later, we will extract some translation for these OOV
words, and then we will integrate these translations to our original translation
model.
Each node in our graph is a phrase, and we create our graph based on the
distributional hypothesis and we will use context vector to create our graphs
nodes and then we will use paraphrase relationship to connect these nodes in a
graph. For creating such a huge graph, we need to take advantage of Hadoop
MapReduce functionality, and for obtaining translation for these OOVs we will
use graph propagations, and our graph propagation algorithm here is modify
adsorption.
For each node in our graph we consider it distributional profile, and we will create
this distributional profile based on context vector throughout the whole corpus.
We need an association measure to show how related you are -- how much you
are related to words other than you in the vocabulary. For example, consider
co-occurrence frequency as the association measure.
In this example, once you show how we can create these distributional profile for
a token which is shown as a red point here, we have some occurrence of this
word into our monotone text, we will consider fixed size window and count the
number of words in the neighborhood.
By using point-wise mutual information, we can find the correlation between
neighborhood words and our node word. And positive value in this formulation
shows the co-occurrence more than the expected value under independence
assumption.
Now for each node we have a distributional profile in dimensional space. How
we can measure the similarity between two nodes. Suppose we have just two
dimension. Dimension one is word car and dimension two is word cat. For two
nodes, park and tail, we can use cosine coefficient to find the distance -- to find
the similarity between these two words.
In our graph we have some labeled node. Labels here correspond to translation
from phrase table, and also we have some auto-vocabulary nodes. Previous
works used just these two types of nodes to find translation for always. In our
work for the first time we use some other unlabeled node which did not appear in
our test set. These nodes can help as a bridge to transfer translation to OOV
words that are not connected directly to labeled nodes.
Here's the object function of modified adsorption. The first term ensured that for
seed nodes we will have a distribution over labels as much as possible similar to
the original distribution, the initial values, and second term ensures that nodes
that are highly connected, strongly connected, should have similar distribution
over labels, and the second term and the third term is a regularizer. It implies
prior belief into the graph, as you can see.
Now we want to add the new translations into our phrase table. We consider it
new for a feature and set it to 1 for our original translation for phrases. And -excuse me -- and then we add our new translations by set other features to 1 and
just add the probability obtained from graph propagations to this new feature.
For our experiment, we select the French-English task. By using French side of
[indiscernible] we construct our graph. We consider two different part of the text
to check first that our method is working on a small amount of parlance data or
not. And also we consider this training set to see that if we can improve the
result when you have a domain shift.
For tests on dev set we used WMT 2005 and we showed that if we remove name
entities, datas, and numbers, 3.2 percent of tokens is still from the [indiscernible]
still OOVs in Europarl. And as you can see, this value is more because we have
a domain shift.
For evaluations, first type of our evaluation was intrinsic evaluation. First we
don't have any gold label for OOVs, so we concatenate a dev set and test set to
our training set and run word aligner and then extract all of aligned word on the
target side as a standard gold and add it to a list. So since our gold standard is
noisy here, we use MRR as a natural choice to compare these two lists and also
recall as a measure.
This graph shows the impact of size of mono-text and 1x correspond to using
125k sentences from Europarl. And as you can see, there is a linear correlation
between the logarithm of the text size and MRR and recall.
We also consider different type of graphs. The bipartite one is just using labeled
nodes and other vocabulary nodes. Tripartite is just connecting auto -[indiscernible] support connected auto-vocabulary words to unlabeled nodes and
then using as a bridge to labeled nodes. And full graph is just connecting all
nodes together.
As you can see, there is a failure when you are using the full graph. There are
some reasons behind this. And we can say that paraphrase of a paraphrase of a
paraphrase of an OOV does not necessarily be a good paraphrase for an OOV.
So this is the reap that we think that a full graph is not working. And also we
consider two types of nodes. First we use unigrams and then we select bigrams.
As you can see, bigrams has a better MRR.
Here's a real example of our work. For this OOV we find the gold standard
approval, and this is the candidate list generated by our graph propagation
method.
But how much our method can affect a real SMT? For extrinsic results we
consider BLEU scores, and we showed that we statistically significant can
improve the BLEU score by using this type of graph propagation.
We provide a new method to use monolingual text for the source site for SMT
and also we showed that we will have an improvement when we have a small
size of bi-text or we have domain adaptations. For the future we want to use
metric learning for constructing the graph and also using locality sensitive
hashing and SMT integrations.
Thanks a lot for your attention.
[applause].
>>: Thank you for your talk. I have two questions sort of related. The first we is
your ages in your graph. Are they rated or they are not? So when you build a
graph --
>> Ramtin Mehdizadeh: Connected or not?
>>: No, are they rated?
>> Ramtin Mehdizadeh: Yeah, they are rated. Their rate is based on how much
they can be paraphrased.
>>: Okay.
>> Ramtin Mehdizadeh: And we calculate this by using this coefficient
measure -- let me show you -- coefficient measure between two [indiscernible].
>>: And the second question I have is that at some point your slides you
mentioned about the fixed window that you choose to hold your similar argument.
And you said that I fix it to three, right?
>> Ramtin Mehdizadeh: No, the window size for finding the neighbors is not
fixed to three. We consider six in our experiment.
>>: And you test it on different sizes of windows?
>> Ramtin Mehdizadeh: Yeah. Actually, we have the result for that.
>>: Okay.
>> Ramtin Mehdizadeh: I think. Let me -- this is the experiments on different
size of window and graphs.
>>: Okay. And then you choose six for your final result, right? The result that
you show?
>> Ramtin Mehdizadeh: Here, no. In this experiment we select four. Yeah. But
later we tried six.
>>: Okay. Thank you.
>> Host: We still have time for questions.
>>: [inaudible] what did you put in the cells? How do you fill the cells of the
vectors?
>> Ramtin Mehdizadeh: Okay. The similarity measure here is we used -- point
was mutual informations, and we used distributional profile, if I get the question.
The question was that or --
>>: So you used BMI?
>> Ramtin Mehdizadeh: Yeah, BMI. We experiment different types of
similarity -- sorry. I should not go there. We experiment different types of
association measure and similarity and using point-wise mutual informations as
association measure and cosine coefficient as a similarity has the best result.
>>: Do you have very upper length phrases [inaudible]?
>> Ramtin Mehdizadeh: We tried on the unigrams and bigrams. Now we are
trying to extend it to phrases with more length.
>>: [inaudible]
>> Ramtin Mehdizadeh: Yeah.
>>: How you decide to expand during the graph construction? How do you
decide you keep expanding the neighbors?
>> Ramtin Mehdizadeh: In our experiments, we considered how much -- when
we expand this, we will face some computational problem. So we just consider
that's how we can get good results in terms of not needing too much
computation.
>>: So what is the criteria you decide that you keep expanding the graph?
>> Ramtin Mehdizadeh: Sorry? I didn't --
>>: What is the criteria used to expand the graph or these nodes or stop
expanding, so forth? Because you can't keep expanding the neighbors.
>> Ramtin Mehdizadeh: We actually consider five nearest neighbor for each
node because more than this, it was not practical, and because it's going to be
very huge, and the graph propagation is not very cheap.
>>: Thank you.
>> Host: Any more questions? We have only one minute left. Okay. If not, let's
thank both speakers.
[applause]
Download