Document 17868834

advertisement
>> Will Lewis: It’s our pleasure to have Kyunghyun Cho here from New York University to give a talk.
Kyunghyun, sorry Kyunghyun, I have a hard time with his name, sorry.
>> Kyunghyun Cho: You said the last one, yes.
>> Will Lewis: The last one was better yeah. When I think about it I say it, right.
>> Kyunghyun Cho: Yeah.
>> Will Lewis: He’s an Assistant Professor at New York University. Last fall actually got the Google
Faculty Award which he’ll be using for doing more work on machine translation of course.
>> Kyunghyun Cho: Yes.
>> Will Lewis: Which is great for the field. Before NYU he was a post-doctoral fellow researcher at the
University of Montreal under Professor Bengio. He received his PhD from Aalto University School of
Science in two thousand fourteen. That’s in Finland. His thesis was entitled Foundations and Advances
in Deep Learning. I give you Kyunghyun. His talk today is Future (Present?) of Machine Translation.
>> Kyunghyun Cho: Alright, thank you Will. Well, thanks for the invitation first. You know, in the slide it
just says Neural Machine Translation. Because I just realized while I was making my slides that, yeah,
like the Future of Machine Translation sounds a bit too grand. Let’s you know like tone it down a bit.
Because a lot of people whenever I give a talk they’re like half of the people who really love what I’m
talking about. The other half really hates about what I’m talking about. I thought okay let’s go for some
conservative title.
Before we actually begin, so one of the, like the really. One question that I get every time or actually
from my collaborator is that the. You know are you a machine learning researcher, machine translation
researcher, or artificial intelligence researcher? I’m like yeah, but I think they’re all the same thing at
the end of the day. Then they started asking me you know like, oh, why do you think so and so on?
Before I talk about actually machine translation let’s talk about you know why language is important for
artificial intelligence in general. Few months ago Yann posted on his Facebook, replied to the comment
by David McAllister that, I’m not sure, he’s not sure language is as important as we think. As I was
reading the comment I was like, no Yann you brought me to NYU to do some language research. You’re
saying that it’s not important, no; that’s not that great.
I began to ask myself, okay, why do we want to do the natural language processing research? When you
know like a lot of things, people are more excited about the artificial intelligence, at least in my circle,
right.
I started to think about it a bit more. Then you know thinking about building artificial intelligence
agents. In order to build an intelligent agent that agent needs to have a direct perception of the
surrounding word, right, in order to know about what’s going on inside. Also at the same time that
agent lives, right. It survives over time and then it gets the individual experience.
Okay, that’s great. At this level I realized that what we can get is something like a tiger, or lions, or
elephants. They are born and then they live in environments, right. They perceive the environments
and then they age and getting individual experience.
Now, what does the natural language actually get us? Natural language first of all gets us the distant
and collective perception. We don’t have to experience or observe everything ourselves, right. I read
the books. I hear from the people you know what’s going on. I don’t know somewhere in Greece they
tell me about you know economy is falling down. I don’t have to go there and see the stores closing
down there. I actually can know by just listening or reading the news articles.
Also at the same time somehow everyone in this room knows about the, I don’t know, Greek, Roman
history. Although I’m pretty sure none of us experienced or perceived what happened about two
thousand years ago, or three thousand years ago. We get a historical experience. I think those two
things that are enabled by the natural languages are probably the most important thing. To push the
boundary of the artificial intelligence research toward the actual human level intelligence.
I was thinking about this. Some people started telling me that well, you know, natural language
processing isn’t it about to be crushed by the, I don’t know, deep learning or something. I think Neil
Lawrence said that last year at the ICML Deep Learning workshop. I like Neil Lawrence by the way, a lot.
Then Wojciech Zaremba who’s a PhD student at NYU, he’s graduating soon, and he joined OpenAI. On
the reddit, ask me anything. He said that the speech recognition and machine translation between any
languages should be fully solvable in a few years. I can see you guys are working on this to say, speech
recognition and machine translation between any languages already. Then you probably disagree. By
the way I know Wojciech personally and he’s a great guy. I like him a lot as well.
[laughter]
Then as soon as this was posted Yoav Goldberg he posted on his Twitter, oh, the arrogance.
[laughter]
Again I like Yoav and his work a lot.
[laughter]
I have a huge respect for him. In this issue I’m more on the Yoav side. I think there are a lot of things
that we need to do. Okay, back to the translation.
>>: You’re becoming old.
>> Kyunghyun Cho: Sorry?
>>: You’re becoming old.
>> Kyunghyun Cho: Oh, not that old, okay, not…
>>: This is the argument.
>> Kyunghyun Cho: Not Yoav’s level yet, okay.
[laughter]
Maybe in about I think twelve years. I don’t know we’ll see. Anyway back to today’s topic, so Neural
Machine Translation. It was two thousand and thirteen. It was the summer of two thousand and
thirteen. I moved to Montreal. Yoshua Bengio told me, hey, you know you’ve been working on the
probabilistic graphical models, probabilistic neural nets. Then I have three topics that you might find
interesting. Then you can choose whatever you want. Then you can just do real research. Because you
don’t speak French you don’t even have to teach or anything.
[laughter]
That was University Montreal completely French speaking university. Then the first one was the usual
let’s say probabilistic interpretation of the Denoising Autoencoder. The second thing was something
else, again, just a neural net. The third thing was Machine Translation.
I was kind of surprised, right. Yoshua Bengio never has done any research on machine translation. I
asked him machine translation, really, you don’t know machine translation. I don’t know machine
translation. I don’t think anybody in this lab where there were about forty people back then, know
about machine translation.
Then Yoshua told me that, yeah somehow I feel that the machine translation can be done with neural
net only. Just like how it’s being done in, with speech recognition these days. I was like, alright that
sounds about right, okay. I hoped to believe that because that’s exactly the kind of topic that I wrote my
thesis on. Then I said, yeah let’s try machine translation.
Then I started to look at the machine translation literatures. You know goggling a lot about the
translation. Then you know it turned out that machine translation as a problem itself is quite, you know
straightforward.
We have some kind of black box, so machine. Then we have a lot of data. Then our goal is to use this
data to make this black box take as the input a sentence in a source language. Let’s say in this case
English. Then spit out a translation. It turned out that is the machine translation. Okay, I was like, okay
great.
Then when you put this into the probabilistic way it gets even more elegant. What you want is that you
want your machine or the black box to be able to model the conditional distribution over the translation
given a source sentence. Then you can rewrite it into a sum of the translation model, log of the
translation model and the log of the language model. The elegance comes from the fact that the, we
use a limited pair corpora to fit, tune the translation model while on the language model we can use the
essentially the infinite amount of monolingual corporas.
I was like, ah this is great. This seems like a very elegant problem that you know we can do a lot of
things about. Then I started reading the Philipp Koehn’s, Statistical Machine Translation book. Up to
chapter two it’s all about these kind of elegance of the statistical machine translation. I really loved it.
Chapter three…
[laughter]
Well I haven’t got into the fun part yet. But, okay, I think chapter three is the basic probabilities and
statistics. I skipped over that. Then chapter four is where the idea models are introduced. Chapter five
is the phase based models are introduced. Then from there on I just noticed that in reality it’s actually
pretty messy. You know like at the end of the day nobody really cared about the conditional probability.
[laughter]
I was like wait that sounds a bit wrong here. Then everyone was using a simple log-linear model. Then
you know of course I work with the neural nets. Then the first thing we learn in the neural net course or
the neural net research. Is that the neural net is great because it can do it in non-linear regression or
the non-linear classification.
I was like; yeah I don’t want to handle the log-linear model too much here. Then there were a number
of feature functions. For the first let’s say five of them I could actually understand why you need it.
[laughter]
But from there on until let’s say there was some paper with the two hundred sparse features. Between
those two I got a bit stuck. Then you know I had to think. Then okay I can actually read those papers
worth ten years of research. Or maybe what I need to do as a neural net researcher is to try to see it as
a completely new problem. Then you know try to approach it in our own way.
I started to look up the history a bit. Then turned out that using neural net for the machine translation
is not new at all, in fact, already in two thousand six Hugo Schwenk from France wrote a paper saying
that, okay. We can use the very gigantic feet forward neural network as a language model. Then use it
to filter out the n-best list from the translation system.
Advantage, you don’t have to touch the existing statistical machine translation system at all. You don’t
have to do it. You just download the Moses, follow the instruction, and then you get the n-best list for
each source sentence. Then you’re going to use the neural net language model to re-score them, or rerank them.
But of course the disadvantage is there is no integration. The neural net language model you use
doesn’t even know what the source sentence is. That makes, that made me wonder how does it even
help when you don’t know what the source sentence is? How can you actually re-rank the target, as I
say translations?
You know we thought okay. This is a bit too; you know this one is too naive. Then in two thousand
fourteen Devlin et al, the lead author is probably is here, yes, okay.
[laughter]
Had a paper where they made a now translation model using the neural net. Then plug it into the
existing statistical machine translation system. The advantage is that there’s almost no modification to
the existing MT system. Of course there are a few. Then the integration is now quite deep. You actually
plug in the neural net into the existing system. Then let the decoder use the features or the scores you
get from the neural net.
Disadvantage is that the, okay, increased complexity during decoding because you have let’s say another
component that neural nets are generally very expensive to compute. Then the second thing, second
question I had was that how do we actually know that the neural net feature you get is linearly
compatible with the other features? At the end of the day as I said earlier the existing phase based
model uses the log-linear model. Then it assumes that all the features we got are linearly compatible.
But how can we actually check? Do we actually know?
Of course in this paper the amazing improvement in the score kind of empirically shows that yes it is
possible. But I still couldn’t you know convince myself that okay those are all linearly compatible. We
decided that the, okay, we are like neural net research group. Then we are only, we are neural net
researchers. Let’s try to plug in the gigantic neural net between the source sentence and target
sentence without any other extra component.
An obvious advantage is that the, every single component of this network is tuned specifically toward
maximizing the translation quality. Unlike existing cases where when you get the feature value you
think, or the people think that you know like those features are good for the translation quality. But
that’s not necessarily guaranteed because you don’t necessarily tune the feature value except for the
weight or coefficient in front of it.
Well there are a lot of disadvantages. First of all back then we didn’t know whether it’s going to work.
That was a huge disadvantage we had. The second thing is that of course now the training is extremely
you know expensive. But this was okay. We are pretty patient people. Neural net researchers are
probably the most patient researchers in machine learning community. We just wait few weeks even
using the state-of-the-art GPUs.
But the most serious, more serious issue I ran into when I started talking to the machine translation
researchers back then. Was that you know I’d found that a lot of people are probably subconsciously I
hope allergic to the term neural net. I found a lot of people just you know like, oh, yeah, neural net to
do the, I don’t know translation. No, not a lot of people love the idea at the end of the day.
But anyway, so I’m going to talk about this approach today. Then you know like I called it Neural
Machine Translation at some point. But of course Neural Machine Translation is just a subset of the
Statistical Machine Translation with different mechanics. I actually prefer the name connectionist MT
except that this term never really picked up by any other people.
The reason why I call it connectionist MT is that already in nineteen ninety-two Hutchins and Somers
talked about this in their textbook on Introduction to Machine Translation. They said in like one
paragraph out of the very thick book they wrote that, “The relevance of the connectionist model to
natural language processing is clear enough.” “As a psychologically real model of how humans
understand and communicate.” Let’s ignore this part because I’m not sure about that. But it is clear, it
is you know like they’re filling a possibility that people have been thinking about already from the early
ninety’s.
Then it actually was tried by two independent groups in nineteen ninety-seven at the same time, both of
them from Spain, one group from Valencia, the other group from Alicante. They actually tried exactly
the same thing that the [indiscernible] and others that Google tried in two thousand fourteen, already in
nineteen ninety-seven.
The first was in called Recursive Hetero-Associative Memory proposed by Forcada and Neco in nineteen
ninety-seven. I found this paper and then you know like this is the figure too if I remember correctly. I
was like, my god this looks exactly you know what we’ve been doing. There’s a recurrent neural net
encoder that’s going to read the source sentence, summarize it into a vector. Then from that vector the
encoder neural net is going to spit out one order of time in the translation.
Similarly, Castano and Casacuberta in the same year from Valencia proposed some model similar, almost
you know like identical to them. But the issue back then was that, “the size of the neural nets required
for such applications and consequently, the learning time can be prohibitive”. Now we know that the,
now a days that is not true. It is still quite expensive. But we can actually manage it in let’s say two,
three weeks.
Then this idea, yes, sir?
>>: In most of the papers that you wrote did you refer to this…
>> Kyunghyun Cho: We do once I found it, yeah we do.
>>: Okay.
>> Kyunghyun Cho: Forcada and Neco’s paper is slightly closer to these reason works. Then you know
that paper was cited only three or four times when I discovered it. But now it’s been cited like thirty
times.
[laughter]
>>: Yes, [indiscernible] didn’t know what [indiscernible].
>> Kyunghyun Cho: Yeah, I think you know it’s a win-win situation.
>>: [indiscernible]
>> Kyunghyun Cho: I actually met Forcada at one conference. He was really you know friendly to me,
right.
[laughter]
I don’t think [indiscernible], okay.
>>: [indiscernible]
>> Kyunghyun Cho: Oh, okay, and in two thousand thirteen which I didn’t really know back then. The,
from Oxford now called Kalchbrenner and Phil Blunsom had a paper where they used. They proposed
almost the same thing except the encoder part was the convolutional neural net. But one thing is that
they didn’t just push enough. Then their reserves were let’s say at best kind of lukewarm.
Let me tell you about what Ilya Sutskever et al., from Google and we have done in two thousand
fourteen. It’s a very simple model, right. It got really simple because we start from the scratch. We
started from the scratch. Let’s forget about you know whatever we know about the machine
translation. But we’re going to just view it as the structured output prediction problem with a variable
length input. It’s a variable length sequence input. We’re going to use a recurrent neural net.
Before that we have to decide on how we are going to represent a source sentence. We thought that
okay what is the most or the least representation with the least amount of prior knowledge about the
words or the characters, or whatever input. That is the so-called one-of-K coding or the one-of vectors.
In that case each word is coded as an O zero vector except for one element who’s in this corresponds to
the word in this in the vocabulary. We set it one. Then the most important property every single word
is equal distance away from the every other word.
Now you know there is no like the knowledge about it. We just encode it like this. Then each one of
them is going to be projected into a fancy term like the continuous space word, continuous word space.
Of course that’s the term is too fancy. In fact it is just a matrix multiplication from left. Then we get a
dense real valued vector for each of the words.
Now we have a sequence of real value vectors. That sequence is read by the recurrent neural net. Each
time it’s going to read the new word. Then update its memory state based on the previous memory
state as well. It reads one word at a time until the end of the sentence. Then we’re going to code that
memory state of the recurrent neural net a summary vector, or the context vector of the source
sentence.
Then decoder is exactly the opposite or the flipped version of the encoder. Now, given that summarized
vector we’re going to update the recurrent neural nets memory state first. Based on the previous
memory state previously decoded our translation word. Then, the context vector from which we
compute the word probability or the distribution over all the words in the vocabulary.
When we have a distribution what do we do? We either sample or take the most likely word. We do
that, so we got the next word. Then we do it recursively until we sample the end of sentence symbol.
That’s essentially, that was essentially it.
Of course you know like we didn’t know about how long it takes to train. We actually coded this whole
thing up pretty all day. Like it was September two thousand thirteen we had all the code there. Then
we were pre-processing the dataset and corporas. We started training the model and you know like
after a day or two it was not doing anything. Of course later on we learned that okay we had to wait
two weeks instead of two days.
But then this model at the end of the day kind of started to work. Especially Llya Sutskever and his
colleagues at Google made it work quite amazingly. It was doing actual English to French translation as
it is. We were surprised because we couldn’t make it work like that even if we waited like two weeks.
That’s why we had to resort to training the phrase pairs instead. Then use it as part of the existing
Moses essentially.
>>: When did they start doing this?
>> Kyunghyun Cho: Sorry?
>>: When did Sutskever, et al., start doing this?
>> Kyunghyun Cho: I think it’s the same time. We didn’t know about them doing it.
>>: Okay.
>> Kyunghyun Cho: I think Yoshua might have known. He probably told them that we are trying to do it.
That’s possibility. But Yoshua never told us that Google is doing it. But we really, at least I really didn’t
know that they were working on the same thing.
>>: That’s about two thousand and three, the summer?
>> Kyunghyun Cho: No that’s going to be two thousand fourteen now.
>>: [indiscernible], okay.
>> Kyunghyun Cho: Yeah.
>>: What’s the big difference between for just the encoder/decoder model without attention that you
guys fed in the source state every time and they fed it in as the first [indiscernible]. Is that the crucial
difference?
>> Kyunghyun Cho: No that’s not the crucial.
>>: Okay.
>> Kyunghyun Cho: That’s not the crucial. The crucial difference between them and us back then was
that their model was about a hundred times larger.
>>: Okay.
>> Kyunghyun Cho: Yeah, so that was the main difference. Then you know like of course we learned
about that. Then we thought okay, maybe we could implement that as well, like why not. You know
we’re going to just make it into a, we’re going to paralyze the model and everything.
>>: It’s a hundred times larger in terms of number of parameters or along with data?
>> Kyunghyun Cho: [indiscernible].
>>: The number.
>> Kyunghyun Cho: Yeah, number of parameters. We’re using even the same data as well.
>>: Here you also have these fault vectors over here.
>> Kyunghyun Cho: Essentially, this can be, think of…
>>: Fault vector.
>> Kyunghyun Cho: Thought of as a fault vector.
>>: Okay.
>> Kyunghyun Cho: I don’t know I don’t really like that term you know.
[laughter]
>>: Are you going to speak to why beam search is a good idea? I see that that’s like your last bullet on
the slide.
>> Kyunghyun Cho: Ah, yes the beam search is a good idea because sampling is not a good idea. We
have a distribution…
>>: Those are the only two options like it’s…
>>: [indiscernible].
>> Kyunghyun Cho: Right.
>>: Right, but…
>> Kyunghyun Cho: Well we can do the just grid search with a beam with one, right? Our goal here is
not to actually get nice samples that are going to be representative of the distribution. But rather we
are looking for the maximum a posteriori sample. Beam search is good. If we could do anything better
than that that’ll even better. But so far the most naïve approach is to use the beam search. Yes
[indiscernible]?
>>: You said the big difference was the number of parameters, hundred times more parameters.
>> Kyunghyun Cho: Yes.
>>: What it your sense in these models? What’s the true degree of freedom that you have? When you
go a hundred times is it equivalent to in some other models drawing like two times? Like how, because
it seems like there are too many parameters in this model so they’re not really free. They’re not freely
used. What is your sense?
>> Kyunghyun Cho: Okay, so well there are actually multiple things in the question. Alright, so I think
first of all I think we are still working with the way. Even with a hundred times more parameters I think
we are working with the way too small models. That’s the first answer I have.
Then second answer is that the, so more the parameters empirically saying easier it gets to train a model
or optimize the training cause function. Then this is not only me but a lot of people tend to agree that
the, when we put more redundancy then the training gets easier and easier. Probably because of
symmetry, but yeah it’s…
>>: But it is some kind of quantitative measure there of degrees of freedom. Like maybe ten times
more you wouldn’t see any difference. Hundred times you see a difference. Then if you want to get
another measurable effect we need to go another hundred times.
>> Kyunghyun Cho: Right, so there is also an issue with that the number of parameters alone is not a
good measure of how large this model is. Or what the, you know it’s not a good measure of the capacity
of the network, right. There is both the number of parameters and the amount of computation.
For instance if we increase the depth instead of the width then what we do is that we can keep the
number of the parameters same but we can increase the amount of computation. If we put the
recurrency there we keep the number of parameters same we can control the amount of computation.
What we see is that in many cases we can keep the number of parameters exactly the same. But by
increasing the amount of computation we can get better performance in other things. But not on, we
haven’t tried it on this one. But in let’s say generative modeling of the images by doing the recursive
processing we get better and better. But of course the amount of computation grows almost linearly as
we do.
Yeah, very good question which I don’t have a good answer to, sorry about that. Hopefully I’ll get that
answered in about twenty years, but not today, not today.
>>: Twelve years, twelve years.
>> Kyunghyun Cho: Twelve, twelve sounds good, yes. One thing we notice is that it’s not really a
realistic model. This model here like the simple encoder/decoder model is not a realistic model of how
translation is done.
Why is that so? I can you know try to give a lot of empirical like the reserves and then you know
numbers that we got. But I think it’s best to answer this by quoting Professor Ray Mooney at the
University of Texas, Austin, because it’s just not a good idea to, “…cram the meaning of a whole…
sentence into a single… vector!”
You can think of it yourself as well. Let’s say I’m going to throw you a hundred word sentence. Now I
ask you to okay, now translate it. But I’m going to show you the sentence for only five seconds. You’re
going to read it once. Then you won’t have the access to it but you have to write the translation and
nobody’s going to be able to do that well unless you’re very well trained to do so.
What we do and then according to the study on the human translation. For instance, in the case of
English to French translation what human translators, professional ones do. Is they translate in a much
smaller translation unit, two to three words at a time. They read the source sentence once. Then they
start writing the translation or the target sentence two, three words at a time. Go back to the source
sentence over and over.
We had one intern, master’s student intern who’s now a PhD student in Montreal. He came up with this
brilliant idea of incorporating an alignment mechanism into the neural networks. We called it alignment
mechanism in our original paper. But somehow attention is always a fancier word than alignment.
Everyone is calling it attention mechanism.
The idea is very simple. Instead of encoding the source sentence into a single vector which is unrealistic.
We’re going to encode it as a set of vectors. Those set of vectors come from the bidirectional recurrent
neural net. The first four recurrent neural net is going to read in left to right. Reversed recurrent neural
net reads from right to left. Then at each location or the each word we’re going to concatenate the
hidden states of those two recurrent neural nets and call it annotation vector.
What is this annotation vector represent? It represents the word. Let’s say in this case growth with
respect to the whole sentence. You can view it as a context dependent word representation or the
word vector. Then now we have a set of vectors. The issue is that the neural nets are not that great if
you have a variable sized input. We have to do something about it.
There comes the attention mechanism. At each time step in the decoder it’s going, we have, first let’s
see what the decoder hidden state represents. The decoder hidden state represents what has been
translated so far. In this case the decoder hidden state here is going to represent the economic growth.
Then it knows what has been translated.
Given what has been translated so far, for each of these annotation vectors we’re going to score them
how relevant it is. This attention mechanism which is just nothing but a neural net that gives a single
scaler is going to give out the relevant score of each word vector for the next word in the translation.
Given what has been translated so far.
Those scores we normalize them to sum to one. Then you know that’s, that always gives us a nice
interpretation of the probability. Then based on that score we take the weighted-sum of the annotation
vectors. Then use it as the time dependent context vector.
Okay, let’s think about the very extreme case. At this point we need to attend to the growth. Then this
attention mechanism is going to look at the, oh, I have translated La here, or the here. Then what is the
next one that I need to translate? It’s going to put up very high score on this growth and very low score
to all the others.
Then it works as if the next time step is computed solely based on a single word in the source sentence.
You do it over and over again until the end of the sentence in [indiscernible] sample or selected. Yes
[indiscernible]?
>>: Why don’t you also have the final state of sentence? Like both this and the previous portion?
>> Kyunghyun Cho: Yes, so I didn’t put it in this graph because you know like this figure gets super
cluttered eventually.
>>: Oh, I understand.
>> Kyunghyun Cho: Is that, so we initialize the decoder’s recurrent hidden state with the last hidden
state of the either forward or the reverse recurrent neural nets.
>>: You could also just have it as an input always…
>> Kyunghyun Cho: We can do that as well. It gets slightly slower.
>>: It would be fair to say that a fault vector approaching the [indiscernible] is a special case of this.
>> Kyunghyun Cho: Yeah.
>>: If you simply make all the weights to be zero except the last one.
>> Kyunghyun Cho: Exactly, yes.
>>: Yeah, okay.
>> Kyunghyun Cho: That’s true. It’s the same encoder/decoder. Just that you know like we have some
conceptual issue drawing a encoding neural net that gives us a set of vectors instead of a single vector.
But, yeah it’s the same thing. Yes?
>>: Did you consider taking into account the coverage like what is the words that you translated so far
and reduce the weight for it rather than…
>> Kyunghyun Cho: Right, so I think there is one paper submitted on that to some conference. I don’t
know how it worked out. The thing is that okay in principal if you have the same [indiscernible] of data
and everything this decoder should be able to you know like consider that as well, right. Because the
decoder knows which of the annotation vectors has been translated. Then the decoder hidden state is
used to compute the scores.
It should be possible. But of course you know we always have a finite amount of data. Then our
learning algorithms are always very primitive. It’s unclear whether that happens. In the case of image
caption generation we did with the very same model. What we did was to regularize the attention
weights. We get the attention weight matrix so that it’s going to be doubly stochastic.
In this case if you just let it run we get a stochastic matrix because for each target site the sum over the
source words is one. Whereas in, we can regularize this such that the other way around is also true.
Well, it’s not exactly, there is some constant. Then you know it turned out that that helps when the
dataset we considered was really small. But when the dataset got large enough then we didn’t really
see any improvement there. Yes?
>>: But if you’re translating from a language that has a lot of function words that don’t map into the
target language. That probably wouldn’t even be a good thing, right. Because there’s lots of words that
you want to just completely ignore, right?
>> Kyunghyun Cho: Exactly, exactly, so that’s why you know we do not put it as a constraint but as
regularization. Yes?
>>: Over here you have three [indiscernible] Zp minus one, Ut minus one [indiscernible]?
>> Kyunghyun Cho: Yes.
>>: But the input here only two you’re missing…
>> Kyunghyun Cho: Oh, yeah, so…
>>: One of them is…
>> Kyunghyun Cho: Yeah, I should…
>>: Okay.
>> Kyunghyun Cho: Put here as well, yes.
>>: Okay, okay, good.
>> Kyunghyun Cho: Yeah, you know I used this figure for my job talk so I wanted to make it as pretty as
possible and not cluttered.
>>: How important it is to have all three…
>> Kyunghyun Cho: It turned out that this one is not too crucial. Because why do we need the actual
sample of the word? Is that the decoder’s previous hidden state gives the distribution only, right. There
is some uncertainty. Then feeding in the previous word essentially you know resolves that uncertainty.
But what happens is that when we have enough data and then we have trained it long enough. The
distribution itself becomes very picky without much uncertainty. In that case Ut minus one is not too…
>>: Also there are other spaces are but you can’t have T minus one, T minus two, T minus three. Have
you explored all these kind of mixtures?
>> Kyunghyun Cho: No, no, I’ll tell you why we haven’t. Then, okay two things that the, we use those
long, short or memory units or the gated recurrent neural nets. Then those update gates or the forget
gates effectively learns to do so by you know carrying over the information if it’s needed.
I don’t think that’s going to be necessary. But obviously if you do all those things there maybe some
gain. But again I’m at the university. We don’t have enough GPUs always. It’s difficult to do that kind of
exhaustive exploration unfortunately.
>>: Do you know what Google’s people play around more because they’re [indiscernible]…
>> Kyunghyun Cho: I’m pretty they do, yeah.
[laughter]
Probably they tried almost all of them, yeah.
>>: But these have the same three terms, right…
>> Kyunghyun Cho: Yeah, I think so.
>>: There’s some difference about whether the word being used in the target is wi or wi minus one,
right. There’s some variation here.
>> Kyunghyun Cho: Yeah, so last year at EMNLP the group, [indiscernible] from the Stanford they
actually did a quite, let’s say exhaustive search on the okay, different types of the attention, different
types of the parameterization. Then you know they got amazing result. But that amazing result was not
that amazing.
I’m not sure whether you know those were really effective or they tried it on the English-German.
Maybe it was specific for that language pair. It’s difficult still. I mean this model was proposed only like
a year ago at best. I think we’re still at the stage of exploration.
Okay, so then we trained the model. We get all this pretty alignment. But of course I don’t speak
neither French nor German, nor English well.
[laughter]
But I do believe those are really good ones. Then you know like since then we started to using it for a lot
of different language pairs. Then on the W, English to French WMT’ fourteen this is where we started
with the attention-based neural machine translation. We introduced a very, no and then you know Ilya
and others were able to actually get there without attention with a very large neural translation model.
We introduced a large target vocabulary. They’re replacing auto vocabulary words going up and up.
Then you do all these things. Then you know you get to the state-of-the-art results on the EnglishFrench translation. Then you know like many MT researchers told me that yeah, but English to French is
kind of solved problem. You know you don’t want to play around on that field.
We thought okay we’ll go to the English-German. We start from here. We started from here then we
did the auto vocabulary replacement, slightly, very large, large target vocabulary extensions. Then you
know like adding all those.
This is the paper from Stanford last year. Then you get something better than Ilya’s phrase-based
model. Actually it’s, I think I made a mistake. I think it was not Buck et al. But the Barry Haddow and
others using the syntactic you know like phrase-based model.
Then on the WMT fifteen the improvement is slightly better. But essentially people have, we’ve been
using like little by little and then adding into the neural machine translation. Then eventually if you add
all those things you get something better. Then in general in WMT fifteen we participated in five
language pairs. Then it turned out that they are kind of like neck to neck. Some languages are better,
some languages are worse.
One thing we learned is that the, you know neural net only MT kind of caught up with the phrase-based
MT. Then if I were to make it less interesting this talk then I can actually finish it here. Just saying that
okay let’s push further and beat the phrase-based MT. I could have finished it, right. But that’s not fun.
Then I even told you already that you know I’m going to talk about future of MT. I shouldn’t finish it
here.
Then you know I started thinking about it a bit. Then this is not fun. I want to do work on machine
translation. Then I believe the machine translation is far from being solved because I’m not going to use
let’s say Google translate or Bing translate to translate my paper. Then send it to my father, right.
No one is going to be able to read that translated one. Then I just thought that then what, should I play
this game. I don’t think this is a real game actually. The real game seems to be like this huge thing.
Then somehow we’re playing in a very small corner of the game.
>>: What is that airplane doing up there?
>>: Yeah…
>>: Yeah…
[laughter]
>> Kyunghyun Cho: There’s, okay, okay.
[laughter]
I think it’s the Arsenal or Chelsea Stadium, right. Airbus, yeah, so it’s Europe, so, yeah they have the
Airbus. But yeah that’s not the actual airplane flying there by the way, for your info, alright.
[laughter]
>>: That’s changing the game.
[laughter]
>> Kyunghyun Cho: Right, yeah that’s changing it too much. That’s changing it too much. Then I started
you know like thinking about it. Then looking at how the translation works. I am going to translate this
movie review with a translation system.
What I’m going to do is I’m going to take the first sentence out. I’m going to do the word segmentation,
tokenization or some kind of punctuation normalizer, all those things, to get a sequence of words or the
tokens. Then I push it into the machine translation system which is going to give me a sequence of
words in a target language. That needs to be detokenized and then desegmented, and all those things
to give me the actual sentence. Then I put there and then say that the okay first sentence translated.
Then I’ll do it for the next sentence, next, next, next so on until the end, the last sentence. Of course
there are a few things that you can, let’s say resolve the core references on the way. But you know
usually it’s the sentence wide translation.
What I felt was that okay there are three issues. We are doing it at the word-level in a sentence-wise
manner. It’s always the bilingual translation. Then even if you want to do it in multi-lingual translation
usually what you do is you go through the people language.
I was like yeah this is not really fun especially for neural net. It’s not fun at all. We actually have learned
that the word-level translation with the neural net is super not fun. We had to come up with all those
hacks. Of course we always justify it based on the importance sampling ideas in MCMC. But those are
all hacks.
We thought, I decided to actually tackle all these problems based on the neural net MT system. First
let’s talk about the word-level modeling. People have started thinking about going below the word.
Starting from last year actually you know people were thinking about it from long time ago. But none of
those were serious I would say.
Almost simultaneously Kim et al., from the NYU, and Ling et al., from CMU proposed to encode each
word by using either recurrent neural net or convolution neural net based on a character sequence of
each word. Then they tried on the language modeling, parsing, post tagging which I don’t believe there
is such thing, but anyway all those things. Then they showed that okay it works for the certain language
pairs, certain languages quite well.
Then the same group, the Ling et al., from CMU just put up the archive paper about two months ago to
be reviewed at ICLR which was rejected I believe. To okay use the same idea on the decoder side and
then do the character level machine translation. I was like, okay, character level neural machine
translation that’s great. Except that if you really read the paper it’s almost impossible to reproduce their
results because they had to pre-train the character level recurrent neural net, they had to pre-train the
word level recurrent neural net. They had to do something and so and so on.
The experiments were slightly you know like not convincing. They tried it on the English to Portuguese
translation. Yeah, there is not even a baseline to compare how well they are doing. It’s not the most
popular language let’s say. Is there anybody from Brazil? Maybe I shouldn’t say that. Anyway and then
you know like at this point it’s still I wouldn’t call it a character level machine translation, rather it’s still a
word level machine translation.
This is the kind of progress report, figure that I draw. We start with the word level. Given this I really
enjoyed this fill in. You know you do the tokenization into the words and then you do the translation.
Then you can do some kind of slightly clever segmentation. You can do the morpheme segmentation or
you just use the bi-pair encoding to segment it into the character n grams. Then you get I really enjoyed
this film. Then you do the usual translation. This was from the Sennrich et al., last year. This was works
amazingly well by the way. If you’re just training a neural machine translation system get their code, do
the offline preprocessing. It’s just one pre-processing and that’s it. It’s really nice. We always use BPE
by the way.
Then you know like Kim et al., and Ling et al., all these reason work on the character level. What they do
is they’re going to tokenize it first into the sequence of words. Then each token is going to be read by a
character level neural net.
Now, I have to ask why didn’t they go into the nothing, no preprocessing first. Is there some kind of
problem with that, probably? I also thought that there must be a problem. We have that updated prior
about you know characters are not the nice unit of meaning. We should go into something more and
then do something, right.
Turned out we can actually do that. We didn’t need any preprocessing. We are working on it now. Let
me show you some results. We decided that okay there is the source target. Then you know if we just
do it on both sides it’s difficult to essentially you know narrow it down when there is a problem. We
decided okay the source side we’re going to leave it as BPE-based sub-words. But the target side we’re
going to just generate characters without any kind of segmentation boundary or anything.
Then initially we thought that you know we need something you know special. We spent like one or two
weeks sitting down together with Chung et al., Xiao. Xiao was actually here, yeah.
>>: [indiscernible]
>> Kyunghyun Cho: Last summer, right, yes. We are sitting down at NYU like the very crappy office you
know compared to here. We’re thinking a lot and you know trying to make a nice recurrent neural net
that can do the segmentation implicitly on the fly, and then generate it. Then we implemented the
stuff. But as a baseline we decided to just you know put the two layer GRU based recurrent neural net
and let it generate the characters.
Turned out that the basic model just works so these are the English to German, English to Finnish,
English to Russian, English to Czech. Then this is the BPE-to-BPE. You can view it as the word level
modeling. BPE-to-characters, so the decoder side, target side, character level, target side character
level. They are doing either better or comparably always. Then we didn’t need anything unfortunately…
>>: But that just works?
>> Kyunghyun Cho: Sorry?
>>: But that just works? There’s not, there’s, you didn’t do anything special to do it in a language model
or something like that?
>> Kyunghyun Cho: No, no, we are not even using the monolingual corporas. With by ensembling these
the numbers get even better, much better than the BPE-to-BPE and it just works. We’re preparing a
paper for ACL. I’m just worried what I’m going to say. I’m going to say that the, I don’t know, character
level neural machine translation on the target side just works. I don’t know if the viewers are going to
buy that.
[laughter]
But turned out that really just works.
>>: Yeah, so but basically in the character level you lose the power of language model, right. Your
character level language is very weak compared to word level.
>> Kyunghyun Cho: That’s what I thought, right. That’s what everyone thinks.
>>: How, yeah.
>> Kyunghyun Cho: But turned out it just works.
>>: But, I know that people do in voice recognition they do the same thing. They lose quite a bit. But if
you use the character, like Chinese character then recognition is fine.
>> Kyunghyun Cho: Yeah.
>>: Because the model, so I’m not sure you know because you think they use these LSPM that the
memories short enough. You just memorize…
>> Kyunghyun Cho: Yes, I’m on that side clearly.
>>: Okay.
>> Kyunghyun Cho: I’m from that side so I’m like recurrent neural nets are actually great.
>>: But I mean you’re gaining on the much smaller you know vocabulary essentially, your target side so
all of your Softmax…
>>: You lose all the [indiscernible] power, right, because your character length is so weak.
>> Kyunghyun Cho: Yeah.
>>: But if you’ve got fifteen [indiscernible]…
>> Kyunghyun Cho: Okay, so the thing is that it has attention mechanism, right.
>>: [indiscernible]
>>: [indiscernible] language model, right.
>> Kyunghyun Cho: It has attention mechanism. Then we visualize it.
>>: I see.
>> Kyunghyun Cho: Then it actually aligns almost amazingly.
>>: Okay.
>> Kyunghyun Cho: In a sense that, so one example I saw is that the, on the source side in English was
spa garden.
>>: Okay.
>> Kyunghyun Cho: That apparently translates to kurspark, does anyone speak German? I think its
kurspark or something. Kurspark is a spa and then park is a garden. It was generating k-u-r-s-p-a-r-k.
Then alignment was perfectly let’s say, spa was aligned to the k-u-r-s with almost same weights and
then to parks.
>>: Do you, would you do the encoding do you have to you know constraint to decode it so it conforms
to dictionary? If something you know if something similar characters that could get you…
>> Kyunghyun Cho: I know, so our prior says that you know we should do that, right.
>>: Yeah…
>> Kyunghyun Cho: I have that prior as well, yes.
[laughter]
I know, so asked Chung [indiscernible], are you sure you’re not making any mistakes?
[laughter]
Are you sure you’re not cheating subconsciously or unconsciously or something like that? But wait,
okay, left spin like okay raising his hand…
>>: But you kind of started answer my question already you know. You said you don’t want to just say it
just works.
>> Kyunghyun Cho: Yeah.
>>: Like you can show examples of the resulting embeddings. Like you can show that a certain
sequence of letters ends up with a state which is close to some other word that has the same meaning.
Then you can take a word add e-d to it and see what happens. I think from this…
>> Kyunghyun Cho: Right, so…
>>: Do you have examples like that? Have you looked at it?
>> Kyunghyun Cho: No, so the thing is that actually makes me kind of like regret that we should have
started from the source side. On the source side that would have been much easier to test. Whereas
on the target side we are conditioned on the source sentence, so it’s slightly, yeah. It’s slightly
inconvenient to do a lot of analysis.
But at the same time that is from the beginning very difficult to do so in this case. We don’t have
boundaries, right. We don’t have explicit boundaries. It’s all recurrent neural net. It’s all dependent on
every single character before. It’s difficult to actually just cut it down and then say what that is.
>>: I’m pretty sure that when you look at the output there must be a lot of words which are not legal,
right. Because a few characters that could be wrong.
>> Kyunghyun Cho: I know. I agree, my prior also says so.
>>: Right.
>> Kyunghyun Cho: Turned out that we do the decoding using Bing search. It rarely happened.
>>: Did you find that, did it generate new words that are valid that were not in the, they’re not in the
training corporas?
>>: Yeah that’s…
>> Kyunghyun Cho: We are still working on it, so doing the analysis. Yeah?
>>: The question I have…
>> Kyunghyun Cho: Okay.
>>: Is about, let me get in here and find it.
>> Kyunghyun Cho: Okay.
>>: At the character level you’re operating at sub-word level which there is some elegance to, I mean
there’s some need to operate it at a less than word level when you’re doing MT.
>> Kyunghyun Cho: Okay.
>>: This is part of the problem with morphologically rich language which allows the alignments we
contend with…
>> Kyunghyun Cho: Right.
>>: Have a degree of morphological richness to them. We’re not capturing that sub-word information
which leads to bad translations in a given context.
>> Kyunghyun Cho: Right.
>>: There’s a certain elegance. It seems like you’re going too far though by going to the character level.
But maybe it’s a proxy for that. Do you have some sense of the, how accurate or how good the
morphological variance are. For instance are you getting better word choice in the output with these
character based models than you do with a word based model?
>> Kyunghyun Cho: Yeah, so we have to do all those analyses. This like the really like the latest
research. We got all these numbers a week ago after running the experiments for about I don’t know
two months or so. We’ve been waiting a lot. We’ll do all those analyses.
>>: Okay.
>> Kyunghyun Cho: Yeah.
>>: Is there a reason you prefer characters on the target versus the source?
>> Kyunghyun Cho: We just chose target first. That’s why I think that was kind of like an inconvenient
choice now that I, in hindsight. But I’m pretty, that this kind of thing actually makes me realize that we
can probably do it on the source side as well. But I’m pretty sure source and the target side have very
different properties. We wanted to make sure that okay we work at one let’s say [indiscernible] at a
time, right. Usually different people we just throw the whole huge thing and then you know say that
hey we solved it, right. I’m trying to avoid that in this case.
>>: What is the…
>>: [indiscernible]
>>: What’s the vocabulary size actually? I think the character would be very useful for smaller.
>> Kyunghyun Cho: In this case though we are using three hundred and eighty character vocabularies
on the target side.
>>: No I mean the word the vocabularies for, you just choose word-based.
>> Kyunghyun Cho: Oh, so if we just use the, you know like really just using the blank space after the
simple tokenization. The, like the Finnish just you know like goes out you know like out of the roof.
>>: Would it be like sixty for most of the languages?
>> Kyunghyun Cho: For like I think German if we use million then we can cover like ninety point nine.
>>: No, I mean like for the character level isn’t it like sixty or whatever the number of characters?
>> Kyunghyun Cho: No, no there are all those like the weird symbols from web crawl, yeah.
>>: Oh, okay.
>> Kyunghyun Cho: Took like three hundred and eighty was like covering ninety-nine point nine, nine.
>>: But it’s still just like the, it’s still just the [indiscernible].
>> Kyunghyun Cho: Yeah.
>>: The implication of this is, I saw many papers dealing with the large vocabulary of things. Then with
this it seems that you don’t need those methods.
>> Kyunghyun Cho: Yeah, I know. We had the paper last year…
[laughter]
>>: Do you think, are you consider using convolutional there on the target similar to [indiscernible]…
>> Kyunghyun Cho: On the source actually, source, one of the students at NYU is currently working on
that for the translation and indeed does help on the German to English translation. But the experiments
are way too incredibly small number. Then you know we need to actually work on it more to make
more concrete.
>>: You do have language model for word to word, right?
>> Kyunghyun Cho: No.
[laughter]
No, we didn’t even touch the monolingual.
>>: No, so that [indiscernible] for the word to word, the [indiscernible] public source of the information
is [indiscernible] use it for that. That of course I can see that [indiscernible] each other but ideally…
>> Kyunghyun Cho: Oh, no, no, no actually, yeah.
>>: Language model.
>> Kyunghyun Cho: We tried to add the language model on top of that to do it. But the improvement
we get by adding the, let’s say recurrent neural net language model is not that large, actually.
>>: What is like the state-of-the-art for English-German?
>> Kyunghyun Cho: Actually let me go back.
>>: [indiscernible]
>> Kyunghyun Cho: The state-of-the-art as in like we got a list of anything.
>>: Was it like around twenty-five, right, or…
>> Kyunghyun Cho: Yeah, I think I had a number here. English to German…
>>: Okay.
>>: [indiscernible]
>>: Oh, I see.
>>: That’s amazing.
>>: Question on just wall time.
>> Kyunghyun Cho: Okay.
>>: You kind of have a couple effects going on, right. One is that your vocabulary size on the target side
gets much, much smaller, so you get more efficient. But do a lot more operations.
>> Kyunghyun Cho: Yeah.
>>: Because you’re doing each character, right…
>> Kyunghyun Cho: It’s about two point five times slower, yeah
>>: Okay.
>> Kyunghyun Cho: That’s unfortunate side affect that, you know but we are, as I said earlier we are
very patient people.
>>: What’s a small [indiscernible]? Your number is smaller with a character…
>> Kyunghyun Cho: Yeah, but the length is now about [indiscernible]…
>>: Oh, the length, I see…
>> Kyunghyun Cho: Five to seven times longer depending on languages. Yes?
>>: What’s the last language there, so I’m not a…
>> Kyunghyun Cho: Oh, Czech.
>>: Czech.
>>: Czech.
>> Kyunghyun Cho: Yeah, so English…
>>: Do you try to, do people try to do this kind of thing for Chinese?
>> Kyunghyun Cho: Yeah, actually one of the students is Chinese at NYU.
>>: Oh, okay.
>> Kyunghyun Cho: Chinese seems to be slightly easier to do at least on the source side. When we tried
it last year by just characters and then it was doing already okay. Then we…
>>: Chinese is easy.
[laughter]
I mean for the character [indiscernible].
>> Kyunghyun Cho: Right, right.
>>: [indiscernible] reasonably strong, right.
>> Kyunghyun Cho: Right, yes.
>>: But [indiscernible] include it for English.
>> Kyunghyun Cho: Yeah, that’s true.
>>: You have to go to strokes for…
>>: No not by strokes…
>> Kyunghyun Cho: Yeah, so…
>>: Like just characters…
>> Kyunghyun Cho: Yeah.
>>: The speech people use the…
>> Kyunghyun Cho: No, we want to go to the stroke or the sub-character level. Here you go.
>>: Do you do strokes…
[laughter]
>>: No, no.
>>: He had a slide ready for it.
[laughter]
>> Kyunghyun Cho: Yes, yes, thank you.
[laughter]
>>: Good timing Anthony.
>> Kyunghyun Cho: That makes everyone wonder what is the ultimate level, right? For images thanks
to the convolutional neural net we went all the way down to the pixel intensities, right. We work at the
pixel level in the images. No body works at the, I don’t know applying the sift or the Gabor filter first,
nobody that anymore.
In the language how far can we go down? Some say bytes, so there was a paper loaded to archive from
Google saying that you know they do the post tagging based on the unit code bytes. They turned out it
works okay, good. Then they trained one model for was it thirty-two languages and then it worked.
This is their model.
Then I started to wonder that is Unicode really the ultimate level. I don’t think so because for instance
in Korean we have like consonant, vowel, you know like about forty-four of them. Then we get one
syllable by combining one consonant and one vowel, and optionally one more consonant. Like this
that’s my first syllable of my first name.
Then because of this combinatorial property we actually get more syllables or the characters in Unicode
space than the words. Then you know it just doesn’t make any sense. Can we essentially decompose it
into the sub-character level symbols and then work on it.
Chinese is similar so you can divide it into radical and then you know remaining. Then radicals are still
the Chinese characters. If we do it we are working on that now for the document classification. The
vocabulary size shrinks by about, I don’t know, three fold. But the length grows by twice. I think we can
actually manage that.
>>: Are those…
>>: How do you…
>>: [indiscernible]…
>>: [indiscernible] decode in small pieces. Can they be all together?
>> Kyunghyun Cho: Sorry?
>>: I mean the order, so you get three [indiscernible], right. You put them together. How do you code
them?
>> Kyunghyun Cho: Oh, so Korean actually it’s like the location is all fixed. We can actually just you
know like put it as a list.
>>: Oh, okay.
>> Kyunghyun Cho: Chinese is a bit, yeah is…
>>: Yeah it’s [indiscernible]…
>> Kyunghyun Cho: Problematic, yes. Yeah [indiscernible]?
>>: Those two words have the same meaning?
>> Kyunghyun Cho: This one is Kyung. This is ghyun. It’s a two different syllable.
>>: Okay.
>>: Then six.
>> Kyunghyun Cho: These don’t have any meaning by the way.
>>: Okay.
>>: Oh.
>> Kyunghyun Cho: It’s just syllable. Alright, so actually this was only the first one, right. I still have
some time, right, or do I?
>>: Yeah, you’re okay.
>> Will Lewis: Your fine, you have a half an hour.
>> Kyunghyun Cho: The second thing we got to go beyond sentences I think. Actually on this part I
don’t have any results for the translation. But let me tell you about the language model first. We
decided at the, how much can we gain by having a larger context when we do the language modeling?
When we model the set of probability of the current sentence will it help to condition the language
model on the previous n sentences?
Then the answer even without running any experiments is that the, yes it’s not going to hurt. In the
worse case you’re going to have a zero weight to the connections to the previous sentences. Then it will
just do the usual language model.
But now if it helps which is kind of like true then how will it help? Will it help by, I don’t know, getting
the authors writing style? Will it help by some magical thing? Turned out it actually helps by giving the
language model a narrower set of vocabularies to choose the word from. It helps the language model to
narrow it down the possible word set.
We trained the model on a bunch of small corpora. That is the reason why this paper was rejected. Is
that we looked at the perplexity changes as the number of the context and as we increased the number
of context sentences. We see that the adverbs, nouns, adjectives, you know verbs, all those open-class
words become more and more predictable.
In other words these larger contexts help us get the better sense of the topic of the current documents.
Now, why is it like important? I think the importance will show up in the dialog translation setting.
Now, by knowing what kind of topics or the themes the dialog is about the translation model is going to
have so much easier time narrowing down or putting a more probability mass on the likely words. Then
that naturally leads to a better translation. Or let’s say more natural translation.
>>: What is…
>>: Why does the, sorry go ahead.
>>: What if the perplexity of the coordination and determiners go up?
>> Kyunghyun Cho: Yeah, that’s a good point, right…
>>: There’s something that’s going off the chart.
>> Kyunghyun Cho: Yeah, so that’s actually an artifact of having a, showing a percentage actually and
one of the reasons the going off the chart. Then the, yeah these functional words turned out to actually
get worse slightly. But my conjecture now which we are testing now at the moment is that because we
fixed the size of the model when we, even if we add more and more context sentences.
I think essentially we get these open-class words become more predictable. The model has the fixed
size so the fixed capacity so it kind of sacrifices the other functional words perplexity, predictability. But
these are actually really small, it’s just a percentage because I put it as the proportion is like artifact,
yeah, it’s a very…
>>: The overall perplexity gets better the way you currently value it?
>> Kyunghyun Cho: Yeah, overall get’s much better. Yes?
>>: I’m wondering this context how it’s actually expressed in these networks. If you’re looking further
and further back in time or the text do you really need a very complex model of context? Or does it
actually reduce itself as just a bag of words?
>> Kyunghyun Cho: Yeah, that’s actually a super good point. Of course we love those attention
mechanism, so what we initially tried was to say okay we’re going to start with let’s say, we’re going to
encode each sentence as a bag of words. Then we’re going to run the recurrent neural net on top of the
bag of words, a sequence of bag of words. Then you know feed it to the, let’s say recurrent neural net
language model. Then we add the attention model on top of that.
The best choice is to do the recurrent neural net bidirectional on the context sentences. The do the
attention on top of that. That’s best. But the difference is minimal compared to just having a bag of
words, of every words before. That is fastest and you know close to the best model.
>>: You just sum all the, you average all the embeddings of all the previous words?
>> Kyunghyun Cho: Yeah, we learn everything. We learned everything from scratch.
>>: I know it’s all joint journaling. But in terms of the modeling it’s just; it’s the average of embedding
of all the previous words…
>> Kyunghyun Cho: Yeah, yeah, yeah, exactly.
>>: I think you are using the attention method. You just use the sequence of word embedding and then
the attention pick whatever the [indiscernible].
>> Kyunghyun Cho: Right, right, right, yeah. We can do that, yes, definitely. But it turned out that
simply just you know having a bag of every single previous word worked really well.
>>: But and balanced [indiscernible] you used every single or previous words but you still use attention?
>> Kyunghyun Cho: No.
>>: No.
>> Kyunghyun Cho: Not after that. It’s just going to be the sum of the word vectors…
>>: Even within the current sentence?
>> Kyunghyun Cho: No, in the previous sentence only. Current sentence is just [indiscernible].
>>: I mean the most obvious way is just to not clear your LSTM state every time, right?
>> Kyunghyun Cho: Yeah, yeah.
>>: Was that the base, like was that the first thing that you guys tried was just running it where you
know you used…
>> Kyunghyun Cho: No, so we didn’t do that and just leave it. We tried to actually cut it down into
sentences first and did it. Because we wanted to use it for the other existing MT system and speech
recognition to test it out.
>>: Right, cool.
>>: How does this…
>>: [indiscernible]…
>>: You said earlier that [indiscernible] language model doesn’t help…
>> Kyunghyun Cho: No it helps but the…
>>: They’re very small…
>> Kyunghyun Cho: Improvement is small. But that’s kind of like understandable. We have a kind of
like idea that the adding of language model is going to give us amazing help, right.
>>: Yeah.
>> Kyunghyun Cho: Especially if that has been the case with the machine translation systems we run it
on.
>>: They also work?
>> Kyunghyun Cho: I think the reason is because the decoder, the translation models decoder was just
too weak.
>>: I see.
>> Kyunghyun Cho: You get a lot of gibberish if you don’t have any language model and then you
decode it out from the translation model. There, you know like having a language model helps
amazingly because you can filter out a lot of wrong things, whereas, these neural net models actually
does the language model implicitly in the decoder recurrent neural net.
>>: [indiscernible]
>> Kyunghyun Cho: The improvement we get is not that dramatic.
>>: [indiscernible] is very small. What do you spend all the effort to focus on [indiscernible] here?
>> Kyunghyun Cho: Oh, this. This is the step toward doing the larger context in translation.
>>: Oh, I see, okay at the sentence level.
>>: How do you train this? What are your examples?
>> Kyunghyun Cho: These, previous n sentences.
>>: Okay.
>> Kyunghyun Cho: And current sentence. That’s going to be one example.
>>: How many sentence context are you using or that was in the…
>> Kyunghyun Cho: Oh, yeah we tried with zero to let’s say eight. Then after four everything kind of
like, yeah, plateaus.
Okay, so the last thing I think the neural machine translation enables us to do the very natural
multilingual translation. Because it’s just a neural net, right. You plug in a lot of different things. Then
you move it around and then plug in a lot of output models. Then you get the multilingual translation.
Of course then why do we want to do that? That’s a very good question that I don’t have an answer to.
I even started reading the second language learning literatures for humans. Then in that field is kind of
split into fifty-fifty. Some say that okay learning more languages help you get better language ability.
Some others say that in fact learning second language when you’re young deteriorates your mother
tongue.
I was like okay which one is true. Then you know for humans we cannot really do the controlled
experiments in that case. You have to raise a kid for like twenty years and it’s not that easy. But for
machines, especially for neural nets we can do it.
>>: When we do five different languages that’s probably…
>> Kyunghyun Cho: Yeah, exactly, teaching it is also very expensive as well.
[laughter]
Then you know like actually people have thought about it already from the last year. Then there was a
paper on doing the multilingual translation with the attention mechanism presented at the ACL two
thousand fifteen by the [indiscernible] et al., from [indiscernible] or [indiscernible], [indiscernible] I
think.
Then what they did was that they, you know because this attention or the alignment looks like its very
language pair specific, right. What they did they said okay we’re going to start from English and go to
the multiple languages. We’re going to put attention mechanism separate one for each target language.
Then afterward yeah, [indiscernible] et al., from Stanford and Google they submitted a paper to ICLR on
doing the multi-way translation, multilingual translation, and also multi-task as they did the parsing as
well. But in that case they removed the attention model and then fell back to the basic
encoder/decoder model because that becomes much easier, right. You just map anything into a vector
and then decode it out from there.
But at the same time me and my collaborators we were started working on this actually fairly early,
starting from March last year or even before we started working on that. But the thing is that this
attention or the alignment model was a problematic one.
The first conceptual issue was that the, is this attention mechanism universal? Can we use a single
attention mechanism for different language pairs? That’s what we want. At least that’s what I wanted
because I want this multilingual neural machine translation systems size, or the number of parameters
to grow only linearly with respect to the number of languages.
But as soon as this attention or the alignment model becomes specific to the language pair the number
of parameters grows quadratically. I just don’t like that. Yeah?
>>: Can you, in this task this is the task where you assume that you have Ny parallel sentences, the
same sentence?
>> Kyunghyun Cho: No.
>>: Okay.
>> Kyunghyun Cho: No, we want to, it would be great to have that assumption.
>>: Yeah.
>> Kyunghyun Cho: Except you know we don’t have too much, right. We just assume that we have
bilingual para-corpora.
>>: Okay.
>> Kyunghyun Cho: Then I asked around quite a lot. Do you think we can actually share one attention
mechanism for different languages? Every single one of them told me that no, that doesn’t make any
sense. Here comes another like the priors that we have, right. My prior was also telling me that yeah
that’s probably not going to work.
Then how am I going to do it? But you know like we are deep learning researchers we are very stubborn
and patient, and are able to train a gigantic model while drinking beer. We thought let’s do it, let’s try
that. Then we are pretty frustrated for some time because others were just publishing a paper on the
multilingual translation by just simply avoiding this problem.
Ah, I was like, oh, maybe we should have done that. But at the end of the day it turned out that it is
possible. We trained a single model or it’s kind of like a single gigantic model where only subset is going
to be used at a time. That has a one, two, three, four, five, six, oh actually yeah six recurrent neural net
encoders and six recurrent neural net decoders, and then a single attention mechanism.
Then we trained on the, every single para-corporas from the WMT’ fifteen. We had the para-corporas
going from English to all the other languages and then all the other languages to English. Then we
trained this model for about three, four weeks. Yeah, it was very difficult.
Now at this level what happens is that the model does not fit on a single GPU. Then you just have to
paralyze the model over multiple GPUs.
>>: But [indiscernible], right?
>> Kyunghyun Cho: Yeah, everything. Not the monolingual…
>>: [indiscernible] or you merged them together to have unified like you know internal representation
about all the languages and then do that?
>> Kyunghyun Cho: Yeah, so we don’t.
>>: I see.
>> Kyunghyun Cho: We don’t have the internal representation because let’s say we have two sentences
in English and Czech. The lengths are different. We get a different number of vectors for each sentence.
Now the, what is unified is this mechanism.
>>: Okay.
>> Kyunghyun Cho: We do not try to find the vector space, common vector space. But we found
instead a common mechanism that connects across different modalities.
>>: But the attention only accounts for like point one percent of the parameters any way. If everything
else is not shared like what’s the intuition that’s this is a good point to share?
>> Kyunghyun Cho: Yeah, I’ve been just you know like stubborn. Actually, so this is the one step. Next
step we’re looking at is once we train this model can we actually go for the zero resource translation? In
that case sharing the attention becomes very important. We haven’t tested it yet. We are trying to test
it but lack of GPUs and lack of time.
>>: Zero resource means going from a line, a pair that you don’t have?
>> Kyunghyun Cho: Yeah, no, yeah, exactly.
>>: English to Finnish or something.
>> Kyunghyun Cho: Yeah, let’s say German to Russian directly without…
>>: German, yeah German to…
>> Kyunghyun Cho: Yeah, or anything.
>>: The reason why you don’t share representation here is because it’s long length?
>> Kyunghyun Cho: Yeah, we don’t know how to do it yet…
>>: [indiscernible] do it naturally [indiscernible] at the top. We just use the neural net [indiscernible]
then it cross over. Then HMM takes care of [indiscernible] length…
>> Kyunghyun Cho: We can do that but I’m not a big fan of HMM, sorry.
>>: Okay.
>> Kyunghyun Cho: But, okay. We trained this model on this let’s say ten direction language pairs.
Then you know like they are doing okay. You know like comparable to having a let’s say ten single pair
language models. We looked at both the log likelihood, BLEU, and then there are on certain language
pairs there’s indirections. Multilingual model is doing better, sometimes less. But now one thing
important is that in fact they are doing comparably with a substantially less number of parameters while
sharing the single attention model for the ten different language pair directions.
That’s the important thing. But at the same time we are kind of, but there must be somewhere you
know it works slightly better. The case was that when we have a very small low-resource language
among the many language pairs. What we get is that this multilingual model gets better than having a,
let’s say adding in more monolingual corporas to compensate for the lack of the data. But the
improvement as you can see is not that great.
>>: Single, what single plus [indiscernible]?
>> Kyunghyun Cho: That’s the single plus the diffusion of the recurrent language model on the
translation model. We add in the same amount of monolingual English. This should be English to
Finnish, so that must be a typo, the target side language monolingual corporas that makes it comparable
to the multilingual model and then looked at how it works. We get generally a better generalization
with the multilingual model. But I think this is this kind of low-resource language translation we are at
the very beginning of tackling it.
>>: What are the two numbers?
>> Kyunghyun Cho: The first one is for the development set; second one is for the test set.
>>: Yes, so [indiscernible] have very different alignment, right?
>> Kyunghyun Cho: Yeah.
>>: What can you…
>> Kyunghyun Cho: I know.
>>: Maybe all the six languages that you talk about have single kind of [indiscernible]…
>> Kyunghyun Cho: Ah, no, no.
>>: Japanese there is probably the score…
>> Kyunghyun Cho: Yeah, but already because of Finnish everything [indiscernible].
>>: Okay.
[laughter]
>> Kyunghyun Cho: Yeah, and you know I still didn’t fill in the Finnish.
[laughter]
Alright, so the, I believe the new kind of territory of machine learning, oh okay.
>>: Sorry, one last question. You mentioned about zero resource. You’re talking about like
[indiscernible] to Finnish [indiscernible] that you don’t have?
>> Kyunghyun Cho: Yeah.
>>: What if you added a new language that you don’t have. Do you need some parallel data or can you
add a language with just monolingual data?
>> Kyunghyun Cho: That I do not have any answer to yet. I do not have answer to. But one possibility is
that the, recently from the Edinborough, again Rico [indiscernible] and others there. But they show us
that the one easy way to incorporate the language model, or the monolingual corporas in this neural
machine translation model. Is that you just translate the monolingual corporas with another translation
model.
When you make a pseudo-bilingual corporas and mix in with the original bilingual corporas, and then
apparently because that helps getting the decoder’s language model better. Then they show quite an
improvement. When we have let’s say new language I think we can do something similar to that. We’re
going to go through another language to get few bilingual para-corporas. Then just fine tune it a little.
This is like the very latest work. We just got the acceptance notification yesterday from [indiscernible].
We are still trying to work on it. Yeah, this is like super new, yeah.
>>: Is the size of the attention model similar to the size of the usual [indiscernible] monolingual…
>> Kyunghyun Cho: Same size, same size. We didn’t, we couldn’t afford to actually do the lot of let’s say
hyper-parameter surge. We just trained let’s say…
>>: Did you use drop out?
>> Kyunghyun Cho: Sorry?
>>: Did you use drop out?
>> Kyunghyun Cho: We didn’t use drop out.
>>: Do you think it will help because you need…
>> Kyunghyun Cho: Well, when we use the drop out in the attention and the target side, let’s say
output. It helped for the very small corporas. But in the WMT study we didn’t really see much
difference. Yes [indiscernible]?
>>: It’s a high level point or question. It’s related to the two extremes you’re talking about. One is
encoding entire sentence or maybe multiple sentences into one vector which is problematic because
you don’t have structure of those vectors you can’t analyze it. Maybe it’s embodying all kinds of things.
But you would like to have some structure there.
>> Kyunghyun Cho: Yeah.
>>: The other one is to mimic the one dimensional structure of the sentence. But the alignment
deployed one dimensional way. However, that one dimensional structure of speech is an artifact of our
physical embodiment.
>> Kyunghyun Cho: Yeah, and universe.
>>: If you had mind to mind you wouldn’t go through it. It would just train the network directly mind to
mind, right. Then what is the real intermediate representation? Why is it fundamental? Should you be
really working on one dimensional representation? Or should that intermediate representation be
different?
>> Kyunghyun Cho: No, I think one dimensional thing the reason is it’s because it’s language. As you
said the language is not the thing that somebody designed from the beginning, right. It has evolved
together with the humans evolved. Then it has adapted to the, essentially the universe, right. Time
flows in one dimension and then we have only a single vocal cord and a, actually we have two years,
okay, that’s slightly different, but okay, right.
[laughter]
In language related task. Then because our language is like they might believe okay this has nothing like
that whatsoever like a theoretical foundation there. Is that these one dimensional thing is very natural.
Because many of intellectual activities we do are based on the language as well.
Of course if you’re connected, even when we use the internet where the bandwidth is amazing, we send
a lot of things. Eventually when we consume the information transfer via internet or whatever becomes
like very one dimensional because that’s the only way we can do it.
>>: [indiscernible]
>>: It’s a matter of dimension. That’s why…
>> Kyunghyun Cho: Yeah, it’s slightly less one dimension. But you know like eventually we read it in…
>>: I see.
>>: It’s a bottleneck. That one dimension is a bottleneck.
>> Kyunghyun Cho: It is a bottleneck.
>>: But there is an expansion though in the mind it expands and it occupies different space, different
geometry.
>> Kyunghyun Cho: Yeah.
>>: That’s why kids send pictures not text.
>>: Right, well...
[laughter]
>>: No, that is the reason.
>>: Also, I mean you have dyslexic people who can’t force themselves to do this. They still understand
things.
>> Kyunghyun Cho: Yeah, so what is the correct representation? Does that representation, in fact
should that representation correspond to our understanding or intuition about the representation? I
would say that is a huge open questions that I cannot really answer easily now, but we can talk about it.
>>: You’re mapping surface forms here. You’re not mapping meanings at all. I mean we’re talking
about our representations. Clearly we’re not doing this kind of mapping.
>> Kyunghyun Cho: Maybe, maybe not.
[laughter]
>>: I would argue that we probably don’t even think completely linearly…
>> Kyunghyun Cho: But I will just finish with the, I have let’s say one and a half slides left. We, it turned
out that the very same model can be applied to other modalities as well. We tried the same thing.
It was the Christmas week in two thousand fourteen. It was actually, there was a snow storm in
Montreal, so nobody was able to go out. I was at home and then just decided that yeah with the very
same code I’m going to just cut out the encoder recurrent neural net and plug it into, plug the
convolutional neural net there. Then train this model to do the image caption generation instead.
Then now unlike the other, many people who have done by just, okay getting a one vector out of
images. Then you know decode out the caption. Because you know we love this attention mechanism.
We decided, I decided to use the last convolutional layer which gives me a set of vectors that preserves
the spatial informational already.
Then it worked, yes. Then work, then I just realized that wait if it worked I can probably do it even
more. This is some examples. But examples are examples I’ll just pass. We decided to do it on the
video. Okay, let’s you know then feed the video in and then let the model describe a video.
It, of course video description doesn’t work I can tell you. You know like it doesn’t work but does
something interesting. This was the input frame like four of them, four of the like thirty frames we fed
in. This is the generated one. Someone is frying a fish, not sure whether there’s actual fish or not
because reference doesn’t say it in a pot. It was doing something, right.
This is the attention weights put on those frames. Then we thought that, wait, speech recognition is
translation. You translate from speech to text. We applied that and it works. Now this is the attention
weights we get after training this model for some time.
There are a number of people in University of Montreal who are actually pushing into this direction of
using the very same model we have. Then you know making a variant of it for the speech recognition.
I’m not working on it. Then it turned out to be there are a lot of other applications of the very same
model without minimal set of multiplications.
I wrote a review paper last year which turned out to be way too early. But it has an extensive list of the
application possible with this model. Okay, thank you.
[applause]
>>: In the speech recognition example were you’re generating…
>>: I’m raising my hand.
>>: Sorry.
>> Kyunghyun Cho: Okay.
>>: When you say that the attention mechanism worked for caption generation, you mean worked as in
it out performs something without attention but still using a convolutional input?
>> Kyunghyun Cho: Yeah, so, yes, yes. Using the, almost same thing without using attention we get the
better score. Interestingly, so that improvement was not captured by the automatic evaluation
measures such as BLEU, MEDIO, or whatever…
>>: Right, right, you did it with human…
>> Kyunghyun Cho: But from the human evaluation. We were, actually we submitted this for the MS
COCO Challenge, so yes. Then we were like eighth, ninth, eleventh according to like BLEU, Cyder, all
those measures. Then we’re like, oh, no we should have spent more time to make even a larger
[indiscernible] sample or something. Then after the human evaluation we went up to like second or
third place.
I think because it can generate a much more natural. Then you know like more let’s say tied, let’s say
caption that is more tied to the images. That’s what I believe is so, but…
>>: Yeah, theoretically I agree with you. But we haven’t been able to see the same kind of
improvements as you guys.
>> Kyunghyun Cho: Oh, I see.
>>: Interesting, thank you.
>> Kyunghyun Cho: Alright, you were saying something?
>>: Oh in the speech recognition are you generating characters or full names, or what? I, yeah what?
>> Kyunghyun Cho: Here when I was still kind of part of the speech team in Montreal we were just
going for the full names. But now a days I think are going for the word or characters, yeah.
>>: But they have to have separate couple PC [indiscernible] in order to do that, right?
>> Kyunghyun Cho: No.
>>: No…
>> Kyunghyun Cho: No, that’s the thing they just put the very same model like the recurrent neural net
on top of that.
>>: I see.
>> Kyunghyun Cho: They are not better than, so the latest one so far is not better than CTC, we are on
par. Yeah, we are on par.
>>: [indiscernible]. Somehow we have to do something beyond [indiscernible].
>> Kyunghyun Cho: Yeah, so language model. Yeah, language model helps.
>>: Yeah.
>> Kyunghyun Cho: I think you know it helped here as well. Yes, [indiscernible]?
>>: Has anybody tried speech to speech, like you just like say something in a different person’s voice?
>> Kyunghyun Cho: Yeah, I think we can do that. But just that there aren’t many data we can easily
access in public domain when it comes to speech to speech.
>>: Do you think about how to do things that can remove the requirement of having larger
[indiscernible] data?
>> Kyunghyun Cho: Yeah, that’s…
>>: [indiscernible]…
>> Kyunghyun Cho: That is exactly the reason why I’m really interested in this multilingual model. Now I
told you about the caption generation video description generation, right. That was exactly the same
model that we used for the translation. I want to plug the image here. I want to plug the video here. I
want to plug the speech here. Then by having this kind of multi-model, multi-task I think that is the way
to go for the, let’s say tackling the, I don’t know, low-resource tasks.
That’s my view. But of course Yoshua and Yann disagree with me heavily on that point. But we’re all;
you know disagreement is the fuel for the innovation idea…
>>: They keep talking about how important it is to [indiscernible]. Do you have any ideas about what
they are talking about how maybe applicable to…
>> Kyunghyun Cho: Yeah, so I believe that the unsupervised learning is going to be important as a part
of let’s say multi-task learning. The thing is that in unsupervised learning nobody knows what, how to
do it, right. That’s the thing. Of course we usually view it as doing the probabilistic modeling of the
input distribution as the unsupervised learning. But nobody’s actually knows whether that is the right
way to view it, right. There can be completely different ways to view it.
One of the views that I like personally is the unsupervised learning in a way of predicting the motor
control given the observation. Something changed, how did it change? Maybe that is the way or
predicting the future, that’s what Yann LeCun’s view. It’s all about predicting the future. Yoshua is more
on the probabilistic modeling of the data distribution, but who knows. Yeah?
>>: The attention model sort of seems semi-supervised learning, do you think…
>> Kyunghyun Cho: Yeah, so it’s not semi-supervised but I prefer to call it weakly supervised which is a
really weird term. But yeah I think we should call it weakly supervised.
>>: [indiscernible] TED Talk, WSLT…
>> Kyunghyun Cho: Oh, yeah, so that one we actually, yeah tried as well with, for this deep…
>>: How much better do you have compared with Stanford. Stanford has a huge amount of…
>> Kyunghyun Cho: I know, so, okay on that we haven’t really tried seriously. We tried it for the
diffusion, the language model paper. But the corporas there is like tiny. I don’t think you know like
whatever we learned, let’s say the amazing results we get let’s say on that corporas I don’t think we can
actually tell whether that is because what the model used was good. Or whether it, because it didn’t
work of the, you know automatic regularization by not being able to translate. That might have been
the case. I think that those corpora are just too small to say too much about it, yeah.
Okay, thank you.
>> Will Lewis: Thank you.
[applause]
Download