>> Will Lewis: It’s our pleasure to have Kyunghyun Cho here from New York University to give a talk. Kyunghyun, sorry Kyunghyun, I have a hard time with his name, sorry. >> Kyunghyun Cho: You said the last one, yes. >> Will Lewis: The last one was better yeah. When I think about it I say it, right. >> Kyunghyun Cho: Yeah. >> Will Lewis: He’s an Assistant Professor at New York University. Last fall actually got the Google Faculty Award which he’ll be using for doing more work on machine translation of course. >> Kyunghyun Cho: Yes. >> Will Lewis: Which is great for the field. Before NYU he was a post-doctoral fellow researcher at the University of Montreal under Professor Bengio. He received his PhD from Aalto University School of Science in two thousand fourteen. That’s in Finland. His thesis was entitled Foundations and Advances in Deep Learning. I give you Kyunghyun. His talk today is Future (Present?) of Machine Translation. >> Kyunghyun Cho: Alright, thank you Will. Well, thanks for the invitation first. You know, in the slide it just says Neural Machine Translation. Because I just realized while I was making my slides that, yeah, like the Future of Machine Translation sounds a bit too grand. Let’s you know like tone it down a bit. Because a lot of people whenever I give a talk they’re like half of the people who really love what I’m talking about. The other half really hates about what I’m talking about. I thought okay let’s go for some conservative title. Before we actually begin, so one of the, like the really. One question that I get every time or actually from my collaborator is that the. You know are you a machine learning researcher, machine translation researcher, or artificial intelligence researcher? I’m like yeah, but I think they’re all the same thing at the end of the day. Then they started asking me you know like, oh, why do you think so and so on? Before I talk about actually machine translation let’s talk about you know why language is important for artificial intelligence in general. Few months ago Yann posted on his Facebook, replied to the comment by David McAllister that, I’m not sure, he’s not sure language is as important as we think. As I was reading the comment I was like, no Yann you brought me to NYU to do some language research. You’re saying that it’s not important, no; that’s not that great. I began to ask myself, okay, why do we want to do the natural language processing research? When you know like a lot of things, people are more excited about the artificial intelligence, at least in my circle, right. I started to think about it a bit more. Then you know thinking about building artificial intelligence agents. In order to build an intelligent agent that agent needs to have a direct perception of the surrounding word, right, in order to know about what’s going on inside. Also at the same time that agent lives, right. It survives over time and then it gets the individual experience. Okay, that’s great. At this level I realized that what we can get is something like a tiger, or lions, or elephants. They are born and then they live in environments, right. They perceive the environments and then they age and getting individual experience. Now, what does the natural language actually get us? Natural language first of all gets us the distant and collective perception. We don’t have to experience or observe everything ourselves, right. I read the books. I hear from the people you know what’s going on. I don’t know somewhere in Greece they tell me about you know economy is falling down. I don’t have to go there and see the stores closing down there. I actually can know by just listening or reading the news articles. Also at the same time somehow everyone in this room knows about the, I don’t know, Greek, Roman history. Although I’m pretty sure none of us experienced or perceived what happened about two thousand years ago, or three thousand years ago. We get a historical experience. I think those two things that are enabled by the natural languages are probably the most important thing. To push the boundary of the artificial intelligence research toward the actual human level intelligence. I was thinking about this. Some people started telling me that well, you know, natural language processing isn’t it about to be crushed by the, I don’t know, deep learning or something. I think Neil Lawrence said that last year at the ICML Deep Learning workshop. I like Neil Lawrence by the way, a lot. Then Wojciech Zaremba who’s a PhD student at NYU, he’s graduating soon, and he joined OpenAI. On the reddit, ask me anything. He said that the speech recognition and machine translation between any languages should be fully solvable in a few years. I can see you guys are working on this to say, speech recognition and machine translation between any languages already. Then you probably disagree. By the way I know Wojciech personally and he’s a great guy. I like him a lot as well. [laughter] Then as soon as this was posted Yoav Goldberg he posted on his Twitter, oh, the arrogance. [laughter] Again I like Yoav and his work a lot. [laughter] I have a huge respect for him. In this issue I’m more on the Yoav side. I think there are a lot of things that we need to do. Okay, back to the translation. >>: You’re becoming old. >> Kyunghyun Cho: Sorry? >>: You’re becoming old. >> Kyunghyun Cho: Oh, not that old, okay, not… >>: This is the argument. >> Kyunghyun Cho: Not Yoav’s level yet, okay. [laughter] Maybe in about I think twelve years. I don’t know we’ll see. Anyway back to today’s topic, so Neural Machine Translation. It was two thousand and thirteen. It was the summer of two thousand and thirteen. I moved to Montreal. Yoshua Bengio told me, hey, you know you’ve been working on the probabilistic graphical models, probabilistic neural nets. Then I have three topics that you might find interesting. Then you can choose whatever you want. Then you can just do real research. Because you don’t speak French you don’t even have to teach or anything. [laughter] That was University Montreal completely French speaking university. Then the first one was the usual let’s say probabilistic interpretation of the Denoising Autoencoder. The second thing was something else, again, just a neural net. The third thing was Machine Translation. I was kind of surprised, right. Yoshua Bengio never has done any research on machine translation. I asked him machine translation, really, you don’t know machine translation. I don’t know machine translation. I don’t think anybody in this lab where there were about forty people back then, know about machine translation. Then Yoshua told me that, yeah somehow I feel that the machine translation can be done with neural net only. Just like how it’s being done in, with speech recognition these days. I was like, alright that sounds about right, okay. I hoped to believe that because that’s exactly the kind of topic that I wrote my thesis on. Then I said, yeah let’s try machine translation. Then I started to look at the machine translation literatures. You know goggling a lot about the translation. Then you know it turned out that machine translation as a problem itself is quite, you know straightforward. We have some kind of black box, so machine. Then we have a lot of data. Then our goal is to use this data to make this black box take as the input a sentence in a source language. Let’s say in this case English. Then spit out a translation. It turned out that is the machine translation. Okay, I was like, okay great. Then when you put this into the probabilistic way it gets even more elegant. What you want is that you want your machine or the black box to be able to model the conditional distribution over the translation given a source sentence. Then you can rewrite it into a sum of the translation model, log of the translation model and the log of the language model. The elegance comes from the fact that the, we use a limited pair corpora to fit, tune the translation model while on the language model we can use the essentially the infinite amount of monolingual corporas. I was like, ah this is great. This seems like a very elegant problem that you know we can do a lot of things about. Then I started reading the Philipp Koehn’s, Statistical Machine Translation book. Up to chapter two it’s all about these kind of elegance of the statistical machine translation. I really loved it. Chapter three… [laughter] Well I haven’t got into the fun part yet. But, okay, I think chapter three is the basic probabilities and statistics. I skipped over that. Then chapter four is where the idea models are introduced. Chapter five is the phase based models are introduced. Then from there on I just noticed that in reality it’s actually pretty messy. You know like at the end of the day nobody really cared about the conditional probability. [laughter] I was like wait that sounds a bit wrong here. Then everyone was using a simple log-linear model. Then you know of course I work with the neural nets. Then the first thing we learn in the neural net course or the neural net research. Is that the neural net is great because it can do it in non-linear regression or the non-linear classification. I was like; yeah I don’t want to handle the log-linear model too much here. Then there were a number of feature functions. For the first let’s say five of them I could actually understand why you need it. [laughter] But from there on until let’s say there was some paper with the two hundred sparse features. Between those two I got a bit stuck. Then you know I had to think. Then okay I can actually read those papers worth ten years of research. Or maybe what I need to do as a neural net researcher is to try to see it as a completely new problem. Then you know try to approach it in our own way. I started to look up the history a bit. Then turned out that using neural net for the machine translation is not new at all, in fact, already in two thousand six Hugo Schwenk from France wrote a paper saying that, okay. We can use the very gigantic feet forward neural network as a language model. Then use it to filter out the n-best list from the translation system. Advantage, you don’t have to touch the existing statistical machine translation system at all. You don’t have to do it. You just download the Moses, follow the instruction, and then you get the n-best list for each source sentence. Then you’re going to use the neural net language model to re-score them, or rerank them. But of course the disadvantage is there is no integration. The neural net language model you use doesn’t even know what the source sentence is. That makes, that made me wonder how does it even help when you don’t know what the source sentence is? How can you actually re-rank the target, as I say translations? You know we thought okay. This is a bit too; you know this one is too naive. Then in two thousand fourteen Devlin et al, the lead author is probably is here, yes, okay. [laughter] Had a paper where they made a now translation model using the neural net. Then plug it into the existing statistical machine translation system. The advantage is that there’s almost no modification to the existing MT system. Of course there are a few. Then the integration is now quite deep. You actually plug in the neural net into the existing system. Then let the decoder use the features or the scores you get from the neural net. Disadvantage is that the, okay, increased complexity during decoding because you have let’s say another component that neural nets are generally very expensive to compute. Then the second thing, second question I had was that how do we actually know that the neural net feature you get is linearly compatible with the other features? At the end of the day as I said earlier the existing phase based model uses the log-linear model. Then it assumes that all the features we got are linearly compatible. But how can we actually check? Do we actually know? Of course in this paper the amazing improvement in the score kind of empirically shows that yes it is possible. But I still couldn’t you know convince myself that okay those are all linearly compatible. We decided that the, okay, we are like neural net research group. Then we are only, we are neural net researchers. Let’s try to plug in the gigantic neural net between the source sentence and target sentence without any other extra component. An obvious advantage is that the, every single component of this network is tuned specifically toward maximizing the translation quality. Unlike existing cases where when you get the feature value you think, or the people think that you know like those features are good for the translation quality. But that’s not necessarily guaranteed because you don’t necessarily tune the feature value except for the weight or coefficient in front of it. Well there are a lot of disadvantages. First of all back then we didn’t know whether it’s going to work. That was a huge disadvantage we had. The second thing is that of course now the training is extremely you know expensive. But this was okay. We are pretty patient people. Neural net researchers are probably the most patient researchers in machine learning community. We just wait few weeks even using the state-of-the-art GPUs. But the most serious, more serious issue I ran into when I started talking to the machine translation researchers back then. Was that you know I’d found that a lot of people are probably subconsciously I hope allergic to the term neural net. I found a lot of people just you know like, oh, yeah, neural net to do the, I don’t know translation. No, not a lot of people love the idea at the end of the day. But anyway, so I’m going to talk about this approach today. Then you know like I called it Neural Machine Translation at some point. But of course Neural Machine Translation is just a subset of the Statistical Machine Translation with different mechanics. I actually prefer the name connectionist MT except that this term never really picked up by any other people. The reason why I call it connectionist MT is that already in nineteen ninety-two Hutchins and Somers talked about this in their textbook on Introduction to Machine Translation. They said in like one paragraph out of the very thick book they wrote that, “The relevance of the connectionist model to natural language processing is clear enough.” “As a psychologically real model of how humans understand and communicate.” Let’s ignore this part because I’m not sure about that. But it is clear, it is you know like they’re filling a possibility that people have been thinking about already from the early ninety’s. Then it actually was tried by two independent groups in nineteen ninety-seven at the same time, both of them from Spain, one group from Valencia, the other group from Alicante. They actually tried exactly the same thing that the [indiscernible] and others that Google tried in two thousand fourteen, already in nineteen ninety-seven. The first was in called Recursive Hetero-Associative Memory proposed by Forcada and Neco in nineteen ninety-seven. I found this paper and then you know like this is the figure too if I remember correctly. I was like, my god this looks exactly you know what we’ve been doing. There’s a recurrent neural net encoder that’s going to read the source sentence, summarize it into a vector. Then from that vector the encoder neural net is going to spit out one order of time in the translation. Similarly, Castano and Casacuberta in the same year from Valencia proposed some model similar, almost you know like identical to them. But the issue back then was that, “the size of the neural nets required for such applications and consequently, the learning time can be prohibitive”. Now we know that the, now a days that is not true. It is still quite expensive. But we can actually manage it in let’s say two, three weeks. Then this idea, yes, sir? >>: In most of the papers that you wrote did you refer to this… >> Kyunghyun Cho: We do once I found it, yeah we do. >>: Okay. >> Kyunghyun Cho: Forcada and Neco’s paper is slightly closer to these reason works. Then you know that paper was cited only three or four times when I discovered it. But now it’s been cited like thirty times. [laughter] >>: Yes, [indiscernible] didn’t know what [indiscernible]. >> Kyunghyun Cho: Yeah, I think you know it’s a win-win situation. >>: [indiscernible] >> Kyunghyun Cho: I actually met Forcada at one conference. He was really you know friendly to me, right. [laughter] I don’t think [indiscernible], okay. >>: [indiscernible] >> Kyunghyun Cho: Oh, okay, and in two thousand thirteen which I didn’t really know back then. The, from Oxford now called Kalchbrenner and Phil Blunsom had a paper where they used. They proposed almost the same thing except the encoder part was the convolutional neural net. But one thing is that they didn’t just push enough. Then their reserves were let’s say at best kind of lukewarm. Let me tell you about what Ilya Sutskever et al., from Google and we have done in two thousand fourteen. It’s a very simple model, right. It got really simple because we start from the scratch. We started from the scratch. Let’s forget about you know whatever we know about the machine translation. But we’re going to just view it as the structured output prediction problem with a variable length input. It’s a variable length sequence input. We’re going to use a recurrent neural net. Before that we have to decide on how we are going to represent a source sentence. We thought that okay what is the most or the least representation with the least amount of prior knowledge about the words or the characters, or whatever input. That is the so-called one-of-K coding or the one-of vectors. In that case each word is coded as an O zero vector except for one element who’s in this corresponds to the word in this in the vocabulary. We set it one. Then the most important property every single word is equal distance away from the every other word. Now you know there is no like the knowledge about it. We just encode it like this. Then each one of them is going to be projected into a fancy term like the continuous space word, continuous word space. Of course that’s the term is too fancy. In fact it is just a matrix multiplication from left. Then we get a dense real valued vector for each of the words. Now we have a sequence of real value vectors. That sequence is read by the recurrent neural net. Each time it’s going to read the new word. Then update its memory state based on the previous memory state as well. It reads one word at a time until the end of the sentence. Then we’re going to code that memory state of the recurrent neural net a summary vector, or the context vector of the source sentence. Then decoder is exactly the opposite or the flipped version of the encoder. Now, given that summarized vector we’re going to update the recurrent neural nets memory state first. Based on the previous memory state previously decoded our translation word. Then, the context vector from which we compute the word probability or the distribution over all the words in the vocabulary. When we have a distribution what do we do? We either sample or take the most likely word. We do that, so we got the next word. Then we do it recursively until we sample the end of sentence symbol. That’s essentially, that was essentially it. Of course you know like we didn’t know about how long it takes to train. We actually coded this whole thing up pretty all day. Like it was September two thousand thirteen we had all the code there. Then we were pre-processing the dataset and corporas. We started training the model and you know like after a day or two it was not doing anything. Of course later on we learned that okay we had to wait two weeks instead of two days. But then this model at the end of the day kind of started to work. Especially Llya Sutskever and his colleagues at Google made it work quite amazingly. It was doing actual English to French translation as it is. We were surprised because we couldn’t make it work like that even if we waited like two weeks. That’s why we had to resort to training the phrase pairs instead. Then use it as part of the existing Moses essentially. >>: When did they start doing this? >> Kyunghyun Cho: Sorry? >>: When did Sutskever, et al., start doing this? >> Kyunghyun Cho: I think it’s the same time. We didn’t know about them doing it. >>: Okay. >> Kyunghyun Cho: I think Yoshua might have known. He probably told them that we are trying to do it. That’s possibility. But Yoshua never told us that Google is doing it. But we really, at least I really didn’t know that they were working on the same thing. >>: That’s about two thousand and three, the summer? >> Kyunghyun Cho: No that’s going to be two thousand fourteen now. >>: [indiscernible], okay. >> Kyunghyun Cho: Yeah. >>: What’s the big difference between for just the encoder/decoder model without attention that you guys fed in the source state every time and they fed it in as the first [indiscernible]. Is that the crucial difference? >> Kyunghyun Cho: No that’s not the crucial. >>: Okay. >> Kyunghyun Cho: That’s not the crucial. The crucial difference between them and us back then was that their model was about a hundred times larger. >>: Okay. >> Kyunghyun Cho: Yeah, so that was the main difference. Then you know like of course we learned about that. Then we thought okay, maybe we could implement that as well, like why not. You know we’re going to just make it into a, we’re going to paralyze the model and everything. >>: It’s a hundred times larger in terms of number of parameters or along with data? >> Kyunghyun Cho: [indiscernible]. >>: The number. >> Kyunghyun Cho: Yeah, number of parameters. We’re using even the same data as well. >>: Here you also have these fault vectors over here. >> Kyunghyun Cho: Essentially, this can be, think of… >>: Fault vector. >> Kyunghyun Cho: Thought of as a fault vector. >>: Okay. >> Kyunghyun Cho: I don’t know I don’t really like that term you know. [laughter] >>: Are you going to speak to why beam search is a good idea? I see that that’s like your last bullet on the slide. >> Kyunghyun Cho: Ah, yes the beam search is a good idea because sampling is not a good idea. We have a distribution… >>: Those are the only two options like it’s… >>: [indiscernible]. >> Kyunghyun Cho: Right. >>: Right, but… >> Kyunghyun Cho: Well we can do the just grid search with a beam with one, right? Our goal here is not to actually get nice samples that are going to be representative of the distribution. But rather we are looking for the maximum a posteriori sample. Beam search is good. If we could do anything better than that that’ll even better. But so far the most naïve approach is to use the beam search. Yes [indiscernible]? >>: You said the big difference was the number of parameters, hundred times more parameters. >> Kyunghyun Cho: Yes. >>: What it your sense in these models? What’s the true degree of freedom that you have? When you go a hundred times is it equivalent to in some other models drawing like two times? Like how, because it seems like there are too many parameters in this model so they’re not really free. They’re not freely used. What is your sense? >> Kyunghyun Cho: Okay, so well there are actually multiple things in the question. Alright, so I think first of all I think we are still working with the way. Even with a hundred times more parameters I think we are working with the way too small models. That’s the first answer I have. Then second answer is that the, so more the parameters empirically saying easier it gets to train a model or optimize the training cause function. Then this is not only me but a lot of people tend to agree that the, when we put more redundancy then the training gets easier and easier. Probably because of symmetry, but yeah it’s… >>: But it is some kind of quantitative measure there of degrees of freedom. Like maybe ten times more you wouldn’t see any difference. Hundred times you see a difference. Then if you want to get another measurable effect we need to go another hundred times. >> Kyunghyun Cho: Right, so there is also an issue with that the number of parameters alone is not a good measure of how large this model is. Or what the, you know it’s not a good measure of the capacity of the network, right. There is both the number of parameters and the amount of computation. For instance if we increase the depth instead of the width then what we do is that we can keep the number of the parameters same but we can increase the amount of computation. If we put the recurrency there we keep the number of parameters same we can control the amount of computation. What we see is that in many cases we can keep the number of parameters exactly the same. But by increasing the amount of computation we can get better performance in other things. But not on, we haven’t tried it on this one. But in let’s say generative modeling of the images by doing the recursive processing we get better and better. But of course the amount of computation grows almost linearly as we do. Yeah, very good question which I don’t have a good answer to, sorry about that. Hopefully I’ll get that answered in about twenty years, but not today, not today. >>: Twelve years, twelve years. >> Kyunghyun Cho: Twelve, twelve sounds good, yes. One thing we notice is that it’s not really a realistic model. This model here like the simple encoder/decoder model is not a realistic model of how translation is done. Why is that so? I can you know try to give a lot of empirical like the reserves and then you know numbers that we got. But I think it’s best to answer this by quoting Professor Ray Mooney at the University of Texas, Austin, because it’s just not a good idea to, “…cram the meaning of a whole… sentence into a single… vector!” You can think of it yourself as well. Let’s say I’m going to throw you a hundred word sentence. Now I ask you to okay, now translate it. But I’m going to show you the sentence for only five seconds. You’re going to read it once. Then you won’t have the access to it but you have to write the translation and nobody’s going to be able to do that well unless you’re very well trained to do so. What we do and then according to the study on the human translation. For instance, in the case of English to French translation what human translators, professional ones do. Is they translate in a much smaller translation unit, two to three words at a time. They read the source sentence once. Then they start writing the translation or the target sentence two, three words at a time. Go back to the source sentence over and over. We had one intern, master’s student intern who’s now a PhD student in Montreal. He came up with this brilliant idea of incorporating an alignment mechanism into the neural networks. We called it alignment mechanism in our original paper. But somehow attention is always a fancier word than alignment. Everyone is calling it attention mechanism. The idea is very simple. Instead of encoding the source sentence into a single vector which is unrealistic. We’re going to encode it as a set of vectors. Those set of vectors come from the bidirectional recurrent neural net. The first four recurrent neural net is going to read in left to right. Reversed recurrent neural net reads from right to left. Then at each location or the each word we’re going to concatenate the hidden states of those two recurrent neural nets and call it annotation vector. What is this annotation vector represent? It represents the word. Let’s say in this case growth with respect to the whole sentence. You can view it as a context dependent word representation or the word vector. Then now we have a set of vectors. The issue is that the neural nets are not that great if you have a variable sized input. We have to do something about it. There comes the attention mechanism. At each time step in the decoder it’s going, we have, first let’s see what the decoder hidden state represents. The decoder hidden state represents what has been translated so far. In this case the decoder hidden state here is going to represent the economic growth. Then it knows what has been translated. Given what has been translated so far, for each of these annotation vectors we’re going to score them how relevant it is. This attention mechanism which is just nothing but a neural net that gives a single scaler is going to give out the relevant score of each word vector for the next word in the translation. Given what has been translated so far. Those scores we normalize them to sum to one. Then you know that’s, that always gives us a nice interpretation of the probability. Then based on that score we take the weighted-sum of the annotation vectors. Then use it as the time dependent context vector. Okay, let’s think about the very extreme case. At this point we need to attend to the growth. Then this attention mechanism is going to look at the, oh, I have translated La here, or the here. Then what is the next one that I need to translate? It’s going to put up very high score on this growth and very low score to all the others. Then it works as if the next time step is computed solely based on a single word in the source sentence. You do it over and over again until the end of the sentence in [indiscernible] sample or selected. Yes [indiscernible]? >>: Why don’t you also have the final state of sentence? Like both this and the previous portion? >> Kyunghyun Cho: Yes, so I didn’t put it in this graph because you know like this figure gets super cluttered eventually. >>: Oh, I understand. >> Kyunghyun Cho: Is that, so we initialize the decoder’s recurrent hidden state with the last hidden state of the either forward or the reverse recurrent neural nets. >>: You could also just have it as an input always… >> Kyunghyun Cho: We can do that as well. It gets slightly slower. >>: It would be fair to say that a fault vector approaching the [indiscernible] is a special case of this. >> Kyunghyun Cho: Yeah. >>: If you simply make all the weights to be zero except the last one. >> Kyunghyun Cho: Exactly, yes. >>: Yeah, okay. >> Kyunghyun Cho: That’s true. It’s the same encoder/decoder. Just that you know like we have some conceptual issue drawing a encoding neural net that gives us a set of vectors instead of a single vector. But, yeah it’s the same thing. Yes? >>: Did you consider taking into account the coverage like what is the words that you translated so far and reduce the weight for it rather than… >> Kyunghyun Cho: Right, so I think there is one paper submitted on that to some conference. I don’t know how it worked out. The thing is that okay in principal if you have the same [indiscernible] of data and everything this decoder should be able to you know like consider that as well, right. Because the decoder knows which of the annotation vectors has been translated. Then the decoder hidden state is used to compute the scores. It should be possible. But of course you know we always have a finite amount of data. Then our learning algorithms are always very primitive. It’s unclear whether that happens. In the case of image caption generation we did with the very same model. What we did was to regularize the attention weights. We get the attention weight matrix so that it’s going to be doubly stochastic. In this case if you just let it run we get a stochastic matrix because for each target site the sum over the source words is one. Whereas in, we can regularize this such that the other way around is also true. Well, it’s not exactly, there is some constant. Then you know it turned out that that helps when the dataset we considered was really small. But when the dataset got large enough then we didn’t really see any improvement there. Yes? >>: But if you’re translating from a language that has a lot of function words that don’t map into the target language. That probably wouldn’t even be a good thing, right. Because there’s lots of words that you want to just completely ignore, right? >> Kyunghyun Cho: Exactly, exactly, so that’s why you know we do not put it as a constraint but as regularization. Yes? >>: Over here you have three [indiscernible] Zp minus one, Ut minus one [indiscernible]? >> Kyunghyun Cho: Yes. >>: But the input here only two you’re missing… >> Kyunghyun Cho: Oh, yeah, so… >>: One of them is… >> Kyunghyun Cho: Yeah, I should… >>: Okay. >> Kyunghyun Cho: Put here as well, yes. >>: Okay, okay, good. >> Kyunghyun Cho: Yeah, you know I used this figure for my job talk so I wanted to make it as pretty as possible and not cluttered. >>: How important it is to have all three… >> Kyunghyun Cho: It turned out that this one is not too crucial. Because why do we need the actual sample of the word? Is that the decoder’s previous hidden state gives the distribution only, right. There is some uncertainty. Then feeding in the previous word essentially you know resolves that uncertainty. But what happens is that when we have enough data and then we have trained it long enough. The distribution itself becomes very picky without much uncertainty. In that case Ut minus one is not too… >>: Also there are other spaces are but you can’t have T minus one, T minus two, T minus three. Have you explored all these kind of mixtures? >> Kyunghyun Cho: No, no, I’ll tell you why we haven’t. Then, okay two things that the, we use those long, short or memory units or the gated recurrent neural nets. Then those update gates or the forget gates effectively learns to do so by you know carrying over the information if it’s needed. I don’t think that’s going to be necessary. But obviously if you do all those things there maybe some gain. But again I’m at the university. We don’t have enough GPUs always. It’s difficult to do that kind of exhaustive exploration unfortunately. >>: Do you know what Google’s people play around more because they’re [indiscernible]… >> Kyunghyun Cho: I’m pretty they do, yeah. [laughter] Probably they tried almost all of them, yeah. >>: But these have the same three terms, right… >> Kyunghyun Cho: Yeah, I think so. >>: There’s some difference about whether the word being used in the target is wi or wi minus one, right. There’s some variation here. >> Kyunghyun Cho: Yeah, so last year at EMNLP the group, [indiscernible] from the Stanford they actually did a quite, let’s say exhaustive search on the okay, different types of the attention, different types of the parameterization. Then you know they got amazing result. But that amazing result was not that amazing. I’m not sure whether you know those were really effective or they tried it on the English-German. Maybe it was specific for that language pair. It’s difficult still. I mean this model was proposed only like a year ago at best. I think we’re still at the stage of exploration. Okay, so then we trained the model. We get all this pretty alignment. But of course I don’t speak neither French nor German, nor English well. [laughter] But I do believe those are really good ones. Then you know like since then we started to using it for a lot of different language pairs. Then on the W, English to French WMT’ fourteen this is where we started with the attention-based neural machine translation. We introduced a very, no and then you know Ilya and others were able to actually get there without attention with a very large neural translation model. We introduced a large target vocabulary. They’re replacing auto vocabulary words going up and up. Then you do all these things. Then you know you get to the state-of-the-art results on the EnglishFrench translation. Then you know like many MT researchers told me that yeah, but English to French is kind of solved problem. You know you don’t want to play around on that field. We thought okay we’ll go to the English-German. We start from here. We started from here then we did the auto vocabulary replacement, slightly, very large, large target vocabulary extensions. Then you know like adding all those. This is the paper from Stanford last year. Then you get something better than Ilya’s phrase-based model. Actually it’s, I think I made a mistake. I think it was not Buck et al. But the Barry Haddow and others using the syntactic you know like phrase-based model. Then on the WMT fifteen the improvement is slightly better. But essentially people have, we’ve been using like little by little and then adding into the neural machine translation. Then eventually if you add all those things you get something better. Then in general in WMT fifteen we participated in five language pairs. Then it turned out that they are kind of like neck to neck. Some languages are better, some languages are worse. One thing we learned is that the, you know neural net only MT kind of caught up with the phrase-based MT. Then if I were to make it less interesting this talk then I can actually finish it here. Just saying that okay let’s push further and beat the phrase-based MT. I could have finished it, right. But that’s not fun. Then I even told you already that you know I’m going to talk about future of MT. I shouldn’t finish it here. Then you know I started thinking about it a bit. Then this is not fun. I want to do work on machine translation. Then I believe the machine translation is far from being solved because I’m not going to use let’s say Google translate or Bing translate to translate my paper. Then send it to my father, right. No one is going to be able to read that translated one. Then I just thought that then what, should I play this game. I don’t think this is a real game actually. The real game seems to be like this huge thing. Then somehow we’re playing in a very small corner of the game. >>: What is that airplane doing up there? >>: Yeah… >>: Yeah… [laughter] >> Kyunghyun Cho: There’s, okay, okay. [laughter] I think it’s the Arsenal or Chelsea Stadium, right. Airbus, yeah, so it’s Europe, so, yeah they have the Airbus. But yeah that’s not the actual airplane flying there by the way, for your info, alright. [laughter] >>: That’s changing the game. [laughter] >> Kyunghyun Cho: Right, yeah that’s changing it too much. That’s changing it too much. Then I started you know like thinking about it. Then looking at how the translation works. I am going to translate this movie review with a translation system. What I’m going to do is I’m going to take the first sentence out. I’m going to do the word segmentation, tokenization or some kind of punctuation normalizer, all those things, to get a sequence of words or the tokens. Then I push it into the machine translation system which is going to give me a sequence of words in a target language. That needs to be detokenized and then desegmented, and all those things to give me the actual sentence. Then I put there and then say that the okay first sentence translated. Then I’ll do it for the next sentence, next, next, next so on until the end, the last sentence. Of course there are a few things that you can, let’s say resolve the core references on the way. But you know usually it’s the sentence wide translation. What I felt was that okay there are three issues. We are doing it at the word-level in a sentence-wise manner. It’s always the bilingual translation. Then even if you want to do it in multi-lingual translation usually what you do is you go through the people language. I was like yeah this is not really fun especially for neural net. It’s not fun at all. We actually have learned that the word-level translation with the neural net is super not fun. We had to come up with all those hacks. Of course we always justify it based on the importance sampling ideas in MCMC. But those are all hacks. We thought, I decided to actually tackle all these problems based on the neural net MT system. First let’s talk about the word-level modeling. People have started thinking about going below the word. Starting from last year actually you know people were thinking about it from long time ago. But none of those were serious I would say. Almost simultaneously Kim et al., from the NYU, and Ling et al., from CMU proposed to encode each word by using either recurrent neural net or convolution neural net based on a character sequence of each word. Then they tried on the language modeling, parsing, post tagging which I don’t believe there is such thing, but anyway all those things. Then they showed that okay it works for the certain language pairs, certain languages quite well. Then the same group, the Ling et al., from CMU just put up the archive paper about two months ago to be reviewed at ICLR which was rejected I believe. To okay use the same idea on the decoder side and then do the character level machine translation. I was like, okay, character level neural machine translation that’s great. Except that if you really read the paper it’s almost impossible to reproduce their results because they had to pre-train the character level recurrent neural net, they had to pre-train the word level recurrent neural net. They had to do something and so and so on. The experiments were slightly you know like not convincing. They tried it on the English to Portuguese translation. Yeah, there is not even a baseline to compare how well they are doing. It’s not the most popular language let’s say. Is there anybody from Brazil? Maybe I shouldn’t say that. Anyway and then you know like at this point it’s still I wouldn’t call it a character level machine translation, rather it’s still a word level machine translation. This is the kind of progress report, figure that I draw. We start with the word level. Given this I really enjoyed this fill in. You know you do the tokenization into the words and then you do the translation. Then you can do some kind of slightly clever segmentation. You can do the morpheme segmentation or you just use the bi-pair encoding to segment it into the character n grams. Then you get I really enjoyed this film. Then you do the usual translation. This was from the Sennrich et al., last year. This was works amazingly well by the way. If you’re just training a neural machine translation system get their code, do the offline preprocessing. It’s just one pre-processing and that’s it. It’s really nice. We always use BPE by the way. Then you know like Kim et al., and Ling et al., all these reason work on the character level. What they do is they’re going to tokenize it first into the sequence of words. Then each token is going to be read by a character level neural net. Now, I have to ask why didn’t they go into the nothing, no preprocessing first. Is there some kind of problem with that, probably? I also thought that there must be a problem. We have that updated prior about you know characters are not the nice unit of meaning. We should go into something more and then do something, right. Turned out we can actually do that. We didn’t need any preprocessing. We are working on it now. Let me show you some results. We decided that okay there is the source target. Then you know if we just do it on both sides it’s difficult to essentially you know narrow it down when there is a problem. We decided okay the source side we’re going to leave it as BPE-based sub-words. But the target side we’re going to just generate characters without any kind of segmentation boundary or anything. Then initially we thought that you know we need something you know special. We spent like one or two weeks sitting down together with Chung et al., Xiao. Xiao was actually here, yeah. >>: [indiscernible] >> Kyunghyun Cho: Last summer, right, yes. We are sitting down at NYU like the very crappy office you know compared to here. We’re thinking a lot and you know trying to make a nice recurrent neural net that can do the segmentation implicitly on the fly, and then generate it. Then we implemented the stuff. But as a baseline we decided to just you know put the two layer GRU based recurrent neural net and let it generate the characters. Turned out that the basic model just works so these are the English to German, English to Finnish, English to Russian, English to Czech. Then this is the BPE-to-BPE. You can view it as the word level modeling. BPE-to-characters, so the decoder side, target side, character level, target side character level. They are doing either better or comparably always. Then we didn’t need anything unfortunately… >>: But that just works? >> Kyunghyun Cho: Sorry? >>: But that just works? There’s not, there’s, you didn’t do anything special to do it in a language model or something like that? >> Kyunghyun Cho: No, no, we are not even using the monolingual corporas. With by ensembling these the numbers get even better, much better than the BPE-to-BPE and it just works. We’re preparing a paper for ACL. I’m just worried what I’m going to say. I’m going to say that the, I don’t know, character level neural machine translation on the target side just works. I don’t know if the viewers are going to buy that. [laughter] But turned out that really just works. >>: Yeah, so but basically in the character level you lose the power of language model, right. Your character level language is very weak compared to word level. >> Kyunghyun Cho: That’s what I thought, right. That’s what everyone thinks. >>: How, yeah. >> Kyunghyun Cho: But turned out it just works. >>: But, I know that people do in voice recognition they do the same thing. They lose quite a bit. But if you use the character, like Chinese character then recognition is fine. >> Kyunghyun Cho: Yeah. >>: Because the model, so I’m not sure you know because you think they use these LSPM that the memories short enough. You just memorize… >> Kyunghyun Cho: Yes, I’m on that side clearly. >>: Okay. >> Kyunghyun Cho: I’m from that side so I’m like recurrent neural nets are actually great. >>: But I mean you’re gaining on the much smaller you know vocabulary essentially, your target side so all of your Softmax… >>: You lose all the [indiscernible] power, right, because your character length is so weak. >> Kyunghyun Cho: Yeah. >>: But if you’ve got fifteen [indiscernible]… >> Kyunghyun Cho: Okay, so the thing is that it has attention mechanism, right. >>: [indiscernible] >>: [indiscernible] language model, right. >> Kyunghyun Cho: It has attention mechanism. Then we visualize it. >>: I see. >> Kyunghyun Cho: Then it actually aligns almost amazingly. >>: Okay. >> Kyunghyun Cho: In a sense that, so one example I saw is that the, on the source side in English was spa garden. >>: Okay. >> Kyunghyun Cho: That apparently translates to kurspark, does anyone speak German? I think its kurspark or something. Kurspark is a spa and then park is a garden. It was generating k-u-r-s-p-a-r-k. Then alignment was perfectly let’s say, spa was aligned to the k-u-r-s with almost same weights and then to parks. >>: Do you, would you do the encoding do you have to you know constraint to decode it so it conforms to dictionary? If something you know if something similar characters that could get you… >> Kyunghyun Cho: I know, so our prior says that you know we should do that, right. >>: Yeah… >> Kyunghyun Cho: I have that prior as well, yes. [laughter] I know, so asked Chung [indiscernible], are you sure you’re not making any mistakes? [laughter] Are you sure you’re not cheating subconsciously or unconsciously or something like that? But wait, okay, left spin like okay raising his hand… >>: But you kind of started answer my question already you know. You said you don’t want to just say it just works. >> Kyunghyun Cho: Yeah. >>: Like you can show examples of the resulting embeddings. Like you can show that a certain sequence of letters ends up with a state which is close to some other word that has the same meaning. Then you can take a word add e-d to it and see what happens. I think from this… >> Kyunghyun Cho: Right, so… >>: Do you have examples like that? Have you looked at it? >> Kyunghyun Cho: No, so the thing is that actually makes me kind of like regret that we should have started from the source side. On the source side that would have been much easier to test. Whereas on the target side we are conditioned on the source sentence, so it’s slightly, yeah. It’s slightly inconvenient to do a lot of analysis. But at the same time that is from the beginning very difficult to do so in this case. We don’t have boundaries, right. We don’t have explicit boundaries. It’s all recurrent neural net. It’s all dependent on every single character before. It’s difficult to actually just cut it down and then say what that is. >>: I’m pretty sure that when you look at the output there must be a lot of words which are not legal, right. Because a few characters that could be wrong. >> Kyunghyun Cho: I know. I agree, my prior also says so. >>: Right. >> Kyunghyun Cho: Turned out that we do the decoding using Bing search. It rarely happened. >>: Did you find that, did it generate new words that are valid that were not in the, they’re not in the training corporas? >>: Yeah that’s… >> Kyunghyun Cho: We are still working on it, so doing the analysis. Yeah? >>: The question I have… >> Kyunghyun Cho: Okay. >>: Is about, let me get in here and find it. >> Kyunghyun Cho: Okay. >>: At the character level you’re operating at sub-word level which there is some elegance to, I mean there’s some need to operate it at a less than word level when you’re doing MT. >> Kyunghyun Cho: Okay. >>: This is part of the problem with morphologically rich language which allows the alignments we contend with… >> Kyunghyun Cho: Right. >>: Have a degree of morphological richness to them. We’re not capturing that sub-word information which leads to bad translations in a given context. >> Kyunghyun Cho: Right. >>: There’s a certain elegance. It seems like you’re going too far though by going to the character level. But maybe it’s a proxy for that. Do you have some sense of the, how accurate or how good the morphological variance are. For instance are you getting better word choice in the output with these character based models than you do with a word based model? >> Kyunghyun Cho: Yeah, so we have to do all those analyses. This like the really like the latest research. We got all these numbers a week ago after running the experiments for about I don’t know two months or so. We’ve been waiting a lot. We’ll do all those analyses. >>: Okay. >> Kyunghyun Cho: Yeah. >>: Is there a reason you prefer characters on the target versus the source? >> Kyunghyun Cho: We just chose target first. That’s why I think that was kind of like an inconvenient choice now that I, in hindsight. But I’m pretty, that this kind of thing actually makes me realize that we can probably do it on the source side as well. But I’m pretty sure source and the target side have very different properties. We wanted to make sure that okay we work at one let’s say [indiscernible] at a time, right. Usually different people we just throw the whole huge thing and then you know say that hey we solved it, right. I’m trying to avoid that in this case. >>: What is the… >>: [indiscernible] >>: What’s the vocabulary size actually? I think the character would be very useful for smaller. >> Kyunghyun Cho: In this case though we are using three hundred and eighty character vocabularies on the target side. >>: No I mean the word the vocabularies for, you just choose word-based. >> Kyunghyun Cho: Oh, so if we just use the, you know like really just using the blank space after the simple tokenization. The, like the Finnish just you know like goes out you know like out of the roof. >>: Would it be like sixty for most of the languages? >> Kyunghyun Cho: For like I think German if we use million then we can cover like ninety point nine. >>: No, I mean like for the character level isn’t it like sixty or whatever the number of characters? >> Kyunghyun Cho: No, no there are all those like the weird symbols from web crawl, yeah. >>: Oh, okay. >> Kyunghyun Cho: Took like three hundred and eighty was like covering ninety-nine point nine, nine. >>: But it’s still just like the, it’s still just the [indiscernible]. >> Kyunghyun Cho: Yeah. >>: The implication of this is, I saw many papers dealing with the large vocabulary of things. Then with this it seems that you don’t need those methods. >> Kyunghyun Cho: Yeah, I know. We had the paper last year… [laughter] >>: Do you think, are you consider using convolutional there on the target similar to [indiscernible]… >> Kyunghyun Cho: On the source actually, source, one of the students at NYU is currently working on that for the translation and indeed does help on the German to English translation. But the experiments are way too incredibly small number. Then you know we need to actually work on it more to make more concrete. >>: You do have language model for word to word, right? >> Kyunghyun Cho: No. [laughter] No, we didn’t even touch the monolingual. >>: No, so that [indiscernible] for the word to word, the [indiscernible] public source of the information is [indiscernible] use it for that. That of course I can see that [indiscernible] each other but ideally… >> Kyunghyun Cho: Oh, no, no, no actually, yeah. >>: Language model. >> Kyunghyun Cho: We tried to add the language model on top of that to do it. But the improvement we get by adding the, let’s say recurrent neural net language model is not that large, actually. >>: What is like the state-of-the-art for English-German? >> Kyunghyun Cho: Actually let me go back. >>: [indiscernible] >> Kyunghyun Cho: The state-of-the-art as in like we got a list of anything. >>: Was it like around twenty-five, right, or… >> Kyunghyun Cho: Yeah, I think I had a number here. English to German… >>: Okay. >>: [indiscernible] >>: Oh, I see. >>: That’s amazing. >>: Question on just wall time. >> Kyunghyun Cho: Okay. >>: You kind of have a couple effects going on, right. One is that your vocabulary size on the target side gets much, much smaller, so you get more efficient. But do a lot more operations. >> Kyunghyun Cho: Yeah. >>: Because you’re doing each character, right… >> Kyunghyun Cho: It’s about two point five times slower, yeah >>: Okay. >> Kyunghyun Cho: That’s unfortunate side affect that, you know but we are, as I said earlier we are very patient people. >>: What’s a small [indiscernible]? Your number is smaller with a character… >> Kyunghyun Cho: Yeah, but the length is now about [indiscernible]… >>: Oh, the length, I see… >> Kyunghyun Cho: Five to seven times longer depending on languages. Yes? >>: What’s the last language there, so I’m not a… >> Kyunghyun Cho: Oh, Czech. >>: Czech. >>: Czech. >> Kyunghyun Cho: Yeah, so English… >>: Do you try to, do people try to do this kind of thing for Chinese? >> Kyunghyun Cho: Yeah, actually one of the students is Chinese at NYU. >>: Oh, okay. >> Kyunghyun Cho: Chinese seems to be slightly easier to do at least on the source side. When we tried it last year by just characters and then it was doing already okay. Then we… >>: Chinese is easy. [laughter] I mean for the character [indiscernible]. >> Kyunghyun Cho: Right, right. >>: [indiscernible] reasonably strong, right. >> Kyunghyun Cho: Right, yes. >>: But [indiscernible] include it for English. >> Kyunghyun Cho: Yeah, that’s true. >>: You have to go to strokes for… >>: No not by strokes… >> Kyunghyun Cho: Yeah, so… >>: Like just characters… >> Kyunghyun Cho: Yeah. >>: The speech people use the… >> Kyunghyun Cho: No, we want to go to the stroke or the sub-character level. Here you go. >>: Do you do strokes… [laughter] >>: No, no. >>: He had a slide ready for it. [laughter] >> Kyunghyun Cho: Yes, yes, thank you. [laughter] >>: Good timing Anthony. >> Kyunghyun Cho: That makes everyone wonder what is the ultimate level, right? For images thanks to the convolutional neural net we went all the way down to the pixel intensities, right. We work at the pixel level in the images. No body works at the, I don’t know applying the sift or the Gabor filter first, nobody that anymore. In the language how far can we go down? Some say bytes, so there was a paper loaded to archive from Google saying that you know they do the post tagging based on the unit code bytes. They turned out it works okay, good. Then they trained one model for was it thirty-two languages and then it worked. This is their model. Then I started to wonder that is Unicode really the ultimate level. I don’t think so because for instance in Korean we have like consonant, vowel, you know like about forty-four of them. Then we get one syllable by combining one consonant and one vowel, and optionally one more consonant. Like this that’s my first syllable of my first name. Then because of this combinatorial property we actually get more syllables or the characters in Unicode space than the words. Then you know it just doesn’t make any sense. Can we essentially decompose it into the sub-character level symbols and then work on it. Chinese is similar so you can divide it into radical and then you know remaining. Then radicals are still the Chinese characters. If we do it we are working on that now for the document classification. The vocabulary size shrinks by about, I don’t know, three fold. But the length grows by twice. I think we can actually manage that. >>: Are those… >>: How do you… >>: [indiscernible]… >>: [indiscernible] decode in small pieces. Can they be all together? >> Kyunghyun Cho: Sorry? >>: I mean the order, so you get three [indiscernible], right. You put them together. How do you code them? >> Kyunghyun Cho: Oh, so Korean actually it’s like the location is all fixed. We can actually just you know like put it as a list. >>: Oh, okay. >> Kyunghyun Cho: Chinese is a bit, yeah is… >>: Yeah it’s [indiscernible]… >> Kyunghyun Cho: Problematic, yes. Yeah [indiscernible]? >>: Those two words have the same meaning? >> Kyunghyun Cho: This one is Kyung. This is ghyun. It’s a two different syllable. >>: Okay. >>: Then six. >> Kyunghyun Cho: These don’t have any meaning by the way. >>: Okay. >>: Oh. >> Kyunghyun Cho: It’s just syllable. Alright, so actually this was only the first one, right. I still have some time, right, or do I? >>: Yeah, you’re okay. >> Will Lewis: Your fine, you have a half an hour. >> Kyunghyun Cho: The second thing we got to go beyond sentences I think. Actually on this part I don’t have any results for the translation. But let me tell you about the language model first. We decided at the, how much can we gain by having a larger context when we do the language modeling? When we model the set of probability of the current sentence will it help to condition the language model on the previous n sentences? Then the answer even without running any experiments is that the, yes it’s not going to hurt. In the worse case you’re going to have a zero weight to the connections to the previous sentences. Then it will just do the usual language model. But now if it helps which is kind of like true then how will it help? Will it help by, I don’t know, getting the authors writing style? Will it help by some magical thing? Turned out it actually helps by giving the language model a narrower set of vocabularies to choose the word from. It helps the language model to narrow it down the possible word set. We trained the model on a bunch of small corpora. That is the reason why this paper was rejected. Is that we looked at the perplexity changes as the number of the context and as we increased the number of context sentences. We see that the adverbs, nouns, adjectives, you know verbs, all those open-class words become more and more predictable. In other words these larger contexts help us get the better sense of the topic of the current documents. Now, why is it like important? I think the importance will show up in the dialog translation setting. Now, by knowing what kind of topics or the themes the dialog is about the translation model is going to have so much easier time narrowing down or putting a more probability mass on the likely words. Then that naturally leads to a better translation. Or let’s say more natural translation. >>: What is… >>: Why does the, sorry go ahead. >>: What if the perplexity of the coordination and determiners go up? >> Kyunghyun Cho: Yeah, that’s a good point, right… >>: There’s something that’s going off the chart. >> Kyunghyun Cho: Yeah, so that’s actually an artifact of having a, showing a percentage actually and one of the reasons the going off the chart. Then the, yeah these functional words turned out to actually get worse slightly. But my conjecture now which we are testing now at the moment is that because we fixed the size of the model when we, even if we add more and more context sentences. I think essentially we get these open-class words become more predictable. The model has the fixed size so the fixed capacity so it kind of sacrifices the other functional words perplexity, predictability. But these are actually really small, it’s just a percentage because I put it as the proportion is like artifact, yeah, it’s a very… >>: The overall perplexity gets better the way you currently value it? >> Kyunghyun Cho: Yeah, overall get’s much better. Yes? >>: I’m wondering this context how it’s actually expressed in these networks. If you’re looking further and further back in time or the text do you really need a very complex model of context? Or does it actually reduce itself as just a bag of words? >> Kyunghyun Cho: Yeah, that’s actually a super good point. Of course we love those attention mechanism, so what we initially tried was to say okay we’re going to start with let’s say, we’re going to encode each sentence as a bag of words. Then we’re going to run the recurrent neural net on top of the bag of words, a sequence of bag of words. Then you know feed it to the, let’s say recurrent neural net language model. Then we add the attention model on top of that. The best choice is to do the recurrent neural net bidirectional on the context sentences. The do the attention on top of that. That’s best. But the difference is minimal compared to just having a bag of words, of every words before. That is fastest and you know close to the best model. >>: You just sum all the, you average all the embeddings of all the previous words? >> Kyunghyun Cho: Yeah, we learn everything. We learned everything from scratch. >>: I know it’s all joint journaling. But in terms of the modeling it’s just; it’s the average of embedding of all the previous words… >> Kyunghyun Cho: Yeah, yeah, yeah, exactly. >>: I think you are using the attention method. You just use the sequence of word embedding and then the attention pick whatever the [indiscernible]. >> Kyunghyun Cho: Right, right, right, yeah. We can do that, yes, definitely. But it turned out that simply just you know having a bag of every single previous word worked really well. >>: But and balanced [indiscernible] you used every single or previous words but you still use attention? >> Kyunghyun Cho: No. >>: No. >> Kyunghyun Cho: Not after that. It’s just going to be the sum of the word vectors… >>: Even within the current sentence? >> Kyunghyun Cho: No, in the previous sentence only. Current sentence is just [indiscernible]. >>: I mean the most obvious way is just to not clear your LSTM state every time, right? >> Kyunghyun Cho: Yeah, yeah. >>: Was that the base, like was that the first thing that you guys tried was just running it where you know you used… >> Kyunghyun Cho: No, so we didn’t do that and just leave it. We tried to actually cut it down into sentences first and did it. Because we wanted to use it for the other existing MT system and speech recognition to test it out. >>: Right, cool. >>: How does this… >>: [indiscernible]… >>: You said earlier that [indiscernible] language model doesn’t help… >> Kyunghyun Cho: No it helps but the… >>: They’re very small… >> Kyunghyun Cho: Improvement is small. But that’s kind of like understandable. We have a kind of like idea that the adding of language model is going to give us amazing help, right. >>: Yeah. >> Kyunghyun Cho: Especially if that has been the case with the machine translation systems we run it on. >>: They also work? >> Kyunghyun Cho: I think the reason is because the decoder, the translation models decoder was just too weak. >>: I see. >> Kyunghyun Cho: You get a lot of gibberish if you don’t have any language model and then you decode it out from the translation model. There, you know like having a language model helps amazingly because you can filter out a lot of wrong things, whereas, these neural net models actually does the language model implicitly in the decoder recurrent neural net. >>: [indiscernible] >> Kyunghyun Cho: The improvement we get is not that dramatic. >>: [indiscernible] is very small. What do you spend all the effort to focus on [indiscernible] here? >> Kyunghyun Cho: Oh, this. This is the step toward doing the larger context in translation. >>: Oh, I see, okay at the sentence level. >>: How do you train this? What are your examples? >> Kyunghyun Cho: These, previous n sentences. >>: Okay. >> Kyunghyun Cho: And current sentence. That’s going to be one example. >>: How many sentence context are you using or that was in the… >> Kyunghyun Cho: Oh, yeah we tried with zero to let’s say eight. Then after four everything kind of like, yeah, plateaus. Okay, so the last thing I think the neural machine translation enables us to do the very natural multilingual translation. Because it’s just a neural net, right. You plug in a lot of different things. Then you move it around and then plug in a lot of output models. Then you get the multilingual translation. Of course then why do we want to do that? That’s a very good question that I don’t have an answer to. I even started reading the second language learning literatures for humans. Then in that field is kind of split into fifty-fifty. Some say that okay learning more languages help you get better language ability. Some others say that in fact learning second language when you’re young deteriorates your mother tongue. I was like okay which one is true. Then you know for humans we cannot really do the controlled experiments in that case. You have to raise a kid for like twenty years and it’s not that easy. But for machines, especially for neural nets we can do it. >>: When we do five different languages that’s probably… >> Kyunghyun Cho: Yeah, exactly, teaching it is also very expensive as well. [laughter] Then you know like actually people have thought about it already from the last year. Then there was a paper on doing the multilingual translation with the attention mechanism presented at the ACL two thousand fifteen by the [indiscernible] et al., from [indiscernible] or [indiscernible], [indiscernible] I think. Then what they did was that they, you know because this attention or the alignment looks like its very language pair specific, right. What they did they said okay we’re going to start from English and go to the multiple languages. We’re going to put attention mechanism separate one for each target language. Then afterward yeah, [indiscernible] et al., from Stanford and Google they submitted a paper to ICLR on doing the multi-way translation, multilingual translation, and also multi-task as they did the parsing as well. But in that case they removed the attention model and then fell back to the basic encoder/decoder model because that becomes much easier, right. You just map anything into a vector and then decode it out from there. But at the same time me and my collaborators we were started working on this actually fairly early, starting from March last year or even before we started working on that. But the thing is that this attention or the alignment model was a problematic one. The first conceptual issue was that the, is this attention mechanism universal? Can we use a single attention mechanism for different language pairs? That’s what we want. At least that’s what I wanted because I want this multilingual neural machine translation systems size, or the number of parameters to grow only linearly with respect to the number of languages. But as soon as this attention or the alignment model becomes specific to the language pair the number of parameters grows quadratically. I just don’t like that. Yeah? >>: Can you, in this task this is the task where you assume that you have Ny parallel sentences, the same sentence? >> Kyunghyun Cho: No. >>: Okay. >> Kyunghyun Cho: No, we want to, it would be great to have that assumption. >>: Yeah. >> Kyunghyun Cho: Except you know we don’t have too much, right. We just assume that we have bilingual para-corpora. >>: Okay. >> Kyunghyun Cho: Then I asked around quite a lot. Do you think we can actually share one attention mechanism for different languages? Every single one of them told me that no, that doesn’t make any sense. Here comes another like the priors that we have, right. My prior was also telling me that yeah that’s probably not going to work. Then how am I going to do it? But you know like we are deep learning researchers we are very stubborn and patient, and are able to train a gigantic model while drinking beer. We thought let’s do it, let’s try that. Then we are pretty frustrated for some time because others were just publishing a paper on the multilingual translation by just simply avoiding this problem. Ah, I was like, oh, maybe we should have done that. But at the end of the day it turned out that it is possible. We trained a single model or it’s kind of like a single gigantic model where only subset is going to be used at a time. That has a one, two, three, four, five, six, oh actually yeah six recurrent neural net encoders and six recurrent neural net decoders, and then a single attention mechanism. Then we trained on the, every single para-corporas from the WMT’ fifteen. We had the para-corporas going from English to all the other languages and then all the other languages to English. Then we trained this model for about three, four weeks. Yeah, it was very difficult. Now at this level what happens is that the model does not fit on a single GPU. Then you just have to paralyze the model over multiple GPUs. >>: But [indiscernible], right? >> Kyunghyun Cho: Yeah, everything. Not the monolingual… >>: [indiscernible] or you merged them together to have unified like you know internal representation about all the languages and then do that? >> Kyunghyun Cho: Yeah, so we don’t. >>: I see. >> Kyunghyun Cho: We don’t have the internal representation because let’s say we have two sentences in English and Czech. The lengths are different. We get a different number of vectors for each sentence. Now the, what is unified is this mechanism. >>: Okay. >> Kyunghyun Cho: We do not try to find the vector space, common vector space. But we found instead a common mechanism that connects across different modalities. >>: But the attention only accounts for like point one percent of the parameters any way. If everything else is not shared like what’s the intuition that’s this is a good point to share? >> Kyunghyun Cho: Yeah, I’ve been just you know like stubborn. Actually, so this is the one step. Next step we’re looking at is once we train this model can we actually go for the zero resource translation? In that case sharing the attention becomes very important. We haven’t tested it yet. We are trying to test it but lack of GPUs and lack of time. >>: Zero resource means going from a line, a pair that you don’t have? >> Kyunghyun Cho: Yeah, no, yeah, exactly. >>: English to Finnish or something. >> Kyunghyun Cho: Yeah, let’s say German to Russian directly without… >>: German, yeah German to… >> Kyunghyun Cho: Yeah, or anything. >>: The reason why you don’t share representation here is because it’s long length? >> Kyunghyun Cho: Yeah, we don’t know how to do it yet… >>: [indiscernible] do it naturally [indiscernible] at the top. We just use the neural net [indiscernible] then it cross over. Then HMM takes care of [indiscernible] length… >> Kyunghyun Cho: We can do that but I’m not a big fan of HMM, sorry. >>: Okay. >> Kyunghyun Cho: But, okay. We trained this model on this let’s say ten direction language pairs. Then you know like they are doing okay. You know like comparable to having a let’s say ten single pair language models. We looked at both the log likelihood, BLEU, and then there are on certain language pairs there’s indirections. Multilingual model is doing better, sometimes less. But now one thing important is that in fact they are doing comparably with a substantially less number of parameters while sharing the single attention model for the ten different language pair directions. That’s the important thing. But at the same time we are kind of, but there must be somewhere you know it works slightly better. The case was that when we have a very small low-resource language among the many language pairs. What we get is that this multilingual model gets better than having a, let’s say adding in more monolingual corporas to compensate for the lack of the data. But the improvement as you can see is not that great. >>: Single, what single plus [indiscernible]? >> Kyunghyun Cho: That’s the single plus the diffusion of the recurrent language model on the translation model. We add in the same amount of monolingual English. This should be English to Finnish, so that must be a typo, the target side language monolingual corporas that makes it comparable to the multilingual model and then looked at how it works. We get generally a better generalization with the multilingual model. But I think this is this kind of low-resource language translation we are at the very beginning of tackling it. >>: What are the two numbers? >> Kyunghyun Cho: The first one is for the development set; second one is for the test set. >>: Yes, so [indiscernible] have very different alignment, right? >> Kyunghyun Cho: Yeah. >>: What can you… >> Kyunghyun Cho: I know. >>: Maybe all the six languages that you talk about have single kind of [indiscernible]… >> Kyunghyun Cho: Ah, no, no. >>: Japanese there is probably the score… >> Kyunghyun Cho: Yeah, but already because of Finnish everything [indiscernible]. >>: Okay. [laughter] >> Kyunghyun Cho: Yeah, and you know I still didn’t fill in the Finnish. [laughter] Alright, so the, I believe the new kind of territory of machine learning, oh okay. >>: Sorry, one last question. You mentioned about zero resource. You’re talking about like [indiscernible] to Finnish [indiscernible] that you don’t have? >> Kyunghyun Cho: Yeah. >>: What if you added a new language that you don’t have. Do you need some parallel data or can you add a language with just monolingual data? >> Kyunghyun Cho: That I do not have any answer to yet. I do not have answer to. But one possibility is that the, recently from the Edinborough, again Rico [indiscernible] and others there. But they show us that the one easy way to incorporate the language model, or the monolingual corporas in this neural machine translation model. Is that you just translate the monolingual corporas with another translation model. When you make a pseudo-bilingual corporas and mix in with the original bilingual corporas, and then apparently because that helps getting the decoder’s language model better. Then they show quite an improvement. When we have let’s say new language I think we can do something similar to that. We’re going to go through another language to get few bilingual para-corporas. Then just fine tune it a little. This is like the very latest work. We just got the acceptance notification yesterday from [indiscernible]. We are still trying to work on it. Yeah, this is like super new, yeah. >>: Is the size of the attention model similar to the size of the usual [indiscernible] monolingual… >> Kyunghyun Cho: Same size, same size. We didn’t, we couldn’t afford to actually do the lot of let’s say hyper-parameter surge. We just trained let’s say… >>: Did you use drop out? >> Kyunghyun Cho: Sorry? >>: Did you use drop out? >> Kyunghyun Cho: We didn’t use drop out. >>: Do you think it will help because you need… >> Kyunghyun Cho: Well, when we use the drop out in the attention and the target side, let’s say output. It helped for the very small corporas. But in the WMT study we didn’t really see much difference. Yes [indiscernible]? >>: It’s a high level point or question. It’s related to the two extremes you’re talking about. One is encoding entire sentence or maybe multiple sentences into one vector which is problematic because you don’t have structure of those vectors you can’t analyze it. Maybe it’s embodying all kinds of things. But you would like to have some structure there. >> Kyunghyun Cho: Yeah. >>: The other one is to mimic the one dimensional structure of the sentence. But the alignment deployed one dimensional way. However, that one dimensional structure of speech is an artifact of our physical embodiment. >> Kyunghyun Cho: Yeah, and universe. >>: If you had mind to mind you wouldn’t go through it. It would just train the network directly mind to mind, right. Then what is the real intermediate representation? Why is it fundamental? Should you be really working on one dimensional representation? Or should that intermediate representation be different? >> Kyunghyun Cho: No, I think one dimensional thing the reason is it’s because it’s language. As you said the language is not the thing that somebody designed from the beginning, right. It has evolved together with the humans evolved. Then it has adapted to the, essentially the universe, right. Time flows in one dimension and then we have only a single vocal cord and a, actually we have two years, okay, that’s slightly different, but okay, right. [laughter] In language related task. Then because our language is like they might believe okay this has nothing like that whatsoever like a theoretical foundation there. Is that these one dimensional thing is very natural. Because many of intellectual activities we do are based on the language as well. Of course if you’re connected, even when we use the internet where the bandwidth is amazing, we send a lot of things. Eventually when we consume the information transfer via internet or whatever becomes like very one dimensional because that’s the only way we can do it. >>: [indiscernible] >>: It’s a matter of dimension. That’s why… >> Kyunghyun Cho: Yeah, it’s slightly less one dimension. But you know like eventually we read it in… >>: I see. >>: It’s a bottleneck. That one dimension is a bottleneck. >> Kyunghyun Cho: It is a bottleneck. >>: But there is an expansion though in the mind it expands and it occupies different space, different geometry. >> Kyunghyun Cho: Yeah. >>: That’s why kids send pictures not text. >>: Right, well... [laughter] >>: No, that is the reason. >>: Also, I mean you have dyslexic people who can’t force themselves to do this. They still understand things. >> Kyunghyun Cho: Yeah, so what is the correct representation? Does that representation, in fact should that representation correspond to our understanding or intuition about the representation? I would say that is a huge open questions that I cannot really answer easily now, but we can talk about it. >>: You’re mapping surface forms here. You’re not mapping meanings at all. I mean we’re talking about our representations. Clearly we’re not doing this kind of mapping. >> Kyunghyun Cho: Maybe, maybe not. [laughter] >>: I would argue that we probably don’t even think completely linearly… >> Kyunghyun Cho: But I will just finish with the, I have let’s say one and a half slides left. We, it turned out that the very same model can be applied to other modalities as well. We tried the same thing. It was the Christmas week in two thousand fourteen. It was actually, there was a snow storm in Montreal, so nobody was able to go out. I was at home and then just decided that yeah with the very same code I’m going to just cut out the encoder recurrent neural net and plug it into, plug the convolutional neural net there. Then train this model to do the image caption generation instead. Then now unlike the other, many people who have done by just, okay getting a one vector out of images. Then you know decode out the caption. Because you know we love this attention mechanism. We decided, I decided to use the last convolutional layer which gives me a set of vectors that preserves the spatial informational already. Then it worked, yes. Then work, then I just realized that wait if it worked I can probably do it even more. This is some examples. But examples are examples I’ll just pass. We decided to do it on the video. Okay, let’s you know then feed the video in and then let the model describe a video. It, of course video description doesn’t work I can tell you. You know like it doesn’t work but does something interesting. This was the input frame like four of them, four of the like thirty frames we fed in. This is the generated one. Someone is frying a fish, not sure whether there’s actual fish or not because reference doesn’t say it in a pot. It was doing something, right. This is the attention weights put on those frames. Then we thought that, wait, speech recognition is translation. You translate from speech to text. We applied that and it works. Now this is the attention weights we get after training this model for some time. There are a number of people in University of Montreal who are actually pushing into this direction of using the very same model we have. Then you know making a variant of it for the speech recognition. I’m not working on it. Then it turned out to be there are a lot of other applications of the very same model without minimal set of multiplications. I wrote a review paper last year which turned out to be way too early. But it has an extensive list of the application possible with this model. Okay, thank you. [applause] >>: In the speech recognition example were you’re generating… >>: I’m raising my hand. >>: Sorry. >> Kyunghyun Cho: Okay. >>: When you say that the attention mechanism worked for caption generation, you mean worked as in it out performs something without attention but still using a convolutional input? >> Kyunghyun Cho: Yeah, so, yes, yes. Using the, almost same thing without using attention we get the better score. Interestingly, so that improvement was not captured by the automatic evaluation measures such as BLEU, MEDIO, or whatever… >>: Right, right, you did it with human… >> Kyunghyun Cho: But from the human evaluation. We were, actually we submitted this for the MS COCO Challenge, so yes. Then we were like eighth, ninth, eleventh according to like BLEU, Cyder, all those measures. Then we’re like, oh, no we should have spent more time to make even a larger [indiscernible] sample or something. Then after the human evaluation we went up to like second or third place. I think because it can generate a much more natural. Then you know like more let’s say tied, let’s say caption that is more tied to the images. That’s what I believe is so, but… >>: Yeah, theoretically I agree with you. But we haven’t been able to see the same kind of improvements as you guys. >> Kyunghyun Cho: Oh, I see. >>: Interesting, thank you. >> Kyunghyun Cho: Alright, you were saying something? >>: Oh in the speech recognition are you generating characters or full names, or what? I, yeah what? >> Kyunghyun Cho: Here when I was still kind of part of the speech team in Montreal we were just going for the full names. But now a days I think are going for the word or characters, yeah. >>: But they have to have separate couple PC [indiscernible] in order to do that, right? >> Kyunghyun Cho: No. >>: No… >> Kyunghyun Cho: No, that’s the thing they just put the very same model like the recurrent neural net on top of that. >>: I see. >> Kyunghyun Cho: They are not better than, so the latest one so far is not better than CTC, we are on par. Yeah, we are on par. >>: [indiscernible]. Somehow we have to do something beyond [indiscernible]. >> Kyunghyun Cho: Yeah, so language model. Yeah, language model helps. >>: Yeah. >> Kyunghyun Cho: I think you know it helped here as well. Yes, [indiscernible]? >>: Has anybody tried speech to speech, like you just like say something in a different person’s voice? >> Kyunghyun Cho: Yeah, I think we can do that. But just that there aren’t many data we can easily access in public domain when it comes to speech to speech. >>: Do you think about how to do things that can remove the requirement of having larger [indiscernible] data? >> Kyunghyun Cho: Yeah, that’s… >>: [indiscernible]… >> Kyunghyun Cho: That is exactly the reason why I’m really interested in this multilingual model. Now I told you about the caption generation video description generation, right. That was exactly the same model that we used for the translation. I want to plug the image here. I want to plug the video here. I want to plug the speech here. Then by having this kind of multi-model, multi-task I think that is the way to go for the, let’s say tackling the, I don’t know, low-resource tasks. That’s my view. But of course Yoshua and Yann disagree with me heavily on that point. But we’re all; you know disagreement is the fuel for the innovation idea… >>: They keep talking about how important it is to [indiscernible]. Do you have any ideas about what they are talking about how maybe applicable to… >> Kyunghyun Cho: Yeah, so I believe that the unsupervised learning is going to be important as a part of let’s say multi-task learning. The thing is that in unsupervised learning nobody knows what, how to do it, right. That’s the thing. Of course we usually view it as doing the probabilistic modeling of the input distribution as the unsupervised learning. But nobody’s actually knows whether that is the right way to view it, right. There can be completely different ways to view it. One of the views that I like personally is the unsupervised learning in a way of predicting the motor control given the observation. Something changed, how did it change? Maybe that is the way or predicting the future, that’s what Yann LeCun’s view. It’s all about predicting the future. Yoshua is more on the probabilistic modeling of the data distribution, but who knows. Yeah? >>: The attention model sort of seems semi-supervised learning, do you think… >> Kyunghyun Cho: Yeah, so it’s not semi-supervised but I prefer to call it weakly supervised which is a really weird term. But yeah I think we should call it weakly supervised. >>: [indiscernible] TED Talk, WSLT… >> Kyunghyun Cho: Oh, yeah, so that one we actually, yeah tried as well with, for this deep… >>: How much better do you have compared with Stanford. Stanford has a huge amount of… >> Kyunghyun Cho: I know, so, okay on that we haven’t really tried seriously. We tried it for the diffusion, the language model paper. But the corporas there is like tiny. I don’t think you know like whatever we learned, let’s say the amazing results we get let’s say on that corporas I don’t think we can actually tell whether that is because what the model used was good. Or whether it, because it didn’t work of the, you know automatic regularization by not being able to translate. That might have been the case. I think that those corpora are just too small to say too much about it, yeah. Okay, thank you. >> Will Lewis: Thank you. [applause]