>> Arul Menezes: So this is Barret Zoph. He is a graduating undergraduate at USC working with Kevin Knight. So he had -- he's been working on machine translation. He has two talks for us. One of them is -- was an EMNLP paper last year that was a full oral presentation, and then the next one is a NAACL paper that he has submitted about neural translation. >> Barret Zoph: Great. Yeah, thank you for the introduction. Okay. Yeah. So today is kind of a two-part talk, but they both kind of have an MT flavor to them. So the first one is EMNLP paper I had on how much information does a human translator add to the original. And then the next talk, or the next part of the talk is on multi-source neural network machine translation. Okay. So the goal of this paper was we asked the question of how much information does a human translator add to an original text, and we tried to provide an information theoretical bound through text compression. So we also aim to kind of provide an alternative, unambiguous way to see how well we can identify and exploit patterns in bilingual text. And we would hope that this has -- this has kind of advantages over other traditional methods such as perplexity where it's very sensitive to how big your vocabulary size is, how you deal with unknown words and various things like that. And we also hope to bring ideas from the compression field into natural language processing and vice versa, because there's a lot of different things that the people in either fields don't do that the others do do. And and are the yeah. And then so the -- okay. Yeah. So then if we have a source text then we're asking the question of how many more bits of information -- given that source text are needed to specify the target text, which is translation of the source. Okay. So just a little background on text compression. So the main goal of text compression is the kind of exploit redundancy in human language to be able to store documents much more compactly. So essentially the more patterns you see and the better you understand the text, then the better you can represent the text -- the much better you can compress it. So, you know, a very trivial example would be is imagining just had a big file with one billion 7s, we could take this file, which would be very big, and we can just presumably compress it to just a few bytes by just writing the following small code, which would then, once we run this executable, it reproduces the original document. Okay. So just another example of this is like if our goal was to compress maybe the first million digits of pi, then we could also try to write a very short piece of code that can do this. And once again, you know, how small we can make the program would really express our understanding of this sequence. Okay. So yeah. So then bilingual text compression. So, you know, since text compression deals with exploiting redundancy in documents, it's also really -- a natural extension of this is to think about how compressing bilingual documents, is thinking about compressing bilingual documents since, you know, ideally there would be even more redundancy. So, yeah, the following quote from Nevill and Bell kind of shows our motivation for this. You know, so they said from an information theoretical point of view, accurately translated copies of the original text could be expected to contain almost no extra information if the original text is available. So in principle, it should be possible to store and transmit these texts with very little extra cost. So this is kind of like a little bit overly optimistic, but, again, this kind of shows our motivation for approaching this problem. Yeah. So as we can see, like the quote does kind of show our motivation, but it's clearly not as trivial as that. As we can see, you know, a Chinese sentence and then 12 different translations of it. So clearly, you know, there is some information being added when someone does translate a sentence from one language to another. Okay. So, yeah. So then by exploiting statistical patterns and bilingual text, we want to be able to answer the question of how much information does a human translator add to the original. And there's many different ways you could try to approach getting a bound like this. And we go about getting this bound through using bilingual text compression. And the schema used for determining a valid entry is the same as in monolingual compression in the various benchmarks there are, like the [inaudible] and things like that, where a valid entry is an executable that prints out the original bilingual text byte for byte. So what exactly does this mean? It means that the size of this executable should contain everything you need that once run can exactly reextract the text exactly byte for byte. Yeah. So, again, before the top line is the kind of rule that we need to follow. And then, yeah, so then, you know, any decompression code, dictionaries, or the resources being used to help compress the text must be embedded in the executable. So we cannot assume -- we can't make any assumptions assuming that they have access to the Internet or various things like that. Everything needs to be put into that executable. Okay. So let's just look at like a visual diagram to see how this process works. So for a monolingual compression, if we have file 1, we compress it, we get some file1.exe. And this file1.exe should be able to contain everything needed to when we decompress it, it'd get file 1 back byte for byte. So file1.exe should -- we're going to try to make that as small as possible in monolingual compression. And then in bilingual compression, we have file 1, file 2 being translations of each other. What we do is we are compressing file 2, but we are allowing ourself to look at file 1 in the process. So then we get some executable file2.exe. And then when we decompress it while looking at file 1 we should be able to get back file 2 exactly byte for byte. And now we would think that, you know, since we have access to file 1 that we should be able to compress this much more. And then, yeah, going back to the title of this paper, our goal was to answer how much information a human translator adds to the original. And we specify this as the size of file2.exe over the size of file1.exe. So, you know, imagine if file 2 was exactly the same as file 1, we would really need to specify nothing more so that the -- so essentially what would end up happening is it would be 0. So there would be no information added. But if file 2 was a seemingly just random file, then the compression sizes would be about the same, leading to a hundred percent. So we know it's probably not 0 or a hundred, but it should be somewhere in between. And we were trying to get -- figure out where that lies, so it, you know, 10 percent, 35 percent, 70 percent or somewhere along there. Okay. So in this paper, we use Spanish and English bitext. And what we're doing in this is we are looking at a Spanish translation and then trying to compress the English as much as we can. Okay. So for the dataset we used in this paper, so we use a Spanish/English Europarl corpus. The Spanish is left as UTF-8. And for the English side, we removed all accent marks and further eliminated all but 95 printable ASCII characters. But the main goal in this is that we wanted to be able to compress the data completely as is so this wasn't, you know, word tokenized, we didn't separate the period off the end of the word or anything like that. So this is just a really -- this, again, goes back to one of the original points that I used, that this could be a very good metric for evaluating how well you understand a text because you don't have to do any kind of tokenization or anything along those lines. You're completely compressing it as is. So it's not sensitive to those kind of things. And we don't, you know, remove any rare words or anything like that. We just leave the data as is. Okay. Here then, yeah, here are some statistics important is probably the sizes as the goal then that size as small as possible once compressed. we're just going to try to make that as small as about the data. Most is going to get to be nick Yeah, so the goal again is possible. Okay. Okay. So yeah. So a little bit about monolingual compression. So compression captures patterns of data and also so does language modeling. Yeah. So the goal in compression, as previously mentioned, is we seek to get a small executable that prints out a text that's able to get as small as possible and then able to exactly print out the original text byte for byte. And in language modeling, we seek an executable that assigns low perplexity to some held-out dataset. So at first glance, these might seem like very different problems, but there's actually huge similarities between the two once underlying algorithms are examined. Okay. So one method that kind of links the two together that can take something where you -- if you have a probability distribution and you want to have it lead to a good compression. So if you have some text and if you have a good model, one way that you can convert this distribution into a good compression is through the very famous method Huffman coding, which I'm sure many of you had heard before. So for this let's imagine we have a five-word alphabet, and these are each of its just unigram probabilities. What we can do is we can sort them from least frequent to most frequent. And what we do is we simply start building a tree by merging the two smallest values and then creating its value here. And, again, this is an available value now and then we just choose the two smallest again, choose the two smallest again, and then we do this again. And then we simply add 0s to all the left sides of the tree and 1s on the right side of the tree. And then when we start at the top and we traverse down to the tree -- traverse the tree down to one of the leaves, we then get the code for this word. And then what ends up happening in this method is that more frequent words get shorter codes, which then leads to much better compression rates. So the most frequent word gets the shortest code and things like this. So then you can see if you have a distribution that well captures the frequency of these characters in a document that you'll get a good compression rate. Okay. And then another just quick note is that this thing must actually be stored in your executable whatever you have. Because then you go to decompress, you need to be able to know what bit sequences correspond to what words so that extract the text exactly byte for byte. Okay. But Huffman coding certainly has some downsides, and theoretically and, you know, it's pretty easy to see by a quick example. So imagine that we -- going back to this. You only have two words in our alphabet, just A and B, but it's very skewed. So the probability of A is, you know, something like .999 and the probability of B is 0.001. Then going back when we were constructing a Huffman tree, what would end up happening was we would still have to have two branches and each one would get assigned a one-bit code. So it's very bad and you don't really get to exploit heavily skewed distributions. Along with very other -- with other problems coming from that too. So arithmetic coding, this method, arithmetic coding, gets rid of this restriction in a very beautiful kind of way. And the very nice part about arithmetic coding is that it allows you to take any model and it allows you to convert your good predictions into a very good compression. So essentially what you want to have is just have the best model you can, and then you can immediately translate this into a very good compression. Okay. So let's see -- let's kind of walk through exactly how this method works. And also feel free to ask questions during this if there's any -- if you need clarification on anything or things like that. Okay. So for arithmetic coding, let's see what we do. We produce these context-dependent probability intervals. And each time we observe a character, we move into that interval. Okay. And our working interval becomes smaller and smaller, but the better our prediction, the wider it stays. Okay. So let's say we're going to try to compress the underscore and that our total vocabulary only has these six characters, and our model is going to be like a bigram character model. But of course the model could be anything. Just for this we're going to use a bigram character model. Okay. So then we get something that looks like this. So these are the context-dependent probability intervals. So how arithmetic coding works is the following. So at the start of when you're compressing, you'll have an interval that begins at a 0 and ends at a 1. And then we have our six different characters in our vocabulary. So we will have intervals for all of them. And then we also have -- and then the size of these intervals corresponds to the probability that your model is giving at this time. Okay. So what ends up happening is is that if we're compressing the underscore, we first -- we're going to condition on the start symbol because we're starting or whatever. And what ends up happening is that we are compressing T, so then we move into T's interval. So if this value here is 0.8 and this value up here is 0.95, we then move into this interval and expand it. So now .8 is here and .95 is here. Okay. So then next we compress H, where then we now move into this interval. If this was .89 here, and this is .83 here. We move into this interval here. And then next we're going to compress E. We're going to move into the range for E. Here, here. And then these intervals keep getting smaller and smaller as you compress the document. And then your -- a part of -- a big chunk of the executable then for this compression would be -- it's going to be the smallest bit sequence to lie in this interval. Okay? And the smallest bit sequence to lie in this interval is corresponding to this number, which is 110111. And now an important thing to note about this is that if you have a really great model, you'll be able to allow this interval -- these intervals to be much bigger. So if you could better predict T at the start, hopefully you could have like, you know, maybe, T, just the top T is here and maybe like the bottom of T is here and at each point, so forth. And the nice thing is is that if you have a better model that can better predict what character is coming next, then at the end you'll be left with a much larger range. And then if you were left with a much larger range, you will be able to get a smaller bit sequence that is able to lie into that range. So it's kind of like a very beautiful method that allows you to, if you have a really good model, then you can keep better and better compression. And it's also nice because essentially what it allows you to do is you could have whatever model you want and then you can just plug it in to try to get good compression. Okay. And then so once you have this -- so, yeah, so this would be essentially put into your executable along with the order of the characters here. And then what ends up happening is when you're decompressing, you would have this bit sequence and you would essentially see a 1 first, so you would be like, okay, so you know that it's going to be in the upper half here, you'll see the next one here, and then you can keep narrowing down the interval. And then you can reconstruct what text it was. >>: So for that bit sequence 110111, some subsequence of those binary digits comes from each of these or exactly one? >> Barret Zoph: No, it can be -- it can definitely be more than one too. So when you're traversing it down, it's not always you can make one character jump per bit. You can certainly narrow down the range, and when you're decompressing, you can figure out which characters were set there. >>: So is it fair to say there is essentially a Huffman coding associated with each of each interval? >> Barret Zoph: Is there a Huffman encoding? Well, what would end up happening was is that you would actually -- so this is like shifting adaptively. So if you were to use like a Huffman encoding scheme, you would then need to like store each of those individual ones. So no. So no. There's actually not always a Huffman encoding scheme that can do that, to answer your question. So what ends up happening is with Huffman encoding you get an entropy bound with like per character you produce that can be between 0 and 1 bits above optimal, and arithmetic coding allows you to actually achieve optimal compression. So there's not always a Huffman encoding scheme that leads to arithmetic encoding. >>: I guess the question is what is the corresponding binary sequence on that number. >>: Yes. >>: I think it's 01 is the [inaudible] is that correct? >> Barret Zoph: Yeah. Exactly. Yes. So yes. So this would be, you know, like -- this is just a binary -- like if you put 0., this is binary representation for this number and decimal. >>: But if you -- so if you had a bunch of things that were -- had .99 probability and you had ten of them in a row, how could you incorporate that in one byte, let's say, or one bit? What would that -- that would be -because, I mean, presumably, then it would want to like -- the [inaudible] would be like encode that [inaudible] distribution, have a bunch of those. But with a higher probability, you'd want like -- you'd want like in like one bit, right? >> Barret Zoph: >>: Yeah. So what would -- how could you get multiple things in a single bit? >> Barret Zoph: Well, I mean, if you did have something that was like where all of these were scrunched down and you could really go something like that big, it would just be one bit. But the only thing you would need to encode is that the order of these characters such that then you could reextract it. That is all you need. So you could get something very small. >>: Perhaps, Barret, you might want to walk through the inverse process of decoding, because that might spell out what goes on like if I had that long sequence of As, then I would repeatedly rescale upwards. So I could take like one bit, that would be right in the middle then. That would encode a relatively long sequence of As, right? Because the first ->>: But how does it know when to stop? decoding when the new sequence? How does it know when to stop >> Barret Zoph: Yeah, exactly. So there is an end of -- there is a stop character. It's just not shown here, but it ->>: [inaudible] how do you know how many bits to consume? >>: If you have a million -- if you have a -- yeah. If you have a million-character thing, it wouldn't be encoded as a single -- as like a single probability, like -- or would it? >> Barret Zoph: Well, if you -- so if you had a million characters, right, you would keep traversing this, you'd keep using this process and then at the end you would be left with some ->>: Oh, so it really would be a single thing -- >> Barret Zoph: Yeah. That is what your executable is. So you could have a -- like a 10-gigabyte executable. But, I mean, then the size of this is just this long bitstream. >>: Okay. And yeah. Okay. So hold on. Let me just go to the -- okay. And then, yeah, so then these interval widths are changing because the context is differently and also because the model -- oh, yeah. >>: If you had only one bitstream, how susceptible are you to errors? And if you have like one little error [inaudible] and one you could fit, could you lose the whole text? >> Barret Zoph: Yeah, you could, if you don't use any kind of redundancy. Yeah. So it is very sensitive to this range. Within some error of margin, sometimes you can still get it. But those are errors. You can certainly lose the whole thing at a certain point by kind of getting off in a skewed range. You certainly can. Okay. And yeah. And then so maybe this will always clarify kind of how this is working. So also at the start we don't want to include any initial counts into the executable. So we start everything at uniform, for example. Then what ends up happening is as we keep compressing, we will updating accounts in the same way for the encoder and decoder. So at each point the compressor and decompressor have the same exact probability distributions. And this allows us to not have to store any like big unigram counts, bigram counts or anything. We can just start it uniform. And then after we see each character, we update the count for that, and we just keep building the model as you compress and decompress such that the encoder and decoder can have the same probability distribution at each time step. >>: So the probability distribution is per file, is not per language. >> Barret Zoph: Yeah, it's per file. And it's adaptive. So after we see a character, then it will, okay, we'll give count to that one. And if we, for example, are using like a seven-gram context, we'll update the count for that seven-gram context, six-gram, five-gram, four-gram. So then, yeah, so as -so in the beginning when we're compressing, we're not doing very well because we don't have a very good model. And then as we get farther along in the document, we do much better because we have a lot more counts. And then, yeah, and then -- so then the only other thing we need to do is since we're building up these models adaptively is that we just need to store the total alphabet and the order of these such that the compressor and decompressor know that. Because if you didn't know the order of these things, then it would be like, oh, well, we would know it was an upper half, we wouldn't even know what characters that corresponded to or things like that. Okay. So, yeah, then there's this method called prediction by partial match, which is probably the most famous text compression scheme which is used like a lot of times when you zip a file, if it detects it's like an English text file or something like that. And, yeah, how it works is actually pretty simple once you see this. So you can see that there's a really strong link between compression and language modeling. So if you have a good model, we can just feed it into this arithmetic coding scheme, and you'll get a very good compression. So how prediction by partial match works is you can just have a nine-gram character model that is adaptively being built up as you compress and decompress is text. And it's being smoothed with Witten-Bell smoothing. That's all it is. So it's very -- it's just -- essentially just a language model. And, yeah, we were building up the counts for the language model as we compress and decompress the document in the picture like before. So the English compressor. So what we're going to do is we're going to try to compress the English looking at the Spanish. So we're going to use the arithmetic coding scheme. And for the model we're going to now predict the probability of the Jth word given the Spanish translation and the previous English word seen so far in the English sentence. >>: Did you mean to say words here? doing characters. Because everything previously you were >> Barret Zoph: Yeah. So let's -- oh, yeah. Okay. So let's back up. Let's back up. So we were doing characters before, but we could have easily had that interval range over words. Right? There's nothing that specifies. There's just much -- many more things there. But, yeah, so let's -- yeah, let's -- that's an important point. So let's just back up to words here. But these could also very well mean characters too. Yeah. So these could very well mean characters too. Just we're kind of seeing everything we've seen on the English side so far, and then we get to look at the whole Spanish translation. Yeah. So, for example, the Spanish sentence, and then, yeah, if we're doing characters, I should like to -- and then yes, so we get to use this whole context and all of this to produce the next. >>: So just to fall in there, so you will have -- as you will have the 26 letters you will have the 10,000 words or hundred thousand words? >> Barret Zoph: Yeah. If you're using words, then you have, you know, your whole -- all the words that could possibly occur. >>: So where you play with the concept of saying let's use the top 8,000, and with that we know we cover 90 percent or 95 percent of the text [inaudible] like oxymoron, use once every 10 million, 7 -- I don't care about this word, I take it out. And then thereby reducing the number of words. And I assume, then, would you see an increasing the [inaudible] ratio. >> Barret Zoph: Yeah. So, yeah, you certainly do get that. But then you don't actually be able to -- when you run that executable, you won't reextract the text byte for byte, right? Because then you won't know what words to put there. So you'll be left with some -- and our goal was to -we're going to compress this document as is. We're not going to replace any less frequent words and stuff like that. So, yeah, for this we have to deal with everything. That's our ->>: So you ship with your document the actual words of the document? >> Barret Zoph: Yeah. All the unique words. And that would be in that arithmetic coding. We stripped all the words and we ship the order they're in. That's all we need. Okay. Okay. Yeah, but so for this bilingual model [inaudible] we'll get the word level first. So, yeah, so EJ is going to represent a word. So if we're trying to predict the Jth English word and we know it translates to the Spanish word -- some Spanish word, then we can probably make better predictions. This was kind of our intuition, and I think it makes a lot of sense. If we know what word it translates to, we can probably have a sharper distribution for what word that is going to be. So we initially thought that Viterbi alignments could be useful for the bilingual compressor. So what we were going to try to do, we were just going to get the -- for the text we're compressing, we're just going to run it through like an HMM alignment model and we were going to extract the Viterbi alignments, and then we were going to store these in the executable. So we're going to take a hit for the cost because we were going to have to put these for the compressor and decompressor to both have. Okay. So yeah. So when decompressing, we also need to give the decompressor the alignments. So therefore we also tried to compress the alignments to reduce the hit we were going to take. So we tried to store the alignments in two different formats. So the absolute format, you know, we're -- so for this target index what source word does it align to, and then just the relative offset being stored. Okay. So then we looked at, you know, how compressible are these alignments, can you actually kind of predict better than random what alignment is going to come based on the previous alignment. And you actually can pretty well. So if we're looking at the relative alignments here, so what's the probability of getting a plus one alignment, given you've just seen like a minus two alignment. You can actually -- you can predict the stuff pretty well actually. So if you've had -- if you've just predicted a previous two, what's the probability now you're going to predict a three in this case. And so we look at the sizes here of the uncompressed alignments just being stored in the plain ASCII. We then use -- we then try to encode them using like the Huffman encoding method. We also tried to compress them using the prediction by partial match method. And even when we compressed them down, we got a size of 12.4 megabytes which was way too large for us to be using them. So we ended up not using them. We found that the 12.4 megabytes was way too large. Because in the end we end up getting down to -- in like the 30s, 30 megabyte size for everything. So this was way too big. Even in our initial experiments. >>: But did you -- I mean, before you tried to compress the alignments, when did you find whether they were useful at all in compressing [inaudible]? >> Barret Zoph: So we -- in like some preliminary experiments, yeah, we did find that it would help. So this actually leads to my next point here, which was that since these alignments were so big we were just going to try to build up a probabilistic dictionary on the fly. So we were going to try to learn these alignments as we were compressing the text. So, yeah, so we were going to try to use like a T table from one of the IBM or HMM alignment models to help us kind of get these alignments on the fly so we wouldn't have to precompute anything. So we can kind of build this model from the -- yeah, the IBM model kind of gives us this probability distribution we want of the probability of the English sentence given the French with this, with the following [inaudible] from IBM model 2. So first we would choose a length for the English sentence, conditioned on the Spanish sentence length, and then for each target index we choose an alignment with some probability. And then for each of those, based on those alignments we then choose an English word corresponding to what Spanish word it correspond to in the alignment. And then, yeah, again, so instead of having to do, you know, like the whole idea of running the IBM model one five iterations over the dataset and then running the HMM alignment model five iterations over the dataset, what we were going to try to do is compute these alignments in a single pass over the data. So what the idea would be is that in the beginning we start with, you know, just like uniform, and then as we get counts we'll -- as we keep compressing the document, we'll keep building up better and better counts to kind of get better and better alignments as we go. And then, yeah, again, we build up these models exactly the same for the compressor and decompressor. So we can get those same arithmetic coding interval ranges. And since we cannot use the standard EM algorithm of doing, you know, multiple passes over the dataset, we use online EM which is we use kind of our own variant from Percy Liang's paper in 2009. So we get the probabilities are updated after each sentence pair. So we kind of are storing this, kind of these large tables of expected counts. In the beginning, we have nothing in there. And then as we see the first sentence, we'll keep updating counts after each sentence pair. Yeah. So, again, like the processes here, so we start with some uniform translation model probabilities. We use EM to collect expected counts over a sentence pair. And then we use this probabilities after each sentence pair. So we'll keep getting ->>: So on the previous slide it sort of implied that you were starting with the model 1 dictionary or not. >> Barret Zoph: >>: No, I'm not. Just put in this -- >> Barret Zoph: No, no, no. So yeah. So sorry. So sorry if this was confusing. So I'm saying in the previous approaches where you want to use like the HMM alignments, for example, you have to initialize with five iterations of IBM model 1, then run five iterations of the HMM alignment model. In this, we're just going to do a single pass with the HMM alignment model. So we're foregoing those previous four iterations and the five iterations of the IBM model 1. So this is completely online. So we're starting with nothing. >>: What are you really trying to optimize here? Is it -- so there is some big competition that you want to do, and then you [inaudible] and you're trying to compress English [inaudible]. >> Barret Zoph: >>: Our goal is -- [inaudible] that is still the goal. >> Barret Zoph: >>: Yes. Mm-hmm. So why do you care about [inaudible] whether you do it in [inaudible]? >> Barret Zoph: Yes. Okay. So, for example, let's say why do we care, right? So maybe we had this, this English dataset we're trying to compress along with the Spanish. So we run five iterations of IBM model 1 and five iterations of IBM model 5 and get those alignments. >>: Right. >> Barret Zoph: But then the issue is is that we would have to store these in the executable. >>: All the iterations. >> Barret Zoph: >>: Right. >> Barret Zoph: >>: And that's going to be too big. Okay. >> Barret Zoph: possible. >>: You have to store those in the [inaudible]. Right. >> Barret Zoph: >>: Whatever the final Viterbi alignments. Because our goal is to make that executable as small as Okay. >>: It depends largely on the -- on the size of the data that you have. you had like 10 billion words of data, then ->> Barret Zoph: >>: If Yeah, then it might -- then it certainly might do that. But you have, what, 50 million words that you -- >> Barret Zoph: Yeah, I think it's -- yeah, it's roughly -- yeah. >>: So but even that, if you stored like -- because like the IBM alignments are sparse or, yeah, word alignments are sparse, right, because most of them are going to be [inaudible]. So if we just stored the head of each -- you know, the top five [inaudible] the top 50,000 words, that would still be -you could -- like I think from probably a megabyte you could get a lot of information, right? >> Barret Zoph: Yeah. I mean, certainly. That's not one of the things we really explored, but as we'll see later, we actually kind of get everything we need in the online approach. So after we kind of realized this was too -so kind of the thought process of me when doing this was I was like, okay, let me see if I can just use the full alignments. And then I found that was too big, so I was just thinking let me see if I can do everything in an online setting, which we found out that we actually could. So, yeah, and then going back to here, we do have, yeah, about like 50 million word tokens. Okay? >>: And the online setting you will eventually not have to solve the Viterbi language, you just [inaudible] because you ->> Barret Zoph: Yeah. So we're building them up in the -- so we're building them up as the model is compressing. So in the beginning, our Viterbi alignments will be horrible because we barely see anything, but as we get, you know, like a quarter of the way through the dataset and we built up a lot of counts from seeing examples and its pairs from previously what we've been compressing, so then we'll have good parameters at that point. >>: And what gets stored in the final file that gets -- >> Barret Zoph: >>: Oh. So nothing, actually. Nothing, okay. >> Barret Zoph: Nothing. Because we start up -- so the compressor and decompressor start at uniform. Then after we've compressed and decompressed the first sentence, we'll have those counts then. And they'll have the same counts at each time step. Okay? Yeah. So then another interesting point is that unlike in batch EM, we don't need to store separate count and probability tables. We only store counts from EM and then we just compute the probabilities we need whenever we need them. Okay? And, yeah, these count tables that I was just talking about, so, yeah, we just store the counts like for this, the counts for F, just like in normal EM, but we never store any probabilities. We just keep accumulating. After each sentence pair, we just keep accumulating in these counts. Okay? And so in batch EM, we typically initialize the T table with a few iterations of IBM model 1 and also sometimes model 2. So the issue was that -- so when you kind of initialize the alignment model with IBM model 1 and IBM model 2 you kind of don't have to deal with this next issue that in online EM the A table learns that everything should align to the null word, which was an issue for us because -- yeah, which is an issue. But actually you can get around this actually very nicely by just heavily smoothing the A table. So the lambda A parameter we set to be a hundred. And this kind of doesn't -- allows it to not make sharp decisions until we've kind of seen a lot more stuff. And this actually works really well and this kind of gets rid of that issue, which is pretty nice. >>: [inaudible] that's tiny, though, you could pass the A table -- you can turn a really robust A table and then compress into like a hundred bytes, right? >> Barret Zoph: No, yeah, that is actually true, but we actually -- we tried two different things. We actually tried storing it and then learning it. And we actually found that it does actually about the same, too. But, yeah, the A table is small. The big thing of course we have this huge D table. But, yeah, you can get around both of these. You can just learn on the fly pretty easily. And I think also another just high-level benefit of this is that if you actually wanted to train out one of these alignment models, you actually don't need to run all these previous iterations of everything. You can just get it on the fly in one pass, which is something I'll talk about a little later. Okay. So what we ended up using for this adaptive bilingual model is we just used a HMM alignment model where then, yeah, we would only do a single pass over the dataset. So, yeah, there's a couple different variants of the HMM alignment model. So for this one what we do is we choose the English length with some probability epsilon. And then for each target word, we set the alignment to null with some probability P1. And we choose a non-null -- or we choose a non-null probability with 1 minus P1 times the relative offset of the previous alignment. Like some parameter for the relative offset of the previous alignment. And then based on that we choose a probability from the T table. Okay? And then also in compressing we must predict English incrementally before seeing the whole English string. Okay? So we also must model when the English sentence stops, which is pretty straightforward from the model here. We just also just wanted to mention that we also do have to model in the stop probability. >>: So effectively you interpolate all of the -- you -- because you don't actually know which word it will answer but you say -- so you have a probability distribution of all the things it aligns to and then you take the actual -- the translation [inaudible] interpolate them, weight it, and then it's -- then you get a single distribution for all the English words and that's what you ->> Barret Zoph: Yeah. Yeah, we kind of get this big -- you can think of it as kind of like a big lattice where you have each target word up here and then all the different possible source words. And then at each point the HMM alignment model gives us a distribution for over each of those points. And then you can renormalize to get a valid distribution. >>: But that doesn't -- does it encode them separately, or does it encode them in a single just over English? Do you interpolate them, just do linear interpolation and then get a single distribution, or do you interpolate -- do you have basically two different [inaudible] would actually jump to and then encode what target word from that I should [inaudible] make a difference? >> Barret Zoph: No, we just -- we just -- I guess, I'm sorry, I'm not really understanding your question. >>: Because you could encode -- you could encode the probability distribution of which [inaudible] should I jump to and then encode given -and then -- then you decompress that ->> Barret Zoph: >>: -- deterministically -- >> Barret Zoph: >>: Oh, I see what you're saying. We just do it -- we just do it at the same time. Okay. >> Barret Zoph: Yeah. >>: You would recover the Viterbi alignment when you're decoding, or do you marginalize out the alignment? >> Barret Zoph: do. >>: We just marginalize out the alignment. So we don't actually Okay. >> Barret Zoph: Okay? So then some training issues for this. So, again, we're not able to initialize the HMM alignment model with anything, so we do run into two issues. One is that EM sets the probability of aligning to null way too high. So, yeah, so it kind of wants to align everything to null. And then also the EM learns that the relative offset of zero is way too high, so you kind of get situations where, you know, if you've seen a 2 you just kind of keep translating a 2; you get a 5, you keep translating a 5 and things like that. So to solve this, it's, again, a pretty simple fix. What we do is we simply just fix the probability of aligning to null so we don't allow that to be a learnable parameter in the EM framework. And then we also -- and then we heavily smooth the O table now instead of, you know, an A table. And we find that this fixes the issue. Okay. So then what we were going to do got this adaptively training HMM model, how good is this model doing, you know, wanted to look at the alignments we got next is, okay, so now we've kind of so we wanted to inspect, you know, in just a general sense. So we from it. Okay? So, okay, so what's going on here? So we're comparing against two different things. We have silver alignment. We have like a silver standard and a gold standard. So for the silver standard, we're just simply comparing against a batch HMM alignment model that was run first with five iterations of IBM model 1, five iterations of IBM model 2, and then five iterations of the HMM alignment model. And then we also have a set of -- for the gold standard, we also have 334 human-aligned sentences. And it's bidirectional. Okay. And then to be able to compare with this, we simply run the HMM, the online HMM alignment model in both directions and then use like a [inaudible] final type thing to merge them together. Okay. So then we can see a couple things here, is that so the online EM reproduces 90 percent of the links from batch EM. So it actually does pretty well. And it also -- it also really surprisingly matches human alignments as well as batch. So this online alignment model does really well actually, which is very cool for us. Okay. And then we also see that the alignments are better in the second half of the corpus than in the first half of the corpus, which would be kind of expected as we're building up the counts as we go. And then another thing we tried was we wanted to say like, hey, so since we're building this model up in an online setting, maybe if we put, you know, the shorter very, you know, easy sentences in the beginning, maybe this could help hone into better parameters in the kind of -- in our online one-pass approach. So we do see that putting shorter sentences at the top did help online EM a little bit. And then we also got the alignments and then fed them into like a Moses pipeline to see how they would match. And you could see that they -they're very comparable, which is pretty cool. Okay. And, yeah, so then just the quick note on this is because we now have this HMM alignment model that's expecting words. And since our goal was that we're going to compress the data as is, this of course runs into issues because now, you know, you have the commas attached to words, you have periods attached to words and things of the like. So then we wanted to kind of come up with a tokenization scheme that we could uniquely reverse exactly. So we kind of came up with this thing. It's a pretty standard approach that I'll just go over briefly. So what we do is we identify words as subsequences for all lowercase or all uppercase numbers and then any other other symbols. And what we do is we append the number of spaces following it. So, for example, the string is directly attached to the hyphen, so we get an @0 attached to it. This guy has a space after it, so it gets attached to 1 after it, so forth. Then what we do is we remove all of the @1s because we would be assuming that, you know, in a corpus most things have one space next to it. And this still makes it uniquely reversible. And then finally what we do is we remove any suffix of an alphanumeric word to become part of a prefix of a nonalphanumeric word. So you get something like this. So here's an alphanumeric word here that we simply just take this and we append it to the -- if there's a nonalphanumeric one in front. So the whole idea is that you would hope that, you know, common things like period often have this kind of deal where now we're like the period will always have like an @ something, period @ something. So you kind of get these more common tokens. So yeah. Okay. So, yeah, so under the previous tokenization, under the previous tokenization scheme we now ask our translation model to give us a probability distribution over possible next words, like, Jacob, you were saying. So now the translation model knows the entire source sentence and the target words seen so far. So what we do is, yeah, we kind of compute this prediction lattice for this HMM alignment model. That gives us a distribution over different source alignment positions. And also what we found very helpful is is that we weight each predicted word in the HMM alignment lattice with a bigram word model. We found this to be very helpful. Okay. And so before we were working in the word space, and then, you know, back in that arithmetic coding example it was characters, and we actually wanted to go back into characters again, so how do we kind of take our prediction from words into characters and we -- it's pretty simple. We just combine the scores for words that share the same next character. So imagine when we're compressing or decompressing we have seen the C so far and we have a distribution over these four words, for example. Then what we do is, okay, so A and A show here, so we simply merge these two probabilities here with this probability here with this probability here and then renormalize them to 1. So it's a pretty straightforward way to take this distribution over words into characters. Okay. Okay. So and then what we also wanted to do was we also want to be able to interpolate this PPM prediction over characters with the HMM's character prediction. So we get something like this. So if we're predicting the next character, it's like a mixture model with the PPM prediction and this HMM alignment prediction. Okay? And what we do is we dynamically adjust this interpolation weight for each context. Okay? And we actually found this to work remarkably well in practice. So what we do is we take the max of PPM over the max of PPM plus the max of HM. Okay. And now this max operator is the following. So PPM will have a distribution over all of its characters. And then max is simply the highest probability it assigns to any of those characters. So ideally this kind of intuitively is like, you know, if one is very confident in its prediction, then, well, let's listen to that more. And then we did -- we did kind of put like a -- kind of like a little exponentiation factor there. But it wasn't too sensitive to that. We just got a small little boost doing that. But overall like this max works very well. It actually worked just as well as kind of training a logistic unit based on different contexts to kind of predict. We found that this worked almost just -- basically just as well and was much simpler. So this was very, very cool. Okay. So now for some of the results. So first what we did was we had the Spanish. We compress it. We get some Spanish executable. That should then be able to extract the Spanish byte for byte. And then what we looked at was we just looked at PPM, which was our monolingual compressor. We looked at how well could this just compress the Spanish. Okay. So what we see is that the uncompressed Spanish file is 324.9 megabytes. Using the Huffman encoding scheme it's 172.8 megabytes. And then using PPM it is 51.4 megabytes. So yeah. So then we get the compression rates here, and then BPB is just the bits per byte for our compression. And this could also be -- so also another common thing is maybe to see bits per character. But for this example, we were also using -- there could also be unicode. We were also -- wanted to deal with the cases in terms of that they were unicode, which cannot be more than one byte. So bits per byte was just the easiest thing to do. Okay. Okay. So, then, then now what we're doing is is that we're going to now be compressing the English while looking at the Spanish to get this English executable that when run, and it can look at the Spanish, can reproduce the English byte for byte. Okay. So, yeah, so then -- okay. So if we see that the uncompressed English file here is 29.4 megabytes after compressing it with Huffman encoding, we get 160.7 megabytes. And using PPM we get 48.5. And then using our bilingual method we get 35.0 megabytes. So we do certainly get a nice boost in compression using this HMM alignment model. >>: But don't owe include the entire compressed Spanish? >> Barret Zoph: Yeah, we do actually. So that's the great -- that's actually the next slide. But this is just looking at the English. Yeah, we certainly do. I completely agree with you. Yeah, going on to this where we then -- where this -- where this size is simply that 35 point, something you've seen before, and then also the size of the Spanish, we get an 86.4. So we get a 15.2 percent compression rate improvement looking at the Spanish. And then there was also related work on this where these guys, they actually have kind of a bilingual model but they simply interleaved Spanish and English words and then compressed them using that way. only get a 7 percent improvement over their compression, and we got percent improvement. didn't the And they a 15.2 And then going back to the main title of the paper, how much information does a human translator add over the original, we say that -- we provide an upper bound saying that a human translator produces at most 68.1 percent, which was 35.0, the size of that English, which is the size of English size over the size of the Spanish information that the original author produces. And we also decided to do, was we ran this kind of bilingual Shannon game that was able to give us more loose upper bound. And we found that actually humans are only adding roughly 32 percent using this. So we're still a far way off from being able to match kind of what humans can do in this scenario. >>: I kind of hesitate to ask something because I want you to talk about your second part of the -- let you go to your second part, but it does feel like there's a mismatch here, right? Because we don't -- in the compression scheme we expect that the model knows absolutely nothing about the correspondence between Spanish and English when it starts. And in the Shannon game at the bottom, I assume you're not like letting people learn Spanish as they try to predict what [inaudible]. >> Barret Zoph: No, I mean, that is certainly -- >>: They have this huge amount of knowledge that's encoded already in their brain, and so it feels like these are two fundamentally different questions. Like one of them is how well can I compress bitexts and the other is what kind of information are the humans having. And it feels like you've focused maybe more on the first question. Does that seem fair? >> Barret Zoph: No, I mean, that certainly does seem fair, but it's also -it was kind of -- it was kind of, you know, a first step towards trying to make something like this where it's very hard to kind of formalize like a rigorous thing to answer that. But, no, I completely do agree with you, I mean, right? So humans certainly do have all of this other knowledge stored. I mean, yeah, it's one of the fundamental really issues here. Okay. And that, you know, yeah, so then just concluding, we do obtain a quantitative bound. And then we also do provide an alternative, completely unambiguous way to kind of identify how well we can exploit and identify various patterns of bilingual text. We think that, you know, compression is certainly a good way to do this as it's certainly unambiguous. Okay. And then -- yeah. And then just some future and ongoing work that we were currently doing is certainly using better predictive translation modeling instead of HMM alignment model. That came out in like 1996, and these improvement a lot since then. So ideally exploit better bilingual modeling and various things like that. Okay. Yeah. Okay. So, yeah, so that was kind of the first part. I can take questions on that now if there are, or we can just keep going ->>: Short on time, so -- >> Barret Zoph: Okay. And then questions, okay. >>: Got the room until noon. people may need to ->>: I doubt there's anybody right after us, but I need to go at 12:00 sharp. >> Barret Zoph: Okay. All right. Sure. Okay. So then the second part of this is on multi-source neural translation. So this is joint work with me and Kevin Knight at ISI. Okay. So let's take a look at this. Okay. So the goal of this work was that can we build a neural network machine translation system that can exploit trilingual text. Because when you're building these systems, a lot of times you have trilingual text, you know, there have been other methods that have exploited having bilingual text in a lot of different languages but can we exploit one that -- can we build a joint model that can take in trilingual text. So, yeah, we try to model the -- you know, for example, the probability of an English sentence given a French and German sentence instead of the probability of an English sentence given a French sentence and like a normal neural network machine translation system. And so we present two novel multi-source non-attention models along with one multi-source attention model, and we show very large Bleu improvement. So we get 4.8 Bleu points over a very strong single-source attention baseline that was -- I think is currently the state of the art on a lot of the different WMT benchmarks, so it was very strong results by using trilingual data. And we also observed something interesting, that the model does much, much better when the two source languages are more distant from each other. Okay. So just a little brief intro on multi-source machine translation. People have certainly worked on this before, starting probably with Franz Och, you know, and so Kay 2000 points out that if a document is translated once, it is likely to be translated again and again into other languages. So like a certain way that this could come about is that a human does the first translation by hand and then turns the rest over to an MT system that now has access to this trilingual data. And ideally a translation system -- so the translation system will have two strings as input. And you can reduce the ambiguity of these two strings, use a method like called triangulation. This was Kay's term. And you can see how in this example down here of how having two source translations of different languages could certainly help disambiguate each other. So, for example, the English word bank could be much easily translated into French, for example, because bank, you know, could be like a river bank or an actual bank you go into or something like that. And if you have like a second German input string that contains the word, this, which meaning river bank, for example, then you could certainly then correctly translate bank because of this. Okay. So in the standard neural network machine translation setting, we have bilingual data, which -- so English and French to typically model the probability of an English sentence given a French sentence. And so we use an attention and non-attention baseline. And so for our non-attention neural network machine translation's model baseline, we use the -- kind of like the stacked multilayer LSTM approach from Sutskever, et al. And just as a quick review on this kind of -- this neural network machine translation approach using multilayer LSTMs, so what ends up happening is is that if we're translating this English string into this French something, what we have is we first feed in the at the first time step into these -into this four-layer LSTM where each of these boxes is an LSTM. We then feed in dog. And, again, then it's passing its, you know -- this vector of real valued hidden states that it's using to represent what's going on so far. We keep eating it up, and then we feed in some stop symbol to let it know that now it should start producing the target text. We keep going, keep going, until we hit some end of sentence symbol. And then once the -- and the model is trained jointly on sentence pairs using maximum likelihood estimation, and then for -- at decoding time you simply use a greedy beam search to get out the target translation. So very simple greedy beam decoder. Much simpler than, you know, like many of the other hard to write decoders in MT. Okay. So, yeah, then again each box here is an LSTM, although there's various different choices for this. Like you could use like a GRU or the other boxes that can deal with vanishing gradients well. So then our idea is that, you know, how can we build a new model to be able to exploit the trilingual data. So we do something like this where we now have two source encoders, so A B C is, for example, the sentence and, for example, it may be English. Then we have I J K which is a sentence in German. And we feed these jointly into the target decoder where it then wants to produce W X Y Z in some third language. Okay. And this is the same as before where these are LSTMs. These are LSTMs. These are LSTMs. Except now we have these black boxes where in this paper these are something called -- what we call a combiner box, which its job is to take the representations getting sent from each encoder and combine them in a nice way such that it's more useful -- it's a more useful representation for the target decoder. Okay. Yeah, and then, once again, I'll -- yeah, like I said before. Okay. So then a quick aside on the LSTM block. I mean, essentially all that the LSTM is is that it takes in this thing called a hidden state and a cell state, which are two vectors of real numbers of the same dimension. And it just outputs again two new vectors of the same dimension. And then the internals are certainly, you know, complicated. But essentially that's all it's really doing, is it's taking in this thing called a hidden state and a cell state, and then it goes through this LSTM mechanism, and then you get out a hidden state and a cell state. And it's fully differentiable. And these were being shown as the single arrow before, just for brevity, but there's really two things being passed along. So if we go look at this picture again, if we just look, for example, at the top layer here, there is one hidden state and cell state being passed to here. There is another hidden state and cell state being passed to here. And now the goal of the combiner block is to take in both these hidden states, both these cell states, and combine them together in some way such that to produce a single hidden and cell state for the descent to the target decoder. Okay. So, yeah, again, so a combiner block is simply a function that takes in these two hidden states and two cell states at each layer and it outputs a single hidden and cell state. And what we did was we tried a few different methods for combining them to see what would maybe work best, the first being the -- what we called a basic combination method and the next being called the child-sum method. Okay. Okay. So for the basic method, what we do is we simply just concatenate both the hidden states, which is what this semicolon represents here, and then we apply a linear transformation. And then we send the output through some nonlinearity which we chose to be tanh here. And then for the cell values, we simply just add them together. We tried various other things, like, for example, doing a linear combination, but we found that the training diverges because when you're training these neural MT systems, if these cell values get too big, like it leads to kind of bad training. Okay. And we also tried a second method, which is certainly much more complicated. It's just the -- which we called the child-sum method. And it's kind of like an LSTM variant that combines the two hidden states and the cells, and it was based off of the tree LSTM. So you can imagine how you can think of this as maybe a tree that has two children and now let's combine them to one. right here. So we essentially tried this and this method again. So, yeah, It uses eight new matrices, which you learn. And the previous method just had one right here. Yeah. And then, again, you know, you send it through these internal dynamics which -- like right here. It's essentially just there's an LSTM that now creates two -- that takes in two cell values, has two different [inaudible] values. And then that's essentially what's going on here. Okay. So then for our neural network machine translation single source attention model, we -- we -- we base ours off of the local attention model in Luong, et al., that was at EMNLP this year. Essentially the idea with these attention models and this approach is is that when you're training these neural MT systems that there's a huge bottleneck because essentially what you're asking it to do is kind of eat up this entire source sentence and then just feed in one vector representation for the decoder, so that's obvious -- it's pretty obviously a huge bottleneck. Then the idea with this being that at each step in the target side we have access to look at the hidden states on the source side. That's the idea behind this. And then, yeah, so at each time step we get to kind of take a look at everything on the source side. Okay. Okay. So then here's just a quick overview, a kind of understanding how this local P attention model works from Luong et al., because we do build on internals too. So to understand it, it's useful to understand how this works. So at each time step, you know, on the decoder side you have this hidden state, and what happens next is you predict this value -- you calculate this value PT, which is the following, okay, where this is the source sentence length, right here. This is just a scalar value, so in this case it would be 1, 2, 3, 4, 5, 6. It gets send through a sigmoid. Then you have these two things just being learnable parameters that you learn along with your model. Okay. And so now note that this value since the sigmoid is between 0 and 1 you will now have a value between 0 and so you'll be able to predict a position in between 0 and then your final target length. Okay. So now what we do is we look at the source hidden states in the range PT minus DPT plus T. And note in this scheme we're only looking at the top-level hidden state. So if you could imagine that you have your stacked LSTMs below this and we're just looking at the top layer here. So what we do is now we compute an alignment score with each of these source hidden states. Right? So what we end up doing is, if you can consider H -we have an HS coming from each of these source ones, which is just this top-level hidden state, what we do is we compute an alignment score which will be between 0 and 1 for every single -- for every single position here. And let's see how this is calculated. You have some functional line here, which is essentially you just compute a score for your hidden state here and then each of these source hidden states, with WA being some learnable parameter. And then you simply normalize these across all the ones you're looking at to get some values between 0 and 1. And then you add this little X term to favor values close to where PT predicted, and this also makes the model fully differentiable so you don't have to use anything like reinforcement learning or trickier methods to be able to -- yeah, to be able to train them all because everything is fully differentiable, which makes it really nice. So then you create this thing called CT, which is just a weighted average of all the HS. So you get alignment score for each of these predictions. You also have a hidden state. So then you simply just do a weighted average of all these to create this thing called a context vector. Okay. Then what you do is you look at your new context vector and then, once again, the original hidden used to create this context vector. You concatenate them and you apply a linear transformation, which is, again, like a learnable parameter for your model. You apply it through nonlinearity, and then you get out this new hidden state representation that should help you more. As opposed to using HT before, now you're going to use this HT tilde. Okay? So that is the attention model from Luong, et al., that we were using from EMNLP, which currently gets state-of-the-art results on many WMT benchmarks. Okay. So now going back to our original non-attention multi-source model, let's see how we can add like a multi-source attention model into this. So what we do is let's just now look at the top layer of each of these source -- each of the source encoders from the previous picture. We can imagine that we're just going to look at this, this, and this. Okay. So, again, we have our hidden state here. And what we do is now we create -we predict two different locations to look, and it could be completely different locations to look. So we predict two different places to look at each of the source encoders here. So let's say we're predicting to look at these three here. We're also predicting to look at these three here. Then what we do is we simply get alignments for each of these in the different locations like we did before. We create two different context vectors. And then we simply concatenate each of the context vectors now from each language because we're getting an additional one. We concatenate it with the current hidden state. We apply a linear transformation and then send it through some tanh nonlinearity. Okay. And then so for the models that we use for these experiments, what we do is we -- just some stats about them, is we train the models for 15 passes over the dataset. We -- yeah, we kind of use like a very common learning rate halving scheme. We use 1,000 hidden state size, which is pretty common in literature, a minibatch size of 128. We fix -- we kind of replace all the non-top 50K source and target words with [inaudible] symbols. And we use dropout for all the models. Okay. And then so also a translation models slow, but they can target vocabulary, note on parallelization. So these neural network machine can get to be very slow to train. Well, I mean, not very certainly get slow, you know, with a lot of layers, large et cetera. So for these, for all the models, for the multi-source and non-multi-source models, I parallelize them across our -- we have these nodes that have four GPUs, so I split the parallelization across in my C++ implementation. And I use a similar parallelization scheme used in Sutskever, et al. And so with the multi-GPU implementation, I get about a 2 to 2.5 times speedup. I could get more, but there are bottlenecks simply because I don't have enough -- there are more layers and things I need to do than I do have GPUs, as you'll see in the next picture. And then, yeah, the speeds here are for training these, the different. So about 2K target words per second for the multi-source without attention and about 1.5K target words per second with attention. Okay? So let's see how this parallelization scheme works. So ideally in a world where you have more GPUs per thing, you could have GP1, 2, 3, 4, 5, 6, something like this. But I don't. So I, of course, have to share these, and these lead to a bottleneck. Because when you put the multiGPU parallelism, you almost get flawless parallelism for speedup on this, which makes it really nice. And then here is the attention layer which I was just explaining before, which is now just being represented with a single node. And then I also have the softmax which was being implicitly represented before. So how does this work? So what you do is you -- at the first time step, you can view -- this guy computes its activation here. Then I send it asynchronously up to the second GPU to start computing, and this guy computes it for here. I then -- you know, you can keep going here and kind of get this nice cascading effect for using this, which I found really helpful for speeding this model up. And then, again, for like the backward pass, you have this guy compute its activation, and then you get the -- you know, the same exact thing. And then as you can imagine for a model like this, what ends up happening is you get the same scheme. I'm just now running also these two in parallel for all of this. So that is how I parallelize these, and I get really large speedups by doing so. And then the -- so okay. So for the dataset for experiments. So I used a trilingual subset of the WMT 2014 dataset. And for the languages we used, we used English, French, and German. And we did two different sets of experiments, one with the target language being English and another with the target language being German. And, yes, we can see we get about 60 million word tokens here. Okay. So then the results for when German is the target language. >>: Did you train on all 60 million? >> Barret Zoph: Yeah. I trained on everything. So this is the exact training data in use. Okay. So then, okay, we look at this when German is the target. Okay. So then for the source side we have English and French. So here are the Bleu scores and perplexity on some test set. We see that English as the source does better. Then we try -- so these are the non-attention models. Here we're just using the basic combination method and the child-sum combination method. We see that we get a 1.1 Bleu gain and then a 1.0 Bleu gain here. We then run this model with an attention mechanism, just a single-source attention mechanism, and we get a Bleu score of 17.6. And then we used the same two basic combination methods but then with the multi-source attention model. And so here we get a 1.0 Bleu gain and here we get a 0.6 Bleu gain. Okay. And then we also inspected when English is the target language, so now we'll have German and French as the two source languages. So now these languages are much more different from each other. And so here we see that we get -- our best is going French to English, where we get a Bleu score of 21.0. And we then use the basic and child-sum method and we get a 2.2 Bleu gain and 1.5 Bleu gain. Then we also wanted to check, you know, just having two different source languages actually matter. So we fed in French and French and we don't -you don't get anything doing that. So actually having the two different source languages is not just about having more model parameters, it actually does make a difference, kind of having these two different source languages to look at. Okay. Then we run an -- the single-source attention model. We get a Bleu of 25.2. And then we use the Bleu -- and then we use the attention model along with the two different combination methods. And then for our best one we get up to 4.8 Bleu point gain, which was pretty high. And then -- yeah. So then the conclusion for this is just that we describe a multi-source neural MT model that gets up to 4.8 Bleu points over a very strong attention model baseline. And we also see that, you know, having the two languages be more distant in the source certainly helps and gets a larger Bleu gain. And then, yeah, are there any questions? [applause] >> Barret Zoph: Oh, yeah. >>: So there are two sources for like the -- in Bleu you're getting on the [inaudible] sources here, one is Z, like the encoder itself, which is bilinguality and one is the target one, the language one, because you have now double Z amount of the data on the target. So did you consider how to compare ->>: No, it's the same. >> Barret Zoph: It's a three-way aligned corpus. Yeah, it's a three-way aligned corpus. >>: The double source and single target? >>: Yeah, so it's the exact same sentences. >> Barret Zoph: Yeah, so we just get like a trilingual dataset. the same number of English, French, and German. >>: So we have So the same. >> Barret Zoph: Yeah. Okay. Any -- >>: Did you try any experiments where you -- the trilingual corpus is, say, X and you have 10X bilingual corpus? Do you have any [inaudible]. >> Barret Zoph: No, I didn't. I mean, that's kind of the stuff I'm actually doing right now. It just takes -- these things take longer to train when you have like a -- when you start stacking on a bunch of these things. But, yeah, certainly that's one of the other things I tried and other -- yeah. That is one of the things I'm doing now. >>: What are you -- like what are you planning to do [inaudible] exploiting ->> Barret Zoph: So what am I exploiting is I'm just exploiting using significantly more -- significantly more languages. So I was actually going to go out to six different source encoders and then seeing kind of how the different combination of languages helped. And then another thing was is that visualizing what they're actually doing. So I didn't put this in the slides. I probably should have. But once you train this joint system, you can look at like trilingual, you know, four-lingual alignments and kind of see what they're looking at. So it leads to pretty interesting things. That's certainly one of the things I'm interested in doing. Another thing I'm interested in is also being able to train a joint system that can maybe now project two sentences into the same space or various things like that which could be useful for various things. >>: I do think it would be great to look at different language pairs. Because if you look at English, like it really is kind of if somebody took German and French and smashed them together, right? Because Anglo-Saxon was spoken by the people [inaudible]. So like taking French and German and trying to translate into English is kind of like the ideal case from a linguistics standpoint of having multi-source. But, I mean, the results are fantastic, right, like a point in Bleu gain is really cool. So that's great. But from a parameter standpoint, though, didn't your child-sum method have like a million new [inaudible] parameters and you've got like 2 million segment pairs? So each -- you only get a training instance for those child-sum parameters, like one training instance per sentence pair or sentence triple, rights? >> Barret Zoph: Yeah, so that's exactly why we think it is poor, right, it just overfits that operation. And also especially because in a normal like recurrent neural network, the parameters are replicated across time. And that's just being used for a single training instance. >>: Yeah. I wonder whether some regularization or some different parameter shape or something might ->> Barret Zoph: too. Yeah. Yeah, I mean, that's certainly another thing to try >>: Yeah. What is the impact on the size of the model? what is it costing you? >> Barret Zoph: What is this -- For speed? >>: No, in terms of the -- more loosely the trained model sizes that you already [inaudible] is there any path of using a -- by using a single [inaudible] train the model sizes ->>: 50 percent bigger, 3 instead of 2X. >> Barret Zoph: >>: Yeah. [inaudible]. >>: Model sizes are never really a big issue here. which is ->> Barret Zoph: Yeah. GPU pretty well. It's totally the speed, You could fit all the stuff like in the memory of the >>: I'm curious about your parallelization capabilities. >> Barret Zoph: Oh, yeah. Of course. >>: So I thought one of the issues with parallelizing this kind of computation across GPUs is that the [inaudible] communication [inaudible] computation. So did you do anything special? >> Barret Zoph: No. So okay. So there's a caveat. So you have to have GPUs that can be able to do these things called direct memory, like these DMA transfers. So how these transfers are being done is they're being done asynchronously. So essentially what ends up happening is you have these, you know, like streaming multiprocessors on your GPU that are running, and you also have stuff that is taking care of a memory transfer. So the way that this is working is that this is a direct GPU to GPU transfer. So these are connected all across this -- like this direct bus. So these things are just communicating. So the data that I'm sending, which is like this 1,000 by 128 matrix of floats or whatever, is just -- is just being sent -- is not ever having to reach the CPU. It's getting sent directly from a GPU to GPU. So you're not having to deal with, you know, okay, so here's memory on the GPU, transfer to the CPU, then, you know, send it to the other CPU or whatever and then transfer to the GPU. It's done direct. So it's actually pretty fast. It's not a bottleneck. I mean, you'll get really flawless speedups doing things like this. So, for example, let's say I have a -- let's say I have a one layer and 20,000 vocabulary model, just a single layer, and it says 20,000 target vocabulary, something like this. If you then use this on one GPU, maybe you're going to get something like 7K target -- 7K source plus target words per second. If you then split that into two GPUs, you get about, you know, 13K words per second. So you almost get like flawless parallelism with it. It works really well I found. >>: And so, of course, this wouldn't work like when you went across -- if you try to go across multiple machines. >> Barret Zoph: You know, you couldn't -- I think you could probably get it to work for MPI if you played around with it. I haven't tried it because we just have these four GPU node things. But I think you could probably get it to work. Another thing I also tried was that I think is actually worth the cost is you can have something where you have a huge softmax, so maybe you'd want to use like a softmax of like a million different things, you could split that matrix multiplication across multiple different computers using MPI, and then they all share the normalization constant and send it back. I actually did play around with something like that, and I got it to work decently well. I don't have that in here as there are, you know, smarter methods to dealing with a large vocabulary. But I think you could certainly do that. I think if you played around with it, you could certainly parallelize it across different computing clusters. And you could also probably do some asynchronous model updates, too, which other places do, too. You could do a combination of both for really fast training time. >>: Yeah, I mean, the -- there's the sort of parameters of a paradigm for parallelization so -- and how you compare what you're doing to that. >> Barret Zoph: I think you could use both. You could certainly use both. Right? So you have a cluster of a bunch of these different, you know, computing things, and each of them has a couple GPUs. So then the couple GPUs you have on your -- like per node, you use to compute the model like this, and then you share those across all of your clusters. So there's no reason that they can't be used together certainly. >>: So this parallelization scheme works well when you have stacked layers like this. Do you know how much the stack layers are buying you in terms of model quality in terms of prediction? >> Barret Zoph: So -- okay, so I've played around with this a little bit. So I found that I pretty much don't get much more after six layers, with four being good. But you certain get a huge gain going from one to four. So, for example, like you'll get like maybe like a 3 to 4 Bleu boost by going from one layer to four layers. So, yeah, so four I found is actually pretty good. I've gotten more up to six, but after that you start to get a diminishing return. And then what's also nice about these is that like if you have an attention layer here, this takes about the same amount of time as these. So you can just treat it as another layer. Or if you want to do like something with characters down here, other things like that. So you get really nice parallelism doing this, and it trains the models much quicker. >>: [inaudible] bilingual, given French, you know, [inaudible] something else for English. So now what is happening at runtime? What kind of a sentence am I going to get? Am I going to get French? >>: You're getting both. >>: I'm going to get -- is that a requirement or -- >>: Yeah. >> Barret Zoph: Mm-hmm. >>: Right? I mean -- >>: [inaudible]. >>: That's assuming that you have a French and a German sentence. >>: Yeah, yeah, I know. >>: Same thing at runtime. >>: I understand, so -- >>: Well, it's a pretty special case. >> Barret Zoph: >>: So what is at runtime? Like is it [inaudible]. It's -- But -- okay. But let me -- let me note this -- [inaudible]. >> Barret Zoph: So for -- certainly. I mean, so, certainly, there's been like stuff showing that like if you -- for example, so you train a model maybe on a trilingual data like this but then at test time, for example, if you only had, you know, like a -- you have going into English and you maybe have French and German but you only have a French, you could then maybe use another MT system to translate that French into German, feed in both and see what you get. And that's another thing, right? So I think that this -- there's reason that this method can't be used in conjunction with other methods where, you know, training with a bunch of bilingual data. Because I think that if you have trilingual data it helps to jointly train it, but then you can certainly use it in conjunction with other methods. Yeah. >>: I think there's been work to sort of learn parameters across -- you know, to do more lingual translation but learn it -- share parameters across multiple languages, right? I don't think they did in speech, but I don't know. Has anyone done it in MT yet? >>: Well, that's a different project. But that's [inaudible] working on. So like if you wanted to train like a low [inaudible]. >> Barret Zoph: Yeah. Yeah. So you could certainly do that. I've actually done -- I'm actually doing that now. So getting these things to work well with low resource languages, so this is what I do. So let's say, you know, you want to do Uzbek to English but you barely have any data, right? So what I'll do is I'll train a huge French English model, okay, like on maybe, you know, the WMT corpus. Then what I'll do is I'll initialize the Uzbek to English model with this exact -- with exact French to English, but I'll allow it to not tune the French -- like to not tune the English parameters within some like L2 square distance of each other. So I won't allow those parameters to change. then I'll allow the source encoder parameters to change. And that alone just works very well. And Like it gets like a couple points -- >>: When you say you initialize the Uzbek model with the French model, I mean, there has to be a correspondence between the vocabulary. >> Barret Zoph: No, yeah, I mean, you could probably do -- you could certainly do something smarter, right, but I don't even do that and I get -you get very good results. >>: So it just -- essentially the Uzbek words get assigned to some random French word. >> Barret Zoph: >>: But I allow those parameters to change -- But they can change. >> Barret Zoph: Yeah. Exactly. And then I -- yeah. So I allow -- I tried a bunch of different experiments fixing different parameters along different stages and found that if you just allow the whole source side to be trained and then you fix the target side to be trained within -- like to not move within some delta, it works really well. Like I was going to be getting like we're getting very close to beating our SPMT base -- our SPM -- our best [inaudible] system on the [inaudible] language pairs. And also another thing that I found helpful that, again, I didn't mention this, is using character level stuff for when you have a small amount of data, right, because a big issue with these models is that they overfit the data like crazy and a majority of these parameters are coming from word embedding. If you can get rid of these word embeddings and move to character embeddings, it's very beneficial. That's, again, another work. So if you do like kind of a convolution operation over these, the different filter sizes looking for, you know, like n-gram matches, it helps. So that's something else I've done. >>: [inaudible] like modeling like [inaudible] embeddings together or even the -- part of the recurrent layer? >> Barret Zoph: Yeah, I certainly do think that's interesting stuff. I haven't tried that. I just tried doing simple stuff and it just worked really well. So I was like -- yeah, I haven't -- I haven't gotten to that yet. But, no, that's certainly -- there's certainly much smarter approaches. But I found that to work really well. >>: Would you consider multilingual embeddings [inaudible]? >> Barret Zoph: to try. Yeah. Yeah, I mean, that's certainly an interesting thing >>: Thank you. >> Barret Zoph: [applause] Thanks.