>> Arul Menezes: So this is Barret Zoph. ... at USC working with Kevin Knight. So he had --...

advertisement
>> Arul Menezes: So this is Barret Zoph. He is a graduating undergraduate
at USC working with Kevin Knight. So he had -- he's been working on machine
translation. He has two talks for us. One of them is -- was an EMNLP paper
last year that was a full oral presentation, and then the next one is a NAACL
paper that he has submitted about neural translation.
>> Barret Zoph: Great. Yeah, thank you for the introduction. Okay. Yeah.
So today is kind of a two-part talk, but they both kind of have an MT flavor
to them. So the first one is EMNLP paper I had on how much information does
a human translator add to the original. And then the next talk, or the next
part of the talk is on multi-source neural network machine translation.
Okay. So the goal of this paper was we asked the question of how much
information does a human translator add to an original text, and we tried to
provide an information theoretical bound through text compression.
So we also aim to kind of provide an alternative, unambiguous way to see how
well we can identify and exploit patterns in bilingual text. And we would
hope that this has -- this has kind of advantages over other traditional
methods such as perplexity where it's very sensitive to how big your
vocabulary size is, how you deal with unknown words and various things like
that.
And we also hope to bring ideas from the compression field into natural
language processing and vice versa, because there's a lot of different things
that the people in either fields don't do that the others do do.
And
and
are
the
yeah. And then so the -- okay. Yeah. So then if we have a source text
then we're asking the question of how many more bits of information
-- given that source text are needed to specify the target text, which is
translation of the source.
Okay. So just a little background on text compression. So the main goal of
text compression is the kind of exploit redundancy in human language to be
able to store documents much more compactly. So essentially the more
patterns you see and the better you understand the text, then the better you
can represent the text -- the much better you can compress it.
So, you know, a very trivial example would be is imagining just had a big
file with one billion 7s, we could take this file, which would be very big,
and we can just presumably compress it to just a few bytes by just writing
the following small code, which would then, once we run this executable, it
reproduces the original document.
Okay. So just another example of this is like if our goal was to compress
maybe the first million digits of pi, then we could also try to write a very
short piece of code that can do this. And once again, you know, how small we
can make the program would really express our understanding of this sequence.
Okay. So yeah. So then bilingual text compression. So, you know, since
text compression deals with exploiting redundancy in documents, it's also
really -- a natural extension of this is to think about how compressing
bilingual documents, is thinking about compressing bilingual documents since,
you know, ideally there would be even more redundancy.
So, yeah, the following quote from Nevill and Bell kind of shows our
motivation for this. You know, so they said from an information theoretical
point of view, accurately translated copies of the original text could be
expected to contain almost no extra information if the original text is
available.
So in principle, it should be possible to store and transmit these texts with
very little extra cost. So this is kind of like a little bit overly
optimistic, but, again, this kind of shows our motivation for approaching
this problem.
Yeah. So as we can see, like the quote does kind of show our motivation, but
it's clearly not as trivial as that. As we can see, you know, a Chinese
sentence and then 12 different translations of it. So clearly, you know,
there is some information being added when someone does translate a sentence
from one language to another.
Okay. So, yeah. So then by exploiting statistical patterns and bilingual
text, we want to be able to answer the question of how much information does
a human translator add to the original. And there's many different ways you
could try to approach getting a bound like this. And we go about getting
this bound through using bilingual text compression.
And the schema used for determining a valid entry is the same as in
monolingual compression in the various benchmarks there are,
like the
[inaudible] and things like that, where a valid entry is an executable that
prints out the original bilingual text byte for byte.
So what exactly does this mean? It means that the size of this executable
should contain everything you need that once run can exactly reextract the
text exactly byte for byte.
Yeah. So, again, before the top line is the kind of rule that we need to
follow. And then, yeah, so then, you know, any decompression code,
dictionaries, or the resources being used to help compress the text must be
embedded in the executable.
So we cannot assume -- we can't make any assumptions assuming that they have
access to the Internet or various things like that. Everything needs to be
put into that executable.
Okay. So let's just look at like a visual diagram to see how this process
works. So for a monolingual compression, if we have file 1, we compress it,
we get some file1.exe. And this file1.exe should be able to contain
everything needed to when we decompress it, it'd get file 1 back byte for
byte. So file1.exe should -- we're going to try to make that as small as
possible in monolingual compression.
And then in bilingual compression, we have file 1, file 2 being translations
of each other. What we do is we are compressing file 2, but we are allowing
ourself to look at file 1 in the process. So then we get some executable
file2.exe. And then when we decompress it while looking at file 1 we should
be able to get back file 2 exactly byte for byte. And now we would think
that, you know, since we have access to file 1 that we should be able to
compress this much more.
And then, yeah, going back to the title of this paper, our goal was to answer
how much information a human translator adds to the original. And we specify
this as the size of file2.exe over the size of file1.exe. So, you know,
imagine if file 2 was exactly the same as file 1, we would really need to
specify nothing more so that the -- so essentially what would end up
happening is it would be 0. So there would be no information added.
But if file 2 was a seemingly just random file, then the compression sizes
would be about the same, leading to a hundred percent. So we know it's
probably not 0 or a hundred, but it should be somewhere in between. And we
were trying to get -- figure out where that lies, so it, you know, 10
percent, 35 percent, 70 percent or somewhere along there. Okay.
So in this paper, we use Spanish and English bitext. And what we're doing in
this is we are looking at a Spanish translation and then trying to compress
the English as much as we can.
Okay. So for the dataset we used in this paper, so we use a Spanish/English
Europarl corpus. The Spanish is left as UTF-8. And for the English side, we
removed all accent marks and further eliminated all but 95 printable ASCII
characters.
But the main goal in this is that we wanted to be able to compress the data
completely as is so this wasn't, you know, word tokenized, we didn't separate
the period off the end of the word or anything like that.
So this is just a really -- this, again, goes back to one of the original
points that I used, that this could be a very good metric for evaluating how
well you understand a text because you don't have to do any kind of
tokenization or anything along those lines. You're completely compressing it
as is. So it's not sensitive to those kind of things. And we don't, you
know, remove any rare words or anything like that. We just leave the data as
is.
Okay. Here then, yeah, here are some statistics
important is probably the sizes as the goal then
that size as small as possible once compressed.
we're just going to try to make that as small as
about the data. Most
is going to get to be nick
Yeah, so the goal again is
possible. Okay.
Okay. So yeah. So a little bit about monolingual compression. So
compression captures patterns of data and also so does language modeling.
Yeah. So the goal in compression, as previously mentioned, is we seek to get
a small executable that prints out a text that's able to get as small as
possible and then able to exactly print out the original text byte for byte.
And in language modeling, we seek an executable that assigns low perplexity
to some held-out dataset. So at first glance, these might seem like very
different problems, but there's actually huge similarities between the two
once underlying algorithms are examined.
Okay. So one method that kind of links the two together that can take
something where you -- if you have a probability distribution and you want to
have it lead to a good compression. So if you have some text and if you have
a good model, one way that you can convert this distribution into a good
compression is through the very famous method Huffman coding, which I'm sure
many of you had heard before.
So for this let's imagine we have a five-word alphabet, and these are each of
its just unigram probabilities. What we can do is we can sort them from
least frequent to most frequent. And what we do is we simply start building
a tree by merging the two smallest values and then creating its value here.
And, again, this is an available value now and then we just choose the two
smallest again, choose the two smallest again, and then we do this again.
And then we simply add 0s to all the left sides of the tree and 1s on the
right side of the tree. And then when we start at the top and we traverse
down to the tree -- traverse the tree down to one of the leaves, we then get
the code for this word.
And then what ends up happening in this method is that more frequent words
get shorter codes, which then leads to much better compression rates. So the
most frequent word gets the shortest code and things like this. So then you
can see if you have a distribution that well captures the frequency of these
characters in a document that you'll get a good compression rate.
Okay. And then another just quick note is that this thing must actually be
stored in your executable whatever you have. Because then you go to
decompress, you need to be able to know what bit sequences correspond to what
words so that extract the text exactly byte for byte.
Okay. But Huffman coding certainly has some downsides, and theoretically
and, you know, it's pretty easy to see by a quick example. So imagine that
we -- going back to this. You only have two words in our alphabet, just A
and B, but it's very skewed. So the probability of A is, you know, something
like .999 and the probability of B is 0.001.
Then going back when we were constructing a Huffman tree, what would end up
happening was we would still have to have two branches and each one would get
assigned a one-bit code. So it's very bad and you don't really get to
exploit heavily skewed distributions. Along with very other -- with other
problems coming from that too.
So arithmetic coding, this method, arithmetic coding, gets rid of this
restriction in a very beautiful kind of way. And the very nice part about
arithmetic coding is that it allows you to take any model and it allows you
to convert your good predictions into a very good compression. So
essentially what you want to have is just have the best model you can, and
then you can immediately translate this into a very good compression.
Okay. So let's see -- let's kind of walk through exactly how this method
works. And also feel free to ask questions during this if there's any -- if
you need clarification on anything or things like that.
Okay. So for arithmetic coding, let's see what we do. We produce these
context-dependent probability intervals. And each time we observe a
character, we move into that interval. Okay. And our working interval
becomes smaller and smaller, but the better our prediction, the wider it
stays.
Okay. So let's say we're going to try to compress the underscore and that
our total vocabulary only has these six characters, and our model is going to
be like a bigram character model. But of course the model could be anything.
Just for this we're going to use a bigram character model.
Okay. So then we get something that looks like this. So these are the
context-dependent probability intervals. So how arithmetic coding works is
the following. So at the start of when you're compressing, you'll have an
interval that begins at a 0 and ends at a 1. And then we have our six
different characters in our vocabulary. So we will have intervals for all of
them. And then we also have -- and then the size of these intervals
corresponds to the probability that your model is giving at this time.
Okay. So what ends up happening is is that if we're compressing the
underscore, we first -- we're going to condition on the start symbol because
we're starting or whatever. And what ends up happening is that we are
compressing T, so then we move into T's interval. So if this value here is
0.8 and this value up here is 0.95, we then move into this interval and
expand it. So now .8 is here and .95 is here.
Okay. So then next we compress H, where then we now move into this interval.
If this was .89 here, and this is .83 here. We move into this interval here.
And then next we're going to compress E. We're going to move into the range
for E. Here, here. And then these intervals keep getting smaller and
smaller as you compress the document.
And then your -- a part of -- a big chunk of the executable then for this
compression would be -- it's going to be the smallest bit sequence to lie in
this interval. Okay? And the smallest bit sequence to lie in this interval
is corresponding to this number, which is 110111.
And now an important thing to note about this is that if you have a really
great model, you'll be able to allow this interval -- these intervals to be
much bigger. So if you could better predict T at the start, hopefully you
could have like, you know, maybe, T, just the top T is here and maybe like
the bottom of T is here and at each point, so forth.
And the nice thing is is that if you have a better model that can better
predict what character is coming next, then at the end you'll be left with a
much larger range. And then if you were left with a much larger range, you
will be able to get a smaller bit sequence that is able to lie into that
range. So it's kind of like a very beautiful method that allows you to, if
you have a really good model, then you can keep better and better
compression.
And it's also nice because essentially what it allows you to do is you could
have whatever model you want and then you can just plug it in to try to get
good compression. Okay. And then so once you have this -- so, yeah, so this
would be essentially put into your executable along with the order of the
characters here.
And then what ends up happening is when you're decompressing, you would have
this bit sequence and you would essentially see a 1 first, so you would be
like, okay, so you know that it's going to be in the upper half here, you'll
see the next one here, and then you can keep narrowing down the interval.
And then you can reconstruct what text it was.
>>: So for that bit sequence 110111, some subsequence of those binary digits
comes from each of these or exactly one?
>> Barret Zoph: No, it can be -- it can definitely be more than one too. So
when you're traversing it down, it's not always you can make one character
jump per bit. You can certainly narrow down the range, and when you're
decompressing, you can figure out which characters were set there.
>>: So is it fair to say there is essentially a Huffman coding associated
with each of each interval?
>> Barret Zoph: Is there a Huffman encoding? Well, what would end up
happening was is that you would actually -- so this is like shifting
adaptively. So if you were to use like a Huffman encoding scheme, you would
then need to like store each of those individual ones. So no. So no.
There's actually not always a Huffman encoding scheme that can do that, to
answer your question.
So what ends up happening is with Huffman encoding you get an entropy bound
with like per character you produce that can be between 0 and 1 bits above
optimal, and arithmetic coding allows you to actually achieve optimal
compression. So there's not always a Huffman encoding scheme that leads to
arithmetic encoding.
>>: I guess the question is what is the corresponding binary sequence on
that number.
>>:
Yes.
>>:
I think it's 01 is the [inaudible] is that correct?
>> Barret Zoph: Yeah. Exactly. Yes. So yes. So this would be, you know,
like -- this is just a binary -- like if you put 0., this is binary
representation for this number and decimal.
>>: But if you -- so if you had a bunch of things that were -- had .99
probability and you had ten of them in a row, how could you incorporate that
in one byte, let's say, or one bit? What would that -- that would be -because, I mean, presumably, then it would want to like -- the [inaudible]
would be like encode that [inaudible] distribution, have a bunch of those.
But with a higher probability, you'd want like -- you'd want like in like one
bit, right?
>> Barret Zoph:
>>:
Yeah.
So what would -- how could you get multiple things in a single bit?
>> Barret Zoph: Well, I mean, if you did have something that was like where
all of these were scrunched down and you could really go something like that
big, it would just be one bit. But the only thing you would need to encode
is that the order of these characters such that then you could reextract it.
That is all you need. So you could get something very small.
>>: Perhaps, Barret, you might want to walk through the inverse process of
decoding, because that might spell out what goes on like if I had that long
sequence of As, then I would repeatedly rescale upwards. So I could take
like one bit, that would be right in the middle then. That would encode a
relatively long sequence of As, right? Because the first ->>: But how does it know when to stop?
decoding when the new sequence?
How does it know when to stop
>> Barret Zoph: Yeah, exactly. So there is an end of -- there is a stop
character. It's just not shown here, but it ->>:
[inaudible] how do you know how many bits to consume?
>>: If you have a million -- if you have a -- yeah. If you have a
million-character thing, it wouldn't be encoded as a single -- as like a
single probability, like -- or would it?
>> Barret Zoph: Well, if you -- so if you had a million characters, right,
you would keep traversing this, you'd keep using this process and then at the
end you would be left with some ->>:
Oh, so it really would be a single thing --
>> Barret Zoph: Yeah. That is what your executable is. So you could have
a -- like a 10-gigabyte executable. But, I mean, then the size of this is
just this long bitstream.
>>: Okay. And yeah. Okay. So hold on. Let me just go to the -- okay.
And then, yeah, so then these interval widths are changing because the
context is differently and also because the model -- oh, yeah.
>>: If you had only one bitstream, how susceptible are you to errors? And
if you have like one little error [inaudible] and one you could fit, could
you lose the whole text?
>> Barret Zoph: Yeah, you could, if you don't use any kind of redundancy.
Yeah. So it is very sensitive to this range. Within some error of margin,
sometimes you can still get it. But those are errors. You can certainly
lose the whole thing at a certain point by kind of getting off in a skewed
range. You certainly can. Okay.
And yeah. And then so maybe this will always clarify kind of how this is
working. So also at the start we don't want to include any initial counts
into the executable. So we start everything at uniform, for example. Then
what ends up happening is as we keep compressing, we will updating accounts
in the same way for the encoder and decoder.
So at each point the compressor and decompressor have the same exact
probability distributions. And this allows us to not have to store any like
big unigram counts, bigram counts or anything. We can just start it uniform.
And then after we see each character, we update the count for that, and we
just keep building the model as you compress and decompress such that the
encoder and decoder can have the same probability distribution at each time
step.
>>:
So the probability distribution is per file, is not per language.
>> Barret Zoph: Yeah, it's per file. And it's adaptive. So after we see a
character, then it will, okay, we'll give count to that one. And if we, for
example, are using like a seven-gram context, we'll update the count for that
seven-gram context, six-gram, five-gram, four-gram. So then, yeah, so as -so in the beginning when we're compressing, we're not doing very well because
we don't have a very good model. And then as we get farther along in the
document, we do much better because we have a lot more counts.
And then, yeah, and then -- so then the only other thing we need to do is
since we're building up these models adaptively is that we just need to store
the total alphabet and the order of these such that the compressor and
decompressor know that. Because if you didn't know the order of these
things, then it would be like, oh, well, we would know it was an upper half,
we wouldn't even know what characters that corresponded to or things like
that.
Okay. So, yeah, then there's this method called prediction by partial match,
which is probably the most famous text compression scheme which is used like
a lot of times when you zip a file, if it detects it's like an English text
file or something like that. And, yeah, how it works is actually pretty
simple once you see this. So you can see that there's a really strong link
between compression and language modeling. So if you have a good model, we
can just feed it into this arithmetic coding scheme, and you'll get a very
good compression.
So how prediction by partial match works is you can just have a nine-gram
character model that is adaptively being built up as you compress and
decompress is text. And it's being smoothed with Witten-Bell smoothing.
That's all it is. So it's very -- it's just -- essentially just a language
model. And, yeah, we were building up the counts for the language model as
we compress and decompress the document in the picture like before.
So the English compressor. So what we're going to do is we're going to try
to compress the English looking at the Spanish. So we're going to use the
arithmetic coding scheme. And for the model we're going to now predict the
probability of the Jth word given the Spanish translation and the previous
English word seen so far in the English sentence.
>>: Did you mean to say words here?
doing characters.
Because everything previously you were
>> Barret Zoph: Yeah. So let's -- oh, yeah. Okay. So let's back up.
Let's back up. So we were doing characters before, but we could have easily
had that interval range over words. Right? There's nothing that specifies.
There's just much -- many more things there. But, yeah, so let's -- yeah,
let's -- that's an important point. So let's just back up to words here.
But these could also very well mean characters too. Yeah. So these could
very well mean characters too. Just we're kind of seeing everything we've
seen on the English side so far, and then we get to look at the whole Spanish
translation.
Yeah. So, for example, the Spanish sentence, and then, yeah, if we're doing
characters, I should like to -- and then yes, so we get to use this whole
context and all of this to produce the next.
>>: So just to fall in there, so you will have -- as you will have the 26
letters you will have the 10,000 words or hundred thousand words?
>> Barret Zoph: Yeah. If you're using words, then you have, you know, your
whole -- all the words that could possibly occur.
>>: So where you play with the concept of saying let's use the top 8,000,
and with that we know we cover 90 percent or 95 percent of the text
[inaudible] like oxymoron, use once every 10 million, 7 -- I don't care about
this word, I take it out. And then thereby reducing the number of words.
And I assume, then, would you see an increasing the [inaudible] ratio.
>> Barret Zoph: Yeah. So, yeah, you certainly do get that. But then you
don't actually be able to -- when you run that executable, you won't
reextract the text byte for byte, right? Because then you won't know what
words to put there. So you'll be left with some -- and our goal was to -we're going to compress this document as is. We're not going to replace any
less frequent words and stuff like that. So, yeah, for this we have to deal
with everything. That's our ->>:
So you ship with your document the actual words of the document?
>> Barret Zoph: Yeah. All the unique words. And that would be in that
arithmetic coding. We stripped all the words and we ship the order they're
in. That's all we need. Okay.
Okay. Yeah, but so for this bilingual model [inaudible] we'll get the word
level first. So, yeah, so EJ is going to represent a word. So if we're
trying to predict the Jth English word and we know it translates to the
Spanish word -- some Spanish word, then we can probably make better
predictions. This was kind of our intuition, and I think it makes a lot of
sense. If we know what word it translates to, we can probably have a sharper
distribution for what word that is going to be.
So we initially thought that Viterbi alignments could be useful for the
bilingual compressor. So what we were going to try to do, we were just going
to get the -- for the text we're compressing, we're just going to run it
through like an HMM alignment model and we were going to extract the Viterbi
alignments, and then we were going to store these in the executable. So
we're going to take a hit for the cost because we were going to have to put
these for the compressor and decompressor to both have.
Okay. So yeah. So when decompressing, we also need to give the decompressor
the alignments. So therefore we also tried to compress the alignments to
reduce the hit we were going to take. So we tried to store the alignments in
two different formats. So the absolute format, you know, we're -- so for
this target index what source word does it align to, and then just the
relative offset being stored.
Okay. So then we looked at, you know, how compressible are these alignments,
can you actually kind of predict better than random what alignment is going
to come based on the previous alignment. And you actually can pretty well.
So if we're looking at the relative alignments here, so what's the
probability of getting a plus one alignment, given you've just seen like a
minus two alignment. You can actually -- you can predict the stuff pretty
well actually. So if you've had -- if you've just predicted a previous two,
what's the probability now you're going to predict a three in this case.
And so we look at the sizes here of the uncompressed alignments just being
stored in the plain ASCII. We then use -- we then try to encode them using
like the Huffman encoding method. We also tried to compress them using the
prediction by partial match method.
And even when we compressed them down, we got a size of 12.4 megabytes which
was way too large for us to be using them. So we ended up not using them.
We found that the 12.4 megabytes was way too large. Because in the end we
end up getting down to -- in like the 30s, 30 megabyte size for everything.
So this was way too big. Even in our initial experiments.
>>: But did you -- I mean, before you tried to compress the alignments, when
did you find whether they were useful at all in compressing [inaudible]?
>> Barret Zoph: So we -- in like some preliminary experiments, yeah, we did
find that it would help. So this actually leads to my next point here, which
was that since these alignments were so big we were just going to try to
build up a probabilistic dictionary on the fly. So we were going to try to
learn these alignments as we were compressing the text.
So, yeah, so we were going to try to use like a T table from one of the IBM
or HMM alignment models to help us kind of get these alignments on the fly so
we wouldn't have to precompute anything.
So we can kind of build this model from the -- yeah, the IBM model kind of
gives us this probability distribution we want of the probability of the
English sentence given the French with this, with the following [inaudible]
from IBM model 2. So first we would choose a length for the English
sentence, conditioned on the Spanish sentence length, and then for each
target index we choose an alignment with some probability.
And then for each of those, based on those alignments we then choose an
English word corresponding to what Spanish word it correspond to in the
alignment.
And then, yeah, again, so instead of having to do, you know, like the whole
idea of running the IBM model one five iterations over the dataset and then
running the HMM alignment model five iterations over the dataset, what we
were going to try to do is compute these alignments in a single pass over the
data. So what the idea would be is that in the beginning we start with, you
know, just like uniform, and then as we get counts we'll -- as we keep
compressing the document, we'll keep building up better and better counts to
kind of get better and better alignments as we go.
And then, yeah, again, we build up these models exactly the same for the
compressor and decompressor. So we can get those same arithmetic coding
interval ranges. And since we cannot use the standard EM algorithm of doing,
you know, multiple passes over the dataset, we use online EM which is we use
kind of our own variant from Percy Liang's paper in 2009.
So we get the probabilities are updated after each sentence pair. So we kind
of are storing this, kind of these large tables of expected counts. In the
beginning, we have nothing in there. And then as we see the first sentence,
we'll keep updating counts after each sentence pair.
Yeah. So, again, like the processes here, so we start with some uniform
translation model probabilities. We use EM to collect expected counts over a
sentence pair. And then we use this probabilities after each sentence pair.
So we'll keep getting ->>: So on the previous slide it sort of implied that you were starting with
the model 1 dictionary or not.
>> Barret Zoph:
>>:
No, I'm not.
Just put in this --
>> Barret Zoph: No, no, no. So yeah. So sorry. So sorry if this was
confusing. So I'm saying in the previous approaches where you want to use
like the HMM alignments, for example, you have to initialize with five
iterations of IBM model 1, then run five iterations of the HMM alignment
model.
In this, we're just going to do a single pass with the HMM alignment model.
So we're foregoing those previous four iterations and the five iterations of
the IBM model 1. So this is completely online. So we're starting with
nothing.
>>: What are you really trying to optimize here? Is it -- so there is some
big competition that you want to do, and then you [inaudible] and you're
trying to compress English [inaudible].
>> Barret Zoph:
>>:
Our goal is --
[inaudible] that is still the goal.
>> Barret Zoph:
>>:
Yes.
Mm-hmm.
So why do you care about [inaudible] whether you do it in [inaudible]?
>> Barret Zoph: Yes. Okay. So, for example, let's say why do we care,
right? So maybe we had this, this English dataset we're trying to compress
along with the Spanish. So we run five iterations of IBM model 1 and five
iterations of IBM model 5 and get those alignments.
>>:
Right.
>> Barret Zoph: But then the issue is is that we would have to store these
in the executable.
>>:
All the iterations.
>> Barret Zoph:
>>:
Right.
>> Barret Zoph:
>>:
And that's going to be too big.
Okay.
>> Barret Zoph:
possible.
>>:
You have to store those in the [inaudible].
Right.
>> Barret Zoph:
>>:
Whatever the final Viterbi alignments.
Because our goal is to make that executable as small as
Okay.
>>: It depends largely on the -- on the size of the data that you have.
you had like 10 billion words of data, then ->> Barret Zoph:
>>:
If
Yeah, then it might -- then it certainly might do that.
But you have, what, 50 million words that you --
>> Barret Zoph:
Yeah, I think it's -- yeah, it's roughly -- yeah.
>>: So but even that, if you stored like -- because like the IBM alignments
are sparse or, yeah, word alignments are sparse, right, because most of them
are going to be [inaudible]. So if we just stored the head of each -- you
know, the top five [inaudible] the top 50,000 words, that would still be -you could -- like I think from probably a megabyte you could get a lot of
information, right?
>> Barret Zoph: Yeah. I mean, certainly. That's not one of the things we
really explored, but as we'll see later, we actually kind of get everything
we need in the online approach. So after we kind of realized this was too -so kind of the thought process of me when doing this was I was like, okay,
let me see if I can just use the full alignments. And then I found that was
too big, so I was just thinking let me see if I can do everything in an
online setting, which we found out that we actually could.
So, yeah, and then going back to here, we do have, yeah, about like 50
million word tokens. Okay?
>>: And the online setting you will eventually not have to solve the Viterbi
language, you just [inaudible] because you ->> Barret Zoph: Yeah. So we're building them up in the -- so we're building
them up as the model is compressing. So in the beginning, our Viterbi
alignments will be horrible because we barely see anything, but as we get,
you know, like a quarter of the way through the dataset and we built up a lot
of counts from seeing examples and its pairs from previously what we've been
compressing, so then we'll have good parameters at that point.
>>:
And what gets stored in the final file that gets --
>> Barret Zoph:
>>:
Oh.
So nothing, actually.
Nothing, okay.
>> Barret Zoph: Nothing. Because we start up -- so the compressor and
decompressor start at uniform. Then after we've compressed and decompressed
the first sentence, we'll have those counts then. And they'll have the same
counts at each time step. Okay?
Yeah. So then another interesting point is that unlike in batch EM, we don't
need to store separate count and probability tables. We only store counts
from EM and then we just compute the probabilities we need whenever we need
them. Okay?
And, yeah, these count tables that I was just talking about, so, yeah, we
just store the counts like for this, the counts for F, just like in normal
EM, but we never store any probabilities. We just keep accumulating. After
each sentence pair, we just keep accumulating in these counts. Okay?
And so in batch EM, we typically initialize the T table with a few iterations
of IBM model 1 and also sometimes model 2. So the issue was that -- so when
you kind of initialize the alignment model with IBM model 1 and IBM model 2
you kind of don't have to deal with this next issue that in online EM the A
table learns that everything should align to the null word, which was an
issue for us because -- yeah, which is an issue.
But actually you can get around this actually very nicely by just heavily
smoothing the A table. So the lambda A parameter we set to be a hundred.
And this kind of doesn't -- allows it to not make sharp decisions until we've
kind of seen a lot more stuff. And this actually works really well and this
kind of gets rid of that issue, which is pretty nice.
>>: [inaudible] that's tiny, though, you could pass the A table -- you can
turn a really robust A table and then compress into like a hundred bytes,
right?
>> Barret Zoph: No, yeah, that is actually true, but we actually -- we tried
two different things. We actually tried storing it and then learning it.
And we actually found that it does actually about the same, too. But, yeah,
the A table is small. The big thing of course we have this huge D table.
But, yeah, you can get around both of these. You can just learn on the fly
pretty easily.
And I think also another just high-level benefit of this is that if you
actually wanted to train out one of these alignment models, you actually
don't need to run all these previous iterations of everything. You can just
get it on the fly in one pass, which is something I'll talk about a little
later. Okay.
So what we ended up using for this adaptive bilingual model is we just used a
HMM alignment model where then, yeah, we would only do a single pass over the
dataset. So, yeah, there's a couple different variants of the HMM alignment
model. So for this one what we do is we choose the English length with some
probability epsilon. And then for each target word, we set the alignment to
null with some probability P1.
And we choose a non-null -- or we choose a non-null probability with 1 minus
P1 times the relative offset of the previous alignment. Like some parameter
for the relative offset of the previous alignment. And then based on that we
choose a probability from the T table. Okay?
And then also in compressing we must predict English incrementally before
seeing the whole English string. Okay? So we also must model when the
English sentence stops, which is pretty straightforward from the model here.
We just also just wanted to mention that we also do have to model in the stop
probability.
>>: So effectively you interpolate all of the -- you -- because you don't
actually know which word it will answer but you say -- so you have a
probability distribution of all the things it aligns to and then you take the
actual -- the translation [inaudible] interpolate them, weight it, and then
it's -- then you get a single distribution for all the English words and
that's what you ->> Barret Zoph: Yeah. Yeah, we kind of get this big -- you can think of it
as kind of like a big lattice where you have each target word up here and
then all the different possible source words. And then at each point the HMM
alignment model gives us a distribution for over each of those points. And
then you can renormalize to get a valid distribution.
>>: But that doesn't -- does it encode them separately, or does it encode
them in a single just over English? Do you interpolate them, just do linear
interpolation and then get a single distribution, or do you interpolate -- do
you have basically two different [inaudible] would actually jump to and then
encode what target word from that I should [inaudible] make a difference?
>> Barret Zoph: No, we just -- we just -- I guess, I'm sorry, I'm not really
understanding your question.
>>: Because you could encode -- you could encode the probability
distribution of which [inaudible] should I jump to and then encode given -and then -- then you decompress that ->> Barret Zoph:
>>:
-- deterministically --
>> Barret Zoph:
>>:
Oh, I see what you're saying.
We just do it -- we just do it at the same time.
Okay.
>> Barret Zoph:
Yeah.
>>: You would recover the Viterbi alignment when you're decoding, or do you
marginalize out the alignment?
>> Barret Zoph:
do.
>>:
We just marginalize out the alignment.
So we don't actually
Okay.
>> Barret Zoph: Okay? So then some training issues for this. So, again,
we're not able to initialize the HMM alignment model with anything, so we do
run into two issues. One is that EM sets the probability of aligning to null
way too high. So, yeah, so it kind of wants to align everything to null.
And then also the EM learns that the relative offset of zero is way too high,
so you kind of get situations where, you know, if you've seen a 2 you just
kind of keep translating a 2; you get a 5, you keep translating a 5 and
things like that.
So to solve this, it's, again, a pretty simple fix. What we do is we simply
just fix the probability of aligning to null so we don't allow that to be a
learnable parameter in the EM framework. And then we also -- and then we
heavily smooth the O table now instead of, you know, an A table. And we find
that this fixes the issue.
Okay. So then what we were going to do
got this adaptively training HMM model,
how good is this model doing, you know,
wanted to look at the alignments we got
next is, okay, so now we've kind of
so we wanted to inspect, you know,
in just a general sense. So we
from it. Okay?
So, okay, so what's going on here? So we're comparing against two different
things. We have silver alignment. We have like a silver standard and a gold
standard. So for the silver standard, we're just simply comparing against a
batch HMM alignment model that was run first with five iterations of IBM
model 1, five iterations of IBM model 2, and then five iterations of the HMM
alignment model.
And then we also have a set of -- for the gold standard, we also have 334
human-aligned sentences. And it's bidirectional. Okay. And then to be able
to compare with this, we simply run the HMM, the online HMM alignment model
in both directions and then use like a [inaudible] final type thing to merge
them together. Okay.
So then we can see a couple things here, is that so the online EM reproduces
90 percent of the links from batch EM. So it actually does pretty well. And
it also -- it also really surprisingly matches human alignments as well as
batch. So this online alignment model does really well actually, which is
very cool for us.
Okay. And then we also see that the alignments are better in the second half
of the corpus than in the first half of the corpus, which would be kind of
expected as we're building up the counts as we go.
And then another thing we tried was we wanted to say like, hey, so since
we're building this model up in an online setting, maybe if we put, you know,
the shorter very, you know, easy sentences in the beginning, maybe this could
help hone into better parameters in the kind of -- in our online one-pass
approach.
So we do see that putting shorter sentences at the top did help online EM a
little bit. And then we also got the alignments and then fed them into like
a Moses pipeline to see how they would match. And you could see that they -they're very comparable, which is pretty cool. Okay.
And, yeah, so then just the quick note on this is because we now have this
HMM alignment model that's expecting words. And since our goal was that
we're going to compress the data as is, this of course runs into issues
because now, you know, you have the commas attached to words, you have
periods attached to words and things of the like.
So then we wanted to kind of come up with a tokenization scheme that we could
uniquely reverse exactly. So we kind of came up with this thing. It's a
pretty standard approach that I'll just go over briefly. So what we do is we
identify words as subsequences for all lowercase or all uppercase numbers and
then any other other symbols.
And what we do is we append the number of spaces following it. So, for
example, the string is directly attached to the hyphen, so we get an @0
attached to it. This guy has a space after it, so it gets attached to 1
after it, so forth. Then what we do is we remove all of the @1s because we
would be assuming that, you know, in a corpus most things have one space next
to it. And this still makes it uniquely reversible.
And then finally what we do is we remove any suffix of an alphanumeric word
to become part of a prefix of a nonalphanumeric word. So you get something
like this. So here's an alphanumeric word here that we simply just take this
and we append it to the -- if there's a nonalphanumeric one in front.
So the whole idea is that you would hope that, you know, common things like
period often have this kind of deal where now we're like the period will
always have like an @ something, period @ something. So you kind of get
these more common tokens. So yeah. Okay.
So, yeah, so under the previous tokenization, under the previous tokenization
scheme we now ask our translation model to give us a probability distribution
over possible next words, like, Jacob, you were saying. So now the
translation model knows the entire source sentence and the target words seen
so far. So what we do is, yeah, we kind of compute this prediction lattice
for this HMM alignment model. That gives us a distribution over different
source alignment positions.
And also what we found very helpful is is that we weight each predicted word
in the HMM alignment lattice with a bigram word model. We found this to be
very helpful. Okay.
And so before we were working in the word space, and then, you know, back in
that arithmetic coding example it was characters, and we actually wanted to
go back into characters again, so how do we kind of take our prediction from
words into characters and we -- it's pretty simple. We just combine the
scores for words that share the same next character.
So imagine when we're compressing or decompressing we have seen the C so far
and we have a distribution over these four words, for example. Then what we
do is, okay, so A and A show here, so we simply merge these two probabilities
here with this probability here with this probability here and then
renormalize them to 1. So it's a pretty straightforward way to take this
distribution over words into characters. Okay.
Okay. So and then what we also wanted to do was we also want to be able to
interpolate this PPM prediction over characters with the HMM's character
prediction. So we get something like this. So if we're predicting the next
character, it's like a mixture model with the PPM prediction and this HMM
alignment prediction. Okay?
And what we do is we dynamically adjust this interpolation weight for each
context. Okay? And we actually found this to work remarkably well in
practice. So what we do is we take the max of PPM over the max of PPM plus
the max of HM. Okay. And now this max operator is the following. So PPM
will have a distribution over all of its characters. And then max is simply
the highest probability it assigns to any of those characters.
So ideally this kind of intuitively is like, you know, if one is very
confident in its prediction, then, well, let's listen to that more. And then
we did -- we did kind of put like a -- kind of like a little exponentiation
factor there. But it wasn't too sensitive to that. We just got a small
little boost doing that.
But overall like this max works very well. It actually worked just as well
as kind of training a logistic unit based on different contexts to kind of
predict. We found that this worked almost just -- basically just as well and
was much simpler. So this was very, very cool.
Okay. So now for some of the results. So first what we did was we had the
Spanish. We compress it. We get some Spanish executable. That should then
be able to extract the Spanish byte for byte. And then what we looked at was
we just looked at PPM, which was our monolingual compressor. We looked at
how well could this just compress the Spanish. Okay.
So what we see is that the uncompressed Spanish file is 324.9 megabytes.
Using the Huffman encoding scheme it's 172.8 megabytes. And then using PPM
it is 51.4 megabytes. So yeah. So then we get the compression rates here,
and then BPB is just the bits per byte for our compression. And this could
also be -- so also another common thing is maybe to see bits per character.
But for this example, we were also using -- there could also be unicode. We
were also -- wanted to deal with the cases in terms of that they were
unicode, which cannot be more than one byte. So bits per byte was just the
easiest thing to do.
Okay. Okay. So, then, then now what we're doing is is that we're going to
now be compressing the English while looking at the Spanish to get this
English executable that when run, and it can look at the Spanish, can
reproduce the English byte for byte. Okay.
So, yeah, so then -- okay. So if we see that the uncompressed English file
here is 29.4 megabytes after compressing it with Huffman encoding, we get
160.7 megabytes. And using PPM we get 48.5. And then using our bilingual
method we get 35.0 megabytes. So we do certainly get a nice boost in
compression using this HMM alignment model.
>>:
But don't owe include the entire compressed Spanish?
>> Barret Zoph: Yeah, we do actually. So that's the great -- that's
actually the next slide. But this is just looking at the English. Yeah, we
certainly do. I completely agree with you. Yeah, going on to this where we
then -- where this -- where this size is simply that 35 point, something
you've seen before, and then also the size of the Spanish, we get an 86.4.
So we get a 15.2 percent compression rate improvement looking at the Spanish.
And then there was also related work on this where these guys, they
actually have kind of a bilingual model but they simply interleaved
Spanish and English words and then compressed them using that way.
only get a 7 percent improvement over their compression, and we got
percent improvement.
didn't
the
And they
a 15.2
And then going back to the main title of the paper, how much information does
a human translator add over the original, we say that -- we provide an upper
bound saying that a human translator produces at most 68.1 percent, which was
35.0, the size of that English, which is the size of English size over the
size of the Spanish information that the original author produces.
And we also decided to do, was we ran this kind of bilingual Shannon game
that was able to give us more loose upper bound. And we found that actually
humans are only adding roughly 32 percent using this. So we're still a far
way off from being able to match kind of what humans can do in this scenario.
>>: I kind of hesitate to ask something because I want you to talk about
your second part of the -- let you go to your second part, but it does feel
like there's a mismatch here, right? Because we don't -- in the compression
scheme we expect that the model knows absolutely nothing about the
correspondence between Spanish and English when it starts. And in the
Shannon game at the bottom, I assume you're not like letting people learn
Spanish as they try to predict what [inaudible].
>> Barret Zoph:
No, I mean, that is certainly --
>>: They have this huge amount of knowledge that's encoded already in their
brain, and so it feels like these are two fundamentally different questions.
Like one of them is how well can I compress bitexts and the other is what
kind of information are the humans having. And it feels like you've focused
maybe more on the first question. Does that seem fair?
>> Barret Zoph: No, I mean, that certainly does seem fair, but it's also -it was kind of -- it was kind of, you know, a first step towards trying to
make something like this where it's very hard to kind of formalize like a
rigorous thing to answer that.
But, no, I completely do agree with you, I mean, right? So humans certainly
do have all of this other knowledge stored. I mean, yeah, it's one of the
fundamental really issues here. Okay.
And that, you know, yeah, so then just concluding, we do obtain a
quantitative bound. And then we also do provide an alternative, completely
unambiguous way to kind of identify how well we can exploit and identify
various patterns of bilingual text. We think that, you know, compression is
certainly a good way to do this as it's certainly unambiguous. Okay.
And then -- yeah. And then just some future and ongoing work that we were
currently doing is certainly using better predictive translation modeling
instead of HMM alignment model. That came out in like 1996, and these
improvement a lot since then. So ideally exploit better bilingual modeling
and various things like that. Okay.
Yeah. Okay. So, yeah, so that was kind of the first part. I can take
questions on that now if there are, or we can just keep going ->>:
Short on time, so --
>> Barret Zoph:
Okay.
And then questions, okay.
>>: Got the room until noon.
people may need to ->>:
I doubt there's anybody right after us, but
I need to go at 12:00 sharp.
>> Barret Zoph: Okay. All right. Sure. Okay. So then the second part of
this is on multi-source neural translation. So this is joint work with me
and Kevin Knight at ISI. Okay. So let's take a look at this.
Okay. So the goal of this work was that can we build a neural network
machine translation system that can exploit trilingual text. Because when
you're building these systems, a lot of times you have trilingual text, you
know, there have been other methods that have exploited having bilingual text
in a lot of different languages but can we exploit one that -- can we build a
joint model that can take in trilingual text.
So, yeah, we try to model the -- you know, for example, the probability of an
English sentence given a French and German sentence instead of the
probability of an English sentence given a French sentence and like a normal
neural network machine translation system.
And so we present two novel multi-source non-attention models along with one
multi-source attention model, and we show very large Bleu improvement. So we
get 4.8 Bleu points over a very strong single-source attention baseline that
was -- I think is currently the state of the art on a lot of the different
WMT benchmarks, so it was very strong results by using trilingual data.
And we also observed something interesting, that the model does much, much
better when the two source languages are more distant from each other. Okay.
So just a little brief intro on multi-source machine translation. People
have certainly worked on this before, starting probably with Franz Och, you
know, and so Kay 2000 points out that if a document is translated once, it is
likely to be translated again and again into other languages. So like a
certain way that this could come about is that a human does the first
translation by hand and then turns the rest over to an MT system that now has
access to this trilingual data.
And ideally a translation system -- so the translation system will have two
strings as input. And you can reduce the ambiguity of these two strings, use
a method like called triangulation. This was Kay's term. And you can see
how in this example down here of how having two source translations of
different languages could certainly help disambiguate each other.
So, for example, the English word bank could be much easily translated into
French, for example, because bank, you know, could be like a river bank or an
actual bank you go into or something like that. And if you have like a
second German input string that contains the word, this, which meaning river
bank, for example, then you could certainly then correctly translate bank
because of this. Okay.
So in the standard neural network machine translation setting, we have
bilingual data, which -- so English and French to typically model the
probability of an English sentence given a French sentence. And so we use an
attention and non-attention baseline. And so for our non-attention neural
network machine translation's model baseline, we use the -- kind of like the
stacked multilayer LSTM approach from Sutskever, et al.
And just as a quick review on this kind of -- this neural network machine
translation approach using multilayer LSTMs, so what ends up happening is is
that if we're translating this English string into this French something,
what we have is we first feed in the at the first time step into these -into this four-layer LSTM where each of these boxes is an LSTM. We then feed
in dog.
And, again, then it's passing its, you know -- this vector of real valued
hidden states that it's using to represent what's going on so far. We keep
eating it up, and then we feed in some stop symbol to let it know that now it
should start producing the target text.
We keep going, keep going, until we hit some end of sentence symbol. And
then once the -- and the model is trained jointly on sentence pairs using
maximum likelihood estimation, and then for -- at decoding time you simply
use a greedy beam search to get out the target translation. So very simple
greedy beam decoder. Much simpler than, you know, like many of the other
hard to write decoders in MT. Okay.
So, yeah, then again each box here is an LSTM, although there's various
different choices for this. Like you could use like a GRU or the other boxes
that can deal with vanishing gradients well.
So then our idea is that, you know, how can we build a new model to be able
to exploit the trilingual data. So we do something like this where we now
have two source encoders, so A B C is, for example, the sentence and, for
example, it may be English. Then we have I J K which is a sentence in
German. And we feed these jointly into the target decoder where it then
wants to produce W X Y Z in some third language. Okay.
And this is the same as before where these are LSTMs. These are LSTMs.
These are LSTMs. Except now we have these black boxes where in this paper
these are something called -- what we call a combiner box, which its job is
to take the representations getting sent from each encoder and combine them
in a nice way such that it's more useful -- it's a more useful representation
for the target decoder.
Okay.
Yeah, and then, once again, I'll -- yeah, like I said before.
Okay.
So then a quick aside on the LSTM block. I mean, essentially all that the
LSTM is is that it takes in this thing called a hidden state and a cell
state, which are two vectors of real numbers of the same dimension. And it
just outputs again two new vectors of the same dimension.
And then the internals are certainly, you know, complicated. But essentially
that's all it's really doing, is it's taking in this thing called a hidden
state and a cell state, and then it goes through this LSTM mechanism, and
then you get out a hidden state and a cell state. And it's fully
differentiable. And these were being shown as the single arrow before, just
for brevity, but there's really two things being passed along.
So if we go look at this picture again, if we just look, for example, at the
top layer here, there is one hidden state and cell state being passed to
here. There is another hidden state and cell state being passed to here.
And now the goal of the combiner block is to take in both these hidden
states, both these cell states, and combine them together in some way such
that to produce a single hidden and cell state for the descent to the target
decoder. Okay.
So, yeah, again, so a combiner block is simply a function that takes in these
two hidden states and two cell states at each layer and it outputs a single
hidden and cell state.
And what we did was we tried a few different methods for combining them to
see what would maybe work best, the first being the -- what we called a basic
combination method and the next being called the child-sum method. Okay.
Okay. So for the basic method, what we do is we simply just concatenate both
the hidden states, which is what this semicolon represents here, and then we
apply a linear transformation. And then we send the output through some
nonlinearity which we chose to be tanh here.
And then for the cell values, we simply just add them together. We tried
various other things, like, for example, doing a linear combination, but we
found that the training diverges because when you're training these neural MT
systems, if these cell values get too big, like it leads to kind of bad
training.
Okay. And we also tried a second method, which is certainly much more
complicated. It's just the -- which we called the child-sum method. And
it's kind of like an LSTM variant that combines the two hidden states and the
cells, and it was based off of the tree LSTM. So you can imagine how you can
think of this as maybe a tree that has two children and now let's combine
them to one.
right here.
So we essentially tried this and this method again.
So, yeah,
It uses eight new matrices, which you learn. And the previous method just
had one right here. Yeah. And then, again, you know, you send it through
these internal dynamics which -- like right here. It's essentially just
there's an LSTM that now creates two -- that takes in two cell values, has
two different [inaudible] values. And then that's essentially what's going
on here. Okay.
So then for our neural network machine translation single source attention
model, we -- we -- we base ours off of the local attention model in Luong, et
al., that was at EMNLP this year. Essentially the idea with these attention
models and this approach is is that when you're training these neural MT
systems that there's a huge bottleneck because essentially what you're asking
it to do is kind of eat up this entire source sentence and then just feed in
one vector representation for the decoder, so that's obvious -- it's pretty
obviously a huge bottleneck.
Then the idea with this being that at each step in the target side we have
access to look at the hidden states on the source side. That's the idea
behind this. And then, yeah, so at each time step we get to kind of take a
look at everything on the source side. Okay. Okay.
So then here's just a quick overview, a kind of understanding how this local
P attention model works from Luong et al., because we do build on internals
too. So to understand it, it's useful to understand how this works.
So at each time step, you know, on the decoder side you have this hidden
state, and what happens next is you predict this value -- you calculate this
value PT, which is the following, okay, where this is the source sentence
length, right here. This is just a scalar value, so in this case it would be
1, 2, 3, 4, 5, 6. It gets send through a sigmoid. Then you have these two
things just being learnable parameters that you learn along with your model.
Okay.
And so now note that this value since the sigmoid is between 0 and 1 you will
now have a value between 0 and so you'll be able to predict a position in
between 0 and then your final target length. Okay.
So now what we do is we look at the source hidden states in the range PT
minus DPT plus T. And note in this scheme we're only looking at the
top-level hidden state. So if you could imagine that you have your stacked
LSTMs below this and we're just looking at the top layer here.
So what we do is now we compute an alignment score with each of these source
hidden states. Right? So what we end up doing is, if you can consider H -we have an HS coming from each of these source ones, which is just this
top-level hidden state, what we do is we compute an alignment score which
will be between 0 and 1 for every single -- for every single position here.
And let's see how this is calculated. You have some functional line here,
which is essentially you just compute a score for your hidden state here and
then each of these source hidden states, with WA being some learnable
parameter. And then you simply normalize these across all the ones you're
looking at to get some values between 0 and 1.
And then you add this little X term to favor values close to where PT
predicted, and this also makes the model fully differentiable so you don't
have to use anything like reinforcement learning or trickier methods to be
able to -- yeah, to be able to train them all because everything is fully
differentiable, which makes it really nice.
So then you create this thing called CT, which is just a weighted average of
all the HS. So you get alignment score for each of these predictions. You
also have a hidden state. So then you simply just do a weighted average of
all these to create this thing called a context vector. Okay.
Then what you do is you look at your new context vector and then, once again,
the original hidden used to create this context vector. You concatenate them
and you apply a linear transformation, which is, again, like a learnable
parameter for your model. You apply it through nonlinearity, and then you
get out this new hidden state representation that should help you more. As
opposed to using HT before, now you're going to use this HT tilde. Okay?
So that is the attention model from Luong, et al., that we were using from
EMNLP, which currently gets state-of-the-art results on many WMT benchmarks.
Okay.
So now going back to our original non-attention multi-source model, let's see
how we can add like a multi-source attention model into this. So what we do
is let's just now look at the top layer of each of these source -- each of
the source encoders from the previous picture. We can imagine that we're
just going to look at this, this, and this. Okay.
So, again, we have our hidden state here. And what we do is now we create -we predict two different locations to look, and it could be completely
different locations to look. So we predict two different places to look at
each of the source encoders here. So let's say we're predicting to look at
these three here. We're also predicting to look at these three here.
Then what we do is we simply get alignments for each of these in the
different locations like we did before. We create two different context
vectors. And then we simply concatenate each of the context vectors now from
each language because we're getting an additional one. We concatenate it
with the current hidden state. We apply a linear transformation and then
send it through some tanh nonlinearity. Okay.
And then so for the models that we use for these experiments, what we do is
we -- just some stats about them, is we train the models for 15 passes over
the dataset. We -- yeah, we kind of use like a very common learning rate
halving scheme. We use 1,000 hidden state size, which is pretty common in
literature, a minibatch size of 128. We fix -- we kind of replace all the
non-top 50K source and target words with [inaudible] symbols. And we use
dropout for all the models. Okay.
And then so also a
translation models
slow, but they can
target vocabulary,
note on parallelization. So these neural network machine
can get to be very slow to train. Well, I mean, not very
certainly get slow, you know, with a lot of layers, large
et cetera.
So for these, for all the models, for the multi-source and non-multi-source
models, I parallelize them across our -- we have these nodes that have four
GPUs, so I split the parallelization across in my C++ implementation. And I
use a similar parallelization scheme used in Sutskever, et al.
And so with the multi-GPU implementation, I get about a 2 to 2.5 times
speedup. I could get more, but there are bottlenecks simply because I don't
have enough -- there are more layers and things I need to do than I do have
GPUs, as you'll see in the next picture.
And then, yeah, the speeds here are for training these, the different. So
about 2K target words per second for the multi-source without attention and
about 1.5K target words per second with attention. Okay?
So let's see how this parallelization scheme works. So ideally in a world
where you have more GPUs per thing, you could have GP1, 2, 3, 4, 5, 6,
something like this. But I don't. So I, of course, have to share these, and
these lead to a bottleneck. Because when you put the multiGPU parallelism,
you almost get flawless parallelism for speedup on this, which makes it
really nice.
And then here is the attention layer which I was just explaining before,
which is now just being represented with a single node. And then I also have
the softmax which was being implicitly represented before.
So how does this work? So what you do is you -- at the first time step, you
can view -- this guy computes its activation here. Then I send it
asynchronously up to the second GPU to start computing, and this guy computes
it for here. I then -- you know, you can keep going here and kind of get
this nice cascading effect for using this, which I found really helpful for
speeding this model up.
And then, again, for like the backward pass, you have this guy compute its
activation, and then you get the -- you know, the same exact thing. And then
as you can imagine for a model like this, what ends up happening is you get
the same scheme. I'm just now running also these two in parallel for all of
this. So that is how I parallelize these, and I get really large speedups by
doing so.
And then the -- so okay. So for the dataset for experiments. So I used a
trilingual subset of the WMT 2014 dataset. And for the languages we used, we
used English, French, and German. And we did two different sets of
experiments, one with the target language being English and another with the
target language being German. And, yes, we can see we get about 60 million
word tokens here. Okay. So then the results for when German is the target
language.
>>:
Did you train on all 60 million?
>> Barret Zoph: Yeah. I trained on everything. So this is the exact
training data in use. Okay. So then, okay, we look at this when German is
the target. Okay. So then for the source side we have English and French.
So here are the Bleu scores and perplexity on some test set. We see that
English as the source does better.
Then we try -- so these are the non-attention models. Here we're just using
the basic combination method and the child-sum combination method. We see
that we get a 1.1 Bleu gain and then a 1.0 Bleu gain here. We then run this
model with an attention mechanism, just a single-source attention mechanism,
and we get a Bleu score of 17.6. And then we used the same two basic
combination methods but then with the multi-source attention model. And so
here we get a 1.0 Bleu gain and here we get a 0.6 Bleu gain.
Okay. And then we also inspected when English is the target language, so now
we'll have German and French as the two source languages. So now these
languages are much more different from each other. And so here we see that
we get -- our best is going French to English, where we get a Bleu score of
21.0. And we then use the basic and child-sum method and we get a 2.2 Bleu
gain and 1.5 Bleu gain.
Then we also wanted to check, you know, just having two different source
languages actually matter. So we fed in French and French and we don't -you don't get anything doing that. So actually having the two different
source languages is not just about having more model parameters, it actually
does make a difference, kind of having these two different source languages
to look at. Okay.
Then we run an -- the single-source attention model. We get a Bleu of 25.2.
And then we use the Bleu -- and then we use the attention model along with
the two different combination methods. And then for our best one we get up
to 4.8 Bleu point gain, which was pretty high.
And then -- yeah. So then the conclusion for this is just that we describe a
multi-source neural MT model that gets up to 4.8 Bleu points over a very
strong attention model baseline. And we also see that, you know, having the
two languages be more distant in the source certainly helps and gets a larger
Bleu gain.
And then, yeah, are there any questions?
[applause]
>> Barret Zoph:
Oh, yeah.
>>: So there are two sources for like the -- in Bleu you're getting on the
[inaudible] sources here, one is Z, like the encoder itself, which is
bilinguality and one is the target one, the language one, because you have
now double Z amount of the data on the target. So did you consider how to
compare ->>:
No, it's the same.
>> Barret Zoph:
It's a three-way aligned corpus.
Yeah, it's a three-way aligned corpus.
>>:
The double source and single target?
>>:
Yeah, so it's the exact same sentences.
>> Barret Zoph: Yeah, so we just get like a trilingual dataset.
the same number of English, French, and German.
>>:
So we have
So the same.
>> Barret Zoph:
Yeah.
Okay.
Any --
>>: Did you try any experiments where you -- the trilingual corpus is, say,
X and you have 10X bilingual corpus? Do you have any [inaudible].
>> Barret Zoph: No, I didn't. I mean, that's kind of the stuff I'm actually
doing right now. It just takes -- these things take longer to train when you
have like a -- when you start stacking on a bunch of these things. But,
yeah, certainly that's one of the other things I tried and other -- yeah.
That is one of the things I'm doing now.
>>: What are you -- like what are you planning to do [inaudible]
exploiting ->> Barret Zoph: So what am I exploiting is I'm just exploiting using
significantly more -- significantly more languages. So I was actually going
to go out to six different source encoders and then seeing kind of how the
different combination of languages helped.
And then another thing was is that visualizing what they're actually doing.
So I didn't put this in the slides. I probably should have. But once you
train this joint system, you can look at like trilingual, you know,
four-lingual alignments and kind of see what they're looking at. So it leads
to pretty interesting things. That's certainly one of the things I'm
interested in doing.
Another thing I'm interested in is also being able to train a joint system
that can maybe now project two sentences into the same space or various
things like that which could be useful for various things.
>>: I do think it would be great to look at different language pairs.
Because if you look at English, like it really is kind of if somebody took
German and French and smashed them together, right? Because Anglo-Saxon was
spoken by the people [inaudible]. So like taking French and German and
trying to translate into English is kind of like the ideal case from a
linguistics standpoint of having multi-source. But, I mean, the results are
fantastic, right, like a point in Bleu gain is really cool. So that's great.
But from a parameter standpoint, though, didn't your child-sum method have
like a million new [inaudible] parameters and you've got like 2 million
segment pairs? So each -- you only get a training instance for those
child-sum parameters, like one training instance per sentence pair or
sentence triple, rights?
>> Barret Zoph: Yeah, so that's exactly why we think it is poor, right, it
just overfits that operation. And also especially because in a normal like
recurrent neural network, the parameters are replicated across time. And
that's just being used for a single training instance.
>>: Yeah. I wonder whether some regularization or some different parameter
shape or something might ->> Barret Zoph:
too.
Yeah.
Yeah, I mean, that's certainly another thing to try
>>: Yeah. What is the impact on the size of the model?
what is it costing you?
>> Barret Zoph:
What is this --
For speed?
>>: No, in terms of the -- more loosely the trained model sizes that you
already [inaudible] is there any path of using a -- by using a single
[inaudible] train the model sizes ->>:
50 percent bigger, 3 instead of 2X.
>> Barret Zoph:
>>:
Yeah.
[inaudible].
>>: Model sizes are never really a big issue here.
which is ->> Barret Zoph: Yeah.
GPU pretty well.
It's totally the speed,
You could fit all the stuff like in the memory of the
>>:
I'm curious about your parallelization capabilities.
>> Barret Zoph:
Oh, yeah.
Of course.
>>: So I thought one of the issues with parallelizing this kind of
computation across GPUs is that the [inaudible] communication [inaudible]
computation. So did you do anything special?
>> Barret Zoph: No. So okay. So there's a caveat. So you have to have
GPUs that can be able to do these things called direct memory, like these DMA
transfers. So how these transfers are being done is they're being done
asynchronously. So essentially what ends up happening is you have these, you
know, like streaming multiprocessors on your GPU that are running, and you
also have stuff that is taking care of a memory transfer.
So the way that this is working is that this is a direct GPU to GPU transfer.
So these are connected all across this -- like this direct bus. So these
things are just communicating. So the data that I'm sending, which is like
this 1,000 by 128 matrix of floats or whatever, is just -- is just being
sent -- is not ever having to reach the CPU. It's getting sent directly from
a GPU to GPU.
So you're not having to deal with, you know, okay, so here's memory on the
GPU, transfer to the CPU, then, you know, send it to the other CPU or
whatever and then transfer to the GPU. It's done direct. So it's actually
pretty fast. It's not a bottleneck.
I mean, you'll get really flawless speedups doing things like this. So, for
example, let's say I have a -- let's say I have a one layer and 20,000
vocabulary model, just a single layer, and it says 20,000 target vocabulary,
something like this. If you then use this on one GPU, maybe you're going to
get something like 7K target -- 7K source plus target words per second. If
you then split that into two GPUs, you get about, you know, 13K words per
second. So you almost get like flawless parallelism with it. It works
really well I found.
>>: And so, of course, this wouldn't work like when you went across -- if
you try to go across multiple machines.
>> Barret Zoph: You know, you couldn't -- I think you could probably get it
to work for MPI if you played around with it. I haven't tried it because we
just have these four GPU node things. But I think you could probably get it
to work.
Another thing I also tried was that I think is actually worth the cost is you
can have something where you have a huge softmax, so maybe you'd want to use
like a softmax of like a million different things, you could split that
matrix multiplication across multiple different computers using MPI, and then
they all share the normalization constant and send it back.
I actually did play around with something like that, and I got it to work
decently well. I don't have that in here as there are, you know, smarter
methods to dealing with a large vocabulary. But I think you could certainly
do that. I think if you played around with it, you could certainly
parallelize it across different computing clusters.
And you could also probably do some asynchronous model updates, too, which
other places do, too. You could do a combination of both for really fast
training time.
>>: Yeah, I mean, the -- there's the sort of parameters of a paradigm for
parallelization so -- and how you compare what you're doing to that.
>> Barret Zoph: I think you could use both. You could certainly use both.
Right? So you have a cluster of a bunch of these different, you know,
computing things, and each of them has a couple GPUs. So then the couple
GPUs you have on your -- like per node, you use to compute the model like
this, and then you share those across all of your clusters. So there's no
reason that they can't be used together certainly.
>>: So this parallelization scheme works well when you have stacked layers
like this. Do you know how much the stack layers are buying you in terms of
model quality in terms of prediction?
>> Barret Zoph: So -- okay, so I've played around with this a little bit.
So I found that I pretty much don't get much more after six layers, with four
being good. But you certain get a huge gain going from one to four. So, for
example, like you'll get like maybe like a 3 to 4 Bleu boost by going from
one layer to four layers. So, yeah, so four I found is actually pretty good.
I've gotten more up to six, but after that you start to get a diminishing
return.
And then what's also nice about these is that like if you have an attention
layer here, this takes about the same amount of time as these. So you can
just treat it as another layer. Or if you want to do like something with
characters down here, other things like that. So you get really nice
parallelism doing this, and it trains the models much quicker.
>>: [inaudible] bilingual, given French, you know, [inaudible] something
else for English. So now what is happening at runtime? What kind of a
sentence am I going to get? Am I going to get French?
>>:
You're getting both.
>>:
I'm going to get -- is that a requirement or --
>>:
Yeah.
>> Barret Zoph:
Mm-hmm.
>>:
Right?
I mean --
>>:
[inaudible].
>>:
That's assuming that you have a French and a German sentence.
>>:
Yeah, yeah, I know.
>>:
Same thing at runtime.
>>:
I understand, so --
>>:
Well, it's a pretty special case.
>> Barret Zoph:
>>:
So what is at runtime?
Like is it [inaudible].
It's --
But -- okay.
But let me -- let me note this --
[inaudible].
>> Barret Zoph: So for -- certainly. I mean, so, certainly, there's been
like stuff showing that like if you -- for example, so you train a model
maybe on a trilingual data like this but then at test time, for example, if
you only had, you know, like a -- you have going into English and you maybe
have French and German but you only have a French, you could then maybe use
another MT system to translate that French into German, feed in both and see
what you get.
And that's another thing, right? So I think that this -- there's reason that
this method can't be used in conjunction with other methods where, you know,
training with a bunch of bilingual data. Because I think that if you have
trilingual data it helps to jointly train it, but then you can certainly use
it in conjunction with other methods. Yeah.
>>: I think there's been work to sort of learn parameters across -- you
know, to do more lingual translation but learn it -- share parameters across
multiple languages, right? I don't think they did in speech, but I don't
know. Has anyone done it in MT yet?
>>: Well, that's a different project. But that's [inaudible] working on.
So like if you wanted to train like a low [inaudible].
>> Barret Zoph: Yeah. Yeah. So you could certainly do that. I've actually
done -- I'm actually doing that now. So getting these things to work well
with low resource languages, so this is what I do. So let's say, you know,
you want to do Uzbek to English but you barely have any data, right? So what
I'll do is I'll train a huge French English model, okay, like on maybe, you
know, the WMT corpus.
Then what I'll do is I'll initialize the Uzbek to English model with this
exact -- with exact French to English, but I'll allow it to not tune the
French -- like to not tune the English parameters within some like L2 square
distance of each other. So I won't allow those parameters to change.
then I'll allow the source encoder parameters to change.
And that alone just works very well.
And
Like it gets like a couple points --
>>: When you say you initialize the Uzbek model with the French model, I
mean, there has to be a correspondence between the vocabulary.
>> Barret Zoph: No, yeah, I mean, you could probably do -- you could
certainly do something smarter, right, but I don't even do that and I get -you get very good results.
>>: So it just -- essentially the Uzbek words get assigned to some random
French word.
>> Barret Zoph:
>>:
But I allow those parameters to change --
But they can change.
>> Barret Zoph: Yeah. Exactly. And then I -- yeah. So I allow -- I tried
a bunch of different experiments fixing different parameters along different
stages and found that if you just allow the whole source side to be trained
and then you fix the target side to be trained within -- like to not move
within some delta, it works really well. Like I was going to be getting like
we're getting very close to beating our SPMT base -- our SPM -- our best
[inaudible] system on the [inaudible] language pairs.
And also another thing that I found helpful that, again, I didn't mention
this, is using character level stuff for when you have a small amount of
data, right, because a big issue with these models is that they overfit the
data like crazy and a majority of these parameters are coming from word
embedding. If you can get rid of these word embeddings and move to character
embeddings, it's very beneficial. That's, again, another work. So if you do
like kind of a convolution operation over these, the different filter sizes
looking for, you know, like n-gram matches, it helps. So that's something
else I've done.
>>: [inaudible] like modeling like [inaudible] embeddings together or even
the -- part of the recurrent layer?
>> Barret Zoph: Yeah, I certainly do think that's interesting stuff. I
haven't tried that. I just tried doing simple stuff and it just worked
really well. So I was like -- yeah, I haven't -- I haven't gotten to that
yet. But, no, that's certainly -- there's certainly much smarter approaches.
But I found that to work really well.
>>:
Would you consider multilingual embeddings [inaudible]?
>> Barret Zoph:
to try.
Yeah.
Yeah, I mean, that's certainly an interesting thing
>>:
Thank you.
>> Barret Zoph:
[applause]
Thanks.
Download