Document 17857479

advertisement
>>Chris Quirk: Welcome, everybody. It’s my pleasure to welcome Chris Dyer to come give a talk
today. So I first knew Chris when he was a graduate student at University of Maryland, where he
became sort of the heart and the soul of the machine translation team, as Phil Resnick [phonetic]
put it. He’s done a variety of interesting work there, and since then he’s gone on to do a post-doc
at CMU, that’s where he is right now, and so today he’ll be talking about some of his recent
research in machine translation.
Thanks so much.
>>Chris Dyer: Thanks, Chris. It’s really nice to be back in Seattle too. Actually before I went to
grad school, worked at Amazon.com and I haven’t been out here as often as I would have liked
since then, so you guys got the weather perfect.
Okay. So we’re going to talk about machine translation today, and so consider the problem of
translating this German sentence with the following cartoon semantics. Everybody’s interested in
semantics, mine involves cartoons, and the framework that I’m going to propose is one where we
generate a bunch of possible translations for the sentence. In fact, we might even generate all
possible English sentences, and then we’re going to look at which ones violate certain wellformedness constraints.
So there are going to be basically three kinds of errors that we might be interested in identifying.
So the first are lexical errors, where we’ve incorrectly translated the lexical semantics of the
source language intro the target language, we’ve mistranslated hund as cat.
Here, a second class of errors is what I’m referring to generally as configurational errors, and this
is basically; we’ve screwed up the word order of the translation. So here all the words are right,
but still the semantics is wrong.
And then finally, we want to have well-formed hypotheses in the target language, so they should
be fluent. So this is a standard language-modeling problem, and so if we exclude all of these in
which we find errors, we’re left with a good, adequate translation of the source sentence.
And so, in general, I argue that there are three classes of constraints that we need to ensure
adherence to for good translation. So, to recap, lexical, so this is the problem of getting the
riverbank meaning of bank versus the financial institution meaning of bank, right?
Configurational, which is basically the semantic or syntactic relation preserved in the translation
to the target language. And then, finally, is the output fluent, and this is the standard language
modeling problem.
So as I talk about various models of various problems and translations, I’m going to make
reference to these kinds of classes. This is just my claim about what you need to have a good
translation.
So the outline is I’m going to talk first about a framework that lets us encode constraints like this,
and reason about them, and then talk about two specific case studies where I instantiate this
model.
First, the problem of word alignment, where we have to, given a parallel text, find how things
correspond to each other, which is an important sub-problem in modern machine translation. And
then a complete end-to-end translation system where we actually produce translation outputs
given some foreign language input.
Then I’ll conclude with some discussion.
So I am working in a discriminative framework, and I’m referring to this as generate and test,
purely just for sort of pedantic or pedagogical reasons, which is that there are two parts in a
discriminative model.
There is a generation component which determines the set of candidates that are searched, so
what we’re writing here is we’ve got some input x, we generate a space of possible labels for that
x, y, and this may be finite, it may be huge, it may be compact, it may be just a list, whatever, at
the high level it doesn’t matter. And then we have a test component, and what that’s going to do
is for each y produced by our generation component we’re going to evaluate the candidate for
goodness, and so this is generate in task or discriminative modeling.
So the particular framework that I’m going to work in is going to assume that we’re going to have
a test function between that compares source sentences, which I’ll write as S, and translation
candidates T, which are going to be translations of that sentence, and we’re going to
parameterize this using a linear model with two components.
So the first is a feature vector function, h, and what this is is a vector of real-valued feature
functions that is engineered, that I assume, at least for the start of this talk, to be engineered to
encode the main knowledge about what is likely to be something that distinguishes good versus
bad translation. So these are going to be features that encode our knowledge of lexical
semantics or the cross-lingual lexical semantics.
You can think about features as hypotheses about what makes a translation good or bad. The
second parties a weight vector, w, this is going to be learned automatically from data. So
example pairs of sources and targets, and these are roughly going to indicate, each component is
going to indicate, how predictive the corresponding feature is for the data at hand.
So the first part is going to be something we write as engineers based on our knowledge of a
problem, and the second part is going to be learned by fitting this function to the corpus of
example translations.
So what do, for example, lexical features look like? So in this input-output pair, we’ve got man
biting a dog incorrectly translated as man biting a cat, and so how would we engineer a feature to
detect a violation like this?
So one possibility might be to say, well, if there’s some English word, or some word in the source
sentence that’s hund, and some word in the target sentence that’s cat, let’s fire this feature with
value one, otherwise it’s going to have a value of zero.
And indeed, this will have a value of one in this sentence and in the correct translation, man bites
dog, it will have a value of zero, so, if we learn a negative weight for it, we can say, well, this is a
bad translation.
Yeah?
>>: What is the distribution you use to train the problem?
>>Chris Dyer: I’m not saying anything about distributions right now. I’ll talk about the specific
learning criteria. I’ll get to that in a second.
So the problem, though, is a feature like this is inadequate because we might imagine a sentence
about dogs chasing cats, and in that case hund would correctly, in a correct translation, co-occur
with the word cat, in fact it’s probably likely to. So the way we might deal with this is by
introducing latent variables, and these are going to be, this is going to be some additional
structure in addition to the output space or the output variable and the input that basically we can
also engineer to have any kind of form that we want.
Here, I’m just drawing links between words in the input and words in the output. They could be
trees, they could be semantic structures, they could be anything, anything we want up to sort of
computational considerations, and we’re going to update our test function to now evaluate not
only the input and output, but also the alignment.
And if we have these alignment links from source words to target words, we can now reformulate
this feature that basically sums over all of these alignment links and says, if the source end of one
of these alignment links is the word hund, and the target end of one of these alignment links is the
word cat, then increment the value of the feature.
So every time hund aligns to the word cat it will go up and it will be zero otherwise. And so, in
this case, we can fire for this violation but we can also conceivably have sentences where hunds
and cats happily coexisting. So in terms to answer your question about the models that I’m using
here, I’m going to be working in a probabilistic framework, probability provides an elegant account
of latent variables. We can exploit general algorithms for inference and learning, things like
expectation, maximization, very easy stuff, and also in machine translation there are a lot of
generative models with a similar structure that are also probabilistic.
And so this means we can sort of compare directly to these generative baselines. So, if you have
more questions I’ll be happy to answer. So the first application of this is the problem of word
alignment. And so what this says is that very often we have large corpora of sentences that we
know to be translations of one another, and what we would like to do is determine which word in
each sentence means the same thing as which word in the other sentence.
And so basically we want to infer links between the two like this. This is a problem where we’re
going to be focusing just on lexical and configurational constraints, so since we are observing
both the source and the target, we are not going to worry about modeling the grammaticality of
the target, so we’re just going to figure out what’s a convincing explanation of the lexical
relationships in the two languages and their configuration.
So in particular I work on unsupervised alignment where we don’t assume that we have any
examples between the source and target languages given at training time, we just have big
corpora of source and target sentences. And so the way this generally works is that we’re going
to model translations, so we’re going to model the sentence pairs, which are the observed in our
training corpus, and we’re going to use a latent alignment variable, which is of course,
unobserved, but we’re still going to be doing this to maximize the probability or some penalized
version of the probability of the observed training data.
This has been, in terms of work in machine translation, this has been there since the beginning.
However, my work is different in that rather than a generative model, and I’m using a
discriminative model to tackle this problem.
So I’m building on the framework proposed all the way at the beginning Brown, et al., which
assumes a finite state generation function and it basically, so in this case we’re going to condition
on the bottom string and generate the upper string. And the way this works is we go one at a
time and we pick a particular word in the source and then we generate each word conditionally
and dependently of the others given the alignment. So we just proceed from left to right through
things.
Now, Brown, et al., make a particular assumption, which is that the translations distributions for
source words are independent of one another, and this is a common assumption we make in
generative modeling. Well, at least until these [inaudible] methods have gotten more popular
recently.
It’s quite bad, I argue, for word alignment. So in particular what the model sees is rather than
strings of words that have any sort of meaning, they just see opaque integers that don’t have any
content. So in particular we’re missing things, like, well, look at these two sentences. If we know
nothing else but we observe that these two have identical surface forms, it’s probably a
reasonable hypothesis that they should be aligned, and if it turns out that the data doesn’t later
bear that out, if we see many then maybe we can disregard that hypothesis, but, it seems like
something we should be sensitive to.
We might also have matching prefixes, matching suffixes, because of the common ancestry that
German and English share, we actually have a lot of orthographic similarity for a number of
common words. So in this case lange and long have a common root, and sort of appear similar,
and then finally we also have, we’ve built lexical resources, so we have bilingual dictionaries that
have been put together for other reasons for many years, we might just look up in one of these
and see abschied can be translated as good bye.
Now, generative models that work just on co-occurent statistics are quite effective. I’m not saying
we disregard those or throw them out, I’m just saying we should have access to this information if
it helps us improve our models.
That’s what I’m advocating in this part of the talk. So what are the features I’m using? Well,
basically I’m starting with the word-word indicator feature. So that’s basically what the generative
model is. That’s the only thing it’s looking at. So I still have that in there, and then I also have
various co-occurent scores that basically say that for any two pairs of words in the source and the
target. How likely are they to co-occur under a variety, or how good is the strength of their cooccurrence according to a variety of different measures.
A little more linguistically interesting, interestingly I also include things that indicate links between,
for example, mounds in the source or parts of speech in the source to parts of speech in the
target language. So we might, for example, hope to learn that nouns often translate as nouns, or
nouns don’t translate as modal verbs. Actually I should say here that I’m not using supervised
part-of-speech tags. Everything I’m doing here is completely unsupervised, but you can include
supervised part-of-speech tags as well.
Of course, identical word features, identical prefix features, and bind orthographic similarity
feature that lets you learn words hopefully that have similar surface forms are likely to be good
alignments with one another.
So the second class of features that’s useful in modeling is what I’m calling configurational
features. So lets consider the problem of translating or aligning English to Urdu.
So these are both indo-European languages, but they’re typologically quite different. So the
canonical word order in an English sentence is subject, auxiliary verb, object, and then and end of
sentence, period, where as in Urdu it’s subject, object, verb, auxiliary, period. So if we instantiate
this with an example, we’ll see basically to get an alignment right we’re often going to need to see
a very large amount of reordering. We would expect to see it in our model, and in many
parameterizations of alignment models it’s hard to predict that things should be out of order. We
typically only are capable of modeling that things should be in order or maybe not to care.
We’d like a model that could actually model that we expect to have them out of order. So how
are we going to do this?
So one way of viewing the alignment problem [inaudible] we can write the source sentence up
along top and the target sentence down along the side, and these represent alignment links
between the two languages. And what we’re going to do is we’re going to look at bigrams of the
words taken as we proceed through the source sentence as it’s being translated to the target
sentence.
And so, for example, if we look at the first two words in the target sentence, these correspond to
this [inaudible] and we can extract a feature that just is the bigram [inaudible] and so every time
that particular, these two words are translated in this order we will fire that bigram feature, and
then we just proceed through the sentence as follows extracting word pairs at each step.
And so we end up with things like will and then period as a common bigram, or hopefully a
common bigram. We would expect it to be common for English-Urdu translation, and of course
the fact that English translates into Urdu in this order is not a fact of the particular lexical items
that are used, it’s a fact of the syntax of the languages, so in addition we can have bigram
features over the parts of speech, and so hopefully this will capture some generalizations.
So -- oh, yeah?
>>: [inaudible]
>>Chris Dyer: No, these are just what’s the path taken, so in practice I actually have a number of
other configurational features, so the model can also learn, for example, that things in the
beginning of the sentence tend to align to things at the beginning of the sentence or not, and
those things conjoined with particular parts of speech, so there are many, many overlapping
features.
Yeah?
>>: So, this feature when it says on the source side [inaudible], you don’t care where these two
words are?
>>Chris Dyer: No, they could be anywhere on the source side. And in practice I actually do have
features that also conjoin this with the distance between them. Well, I don’t have the lexicalized
ones conjoined with the distance but I have these conjoined with the distance, so ->>: [inaudible]
>>Chris Dyer: Yes. So it means that what the left and right side means here is that the order that
they’re being translated to produce this output is [inaudible] first and then dubais. So there is an
ordering there.
>>: But it means that the translations are adjacent and sequential in the [inaudible].
>>: Right, but on the source there’s any number of gapping between ->>Chris Dyer: Oh! And it could have been dubais sujut [phonetic], so it’s not, this order isn’t the
way, for example, we see dubais visit, so then we go backward in the sentence, dubais visit.
Okay, so I’ll say a little bit about learning in this model. So, as I said, I’m trying to maximize the
probability of just the translations, but I don’t know what the alignments are. So what I’m going to
do is I’m going to sum over all of the possible alignments, and the way I do this is I, so say we
have a given source sentence, [inaudible], I will use this finite state generation procedure to
generate a lattice of all possible translations and their alignments, which can be coded in a lattice
like this.
The second step is to intersect this lattice with the target language strings, so small house. So
what remains here is all of the different alignments that cogenerate this target sentence and only
this target sentence, where as this first lattice contains all possible translations and all possible
alignments.
>>: [inaudible]
>>Chris Dyer: So I’m considering you can translate either the first word or the second, right. So
it’s both all alignments and all sorts, so these are very, very large lattices the way I’m defining
them here.
>>: [inaudible]
>>Chris Dyer: No, it’s going to be very big. It’s still polynomial but there are a large number of
lattices, so the reason it is not exponential is because you can translate house twice. I have no
constraint on translating words exactly one time, so it’s not quite as bad as the traveling salesman
problem where you can only visit each city once.
As you point out, yes, it’s very, very large and very slow, so the solution I take is very simple, is
just a course to find approximation where in that first lattice. Rather than generating all possible
translations under the model, I have a simpler model, and if a translation is very low probability
under that simpler model I just never construct that edge.
And the simpler model that I use is just IBM model 1 and I find that if I use that I can prune down
to, or the inference time drops down to about one second per sentence, and the oracle error rate
that I get is less than 4% doing this, and there are probably ways to make this quite good. And I
should say you can’t actually get a perfect error rate because of the assumptions of the model, so
it’s not that I would be at zero without this assumption, I should have the gap there.
But, anyway, this is the approach that I take. It gets down to about one second per sentence pair,
which is reasonable, it’s comparable to, for example, the decoding time with something like the
IBM model 4, which is a widely used generative model.
So alignment is an interesting problem to evaluate. So it’s used, the reason we study it is it’s an
interesting problem; it lets you simplify some of the modeling assumptions that we worry about in
translation. Rather than having to predict translations, we just have to explain translations, so it’s
nice as sort of a test bed but it’s also useful.
So typically we use these word alignments to constrain grammar learning procedures where we
learn to translate, sorry, where we build grammars that are used in more sophisticated models
downstream. So if you have bad alignments in that first step, then you’re going to learn less good
alignments in your downstream phrase base translation models.
So in this evaluation component, we’re going to look at two things. We’re going to look
intrinsically at the quality of the alignments and see are the alignments, do they match up with our
intuitions about what should be aligned, and then we’re going to use these alignments to replace
alignments produced by generative models and see how that compares in standard state of the
art translation models that build on word alignments.
So our baseline is going to be IBM model 4, which continues to be a state of the art generative
model. It has a very similar structure, all though it is parameterized quite differently. And we’re
going to look at a bunch of different language pairs. So the first thing we’re going to do is look at
a comparison of what the model produces.
This is, of course, a cherry-picked example but it does illustrate some difference in behavior
between the two that are important and interesting to note. So the first thing to see here is that,
so on the left is the IBM model 4 alignment, and you’ll notice that the word dislike here has been
aligned to sort of half of this Urdu sentence, and this is a well-known problem in word alignment.
One of your departed colleagues, in fact, wrote a paper about this a number of years ago. And
what happens is, baseline is a conditional generative model that’s trained to maximize likelihood,
and you can think about this as basically every source word, so we’re translating from English
into Urdu now, every source word has one unit of probability mass to give away. And when
you’re maximizing the probability of the target string, one way to gain that metric is to take rare
words, like dislike, and just use them to explain a whole bunch of the sentence.
Even though dislike isn’t a very good translation of the rest of the sentence, it’s a good way to
improve the likelihood of the training data. We see this happening in generative models. And on
the right my alignment doesn’t have this problem.
>>: Are you going to say a little bit more about your parameter estimation techniques?
>>Chris Dyer: Not really, no.
>>: This underlying IBM model has normalized emission and transition distributions, so ->>Chris Dyer: Right. So ->>: Assuming yours does not. You just have a series of features that fire and you get a
distribution over an alignment and source sentences.
>>Chris Dyer: Right. So everything is a globally normalized model, there are just features. I do
have an L1 prior on the parameters so it encourages some sparsity. There are, for some of the
larger models, there’s a potential feature space of upwards of 80,000,000 features and then the
active number of features usually ended up somewhere around 10 to 20 thousand.
So there are some dense, highly informative features in the model that we use that explain a lot
of the data. So there’s no way for this model to get the same abuse, because there is no
probability of dislike translating into any other word. There are just features that have to come
together. So it’s a global score rather than consisting of a bunch of local products that -- right.
>>: So we avoid some of these label biases.
>>Chris Dyer: So the second thing to notice is that we manage to, in our model, have this big
ump from the end of the Urdu sentence up to the middle of the English. So this is exactly what
we expect when aligning a SOV language, like Urdu, to SVO like English. And this is actually
something that the baseline model is very poor at modeling. It basically says, well, model the
probability of jumping a certain distance.
And if you look at enough sentences, you basically always conclude that it’s a very bad idea to
jump very far away in alignment. That’s the best decision to make; there are probably not enough
parameters to model this effectively. Whereas our model, I’ll show some evidence in a second,
that we are managing to get models that say this kind of jump is actually expected. So that’s
good to see even though this is just a single example.
So to recall, lets think a little bit more about these features. So I introduced these path bigram
features that says what is the order through the English sentence that we’re taking when we’re
translating into Urdu. And so we can look at the features and see what’s been learned. And if we
look at the top 10 most highly weighted bigram features, we see that we’re seeing frequently that
a verb followed by an end of sentence punctuation character is highly weight.
So these are just the top-ten most highly weighted features. And these are not highly weight
English bigrams. If we look at the English corpus we’re not going to see "will" followed by an end
of sentence period very often. So these features are being learned not because they’re present
in the English, but because they are good explanations of the order of the Urdu output.
So this is some good evidence that the model is doing something sensible and not reproducing
what’s in the input. So the second intrinsic evaluation we can do is actually ask humans to write,
so given a set of translations we can say, humans, draw lines between the words or translations
of each other.
And this is a surprisingly difficult problem even for natives because it turns out that words don’t
typically translate exactly between languages. There’s a lot of fuzziness around the edges. So
typically you can get about 80% that’s really easy to do and then the last 20% you have to write a
300-page style guide to get agreements on.
But we can still do this. For one the language pairs I was looking at, Czech-to-English, I did have
some data, about 500 sentences, and then we can evaluate the performance of our model by
looking at the precision and recall of it’s predictions versus the gold standard alignment points.
SO if we look at our baseline, the error rate, so this is one minus the f -measure, the baseline is
23.4% and we drop to 20.5%. So it’s a reasonable improvement. The second number that I’m
reporting here gets at the issue of garbage collection a little bit. So the garbage collection
phenomenon occurs with rare words in the source language. So words that occur one time are
likely to be victims of this garbage collection problem.
So what I did here was just computed the average number of words that each single type aligned
to. So in the generative baseline we see that it’s 2.7 words, and in the discriminative model it
drops to 1.6. So I’m not saying that the alignments are better necessarily, just that there are
fewer of them to these rare words, which is probably a good sign.
Yeah?
>>: [inaudible]
>>Chris Dyer: So garbage collection is this phenomenon here. So dislike turns out in this corpus
of English to be very rare, it only occurs once, and we see that it’s been aligned in the baseline
model to a lot of words. And this is a particular well-known problem with these generative
models.
>>: [inaudible]
>>Chris Dyer: No, this is the model prediction. This is the output. This pair of sentences was in
the training [inaudible].
>>: So [inaudible]
>>Chris Dyer: So there are various kinds of priors and regularizes you can put on these
generative models, which will also help address this issue. I’m just using the standard
formulation for them.
So at any rate, this just shows that this is weak evidence that the model is doing something a little
more sensible. The gold standard in alignment evaluation though is looking at the performance of
the alignments when they are used to construct a state of the art translation system.
What I’ve done here was compared three conditions. One, the baseline alignments, one the
discriminative alignments, the third condition where we combine the alignments from both the
generative baseline and our discriminative model and use both to extract translation rules in a
hierarchical phrase-based translation model that’s currently widely used.
And we evaluate them using three different standard MT evaluation metrics, the arrow indicates
whether up is better or down is better. And we see that over the degenerative baseline we see
relatively small but reliable improvement. And then interestingly when we combine the
alignments from the generative and discriminative base, we fairly substantial improvements. This
is on a checking-list translation set with one reference and one blue point there, for example, is
quite good.
>>: So for the baseline [inaudible]
>>: Oh, yeah. So I didn’t say this at the beginning, but yeah. So in the methodology what we’re
learning is these are directional models. There is a source language and there is a target
language in the model. We condition on the source and predict the target. So we can train
models. We train one in the forward direction, one in the reverse direction, and then we
symmetrize the alignments produced by the two models. This is a standard; we just used a fairly
standard technique to do this. There are more interesting approaches that have been proposed
recently.
Model 4 is the same. It’s also a conditional model like this. We are doing the same kind of
symmetrization.
>>: [inaudible] is also a directional and ->>Chris Dyer: Yes. So it’s got the exact same kind of latent output space. So it predicts wordto-word, or single word on the source side aligned to potentially multiple words on the target side.
>>: I understand why IBM model 4 is directional, but [inaudible], why does it need to be
directional?
>>Chris Dyer: So in a discriminative model you can almost think of drawing an arrow from your
conditioning onto your predicting. And it’s normalized with regard to that. So often in
discriminative models you -- I mean, I could’ve jointly modeled it. It would have been -- the
inference problem was hard enough conditioning on the source sentence then trying to predict
both sides and sum over all possible English and French sentences, for example. That would’ve
been fun but a little terrifying.
So just to recap here, I’m just showing one slide of numbers on these translation results. We’ve
seen the exact same pattern of results in a number of diverse languages with different data sizes
and different language typologies and we’ve seen exactly these kinds of results. We’ve done
some fairly elaborate significance testing of these various outcomes. I think these results we can
say at this point are reliable.
>>: [inaudible]
>>: So actually we put the two -- we just made the corpus twice as big and just used one set of
alignments for the first half and the other set of alignments for the second half.
>>: [inaudible]
>>Chris Dyer: So this is a trick I found is good, but one interesting thing is when you combine
multiple generative models. So if you combine model 4 [inaudible] for example, you don’t see
these improvements. You need some kind of diversity. This is an open problem really, what is it?
These rule [inaudible], I don’t know how productive it would be to work on this problem of saying
what is the right kind of alignment set to extract rules form.
But there is sort of a sense that having a diverse set of word alignments is good. So it seems like
what we’re getting are different kinds of word alignments. And indeed when we look at the kinds
of improvements that we see here, we’re actually doing better at translating fairly rare entities.
So the generative baseline, it appears, is systematically not having translations for a lot of rare
words because every time a rare word occurs in the source language, it tends to garbage collect.
Because of the standard rule extraction procedures used, it basically blocks extraction from that
sentence. So it can’t learn anything when you have that garbage collection.
And so what we’re seeing here is -- when I did some manual error analysis of this sentence is we
actually see that fairly rare words that were out of vocabulary previously are now getting
translated. So not only are these slight improvements, these are good improvements. We’re
translating high-value words -- I mean, all words are high-value, but most of these evaluations
analyze getting an article incorrect rather than getting a content word incorrect to the same
degree.
There is still of course an intuitive sense that you want to have the content words at least correct
in your translation.
>>: So there was a step from just a simple multinomial distribution to a feature distribution,
another step from a set of feature distributions to a globally normalized distribution over
[inaudible] alignments, right?
>>Chris Dyer: So I didn’t go through that middle step. I just went straight from -- my sort of
straw-man foil is this model 4, which is a bunch of multinomial distributions. You can reparameterize those ->>: An MEMM instead of a CRM.
>>Chris Dyer: As an MEMM or even just an HMM with a log linear parameterization of each of
the multinomials. So the Berkeley guys Taylor did this. I do have ->>: [inaudible] intuition about what gain comes from which?
>>Chris Dyer: No, and this is something I’d like to explore a little bit. I include features those that
can’t be formulated in the directed models because they would induce cycles, so the global
normalization at least gives me freedom. Part of why one of the strengths of this model is it lets
you formulate features based on sort of your intuitions. You don’t have to change the structure of
the models.
So one thing, if you look at the way IBM had to do to go from model two to model three when they
introduced fertility, although I think fertility actually came first, they developed model one and two
later, they pretended they came up with model two first and they had to completely change the
structure of the models just to get a new kind of features.
This, while features will still -- they may change how difficult your inference is, the structure of the
model can admit, or the undirected globally normalized nature of the model means that you can
potentially include any features you want. So the idea here is to decouple your creativity from the
structure of the model. Whether that’s good or not, I don’t know.
>>: I understand the [inaudible] between model 4 and your model. What about Excel, what
about previous [inaudible] ->>Chris Dyer: Yeah, so the big difference between my work and the previous work is they
assume that there are alignment points given in the training data. So this is, this makes no
assumption that such exists, and that’s the main difference.
>>: [inaudible]
>>Chris Dyer: Exactly. So in standard discriminative frameworks, they’re discriminating the
alignment grid, I’m discriminating the target words and happen to have a latent variable. Okay.
All right. So just a little bit of ongoing work in this area. So I’ve been relying completely on lexical
features that -- I know about language. I actually have a Ph.D in linguistics, so I like to write down
my own features, but I’d like to know, can we learn representations of words that can work well,
for example. So one very promising approach I think is to make use of a vector space lexical
semantics. And the basic idea is that each word is embedded in some high dimensional or some
relatively low dimensional actually, vector space, and is a point. And points that are close to each
other in this space have similar meanings.
And so then basically what we’ve got is the problem of can we translate the vector representation
of the word hund into the vector representation of the word dog. And how might this look? The
answer is yes. There is a way at least to set the problem up that’s quite nicely, works quite well.
The basic idea is that we assume some embedding of the source and target words that comes
from somewhere. This has been an active area of research for many years, we’ll just reuse
those, and then we’re going to formulate translation as just matrix multiplication.
So we’re going to take the transpose of the source language word vector, multiply it by this
interaction matrix, which will produce a vector in the target language. And we will just compare
this target language vector to the words in our target language vocabulary, just with a dot product
for example, and this will give us some score which we can exponentiate and still have the
standard symmetrization formulation.
So if we do this, we end up with what I call the log-by-linear translation model, and basically what
we’ve done here is instead of taking a dot product between a learned weight vector and a feature
vector, we have this term, which is this bilinear product of the source language vector, this
translation matrix, and this target language vector. And the interesting thing about this model is
that we can train these vector embeddings from large, unlabeled, monolingual data in the source
and target language, which we often have copious amounts of.
The hope is that we can learn this interaction, this translation model matrix, from small amounts
of bilingual data, which is typically small and difficult to construct. Since we only learn d-by-d
parameters, it’s typically a much, much smaller space of variables that need to be learned than
we normally do in translation. So the hope is that we can learn to translate, we can learn to
reliably translate words in the source language to the target language never observed in the
parallel corpus that we’re learning from because we know something about their semantics based
on their distributional profile or whatever construction we’ve used.
Yeah?
>>: I think this is really pretty but what happens when we have sense issues, right? And the
English word [inaudible] can be translated in German as either [inaudible] depending on whether
you know somebody or something, right? And it seems like if I were to learn a monolingual
distribution for [inaudible] there might land in different places where the English word is going to
land in a single place.
>>Chris Dyer: Right. So ->>: How do you deal with these issues?
>>Chris Dyer: So the translation model is working on that low-dimensional space. So there
would be a part where no would have a single representation and there would be some kind of
activation in basically both parts of the target language vector. So there would be some
ambiguity. So basically you would still want to use a language model on the target side to
distinguish between things that are ambiguous as we always do in translation.
Yeah?
>>: How do you determine the dimension of the vectors and why [inaudible] language?
>>Chris Dyer: They don’t have to be. This is just where I started. I basically determined the
dimension because there was a data release a couple years ago from somebody who compared
a whole bunch of these and I could just download them and they were in 50 and 100 dimensions,
and so I used 50 and 100 dimensions. But you’re right. So I actually think a promising -- I think
this model isn’t going to work particularly well.
I think we need slightly richer representations over here, so I may add this as a term to the
current formulation that I’ve got so you can have both the energy term and the standard one. And
also there are several different ways to construct these low-dimensional embeddings of words,
and I think we’re probably going to want to come up with a, to also in addition to learning this
interaction matrix, we’re going to want to take say – okay, here are the [inaudible] embeddings,
here are the [inaudible] embeddings, and we’re going to learn another matrix that projects those
down into a d-dimensional space that’s actually used for translation.
So some of these embeddings are, or here are some SPD embeddings, so we’re going to get
several low-dimensional things and then we’re going to basically have something -- it’ll look a little
bit more like a multilayer neural network or one of these deep architectures. But I don’t know set
the structure. There are ways to learn the structure of these things. Again, inference is really
scary even when you don’t have the latent variables that I have.
So I don’t know how well that would work, but I have, as of yet, very few intuitions about what the
right architecture is, but I think it sounds promising.
Yes?
>>: [inaudible]
>>Chris Dyer: So right now I’m assuming these come from somewhere else. So I use a
standard off-the-shelf technique for learning low-dimensional embeddings of word meanings. So
there are established -- you basically often do something like build a very large vector based on
the context that a word occurs. So you get every occurrence of the word hund in a large German
corpus and say what occurs to the left, to the right, and within five words? And you run an SPD
on that, and that’s one way of learning a low-dimensional embedding.
And then I just define that to be five hund. But I don’t, at that point then, [inaudible] fixed when I
learn W. And the reason is I want to be able to learn phi from a large amount of -- I want to just
be restricted to what I’ve got in the parallel corpus to learn those meanings.
>>: So the way to generate is to find function. I assume is a kind of generative model rather than
a [inaudible].
>>Chris Dyer: No.
>>: [inaudible]
>>Chris Dyer: I haven’t thought too much about how to deal with OOVs here. The idea is that
you’re going to be able to learn phi from some a large amount of data that it will be a much rarer
problem than when you just learn translation models from just a small amount of data. So the
idea is you’ll get some embedding for a very, very large number of words, every thing on the
Internet say.
>>: Do you have any idea how this framework has solved [inaudible]?
>>Chris Dyer: Oh, that’s a good question. I haven’t thought about that yet. That’s a really
interesting idea. So the question was like, this is for lexical semantics, how would you do this to
model something like [inaudible]. There’s some really cool work on -- yeah, I don’t know. That’s
an interesting idea. These have been applied very successfully to build very good language
models. So translation modeling we’re just conditioning on something else. It’s the same
problem.
You had a question?
>>: [inaudible]
>>Chris Dyer: It’s hard to say. Since I’m modeling, since I’m evaluating this as an alignment
model, I don’t ever really do predictive inference in the model. I mean I have run it. It’s not a very
good translation model. It’s not as bad as you might think it would be, perhaps, but it’s still eight
blue points off of a phrase base system.
>>: [inaudible]
>>Chris Dyer: Yeah, it’s hard to say. There are no constraints on the features, though. So you
can definitely -- you can include context basically when you’re making a prediction, so you can
join a word with a part of speech, or word to the left or right, or whatever topic or something like
that. And I’ve tried those things. You can definitely improve the likelihood. Alignment plateaus
fairly quickly, unfortunately.
Yeah, yeah, yeah. Sorry. Okay. So quickly now, this part of the talk is much quicker. So we’re
going to talk now about -- so alignment is nice, but it’s not the whole enchilada. So now we’re
going to talk about actually doing translation and we’re going to focus on these configurational
constraints.
So one of the reasons why translation is difficult today is we don’t have to translate things wordby-word like we were doing in the first part of the talk. We can, for example, translate this
sentence from Japanese, John apple eat into John ate an apple, by translating apple eat as a
unit. And this is a great model, this gives us
state-of-the-art performance and we got a good translation here, but if we rely just on this to get
the outputs, the generalization is very poor.
So what we want is the ability to do some kind of reordering, of course, in cases when we don’t
have memorized phrases that tell us how to put things together in the target language.
So what we want is something like this. In some cases we may have a phrase that does the
reordering for us, in some cases we may need to do the reordering ourselves. So this is why
reordering is a challenge. It’s also a challenge because it’s a really, really hard problem.
Searching permutations is about the hardest thing you can do in computer science. It’s pretty
easy to show that it reduces to a traveling salesman problem.
If you want to compute expectations over a bunch of permutations, as I probably do if you want to
learn, it’s even worse. You get into all these horrible counting problems. So basically what this
means is when you’re modeling and doing inference, or when you’re doing inference and models
of permutations, like we have here, you have to work with some kind of approximation. And a
popular on in MT is to search an exponential number of permutations using a context-free
structure.
The basic idea is here you have some kind of tree structure and you consider swaps, local
swaps, of that tree structure. And so my contribution here is to constrain the reordering that’s
considered by a source language parse tree. The intuition is to think about parse trees as these
[inaudible] that let you permute the daughters of individual nodes. Here we’ve go t this sentence,
a dog chased john, which is, of course, an SPO sentence. We’d like to translate this into
Japanese. So what we’d like to be able to do is know that we need to switch John and chased.
So we need to do something like this. Now, as I said, in the phrase-based world where we might
have learned that dogs chased john and can just use that whole unit all at once, we actually don’t
know if we want to reorder this necessarily. We don’t know until we try and translate it, in fact.
So what we’re going to do is we’re going to expand this tree.
So in this my approach to reordering, we take all of the non-terminal productions, all of the multi,
all of the rules in a tree that have multiple nodes on the right-hand side and we consider all
permutations of them locally. In this case we have a forest now, which I’ve encoded up here in
this picture that considers both orderings of the verb. And now we’re going to translate the
sentences. We’re going to allow the model to translate any of these sentences as the possible
source.
So which sentence in this grammar that we’ve just constructed has the most
target-like word order? So if we had examples where somebody had written down the sort of
most English-like reordering of a Japanese sentence, we might be able to train a model like this
directly. In fact, some people have done things like this. So there was a nice paper from Google
last year that did this. But as I tend to do, I like to learn things like this as a latent variable.
So rather than assuming that I can observe this sort of English prime order when I’m translating
into Japanese, I assume that I have a model that says how likely is this English sentence to be
reordered into this English prime before it is to be translated? Then I can model this just by
saying how likely is, given some source sentence, the observed target string summing over all
possible reordering?
So because of the assumptions I’ve made, what I’m saying is that I’m going to encode the space
of reordering of the source language in this context-free structure which is derived from a source
language parse, I’m assuming the existence of a
source-language parser. And it turns out that you can intersect this context-free space with a
translation model represented as a finite space transducer using a variant of Earley’s algorithm.
It’s just a slight generalization. The exhaustive computation of this intersection takes place in
polynomial time and it turns out that, unlike the previous problem, where even though it was
polynomial time it was too expensive, exact inference is possible. It’s very, very fast. You don’t
have to do any printing, which is nice.
So this lets us construct a context-free forest containing all possible translations and all possible
reorderings of a particular source sentence. So the model I’m using consists of just six coarse
phrase-based features. I’m not making any particular innovation there. Standard trigram
language model, and then very simple configurational features where I just fire a particular
indicator feature each time that just says what rule was used.
So in the forest there might be a rule that says VP rewrites as VNP or it’s permutation VP rewrites
as NPV. There’s nothing to say what the source-language word order is. There’s no information
that comes about form a target language parse tree. We’re just going to have features that say
what does it look like when a VP rewrites typically.
So obviously we can imagine constructing a lot more informative and interesting features here.
The experiments that I’m going to do here -- I’m going to look at three different cases. So the first
is Arabic English. And Arabic and English are actually not very difficult. Not a very difficult
translation pair when it comes to reordering. So there are some local differences. Adjectives
follow the nouns they modify. And the one sort of large-scale difference is that verbs are typically
the first thing in the sentence, and to translate into English you need to move that verb that
occurs at the beginning of the sentence somewhere to the middle of the English sentence.
The second set of experiments look at Chinese and English, and we’re going to look at a very
small corpus of training data and then a much larger one. And Chinese and English is a
language pair where word order matters a lot. So the local word order tends to be rather similar
to English. Adjectives come before the noun they modify, but some larger structural units are in
different places. Relative clauses and prepositional phrases come to the left in Chinese of the
things they modify, where as in English they come after. And this is very important to remember
because we typically are going to have phrasal units that memorize things like a noun and the cp
that it modifies or that’s modified by a cp memorized as a phrasal unit.
That’s just too much information. Or these are too many words to have memorized. So a good
translation model for Chinese needs to be able to move these large structures around sort of as
abstractions, not just by memorizing them.
Here is first a list of what happens when we train this model to explain the English output of a
Chinese input using the configurational features I talked about on top of this daughter node
permutation procedure that I introduced earlier. The two things that I’ve got bolded here indicate
that the model has learned to put prepositional phrases after the VP’s they modify and to put
CP’s, which are what the Chinese Treebank calls relative clauses, after the NP they modify. And,
importantly, these are the English word order, not the Chinese word order. So, again, seeing that
these latent variable models can be used not just to reconstruct what’s in some senses going in
on the source side, but coming out what’s observed as the target language.
And the rest of the features are sensible and there were Chinese and English do agree.
Okay. So we’re going to look at two evaluation criteria for the translation model. First, the model
side of the number of rules, and second the translation quality. Just in terms of blue this time.
So the translation model size, compared to synchronous context-free grammar baselines, which
are current sort of state of the art in translation, the finite state translation model that I’m using in
this, in my proposed approach, is about an order of magnitude smaller. And that can be
obviously very important for a variety of reasons. And the way to think about this is that the
SCFG model in some sense is making the translation model do two things at once.
It’s making it come up with a model of reordering and it’s making it, of course, deal with lexical
translation. So in a sense this is the product of the SFT and CFG that we’re using for reordering.
And it’s just, you might say, over lexicalized.
Now, so what does this benefit in model size? Does it come at a cost? We see a couple of
things. For Arabic and English, the first line here, the model that I’ve proposed here doesn’t work
quite well as the synchronous, context-free grammar baseline. And it’s down a little bit. But for
Chinese and English, in both of the cases the low data condition and the large-scale sized data,
we’re already seeing that this model is out-performing this current state of the art context-free
grammar models, which are shown here.
And just with these very, very simple features that decide what order to put the Chinese in before
translation. And so this is very promising that we don’t have to -- we don’t really need these
massive models to do as well at translation as we often have been building.
We can still rely on things that look more like standard phrase-based translation models. So
that’s nice to see. So for languages where you have large-scale reordering, this model seems to
work well. Okay. So I know this has gone long, but I’m going to conclude by saying if you like
any of this stuff, all of this stuff is available in some software I’ve released. It’s all open-sourced.
You can try it today.
And I’m going to conclude by saying – okay. Generative models are great. It’s really dominated
for unsupervised learning, but the assumption that processes and their parameters have to be
isomorphic is, I think, holding us back. And in particular we often end up coming up with
computationally tractable approximations of what we think the phenomenon is that we’re trying to
model. So instead of generative stories, we have creation myths. And generate and test models
let us recover from that a little bit.
So this basically just says that we can separate the process of generating a set of candidates
from the evaluation of which of those are good. And, importantly, in models like this, when we’re
constructing a hypothesis about what is a good output, we don’t have to have a whole generative
or complete account of how some data got to be. Our model doesn’t have to be true. We just
have to declaratively state about which variables correlate and how. We don’t have to say why.
And we don’t have to have a complete account.
We don’t have to have say if we want to model an English morphologies relationship to tense, we
don’t have to have a complete account of temporal semantics and morphology to put this into a
model. We just need to say here are some features that look at tense, here are some features
that look at English morphology, and then if this is a good fit for the data, you’ll get a better fit.
And if not, the feature will be regularized out.
So we don’t have to know why things correlate. Basically the theories that we rely on can be
partial and descriptive. And that’s really where we are in language these days. We don’t have
complete accounts of the phenomenon we’re interested in modeling. Here we don’t need to.
So when we’re doing, especially unsupervised learning, we need to be able to build all the
knowledge that we can. And feature engineering is a really, really powerful tool to do that. So Ill
conclude with that and thank my many collaborators in this work. So thanks.
>>: Thank you.
[applause]
>>: We had a long discussion section in the middle, but if anybody has remaining questions.
>>: So how much of the feature-engineering do you think has to be language specific? I was
noticing you had a constraint blocking nouns from [inaudible] ->>Chris Dyer: Oh, no, no, no. That was an example. So I just had indicators that fired when a
noun was hypothesized to have aligned to, or any part of speech, was hypothesized to link to any
other part of speech, I just said I think part of speech alignments are useful signals for alignment.
And those were examples that I thought might be reasonable to assume.
>>: [inaudible]
>>Chris Dyer: Right. So the great thing about these models is that the features are just
hypotheses about the correlations and the data has the ultimate say in whether they’re useful or
not. So in different language pairs feature will have very different weights.
So a good example of that were those path bigram features where in English to Urdu translation
we see auxiliaries followed by period as a high-weight feature that is very different than what you
see in Chinese. But, more broadly, your question about language pair specific, you can be
surprisingly agnostic about this stuff. I did have a few things that assumed that the two
languages shared a common script so that things like edit distance were meaningful. And since I
was working on things like Urdu, I did some kind of fake Romanization to make that assumption a
little more reasonable.
You don’t have to have those features, though. You can throw other things in. And to some
extent that’s really what motivated, though, this completely unsupervised learning of the
representations though. Can we learn everything and not do much engineering at all?
>>: [inaudible] so we’re talking about situating this in a space of reordering Google’s paper?
>>Chris Dyer: Yes, yes, yes.
>>: That’s not what you’re doing, actually.
>>Chris Dyer: No, it’s not. So I’m -- rather than making a hard reordering decision, I’m saying
that we don’t actually necessarily want to make a hard reordering decision, because we may
have already memorized how to do the translation. And in some cases, for some noun-adjective
pairs, we may have it memorized; for others, we might not. And so rather than trying to make a
decision about whether to
pre-reorder or not, let’s just let the model make a decision when it comes to, you know, let’s let
the whole pipeline do the inference rather than making hard decisions at each point.
And then we further, rather than saying the reordering that are good are for a particular language
pair, let’s just consider all of them within a certain class, sort of this ITG style reordering where
you can permute the daughters of particular nodes but nothing else. So this doesn’t capture all
kinds of reorderings. So things that, for example, certain processes do extraction like Chinese
has W-h words in situ [phonetic] where as in English we put things like what and why we move
them to the front.
This model can’t capture that because that’s more than just the transformation of a node’s
daughter.
>>: Then you are learning the weights for these reorderings?
>>Chris Dyer: So each rule here that we see, so you can think about this tree, which is an input,
gets transformed so each of these binary children gets transformed into two different nodes with
two different orders. Each order of, so VP now can be written as NP followed by VBD or VBD
followed by NP. Each of these will have a different weight, and right now the weight is just
determined by the order of the rule and what its left-hand side is.
So that is what I’m learning as a latent variable. So each S prime is a string of a particular
reordering hypothesis. I’m training the model to maximize this probability. So the target
sentence, given the source sentence, is summing over all possible reorderings.
So this is very similar to what I did in the first part of the talk where I’m modeling just the
probability of the observed string pair summing out this latent structure. So if we find that, for
example, switching verbs and their objects gives you a good account of the data and improves
the probability of the corpus, then we learn ->>: So [inaudible] equation, then the left side is just based on those -- what is on the right side?
>>Chris Dyer: This is just a standard translation model.
>>: So, which is phrases, or?
>>Chris Dyer: Phrases. So phrase is the standard. I just threw the standard Philip Kuhn phrasal
features in. And I should say when I trained these features I didn’t have a language model. So
this part I added the language model in and sort of in a second phase and then trained the whole
thing with mert [phonetic] so that I got good --
>>: [inaudible] to the right has a bunch of [inaudible] and the part to the left is just like [inaudible]level features?
>>Chris Dyer: Exactly. Yeah. So I collapsed all of this into a single top-level feature. Yeah, it
was a single top-level feature. Yeah, so the basic idea here is I didn’t want to have to assume
that somebody had annotated reordered sentences for reasons. One is that’s a really artificial
task and two, you know, in some cases we don’t want to reorder because the translation model
knows how to translate something as a unit. So I did both.
>>: But the people are doing that, they actually like [inaudible].
>>Chris Dyer: Yeah, I guess so. I could’ve iterated it and then relearned the translation model.
It all seemed kind of artificial.
>>: But you haven’t actually been able to compare against those systems ->>Chris Dyer: You know, that’s actually a good question. Nobody’s ever brought that up. So I
have run an SCFG model with Philip’s Chinese pre-reordering rules, and I did this and I know
[inaudible] did this, and neither of us found that it helped on top of an SCFG system. So it helps
with Moses, but basically the results are the same.
And from what I remember, I don’t think there was ever any comparison of the pre-reordering
systems or with a phrase-based system to hiro [phonetic] so I don’t know actually where that
would slide in, whether it would be here, here, or here in terms of the scores.
>>: Any hypothesis to why [inaudible]?
>>Chris Dyer: Yeah. First of all, Arabic parsing is terrible, and especially the span of the subject,
which is exactly what you need to get in order to get the reordering of the verb right, is just off.
And I actually don’t quite understand why it’s as hard as it is. Some people, who really, really
care about Arabic nlp, have been working for a very long time and the results are just, like, they’re
bad, and much worse than something like determining prepositional phrases in Chinese, which is
what you have to target in Chinese.
The other things is the trees are really, really flat, and so this permutation base model -- I also
introduced some constraints on if there are too many daughters of a node that you only do some
local permutations rather than the whole factorial number, just for tractability reasons. And that
may be hurting. But it’s probably not that I’m leaving out good things, it means I’m probably just
not getting good estimates of the reordering parameters.
>>: You could, for example, consider other, you don’t have to constrain yourself because the
training doesn’t care what permutations ->>Chris Dyer: No. So the permutations -- so I do need to be able to compute. I do need to be
able to sum out compute certain expected values efficiently, and the dynamic programming
algorithm that I was able to use assumed that there was a context-free structure on the input that
could be intersected with a finite-state representation of the translation model. I think many of the
transformations on trees still result in context-free structure so you could do more things, but as
you add more and more to that source grammar, that inference problem becomes harder
computing those expectations.
>>: [inaudible]
[laughter]
>>Chris Dyer: All right. Thanks. My apologies for going long, but I think the discussion was
good.
[applause]
Download