Document 18013684

advertisement
>> Will Lewis: That's great. So it's clearly -- everybody really wants to be getting into the talks.
It's been great. We've seen people already going around the posters there. So I'd like to
welcome you, though, to the first -- what we've called the extended abstract session for SMT.
And probably the best person to have been hosting this would have been Anoop Sarkar, my
colleague from Simon Fraser. But since he's not here, you get stuck with me again. I know
Anoop and I always get mistaken for each other, so my name tag is here so you just don't get too
confused here.
So for the two presentations, the first one is going to be by Baskaran Sankaran. And he's going
to be presenting a paper on incremental decoding for phrase-based statistical machine translation.
So Baskaran, it's all yours.
>> Baskaran Sankaran: Thank you.
[applause]
>> Baskaran Sankaran: Good morning, everyone. Thanks for coming for this talk. This is a
joint work with Ajeet Grewal and Anoop Sarkar. And I'm going to talk about the idea of
incrementally decoding a sentence in a phrase-based statistical machine translation.
So just to give an outline of the talk, so I'll first motivate the problem before going to algorithmic
details of the incremental decoding, including some of the details about future cost computation
and delay pruning. And then I'll present the evaluation results and then give some concluding
remarks.
So the problem, it's -- it can be motivated easily. So it's what I called a TAYT, or Translations
As You Type. So what I essentially mean is I want to get the translations as the sentence is
being input by the user for every word instead of waiting for the entire sentence as input and then
the translation is done. So I want the translations to be generated for every word.
So thinking about this, we can sort of think about two approaches, which is quite obvious, like
one is to re-decode every time a new word is added. So first time that new words comes in and
you decode for that word, and when you get the second word, you again decode it again from the
scratch.
So a slightly tricky thing would be to use -- reuse the partial translations that were generated
earlier and just incrementally decode for the new word.
So my talk is going to focus on this thing. This is what we call as incremental decoding.
If we think about the applications, it could be used in realtime chat or language learning. And
interestingly it's already available in Google Translator and possibly some of the other things as
well.
However, I want to just mention that though it's available in Google, it's still not clear about the
type of algorithm that they use, and to the extent we are familiar with this, this is the first work
that presents this idea.
So this is our objective. We want to get this incremental decoder implemented, and it uses the
partial decoder states. And which we hope will be faster than re-decoding from the scratch every
time you get a new word.
And one of the very significant problem is the search error because we really don't have the -- we
haven't seen the entire sentence. We're just dealing with the partial sentence at every step. There
are issues in computing the cost.
So how do we reduce or minimize is search error. That's one interesting research question here.
So before going into the incremental decoding this is a very brief overview of beam search,
regular beam search decoding as is done by Moses.
So given a sentence, Spanish sentence like this, the beam search decoder translates increasing
number of source words at every step. So at the first step it translates one word -- I mean
different words in the source language, and at the second step it extends the translations that
were generated earlier and translates additional words. So this is how it is done.
And at every step it keeps track of two cost: one is the current cost and the other one is the
future cost. So the current cost is the cost of generating a translation, let's say the first one,
Mary, the cost of generating the translation for the word Maria is as Mary, that's going to be the
current cost. And the future cost is going to be the cost of translating the rest of the words in the
sentence.
A bit more detail about the future cost. So the future cost is computed for all the phrases in the
sentence, and then it's aggregated over all possible sequences in the sentence using some
dynamic programming.
Typically people sort of do it as a preprocessing step, even before the decoding is started. So
these are some of the challenges in incremental decoding, like how to reuse the partial decoder
states efficiently. And the next question would be the sentence is not complete yet. So how do
we compute the future costs. Then we started posing the question as do we need the future cost
at all, like in order to efficiently decode.
And the other problem, as I mentioned, the search errors, which of course due to pruning.
Because in a practical setup like a realtime chart and other things, you really want to keep a
smaller beam size. And in addition to that, you are having partial future costs. So some of the
candidate hypotheses that are generated by the translator could be pruned much earlier, leading
to search errors. So this is, again, an important problem.
So, yeah, this is roughly the outline of the incremental decoding algorithm that we use. So for
every new word WI that's added, we get the new phrases that could possible be generated for this
word. And get a corresponding translation options.
It should be noted that we are not trying to get the phrases for the words that are already typed in
and getting the translation options. So those things are there as part of the partial decoder states.
And then for these set of new phrases, we compute the partial future costs and then update -- do a
limited update of the future costs in the partial decoder state.
And then we trace through all the bins beginning from the first one until the current one and in
this process we create new hypotheses and extend hypotheses from the previous bins. This is
just like the regular beam search. So I'm not going to go through the details there.
And then when it reaches the current bin, it just returns the best translation for that bin.
So in action it works in this way. It has the word "having" first, which is the first word, it gets
the corresponding translation options. And it puts the translation -- I mean, it creates hypotheses
in the first bin corresponding for one word. And then it creates the second word and then it gets
the translation options and creates new bin for phrases of line two, and it updates that, updating
the earlier bins.
So the process goes on like this until it reaches the last bin when it will have generated the target
sentence. Here we don't -- we are not computing the full future cost. Instead we are just
computing the partial future cost. And we are just reestimating the partial future cost for all the
sequences and just updating the decoder state.
And we found that just doing this is much faster than computing the future cost again from the
scratch. As we'll show later in the results.
Yeah. So that's an interesting problem they mentioned earlier. So pruning is part of the cube
pruning. We have the cube pruning implemented as part of the decoder. So the actual pruning
happens in a cube pruning step. So we have set the beam size and the cube pruning pop limit to
be the same. So essentially the cube pruning pop limit will just pop out as many translation
hypotheses as it is being specified, which is going to be a small number.
And when this happens, all the other hypotheses that have poorer scores than the tall popped out
hypotheses, they are going to be pruned out or just ignored.
And as we'll see later, this pruning is risky. And this will get even worse with a smaller beam
size, and the issue is partially due to the lack of the availability of the full future cost.
We just make this key observation like the best candidate could have been pruned already. So
this is something that's key observation for us. So when we started looking some solutions for
this and started scratching our heads, we came across this. It's a story. Once upon a time there
was a race between a tortoise and a hare. And we all know the rest of the story.
Huh. Looking at this, we realized, okay, possibly we can draw some analogy from this. And
when we try to do this, we found that we are having few hares and running with large number of
tortoises. And what are hares? Hares are the hypotheses that are passed by the cube pruning
step. And what are tortoises? Those are the poor scoring hypotheses -- I mean, poor scores
compared to the hares, not necessarily very poor scores.
And these poor scores could just be at the particular decoding state, and things could improve
and new words are added at a later time.
I would like to just stress this point, like when we compare the tortoises and hares, the
comparison is in terms of the translation code and not necessarily the speed.
So every story has a model, including ours. So here we think that a candidate translation with a
poor score at this stage might beat a better candidate at a later stage when more words are added
in the sentence.
Or to put it in a different -- to put it differently, we don't want to discard the candidates, at least
not yet.
So with this understanding, we introduce this notion of delay pruning. So as the name suggests,
the goal is to delay the pruning and give the candidate hypotheses a chance to improve. This is
our goal. And at the same time have the constraint not to let the search space to grow to
explosive size.
And we achieve this by a two-level strategy. At the first level we delay the pruning and we
selectively retain some hypotheses, some tortoise hypotheses for every bin. I'll explain how we
do that on the next slide. And at the second stage we prune the tortoises that do not improve
beyond a set limit.
So delay phase. So in every bin we had different coverage -- different sets of hypotheses that
cover different sort of work, which we call as coverage vector. So we first identify all the
coverage vectors that are not represented in the bin at the end of the cube pruning step and for all
such unrepresented coverage vector, we sort of generate a few small number of hypotheses,
tortoises. Typically we generate up to three tortoises for every coverage vector.
And then we compute the normalized language model score, and we use a threshold value for the
normalized language model score and only retain those hypotheses that have a better language
model -- better normalized LM score than the threshold.
And we typically found the values minus 2 and minus 2.5 in log scale to work better based on the
small depth set.
And once we identified all these tortoises, these are flagged as the tortoises, and at each
subsequent bins we expand them minimally.
And then coming to the prune step, giving the candidate hypotheses a chance time improve is a
good thing, but at the same time we don't want to let it forever in order to contain the search
space.
So we prune the tortoises beyond what we call a race course limit. Race course limit is nothing
but the chance given to the tortoise hypothesis to improve and beat some of the hares and break
into the cube pruning step.
So we define this in terms of the number of bins that a tortoise has in order to break into the
beam. And we actually experimented with different values of the race course limits from 3 until
7, and I'll show some interesting results later.
And one interesting -- or one important observation is that downstream some of the tortoises
might beat the hares. And when that happens, these tortoises are unflagged and merged into the
beam.
Yeah. This is -- this slide highlights the differences between the incremental decoder and the
regular beam search decoder. So, number one, we use partial future costs. The future costs are
computed only for the new phrases and are updated in the decoded states. And, number 2,
existing hypotheses are reused. And the third important thing is the introduction of the delay
pruning.
Now, moving on to the evaluation, we implemented in-house decoder in Java, which includes the
cube pruning and the lazy cube pruning. And we use seven standard phrase-based features. Our
decoder supports both regular beam search and incremental decoding so we can run it on any
mode.
So we use French-English and English-French for evaluating this trained on WMT-07, used
Moses for training and MERT for optimizing weights.
So we did some -- five different experiments. One is we wanted to benchmark our decoder with
the Moses in order to compare them in the regular beam search. And secondly we wanted to
compare the -- use our decoder and compare the incremental decoding with the idea of
re-decoding from the scratch and show that incremental decoding is better in terms of translation
quality, speed, whatever it is. And the third thing is we also experimented with different race
course limits.
Here the first two rows gives the comparison where -- of our decoder and the Moses in the
regular beam search setting. So this result is for the French-English and the next page has the
English-French. We found this -- our decoder to have a slightly higher score, and which we
believe like indicates that it compares favorably to the Moses and that the last three lines
compare to the case of the incremental decoder being compared with re-decoding from the
scratch.
So as you can see, when we use the strategy of re-decoding from the scratch, it has the Bleu of
26.96, but when we use the delay -- incremental decoder, Bleu actually -- yeah, it slightly
improves. Though, however, we believe that this will have the search errors. And then we
implement -- then we sort of tested the incremental decoder with the delay pruning and we found
the Bleu to increase significantly, which just comparable to this.
Just to mention, these three things use the beam size of 3, whereas this one has the higher beam
size. That's why you see the difference in this course.
And the results are almost the same for English-French, so -- except that Moses has a slightly
better advantage here. And though Bleu indicates the advantage of using incremental decoding,
we wanted to quantify the search errors. So we tried to compute the mean squared error of the
incremental decoding with the regular decoding and see whether that indicates something.
So what we did is this. So for the two settings of incremental decoding without and with delay
pruning, we got the scores of the top hypotheses and we also got the scores of the top hypotheses
using the regular beam search.
And then the data got the mean squared error between these two sets, beam search with this and
the beam search with this and these sort of numbers.
As you can see -- oh, sorry -- the search error here for the case of with the delay pruning is
significantly much lesser than the case of without delay pruning.
And we also experimented on the speed of this. And these are the setup, the machine details that
we used. And we were interested in finding two things. One is the average time spent on a
decoding for every additional word, and the second factor that we were interested in is the
average number of hypotheses that were generated for every bin.
And when we did this, we found that really -- I mean really coding from the scratch is much
lower, like by a factor of nine, when you compared that with the incremental decoding.
However, when you compare the average hypotheses that were generated in the bin, the delay
pruning resulted in large number of hypotheses being generated. But despite having large
number of hypotheses, it could actually achieve higher speedup than the re-decoding thing.
We also experimented on the race course limits, different race course limits. For the value from
3 to 7, we see slight differences in the Bleu code in both French-English and English-French.
They are not statistically significant. However, what we also found is the average time to decode
also increases with the increase in the race course limit. This is obvious because as you increase
the race course limit, this increases the number of average hypotheses that are retained every bin.
As you can see, it shows an increasing number trend here.
And we were trying to analyze why increasing the race course limit doesn't increase the Bleu.
And we believe that this is possibly due to the lack of long-distance reordering between English
and French. Between English and French, whatever reordering that we have is only local
reordering, like noun and the adjective being turned into adjective and noun. And we could
easily capture that with a race course limit of 3. So have a race course limit -- higher race course
limit of 7 doesn't actually help. That's what we found.
Just to conclude, we presented the incremental decoding algorithm which we believe is effective
in the realtime scenario than re-decoding from the scratch. And we introduce this idea of delay
pruning, which we believe is a normal method that can adapt to the text in sort of having a
fixed-set beam size.
And we also believe that the delayed pruning could possibly be used in regular Viterbi decoding
in the regular beam search. And this could potentially be helpful when it's integrated with a cube
pruning step. Though we haven't done that, though. That's it. Thanks.
[applause]
>> Will Lewis: Thanks, Baskaran. We have ten minutes for questions here. So I see we have a
few already. So go for it.
>> So you mentioned you allow limited expansion of the [inaudible]. Can you elaborate
[inaudible].
>> Baskaran Sankaran: Okay. We -- as I said, we select up to three tortoise hypotheses for
every unrepresented coverage vector. And for those three higher tortoises, we compute the
normalized LM score and then we again do a further selection by removing the tortoises that
have poor normalized LM score.
We compute the normalized language model score up to that point and use the threshold of
minus 2. Yeah. I think we use minus 2 in all our experiments to further limit the number of
tortoises.
>> Will Lewis: We have microphones going around, so just raise your hand and you will get
[inaudible] if you have a question. Hold your hand high.
>> Thanks, Baskaran, for the interesting talk. I wonder -- so are you using a fixed distortion
limit or reordering limit during decoding?
>> Baskaran Sankaran: That's right. We are using a fixed distortion limit.
>> A fixed. Do you think that it's possible to bound the amount of bins that you need to throw
away based on your distortion limit?
>> Baskaran Sankaran: The amount of hypotheses?
>> The amount of hypotheses that you need to recompute, given a fixed distortion limit. Do you
understand what I'm saying? So if we assume that every hypothesis has to jump to the end of the
source sentence at the end, then we can bound the number of bins that need to be recomputed.
Do you understand what I'm saying?
>> Baskaran Sankaran: Not so ->> So if we think about it in a monotone -- have you thought about doing incremental decoding
in a monotone scenario where no reordering is allowed?
>> Baskaran Sankaran: Oh, okay, okay.
>> Then basically you can completely reuse all your previous hypotheses, right? Because future
cost estimates are actually not necessary. And so there's a generalization.
I was just wondering if you thought about trying to exploit your distortion limit to bound the
amount of hypotheses that would be affected by adding an additional word.
>> Baskaran Sankaran: No. But [inaudible] use of fixed distortion limit, what we do is for each
of the hypotheses, before they are extended, we sort of look ahead into the future and see -- I
mean, before extending this into a different hypotheses, like if this can further be expanded,
because this could be -- it may not be able to expand that further because of the distortion limit,
so we sort of do some pruning -- I wouldn't call it pruning, but, yeah, checking. And if we find
that they cannot be expanded at a later point, we don't expand them. I mean, we don't create new
hypotheses out of that.
>> Baskaran, trying to understand the contrast between your method between delay pruning and
partial future costs versus like from scratch every single time, I'm trying to understand if there
are factors involved in certain -- a threshold or a tilting point beyond which your method is
advantageous over do everything from scratch. And on the other side of which it's not
advantageous yet, so I'm suspecting sentence length might be one.
>> Baskaran Sankaran: That's right.
>> Does your race course limit -- does it have anything to do with sentence language?
>> Baskaran Sankaran: No, no. Yeah, but actually we found that it doesn't help us to store
partial decoder states beyond a set endpoint. So because they are never going to be reused again.
So we try to limit the partial decoder states between three words and 15 words, and beyond that
they are just really coded for the rest of the words. So if you have 16 words, it would be just
incrementally decoded for 1, and if you have 17, the incremental decoding will start from word
16.
>> 16 [inaudible].
>> Baskaran Sankaran: Yeah, that's right. But for this -- for these experiments, we sort of stored
all the decoder state, all the partial decoder states. Yeah.
>> Okay. And the other factor, you made a brief allusion to that along the way, might be the
contrasting nature of dependencies in the sentence between the source language and the target
language.
And on the lighter side, one of the classic examples I came across as a student was dependencies
in a complex sentence like John saw Peter help Mary swim, where, you know, if you're
translating it into German, it is one of those push -- push onto stacks and pop off the stacks kind
of relation for the verbs. And in Dutch it's even more kind of perverted, it's kind of cross-serial
dependencies.
So I'm thinking that, you know, as you translate word by word from English to Dutch, let's say,
you -- it may be where you approach [inaudible] retaining partial work as far as possible might
prove to be superior.
>> Baskaran Sankaran: Yeah. That's an interesting question. And, yeah, that's one thing we
haven't done at this work, like what will happen if there's much reordering that's happening.
>> In the simplest of the terms, the ordering of the modifier and the modified, between
mathematics test and test the mathematic, it's a very, very simple kind of a thing.
>> Baskaran Sankaran: That's right. I understand that.
>> But between rules and verbs and in complex and embedded sentences and so on, it's a slightly
more higher magnitude complexity. So it seemed that the contrast between languages might be
relevant.
>> Baskaran Sankaran: That's right. Yeah. But the thing is one thing that we could do then is to
play with the distortion limit and maybe have a higher distortion limit. The other thing would be
to have reordering models included in them. Right now our model doesn't use any of the
reordering model. So that might be one possible way to handle that.
>> Will Lewis: Great. Thanks. We have another question. We have time for one more
question. All the way over here. Josepe [phonetic].
>> So for like people like myself who are not exert in machine translation, to understand your
results, what is the range for Bleu and how good is .27?
>> Baskaran Sankaran: Okay. So, yeah, there are different variations in Bleu that you could
compute, like with casing and without casing. And all the numbers were on detokenized data
and then recased. So that's where you probably see slightly less in numbers. But they're
comparable across -- comparable across our decoder and the Moses.
But I guess for the French-English, the reason we included WMT, I think, if I'm right, had about
30 Bleu score, 30 or 31. Yeah.
>> [inaudible]
>> Baskaran Sankaran: Yeah. Oh, yeah. That's -- yeah.
>> [inaudible]
>> Baskaran Sankaran: Max is 100.
>> Just because you don't match the human reference translations doesn't mean ->> Baskaran Sankaran: Yeah, it doesn't mean [inaudible].
>> Will Lewis: So this gives us a discussion point for the break, also discuss measurements and
evaluation schemes. So I'd like to thank Baskaran one more time. Thank you very much.
[applause]
>> Baskaran Sankaran: Thank you.
>> Will Lewis: Okay. Next my pleasure and with -- oh. Wow. This worked wonderfully. This
is great. Thank you very much to the people that set this up. I'm pleased to be able to introduce
Ann Clifton who's going to be talking with us about morphology generation for statistical
machine translation. So thanks very much, Ann.
>> Ann Clifton: Hi. I'm Ann. And I'm going to be talking, as Fred said, about some work that
I've been doing with professor Anoop Sarkar at SFU on morphology generation for SMT.
And which one am I -- okay. So to give you an idea of what's motivating this task, I have this
little example taken from Turkish which shows how by packing a vast amount of morphology
onto a single stem they are able to convey something that takes the better part of an entire
sentence to convey in English.
So as you might imagine, this leads to some pretty interesting challenges for statistical machine
translation, particularly morphologically complex languages tend to exhibit a lot more data
sparsity. In parallel texts, say, you can imagine that if you have a lexicon of productive stems
and morphology, that this is going to give rise to a greater overall vocabulary of surface forms,
each of which will appear more infrequently in any given text than, say, English or something
like that.
So this also leads to a great deal of source target asymmetry. An example that a lot of you might
be familiar with is how in morphologically complex language with, say, a case system, then that
would be used to express something that in English we would use a propositional phrase for.
We'd express it lexically.
So and then there's also just a general lack of resources for morphologically complex languages,
which just has to do with the fact that they're relatively unstudied in comparison to, say, English
or one of its related languages.
So there are a few major camps of the approaches that people have taken to introducing
morphological awareness into MT models. And one of these has do with preprocessing and
using the morphological information in the translation model in order to increase source target
symmetry.
So this has been done for a number of language pairs. And some approaches tend to strip the
morphologically complex language of some of the morphology in order to increase symmetry.
Some artificially enrich the poorer language. And then some others retain segmentation aspects
in the translation itself.
It's worth noting here, though, that a majority of this type of work has been done on translation
from the complex language into English. And as you might imagine, that's a substantially
different task than going in the opposite direction.
So another major approach is to perform the translation on a stunned version of the original text,
which is an easier task, and then to do the morphology generation in post-processing. And
people have gotten some good results with that, actually here at MSR, for some language pairs
such as English into Arabic, Russian and Japanese using maximum entropy Markov models.
And the idea here is that you can capture dependencies between the morphemes as opposed to
just words in a way that's difficult to do in, say, just a straightforward language model or
something.
So the last approach that I'd like to mention here is using factored translation models, which in
the event of an unseen word are able to back off to a more generalized form, say like a stem plus
morphological tags or some other piece of morphological piece of information.
So we incorporated elements of these different approaches, and then we took as our jumping off
point the Europarl dataset. And so what you're seeing here is the -- is how the blue scores, the
translation performance measurement, correlates to the vocabulary size.
And this is all for approximately 1 million parallel sentence training set, but the different
languages have vastly different vocabulary sizes. So you see over here, French and Spanish,
how the blue scores are relatively high and relatively small vocabulary sizes, and then way out
here we have Finnish which perform the most poorly out of the bunch and has far and away the
biggest vocabulary size at roughly five times that of English for the same parallel text.
And to take just a closer look at that here, you can -- this is split into translation direction. And
so you can see that on average translation into English does far better than translation from
English into another language. And the worst of these is Finnish. And that's with that same 1
million sentence parallel corpus and a 2,000 sentence development set on which the model
weights were tuned and then another held out 2,000 sentence test set.
And so naturally we decided to pick Finnish as our target language. So a bit about Finnish.
Here's an example up here of some of the types of morphological information that you can pack
onto a stem. In this case they manage to get in one word something that we need a whole
sentential question to express in English.
So it's a highly synthetic language displaying both fusional and agglutinative properties. And so
it's got consonant gradation, vowel harmony phenomena.
And so what this adds up to is while you have a great deal of morphology that you can pack onto
a single stem, it makes straightforward segmentation of this morphology pretty difficult.
So what we started out with was just training off a baseline word-based system and then also
doing a segmentation of the whole training and dev corpus and looking at the types of errors that
we were getting, with the idea here being that if we could devise a way of incorporating the
morphological awareness into our model in a way that could handle this type of complexity, then
we could perhaps come up with something that would be robust to a great deal of different types
of morphological complexity, but trying to stay minimally language specific here.
So a couple of things that we saw were words that were missed by the MT system had available
segmentations and some common error types. I'm going to give a little more detail in this.
So this table up here shows that a lot of -- about half of the words from the original development
set reference text were missed in the translation output. And of the words that were missed, a
majority of them had available segmentation, so we wanted to make use of that. And then down
here these are just some examples of some of the types of mistakes that the MT system was
making wherein specifically it would get the stem right and the suffix was wrong.
And so a bit about the segmentation that we did. We actually performed two types of
segmentation. We did an unsupervised and a supervised version. For the supervised version we
used a finite state transducer-based morphological analyzer that was based off of a precompiled
lexicon from an annotated corpus.
And so, as you can imagine, that's going to have some coverage issues, although high accuracy.
And then the unsupervised segmentation that we used was from Morfessor which is not at all
linguistically motivated in its segmentation.
So we took the segmentations and for each of them we trained two different MT systems. We
did a full segmentation model for each wherein we retained the entire segmentation and we just
treated all of the morphemes as separate words, and we just inserted some word internal
morpheme boundary markers so as to be able to restitch the morphemes back together in the
output. And then we also did a stems-only model where we discarded all the suffixes and just
did translation on stems.
So here's the general overview of our system. So we started out by submitting the input to the
morphological analyzer of either type, and then we would take this analysis and train the MT
systems on that. So for the segmented systems, the output would look just the same as we saw
coming on the other side, and so therefore it would be quite a straightforward matter to just put
the morphemes back together again and be able to do evaluation against our original word-based
reference text.
For the stemmed output, it would be a little different because we would have to then figure out
how to generate the morphology. So what we did here is we introduced the stemmed MT output
to a conditional random field inflection tag generation model.
So this would output the stem together with the POS tag. For that it would output a prediction of
a tag set, of a morpheme tag set. And then we still aren't home-free yet because there's still the
matter of then coming up with the actual corresponding surface form, which is a nontrivial task.
So for this we used a language model to disambiguate between the possible surface forms that
can correlate to a stem and morpheme tag sequence.
And the idea here was that while the CRF model is trained on supervised data of which there's
less, our language model can take advantage of the relatively large amounts of unannotated
training data for the disambiguation purposes. And so, dare to dream, what we get on the other
side is the original surface form back.
So a bit more about conditional random fields. So the way this works is it selects the best tag
sequence based on the probabilities which are just the product of some binary features over the
current label, the previous label, and the observation sequence along with the model weights.
And this is of course with regular normalization term. This is actually a global normalization
term for the CRF.
And so F here is a vector of these binary features, and then this down here is an example of the
types of features that would be used in this model.
And so you can see how it can compare the current predictions case value to the previous one.
And so that gives you an idea of how this type of model might be able to capture agreement
phenomena.
So the motivation for going this route was that it could capture the long-distance dependencies
between morphemes. It's globally normalized rather than normalized on a per-state basis. And it
allows us to select our prediction sequence from the entire training data -- or all of the tag
sequences available in the training data as possible outputs.
And so we had hoped that this would make it more robust towards handing new surface forms
that are the result of productive morphology that haven't actually been seen before in their
training data.
So we used CRFSGD, which is an implementation that's optimized with stochastic gradient
descent. We trained it on overlapping lexical and morphological features such as for the stems,
unigrams and bigrams of the previous and next stem and POS. And then for the prediction
features we used case numbers and features of the previous labels.
So the accuracy rate that we got was 79.18 percent which was definitely on par with similar tasks
that had been carried out on different language pairs for the same monolingual features. So we
decided that this seemed like a reasonable jumping-off point to introduce into our MT system to
see how it interacted with -- because this was, of course, on the clean data, so to see how it
interacted with the ostensibly noisier output of an MT system.
And so this here compares the output of all of our different systems. The first is just from the
original Europarl dataset. And so our baseline does significantly better than that, which we just
attribute to having a better newer decoder.
But so you can see that the best result that we got was for the unsupervised segmentation model
which retained the morphemes in decoding, and the supervised segmentation model did a bit
better than the baseline, but not to a statistically significant extent.
However, our supervised stem model with the CRF for morphology generation did quite poorly.
But still we introduced it as well to the supervised segmentation model just to see whether it
could kind of go through and catch any hanging stems that had been missed by the model. But
that didn't make any difference really just because that model didn't output enough hanging
stems that were missing morphology to make a difference.
So we include this here to evaluate the comparison on segmented data. And so this is before
restitching the morphemes back together. And to make a comparison, we actually took the
word-based baseline output and then segmented it.
And we include this over here in the last column. That's the Bleu scores without considering
unigrams. Because we wanted to make sure that any increase that we were seeing in scores
wasn't just due to getting more unigrams right, which there would be more of anyway.
And the other reason that we wanted to include this was because it seems like the Bleu
evaluation method -- it's funny, we were just talking about this -- is really not particularly apt for
dealing with morphologically complex languages in the first place for MT with morphologically
complex languages, since if you could have basically a sentential word with a whole lot of
morphology on it, and if you get five of them right but one wrong, then you get zero credit from
Bleu. So we just decided to go ahead and include that from there.
So the conclusions that we draw from this are that morphological information does seem to be
helpful in improving translation. But CRFs can be an accurate way of predicting the tag
sequences, but we need a better way of constructing the surface form.
So to that end we're adding bilingual, syntactic and lexical features of the aligned corpora into
the CRF. We're also looking at moving the actual surface form prediction into the CRF. Also
looking at automatic classification methods to cluster the unsupervised segmentations so that we
can use that with the CRF so as to have a more general model.
And then also looking into currently running actually some factor translation models to have
access simultaneously to both the word and morphological knowledge and using lattice-based
models. And then also discontinuous phrase-based models in order to be able to capture the long
distance dependencies between morphemes in the decoder itself and not just in post-processing.
So that wraps it up. Thank you.
[applause]
>> Will Lewis: Thank you very much, Ann. Very good presentation. So we have ten minutes
for questions. And you're closest to the mic.
>> Yes. Well, I don't know if I need to stand up. But could you go back to Slide 19, I think. So
could you explain again -- I think I just missed how you get from the predicted tags to the
surface form.
>> Ann Clifton: Yeah. So basically we looked -- we gathered all of the surface forms that a
stem can occur with in the training data. And -- which was actually 1.3 million corpus in the
unannotated data. And then we used a five-gram language model from the training data to pick
the most likely surface form for that context.
>> So on the second results that you showed where you leave it segmented and run Bleu, the
reference is segmented with a supervised model?
>> Ann Clifton: No, the reference is segmented for in the case of the supervised model by the
supervised model and in the case of the unsupervised by the unsupervised.
>> By Morfessor.
>> Ann Clifton: Yes.
>> I see. Okay. The one thing I would say about that is that -- well, I don't know how do this,
but basically because you would have presumably fairly frequent suffix morphemes in there and
Bleu may or may not be penalizing them so much for -- you're getting some credit for getting it
whether or not it's necessarily used with the right root or something like that.
I actually don't know how many tokens you have in your utterance or things like that, but I
wonder if there would be some way of building in string edit distance into that or maybe Bleu
under various forms [inaudible] up to four grams or something like that would capture enough of
that. Do you have a sense of whether that would be worthwhile for something?
>> Ann Clifton: That was the idea of leaving out the unigrams. So that one was only measuring
two, three, and four grams, that score.
>> Yeah, this is a nice talk. First of all, just sort of one side comment, I don't know if you're
aware of the work that [inaudible] has done on Turkish speak recognition, which actually has
rather similar conclusions to what you're getting. But I do have one question. Maybe I just
missed something here.
When you talk about stems, I'm not sure exactly what you're referring to and how much you strip
off, but obviously in the language like Finnish or Turkish, the affixes can encode information
which is to varying degrees more or less contentful.
So in the case of a case affix, you might be -- expect to be able to reconstruct that from the fact
that you know that this is functioning as, say, [inaudible] case, for example, it's functioning as a
direction too or something. You could sort of figure that out.
In the case like if you have some sort of volitional marker on the verb, that's much harder to
reconstruct. So I'm just wondering if you've -- you've sort of -- it's an informational question,
what do you actually ship off, and, B, if you're stripping off a lot of stuff, maybe there's a
difference between the different classes of things you're actually removing and would therefore
have to reconstruct and if you've thought about that issue.
>> Ann Clifton: Yeah. We definitely thought about that. And the guiding principle that I was
keeping in mind is I was trying to keep from getting too language specific here. And so I was
using -- whether it was from the supervised or the unsupervised segmentation model, I was
trying to use that as just giving me a place to chop.
Because if I got too much into looking at the function of any like, say, a syntactic function or its
lexical correlation in another language of any particular morpheme, then I felt like I would be
making a more specialized tool than was really my goal.
>> Well, just one suggestion. I mean -- this is on? Yeah. Just one suggestion. You know, this
may obviously not work, but you could consider losing cross-language information in that case,
so which things actually get expressed by a separate word in the English, so would know -- the
auxiliary verb "would" would presumably correspond to a particular suffix in Finnish, whereas
something like the accusative case marker in Finnish would correspond in general to nothing in
English. So you might be able to use cross-language information to see which things might be
important.
>> Ann Clifton: So I'm not sure if this is what you're talking about, but I mentioned in the last
slide that we're currently adding in syntactic features, syntactic bilingual features, so using it -yeah, okay.
>> In the case where you're restitching morphemes, since you're treating the morphemes -- the
affixes as individual words, they can be reordered and moved around, just generally go
wandering. So what do you do when you restitch things? Is there something special you're
doing in that box there to make sure that you end up with real words around not just [inaudible]
do together?
>> Ann Clifton: We're just -- the only morphemes that we're restitching are when there's a
context wherein you have two adjacent word internal morpheme boundary markers, and that's
the only time that we restitch something into words. So any other time, if we get it wrong, then
our score gets penalized.
>> So you just leave the morpheme alone in the output, you don't delete it?
>> Ann Clifton: No, we don't delete it.
>> I noticed that in your morphological tags you have a part of speech marker which typically
doesn't actually give rise to a morpheme. And unless you had a whole lot of part of speech class
ambiguity in Finnish, that information is going to be very redundant. It's just -- it's basically a
property of the stem.
And so I'm wondering if that might actually be artificially inflating your morphological tag score
without actually helping you in the output.
>> Ann Clifton: Well, that's possible. The reason that we did it that way and we left the part of
speech tag as a feature of the stem is because we couldn't afford to retain all of the inflectional
categories in the prediction. And so there are certain categories that apply to -- not just to nouns
but also to verbs.
And so by retaining the part of speech that was able to kind of form a sort of complementary
distribution between the tags that it would predict, so when it could have been ambiguous
otherwise.
So I'm not sure if that's artificially inflating the accuracy scores or whether it's just kind of sort of
a shortcut, more means of efficiency.
>> Will Lewis: Okay. I think we have time for one more question at the back here.
>> Well, it was answered by -- so if there's a really -- Bob, you haven't had a chance yet, so...
>> Yeah. I had a little trouble relating your results slide to this slide, and in particular I'm
interested in which of the results lines in your results tables correspond to this restitching the
segmented output.
>> Ann Clifton: This one?
>> Yeah. Yeah. Can you show me in the results slide?
>> Ann Clifton: Oh, yeah, yeah. That would be the one in bold.
>> Okay. So that's the one that performed the best?
>> Ann Clifton: Um-hmm.
>> Will Lewis: Quick follow-up here since you've been very patiently waiting.
>> Thank you. So what we found sometimes when we use unsupervised segmenters like
Morfessor or others like Paramor [phonetic] is that they tend to undersegment relative to
supervised models, ones where they've actually been built by hand transducers.
Do you have a sense of which one's oversegmenting, which one's undersegmenting, and if you
chose a different operating point of view, asked the unsupervised one to sort of chop a little bit
more, do you think that would make a difference? Do you have an intuition about that?
>> Ann Clifton: My intuition about that is that with Morfessor, anyway, by controlling the
perplexity threshold, you can control the degree to which it segments. And I found actually that
when I had it segmenting more it would overfit to dev set to a point where it performed far more
poorly. So we ended up going with quite a conservative segmentation.
>> Will Lewis: Great. Well, thanks very much. And thanks, everyone, for a very engaging
question session there.
[applause]
Download