>> Will Lewis: Well welcome everyone. I'd like... Koehn from the University of Edinburgh. He is a...

advertisement
>> Will Lewis: Well welcome everyone. I'd like to introduce our guest speaker today Philipp
Koehn from the University of Edinburgh. He is a professor in the School of Informatics there
and basically chair of machine translation. He got his PhD in machine translation from USC. His
advisor was Kevin Knight. I consider him a leading light in the MT field. He is best known for
the open source decoder Moses which is widely used within the research community as well as
in commercial enterprises well. It's basically kind of a go to for a lot of research and
commercial work. Philipp directs a fairly large and productive MT team at the University of
Edinburgh and has graduated a number of students over the past few years, two of whom I
think are coming today. I don't see either one of them, but they said they were coming
[laughter], Michael and Avachek [phonetic]. They'll probably be here shortly. He has led or
leads or plays a principal role or a key role in a number of research projects within the EU, a
number of EU funded projects as well as two DARPA funded projects. Actually he and I were
talking at AMPTA [phonetic]; I was asking about some of his recent awards and those of you
that have actually applied for grants, and if you get one out of six funded you are doing really
well and this year alone Philipp has managed to get three out of five which is a phenomenal
record. The only drawback of that is you end up having all of the management responsibilities
and having three grants funded when you were hoping for one. So Philipp also manages the
statMT.org site which a number of people in the MT field actually have gone to to get data or
information about MT. He also runs with Chris [inaudible] and a number of other people. The
workshop on machine translation which is co-located with ACL or LMNOP [phonetic] every year.
A rule actually calls the WMT the Grammys of the MT world. I think you called it that
[laughter]. But anyway, so we welcome the Grammy’s MT day here Philipp Koehn.
>> Philipp Koehn: Okay. I don't have as many jokes as at the Grammys [laughter]. I want to
give a bit of a talk about work we have done in computer aided translation. So there's a bit of a
shift in our research where we finally look at who's actually using machine translation, so I have
way too many slides so I'm not going to get through the talk anyway. Please interrupt me as
much as you can. I tried to give a version of this talk at the ISIM [phonetic] and we got about
one third. I would consider that a success. So please ask me any questions about this work. As
Will already alluded to this was also the basis for some of the funding we asked for from the
European Union and so we have two research projects on this very topic. That's one of the
things that you -- you would've hoped to have one research project, but now I have two which
makes us -- no, no on this particular topic, the other stuff too; this was just the start of it.
[laughter]. When we committed to build an open source toolkit for computer-aided translation
we worked with translation agencies, so this is something I've been pursuing for the last two or
three years and now it's definitely going to get going full steam ahead. So I'll talk a little bit
more about the work that we have already done and a little bit about what the plans are there.
Okay. So just to set that up in sort of a broader context, why are we doing machine translation?
So the majority of the research focused on the states has been on assimilation, so the idea that
someone wants to have some kind of information that's only available in foreign text, so you do
a Microsoft Bing search for some technical term and you only get a Chinese webpage back and
you want to know what's written in it. And if you get it translated and is garbled and it's not
necessarily the best translation but if you get information out of it, and you are happy. So this
is the scenario where the user is thought to be tolerant if not perfect quality and this has been
the focus of DARPA funded research where the scenario is some analyst wants to find out some
document that's written in Arabic or Farsi or [inaudible] and you can kind of see where that is
going. Another application is communication where you maybe are having a multilingual chat
application where people talk to each other and that has the advantage that if something gets
mistranslated you can always ask follow-up questions, so that's also some room for error in
terms of machine translation quality. This is somewhat connected with speech recognition
research. There's always this idea of a handheld device if you travel abroad to China or Japan
and you have a handheld device you can actually ask for directions and things like that. So and
the final application is dissemination where you actually have a lot of text in your own language
and you want to actually have that published in other languages and then you are going to
basically push these translations onto unsuspecting civilians out there who don't know how it
was produced and they are not going to be happy if there are errors. This is kind of where the
action is in terms of money being spent on translation. Money is being spent on high-quality
application level translation and this is currently done pretty much by human translators. This
is not done by machine translation. So this is what I'll try to focus on since -- especially in
European context where the goal of machine translation is to deal with this situation where the
European Union has 23 official languages and they are going to have one more; they're going to
have 24 official languages because Croatia is going to join next year and it's just not going to go
away. People in Denmark are not going to stop speaking Danish and they will expect all of the
EU level laws to be applied to them to be published in Danish and so on. If you can't beat
human translators at producing publication quality translation, join them. I started this out
with the question like how can we actually help human translators which kind of leads to the
question, what do they actually do. How do human translators actually do it? What are the
hard problems for them? Why don't they just write the translation? I need what stops them
and what slows them down. So building an MT tool kind of then has kind of be geared to what
are the biggest problems in human translation. Okay. So I'll talk a bit about human translation,
about how we can then assist human translation, so we built a tool for human translators and
we're going to talk about this. And a user study, and the last two topics I'm not sure if I'm going
to get to. One is the extreme case of a human translator who doesn't know the source
language at all, so a monolingual translator, and at the end some work on integration with
translation memories, but I don't think I'll get to that. I'll start with the study. We wanted to
find out how human translators work so what we do is we just get a bunch of human
translators and observe them really closely, so we did it at the university for what we do at
university? We just hired students. So we did French to English because it's a language pair
were MT is pretty good and we could get access to quite a few people who speak French at the
University of Edinburgh, so there are French natives who studied in Edinburgh and then there
are, French is one of the languages that is somewhat popular in England and Scotland to learn
so we have also English speakers who learn some high school French, or least claim they know
French. We offered them money and they said yeah, I know French and we said yeah, you're
hired. Okay. And that's actually good. Some of them didn't actually know French all that well,
but that's an interesting data point to how well do they do. So we have each of the students
translate news stories from French to English about 40 sentences. It's a pretty easy task. There
are some content that they are familiar with, it's just the news of the day. There's no
specialized terminology, so if you for instance translate Microsoft manuals then you need to
know what mouse click means in French and it's not necessarily a literal translation, so you
need to know all of these terminologies, so we didn't have those kinds of problems. And we
logged exactly which key they type at which point and how the translation looked like at this
point, so we had a really good view of how they produced the translations. So here's an
example how it looks like, the data we get out of this. It's a bit of a challenge actually how to
visualize that and what information you want to get out of it, so I'll get over it a bit. So this is
the keystroke log. So the input, I think this is one character got lost in the logic for a minute, so
it's a French sentence and this was the translation that was produced. The manufacturer has
delivered 97 planes during the first half, and this is the keystroke log. So it goes over the time
axis here, 0 seconds to 35 seconds. The height of these bars is how long the sentence was at
that point and the color of the bars indicates what kinds of keystroke was done, so the black
ones are just a regular letter character that was a being typed. The purple ones was when the
delete key was hit and the grayish ones you may or may not be able to see are cursor
movements. So what happened here? For 3 seconds nothing happened or at least nothing
observable happened. Then the person started to type happily along with a hesitation here up
to 10 seconds. Then 3 seconds of thinking, some reflection, deleted some characters, typed
something, moved the cursor around and then started typing again, deleting something again
and then just kept typing for quite a while. Then there was a break here and then at the end
apparently here the translation was pretty much done. The cursor was moved around; some
characters were deleted and some characters added. So we actually had, this is one way to
visualize the data. You actually don't know how the translation looked like at any of these
points but it's hard to actually show that in this graph. So on one of the research products we
are currently doing, we actually have a replay mode where you can run the entire interaction
and see exactly what happened at each point including eye tracking, so you can also see where
the person looked at the screen. But at the point of this study we didn't have that. We just
have the key log, but we have an exact -- we could actually do the replay. We know which
character was typed and how the translation looked like here, so if you want to visualize that
to.
>>: [inaudible] you mean first they didn't [inaudible]?
>> Philipp Koehn: Probably. The user could also use the mouse, so is this just a web interface
so you can just do anything, but yeah.
>>: So you don't know the mouse movement here?
>> Philipp Koehn: We don't know the mouse movement. We know, yeah, we don't know the
mouse movement.
>>: But you have it recorded. You just don't have displayed in this graph.
>> Philipp Koehn: We don't have the mouse movement because this is a web interface. The
mouse movements don't really mean all that much. Well yeah, you could reposition this and
the cursor, so we don't know where the cursor position is at each time, but if a keystroke
happens we know how the translation looked before and afterwards so we can reconstruct
where the keystroke happened. So we record at each given point here yeah, each event we log
what kind of key was pressed, how the translation looked at this time, but not where the cursor
was on the input. Although, we could maybe do that, but it follows from the keystroke. And
it's a [inaudible] so you could actually do cut and paste actually like control C control V but I
don't think people have done that. Yeah?
>>: Thinking about that, do you think that you would have gotten different results if you had to
use professional translators?
>> Philipp Koehn: Probably, yeah, we looked into that much more and one group we work with
in one of these European projects has been doing translation studies, translation process
studies is the official term, so they actually study translators and what they do and if there is a
difference, I think the most striking intuitive or whatever apparent one is how much of the
input text do you read before you start translating. So if you have a fresh translator they are all
trained to first read the entire sentence and then start translating, and if you actually look at
professional translators they don't do that anymore because they've done it for five years.
They forgotten all of the good lessons they learned in school. They just see the first three
words and start typing. There is definitely a difference in speed obviously and the kind of
pauses. I'll have a bit of all of this because we have different types of users in this study. Do
you have another question?
>>: If your goal was to look at dissemination and publication quality then how do you transition
[inaudible] why did you even bother doing the study with people who weren't professional
translators?
>> Philipp Koehn: Because we had them available partly and because we basically started
building a tool. I mean they are still bilingual speakers who can translate but it's not to the level
of professional…
>>: That's where things differ because bilingual isn't necessarily a good translator.
>> Philipp Koehn: No, no I'm not claiming that at all. We also have a bit of a broader focus than
just professional translator, so the other idea is -- I'm not sure if I'll get to the extreme case of
monolingual translator who doesn't know the source language and just wants to use a tool like
this to maybe better decipher the foreign document, but we also want to use this in kind of
volunteer translation communities so there are a bunch of communities on the web who
translate stuff for fun to produce content in their own language. There's quite a few groups in
China that translate news stories into Chinese from the BBC or the Guardian that wasn't
officially sanctioned by the Guardian and there is another website that we have started to try to
collaborate with global voices where people are like amateur journalists and translate this type
of material, so it's a bit broader than just -- I think the point that is consistent in them is yeah,
they all have the goal to produce correct translations that are at least amateur level quality
opposed to just raw MT. So you don't like the translation [inaudible] either [laughter]. Okay I'll
have another slide about translations and then I am interested about your feedback on that. So
the question is who uses this. And it's not just professional translators but also the [inaudible].
But where the money is is obviously professional translators and how we can help them and
they have higher standards as to what they expect. Okay. A bit more analysis. What can you
get out of this? You can observe that people type maybe slow or fast and they make pauses
and that's kind of where the action is. So what are the pauses they make and how many pauses
do they make? So they make a pause at the beginning, when they read the sentence, and then
they might make a pause at the end when they review the sentence and decide do they like the
translation or not. And then there's all the different types of pauses and so it's a bit hard to
break that down. If it's a short pause of maybe 2 to 6 seconds they just might, I'm not entirely
sure of what the next word is, but it's just a hesitation. If it's a medium pause up to a minute
they are really solving some problem, maybe rereading the source sentence, maybe reading
their translation. There is something a bit more bigger that is kind of causing them to pause
that long, and then there are pauses of longer than 60 seconds.
>>: Do you have any information about what they did during the pauses?
>> Philipp Koehn: In this study we don't. We are very bothered by that, so the work we are
doing now also does eye tracking so we at least know where they are looking on the screen. So
if you have a pause, that was kind of one of the things that left us puzzling after this. If there is
a pause like of 2 minutes, what are they doing?
>>: [inaudible] [laughter].
>> Philipp Koehn: There's a good chance of that.
>>: Looking up a strange word.
>> Philipp Koehn: Yeah, looking up a strange word, I don't know; looking in a dictionary, I
mean, that's a big question and that's kind of yeah.
>>: [inaudible] translation get some feedback on this one. So short pauses like to 6 seconds
you take the suggestions from the translation memory, they might be fuzzy matches which
means basically that instead of retyping the whole sentence from scratch you can actually
recycle that suggestion and…
>> Philipp Koehn: So we didn't have sub session memories here. This is just translation from
scratch without any [inaudible] but anyway.
>>: Sorry. The second thing, terminology check, so basically there is a need to make sure that it
is translated in the system glossaries or whatever requirements there and also to verify
consistency with previous translations of the same term. That can take just a few seconds to
collect. And more pauses if you don't know a certain term it can take minutes basically to make
sure that you are using right thing. That is from my experience, of course.
>> Philipp Koehn: Yeah, since you're here maybe I should ask you, ask a translator. So what
takes actually the most time? What takes them the most time in translation?
>>: I'd say terminology.
>> Philipp Koehn: Terminology?
>>: So when you don't know exactly what term to use that takes a lot. Sometimes checking
grammar like comma, cases, maybe, you know, it depends on the language but there might be
some [inaudible] that require you to go into some resource to find whether you need a comma
here or not, so that's probably the two biggest pauses.
>> Philipp Koehn: Okay great.
>>: Name entities, checking, getting the name of your customer right.
>> Philipp Koehn: Okay. So that all kind of depends on what kind of tools you build and these
are all things that we could build…
>>: Do you have any information where in the sentence that they pause? Always in the
beginning of a new clause or…
>> Philipp Koehn: We do have that information. I don't analyze where that pause happened
temporarily. Besides I make a distinction between at the beginning and final pauses, but we
have all of the data. Actually I've posted all of the data for this on the web, and if you want to
dig into it deeper -- it feels like this is like massive amounts of data. You have every keystroke
that happened and when it happened down to the microsecond and what kind of questions do
you have and how can we get kind of -- we always want this one number answer for everything
and you have like millions of numbers and so how you distill it all is not entirely obvious.
>>: [inaudible] English speakers don't line up exactly. Sometimes [inaudible] because
[inaudible] of the sentence. [inaudible]
>> Philipp Koehn: Uh-huh. That's also, I guess sign language translations are very similar to
simultaneous translations where you have to translate speech while it happens and that's a bit
of a different scenario, but it's yeah, people do those kinds of translations very differently, like
they don't change word order much because they have to kind of spit out the words when they
come in and they can't wait much. Here they have more time to think. They don't have to do a
real-time. They usually take more time to produce this translation than you expend on just
talking.
>>: [inaudible].
>> Philipp Koehn: If you didn't pay them, yet we paid them essentially by the work here, so we
just gave them a flat amount of money for translating all these sentences.
>>: As a professional translator, the more you do the more you get paid so you have an
incentive.
>> Philipp Koehn: The incentive here is clearly to be fast also, so that makes sense. We don't
pay them by the hour. Okay. Here's a big table where time was spent. Okay that's a lot of
numbers. I will go gently over those, don't worry [laughter]. You don't have to memorize all of
them. Before I get to that, we have these different groups. We group the translators into two
different groups, the ones that are native French and the ones that are native English. In a
professional translation scenario the standard thing is you translate into your native language,
so you need to know the language you are translating into and the language you are translating
out of it is the one that you learned in school or whatever. So we have here the L1s which are
the ones that are native French, so these are the ones that you wouldn't normally hire and the
L2s are the native English. The total time is the time per word. Another standard way to
measure this is how many words per hour. That was kind of the inverse of each other. So these
are on average people translate between 500 and 1000 words per hour which is relatively fast
by professional translation standards.
>>: How much?
>> Philipp Koehn: 500 to 1000 words per hour.
>>: That's a lot.
>>: That's a lot. [inaudible] translation memories or…
>>: It's 3 to 4 times as much as normal.
>> Philipp Koehn: Yeah, but it's also news, so it's not difficult. All of the terminology brought
up, it's not necessarily a problem because there's no fixed terminology to say Secretary of State
or…
>>: There's another criteria. Languages like German the words are not used as a measure of
productivity. They use lines because words are combined [inaudible] [laughter].
>> Philipp Koehn: [inaudible] characters then.
>>: [inaudible] languages.
>> Philipp Koehn: Yeah. So this is the number of source words they translated and then we
divide the total time by. So you see some variance from the fastest ones are 2.8 seconds so this
is really fast. This is more than 1000 words per hour and the slowest one was 7.7 seconds. So
let me try to allocate the time on activities and we had like these different kinds of pauses and
keystrokes, and if you are really strict about it a keystroke doesn't take any observable time,
press a key and that's it. It doesn't even take a second; it takes no time at all, just a point in
time. So the way we define this [inaudible] keystroke the second before and the second after
its part of the typing activity. So we kind of break down all these typing activities and only if
there is no activity for at least 2 seconds and it is not part of the ending interval of typing and
the beginning interval of typing and it's time in between there is part of the pause, so this is
how we then defined these intervals. So there is a typing period and then there's a pausing
period and then there's a typing period and there's a pausing period. Okay. And then based on
these intervals we can allocate how much time was spent on these activities.
>>: And then you normalize the numbers [inaudible].
>> Philipp Koehn: Yeah. So you just measure all of these times and then just at the end just to
make it kind of numbers you can compare, so if you have this total time spent on the first
translator, 3.3 seconds per word, this is kind of how it breaks down and where they spent all
this time. So let's just go over that. Not much time was spent by these translators on the final
pause. Once they were done with the translation they were happy and they moved on. They
didn't spend a lot of time rereading it. They didn't spend much time on these short pauses.
These 2 to 6 seconds and they also were not really all that much different on how much time
they were typing, so there was not a big difference between slow typists and fast typist. The
big differences between the translators were how much they paused. Medium time and long
time pauses, especially the big pauses if you want some translators who didn't have any big
pauses at all, they never paused for more than 60 seconds. They just always kept typing. And
the second translator was probably the worst translator in terms of speed, spent 2 seconds per
word with big pauses. So this is where the good and the bad translators kind of differ. How
much you have to think about the whole thing?
>>: Do you have a corresponding table [inaudible] of the output?
>> Philipp Koehn: We'll get to quality, yeah, we'll get to that. There was a concern and yeah, so
I'll break this down a bit more in the next 10 or 20 minutes.
>>: The usual [inaudible] for translators is considered 150 to 300 words per hour so the second
translator is actually pretty much within those boundaries, so everybody else seems to be
translating much faster than normal translators [inaudible].
>> Philipp Koehn: Yeah, so 300 words per hour is 12 seconds per word, so these are all
[inaudible]. Again, the question is how good is the quality and how difficult is the task. I think
that the task is not that hard. When I do this to a quality level to my satisfaction I get similar
speed because it's news. This not so much specialized vocabulary and so on. Okay, I think I'll
skip this one here. You can also have a graph like this where you can track how much time was
spent on the pauses of a certain length, but it's going to take -- if you get this formula here, you
just accumulate and you're going to add longer and longer pauses and see how much total time
was spent and you get something out of it. So this is a person who spends a lot of time, so it's
not really much slower if you just consider short pauses in bulk, but they spend a lot of time in
these long, long pauses. It's just a pretty colorful graph, but yeah, it's a bit of a challenge how
do you actually visualize this, so is a rather arbitrary distinction until like up to 6 seconds of this
and up to 60 seconds of something else, somewhat questionable. Okay. I dig in a little bit more
into all of these numbers but I first go to the main point of all of this work was how to build
type systems that assist human translators, and then I'll dig more about how they will help and
how much time they spend in the quality of the translations and all that. Maybe a spoiler, the
quality didn't differ too much -- no, I'll get back to that. Quality is the issue and we'll get back to
that. And the translators differ in quality; I can say that. So we tried two different types of
assistance. One is sentence completion which is kind of an auto suggest kind of facility so the
translator types in the translation and the tool makes suggestions. So the next word should be
this, the next phrase should be this and the user can just accept this and then it just produces
that section. So it's one phrase at a time, so what does phrase base mean? If you know phrase
base MT that's what we mean, so these are our short engrams that are used by the phrase base
model, so it's a reflection about how the phrase-based model produced the translations and so
it just spits out these phrases with the phrase-based model. Translation options, is also
something very closely tied to the way the phrase base MT system works, so it gives
suggestions for the single words and phrases and ranks them. I'll have a visualization of that
and the third one is kind of the default. If you don't do anything smart about integrating MT,
you just give the human translators the MT translation in the beginning and say fix it up. So this
is kind of creeping its way into the industry where translators get increasingly confronted with
instead of translating from scratch or from translation memory, now picks up MT output and
they are not necessarily happy about this. This is actually a tool that is online. You can try it
yourself. So this was developed in Ruby On Rails and Ajax 2.0 and all kinds of Web 2.0 and
MySQL and php and no, this was written in Ruby and its spec end is a Moses machine
translation engine. Go to the website and try it out. The browser compliance is not as good as I
would've hoped at this point. It works best on Firefox on a Mac, but some of the formatting
things need work, but I'll demo it a bit later. It's just there are probably a few things I should fix
up. We are not going to build this tool much farther because in the new research project we
collaborate with other groups who also have their tools and we decided to just start from
scratch and build new tools and so it is as it is for now. So this is how on a very, very short
sentence, just a headline this sentence prediction looks like. So you have in green input
sentence, the headline of this new story. You have a text box that is in orange. It's just a
regular HTML text area. You can do whatever you want in the text area. And you have a
suggestion of what the next word should be so this is in red. It's not rocket science, so it just
comes up with Newman and you can type and it's going to make the next suggestion. So the
user accepts this by pressing tab or they can just type in their own translation. So if they are
typing in a different word, the tool kind of thinks about it again and makes a new suggestion.
So there was a project 10 years ago, TransType that was done by Canadians in [inaudible]
research and some other groups in Europe that came up with this originally and there were
some people who kind of this strand of research alive, although it never really made a huge
breakthrough and we are now kind of trying to revive it a bit. Okay, so how does this work? So
we first run the input sentence through the machine translation engine and we create the
search graph, so we have an entire search space that was explored by the machine translation
decoder and we try to match what the user typed in against the search graph. You could also
just rerun the machine translation with this prefix but that would be too slow, so this is
something where you don't want to wait at all. So if the user types in some characters you
really can't even wait a large fraction of a second, so it has to be really, really fast coming up
with new suggestions, so there can really much of a wait period. If it takes a second, that's too
slow, so running the MT engine, we wouldn't have done that with our kind of machine
translation engine back in the day. So that's why we operate on search graph because going on
with the search graph is much faster. So yeah, there are two criteria. If you want to find the
minimum edit distance matched to what the user typed in, so we might not have what the user
typed in in our search graph and so we want to find something that has minimal string edit
distance, that's the number one criteria. If you have multiple paths, the same minimum string
edit distance, we take the highest scoring parse, so the highest probability parse. So the search
graph is precomputed and stored in a database and matching is done on a server and the
browser kind of makes these requests. Typically it takes less than a second and usually it's
much faster than that. I could at some point I'm going to demo this. Let me just go through all
of these things that you're going to see on the screen before that. Okay, that was number one.
Number two was the translation options, so you have the same input sentence here and you
have displayed the top translations according to the model. And it shows you word translation
and phrase translations kind of mixed up, so the top line is the word translations and then the
phrase translations. And the user can just click on these so you don't have to retype them. You
just click on all of this and you build your whole translation [inaudible]. How does that work? It
also works very heavily on the phrase-based MT system. There's a phrase translation table and
we score these phrase translations with not only translation probabilities, but also estimate of
the language model cost for all of them, so it's a lot like the future cost estimation in phrasebased MT we try to figure out what are the easy parts of the sentence or the hard part of the
sentence so your search doesn't get lost and does only the easy part first because that looks
kind of most promising. So it's very similar to the outside cost estimation. This is how the tool
looks. So let me try -- this is a really big screen, so this is actually going to work pretty nicely.
So this is how it looks for a French sentence. It was very short but it actually fits perfectly on
the screen. On the left you have the source. In green is the sentence I am currently translating.
In the middle is just information, the raw MT output, and on the right is what you already
translated. Also, I just deleted the translation so you kind of build up this part here. It's
probably not the best way to structure this and we are doing it differently in the new tools we
built, but the main thing is down here. So you have the source sentence. You have the string
edit difference to the post editing and you have here the text box, so here it proposes that you
start with Sarkozy and you can just say Sarkozy at the meeting of fishermen angry. Okay, that's
not a great translation, but you kind of see how these things kind of came up. If you want to
actually do something different so maybe at the meeting this -- I don't know. Any other
suggestions? Anybody know French? At the meeting, maybe with angry and then hopefully it's
purple if it's fishermen. Oh yeah, it does. So you kind of see how it -- you don't like -- you are
not happy with the translation.
>>: [inaudible] at the meeting, just meeting [inaudible].
>> Philipp Koehn: Oh just meeting and then it should say angry fishermen and it doesn't
[laughter]. Maybe let's see a demo of when I click on something. I can click here on angry and
it pops in. It's somewhat smart about upper casing the first word in the sentence and just
adding commas and periods and all that, but it's not super perfect. Angry fishermen and then
it's done with it.
>>: [inaudible] stop correcting?
>> Philipp Koehn: Up so you can do whatever you want, but the prediction is set up that you do
it left to right. This is a text area. If you want you can just type fishermen and then you can go
whatever you want and you can even cut and paste things. I mean you can do what you want.
So you can use the tool. It doesn't hinder you from anyway you want to use the tool. It doesn't
force you into any operation, but to get benefits from the prediction -- I don't know what
happens down here. It is of course a bit lost. You actually see the shaded out? This is what the
machine translation system thinks you have currently translated, so it thinks you already
translated Sarkozy with that first scribbling stuff here.
>>: [inaudible] fisherman [inaudible] produces [inaudible]
>> Philipp Koehn: Yeah, [inaudible] with some idea about where [inaudible] paste this
[inaudible] where did this come from? And its best hypothesis is it got Sarkozy wrong and
[inaudible] is the right word so it's a substitution, so there is a string added to this one in the
search graph. There's no way -- you have to do something with that, so it's going to be an error
of one to the search graph because you have a new word and one word in there so it tries to
come up with the best explanation which in this case probably is substitution with Sarkozy.
Okay. So that was a lot of fun. Back to the talk. So the final thing is post editing in MT and
there's really not that much to say about it. You just sketch in your text area already pasted in
the MT output and you can fiddle with that anyway you want and in this bluish area it gives you
kind of a visualization of the string edits so you kind of see which words you deleted and which
words you inserted. So this sentence corrected with a few things so it's -- MT output had them
as an interpreter; an actor is probably a better word in English and there was also years in there
and it mistranslated the name of the title of the movie because I guess in French it's just called
and the Kid and not the Sundance Kid. You have to get the proper English title. So we have the
same set up here now, so we have 10 translators. Actually, the study I reported earlier was just
part of this study. I actually didn't do the first study or the separate study, just we had to have
them translate these 40 sentences under different conditions and one of them was without any
assistance. Same people and we have five different conditions. The unassisted is what I
originally talked about where they didn't have any help at all. They just had the text area and
that's it, nothing else on the screen, just the source obviously. The first thing we tested was
prediction, so this was mostly demoed with making suggestions, the options where you can just
click on all these words or both of these things. And the second condition was post editing. So
they just had the output of the MT and they could fix it anyway they want to. So they had
blocks of 40 sentences under each condition and each translated them and we rotated things
around that each text block was translated under all of the different conditions by different
translators. Obviously no translator translated the same block twice under different conditions
because they would have known already too much about it. So we are concerned about
quality. We are mostly concerned here about speed. We want to have faster translators. We
don't necessarily want to have worse translators, so we thought we were just going to do a very
simple quality assessment where we just ask them afterwards, judges saying is that a right
translation or not. Because you saw that these were human translators, they should be 90%
right. It didn't turn out that way. So we just asked them is that a fully fluent and what's the
word, the wording in here? Fully fluent and meaning equivalent translation of the source?
Sounds like a straightforward question. And we showed it also in context so if there was some
confusion about, you know, pronoun it referred to and so on, you could figure it out from the
context. Okay. To our surprise we got about 50% correct and I'll show you the slide with one
sentence and we can argue about if the judges are too harsh which was my impression, or are
all of our translators really crap. Some of the same students and other students did the judging,
just also nonprofessionals. Just to give you one example.
>>: So do you have the quality judged by the same guys that produced this?
>> Philipp Koehn: Yes.
>>: What difference [inaudible]?
>> Philipp Koehn: Some of them -- there was a mix of people but there was some overlap
between the two groups, yeah, so they didn't know who produced which translation. Maybe
they recognized their own translation, and maybe not.
>>: The groups could be reviewing their own translations?
>> Philipp Koehn: They could have reviewed -- yeah. So they were actually given -- let's see
how we actually did that. I think we showed them all the 10 translations like here, the different
translators under different conditions and had to say about each one was it correct or wrong.
So there's a concern if they see their own translation and say that's the one that's right and
everything else is rubbish. But in the end…
>>: The blind is judging the blind.
>> Philipp Koehn: Yeah, yeah. [laughter] I'm not going to go down that road. Anyway so,
here's a sentence. So this, maybe it's a somewhat French sentence. It started the MP system
which came up without dismantle it has been concise and accurate. So without dismantle is
kind of the biggest struggle here in how to get that right. And these are the different
[inaudible] you see here who did it, again the L1s were the ones that were French native and
the L2s are the ones that are English native and these are the different systems. The first
striking observation that is always stunning is these people were very much prompted by the
MT system, by them showing them translation options. They had a very kind of similar kind of
mindset on how they should translate the sentence and they all came up with very different
translations. If you give 10 people a sentence to translate, they all come up with different
translations even for short sentences like these. Even in this scenario where two of them are
just post editing from the same source, and others were shown like all these options and they
all were kind of trying towards certain vocabulary to use and all of that and they still all…
>>: What is good translation and all…
>> Philipp Koehn: Oh yeah, so my impression -- just to finish the description of this slide, the
first number is in green how many thought this was the correct translation and the red is how
many thought it was a wrong translation, so what's -- let's start with the third one. Without fail
he has been concise and accurate.
>>: [inaudible]
>> Philipp Koehn: There's the first one that three people thought was not a good translation. I
don't know French, but just from the general context of the article I thought that was actually
not a wrong, not a bad the translation. I mean if our MT system would produce that we would
be perfectly happy. Without getting flustered he showed himself to be concise and precise.
Everybody liked that one. Sometimes, yeah, it's human judges also so you have [inaudible]. He
showed himself concise and precise and so this first weird thing was just lobbed off [laughter]
and two people said yeah, that's all redundant; it doesn't mean anything.
>>: The native French speakers and -- how good was the English of the native French speakers?
>> Philipp Koehn: So they were university students at Edinburgh so the English was good
enough to attend university classes so it's not…
>>: Were they taking English lessons?
>> Philipp Koehn: No, they were just regular students at the university. They were not taking
English lessons.
>>: [inaudible] enrolling at the University [inaudible] at the same time.
>> Philipp Koehn: Yes, so here is the output so you can judge -- I mean some of these, I think
the L1s are the ones that were produced by the French native speaking so there are problems
with the grammar. There's also always effort in how much did they really try. Post editing you
can always just say yeah, whatever. I'll make my money; I'll just say yes to everything. But here
are the different translations.
>>: [inaudible] each one predicts there, each translator produces their own translation. There
was no convergence. There was no two people producing the same…
>> Philipp Koehn: No. And that's absolutely, you know, that's absolutely standard. So this is
absolutely typical translation behavior. I don't think you'll find any sentence where maybe on a
very short sentence sometimes two people came up with the same translation, but then the
other eight came up with different translations.
>>: I think that's one of the areas where you might find the difference between professional
translators and amateurs. If you have people who are used to working for say Microsoft, when
we did technical documentation we strive for standardization.
>> Philipp Koehn: Yes, I think for technical documents I think there is a more, there are much
more guidelines as to how you formulate things and what tense you use and you know and it's
standardized terminology.
>>: Mostly yes, but we’re getting away from the very structured and dry language.
>> Philipp Koehn: But I also hear stories from someone who works at the European Parliament
or the European Commission with human translators and we hear the story where sort of
showed this translation to someone and he said this is all wrong and this should be changed
and they said yeah but that was your translation from yesterday [laughter].
>>: I still find it surprising that this priming from the MT output. The engram overlap with the
second clause there is pretty strong for a number of these, so clearly this is where the
translator got it better they tended to keep that. Whereas, if the translator got it wrong that
first clause there, that's where you see the biggest divergences.
>> Philipp Koehn: Yeah. So some of them -- so if they have been concise and accurate,
especially if you did post editing; did they keep that? He has been, so there is post editing,
showed himself to be concise and accurate. I mean even there they changed it may be more
than necessary. I thought these translations were -- one thing I wanted to stress here. I didn't
think that these were all that bad, but we [inaudible] get all these numbers with accuracy of
50% and that's where they otherwise come from.
>>: [inaudible] did the student previously before doing that [inaudible]
>> Philipp Koehn: No, we were just saying try as best as you can. You get a fixed amount of
money to do this and they were very well-paid actually and yeah, try to produce good
translations. They were not say, use the tool as much as you can. They were not said, you
know, when you post edit, don't just delete everything. And you will get into the behavior in a
bit more detail, so people do different drastically and some people just basically didn't use the
assistance we offered. They just always typed in the translation and completely ignored all of
the other options that were given to them.
>>: [inaudible]
>> Philipp Koehn: On each of these sentences, yeah, this is probably a pretty average sentence
here. Average time was three or 4 seconds per word so this is a 10 word sentence roughly so it
was done in less than a minute.
>>: [inaudible] variations between people [inaudible] unassisted and…
>> Philipp Koehn: I'll get to that. I'll get to that. Yes, so yes, that was the main point, you
know, how much faster they were with these things. So this is just, yeah, this is human
reproduced translations and even that is judged harshly.
>>: [inaudible] where they do better in quality than the ones that do…
>> Philipp Koehn: Yeah, I'll get to all that, yeah. So that actually was the main point, so I'll get
to that now, quality and speed and with the assistance and without the assistance and all that.
This was just too stress the metric that came up so don't yell at me like I only got 50% of
sentences right. Well, this is what that 50% means. Okay. This is kind of the one way to
summarize this. So every speed over all translators and then broken down by different kinds of
conditions. Unassisted 4.4 seconds with post editing 2.7 seconds. Given the options, 3.7 and
prediction 3.2 and both of these 3.3, so they are all faster under all of these conditions over…
>>: So that is like really surprising, right? Because everyone hates post editing [laughter] so
[inaudible] you get from probably half [inaudible] [laughter] post editing is a pain in the neck
and slows the down and blah, blah blah, and so before you put this slide up I would've
predicted that all of your other options would have been better in fact than post editing.
>> Philipp Koehn: Yes, and you wouldn't be the only one to think that because we asked -- I'll
get to that to the end but I can say it now. We asked them afterwards, what did you think was
more helpful? What did you think made you more productive and they didn't like post editing
[laughter]. They sat, the settings are really crap, but if you actually look at the raw numbers,
oh…
[laughter] [multiple speakers] [inaudible]
>> Philipp Koehn: 400. So this is not now, yeah, how many words are this and how many
judgments are there? Yeah, I'll get back to you on that, yeah.
>>: There's another thing, you know, these are not professional translators. They are just
regular students. They would first of all not be that good in terms of quality even with
assistance. The [inaudible] would be much better unassisted [inaudible].
>>: I'm actually not even concerned with the quality. I totally agree with you on that. I'm just
looking at the speed.
>> Philipp Koehn: Quality, by the wayside [inaudible].
>>: Before you put the slide up I would have predicted that the speed would be much better
with the three sort of assisted options as compared to post edit.
>> Philipp Koehn: And that would've made us happier too.
>>: But you only did 40 sentences per person, right?
>> Philipp Koehn: Yeah.
>>: And so these people were not only trying to do this task, but learn a novel user interface
with a lot of complication, so are you going to address the learning curve?
>> Philipp Koehn: I'm going to show 10 more slides, exactly that question and what kind of
people they are and all of that, yeah. [laughter] so first of this is just another way to break
things down.
>>: Do it all at once? [laughter]
>> Philipp Koehn: Do it all at once [laughter] and then I'll have a bit more breakdown of this.
So this is kind of all-in-one, so these are the 10 different people and green are the boxes where
they are faster and better with assistance and red is where they didn't, well here, well this one
was slower and worse. And yeah, this is the raw number. The difference here they were slow
on or worse and here, yeah, the red ones were where they were slow and worse and the white
one where they improved in one and not in the other. Okay. There's a bunch of green here,
but let's just look at all of these people and kind of break down. So two people we would
characterize as slow translators, so these were the slowest ones to begin with and they were
also not very good. So this person here 10% of sentences correct, so it might actually not
known these languages all that well. So they improve drastically, so if someone doesn't know
how to translate very well either because they don't know the source language or the target
language very well, they are greatly helped by this tool. These are people who were not as slow
as the previous ones but still rather slow and they definitely got faster but not necessarily
better but they were around 50% to begin with, so that's not much change. So they were faster
instead of better. Two of them were fast translators to begin with and they got even faster and
better. We had four people labeled as refusing, we had the keystroke log. We knew what they
were doing and they never pressed tab. They never accepted any of the predictions. They
never clicked on any of the options. They might've looked at them. We don't know that, but
they just didn't use the assistance. So it's not a total surprise that the systems didn't help them.
So these are the people where they actually, if you look at the logs they were the same speed
with both prediction and with the options and with an assistant because they just didn't use
them, so the only thing they couldn't get away from was was the post editing because when
they opened the sentence it was already in the text box. So at least they had to do something
about it. Maybe delete it in one big goal but if you look at the breakdown on the numbers they
didn't do that, at least not always. And they got, so all of the errors point down, so they all got
faster. So even these people who were not totally convinced by all of this newfangled
technology and also did not like the post editing all that much got faster with post editing. The
quality got…
>>: One of them, two of them got better.
>> Philipp Koehn: And this is a champion translator here. This was one highest percent right
was the fastest, almost first one, but one of the fastest for sure, and the errors go in the right
direction.
>>: And so who are these optional [inaudible] random order?
>> Philipp Koehn: They were given [inaudible] so they had worked on one block at a time.
>>: So post editing, [inaudible] translation by [inaudible], followed by post editing, followed by
because it may have gotten acclimatized to the task and could have…
>> Philipp Koehn: So they were not given any specific -- so the way that it was presented to
them in the tools they were not forced to do it in the same order as we presented to them so
they had like a list of do this task or this task or this task. The order was kind of mixed up for
different people but they did one condition all the way through so they interpreted the entire
news story under one condition and then they went to the next news story and that may be a
different condition or the same condition. So it wasn't completely randomized but -- okay. So
we have some more analysis on this answering some of your questions. We did this of course
because we have this keystroke log and there's no one person who did the sentence prediction
so we have a new color, red. Red is pressing the tab key, accepting the prediction of the MT
system and this is a somewhat representative way of, one way to use this prediction. So you
kind of do post editing and key but you do it in a way that you produce the sentence as you
read it and only when you then don't like something anymore, so you just up to here it's exactly
one best translation of the MT system but it was just kind of read through. At this point the
person was like okay, okay something went wrong; I don't like this. Some deletions, adding a
letter, a lot of moving around here like deleting and adding and so on. Than a pause and then
the tool kind of makes new suggestions and the person accepts them all and maybe even until
the end of the sentence and then does a revision on that part as well. So it's kind of doing post
editing but it's a bit more interactive since the user controls what kind of pops up in the text
area. And certainly this continuation here might be better than what was in the original MT
output, or definitely better suited. Okay. This is how much time is spent on these different
activities. This is now just the one user average over everything, so what do we highlight here?
This is how much time they originally spent on typing, so this of course goes down when you
don't have to type everything anymore especially in post editing where you have to type less or
even in prediction. And how much time was spent on the other activities, so a slightly less, so
this one it's a time-consuming part of translating is actually typing in the translation, so you
reduce that time a bit, but the biggest difference is the time spent on pause went down, quite a
lot. So yeah, it reduced a lot of these pauses, especially the big pauses, so this was one that
spent like 2 seconds on very long pauses and that just didn't happen afterwards anymore.
Okay. This is another one was so that person really didn't use assistance all that much, so they
spent less time on tabbing and clicking and I'm not sure if it shows that but you can actually see
where did the characters come from in the final translation. Did they come from any of these
prediction activities or the actual MT or were they typed in? You can look at that too. And in
this case when both of these options were given one actually never clicked and never pressed
tabs no times we noticed at all. So only in post editing the person was forced to use the tool
and spent dramatically less time typing, not totally surprising I guess. Spent a bit more time on
these pauses, so you have to read chunks at a time so this is pauses between 6 and 60 seconds,
so you have to read maybe for 10 seconds, 15 seconds something that shows up here. So
overall slightly faster for this person. Yeah, this is what I just alluded to. This is actually
reflection of where the characters came from. So if you can just kind of follow-up all of these
typing events and tapping events and clicking events you can actually track each character in
the final translation and you can actually figure out where this character came from. Was the
character typed in? Was the character generated by clicking on one of the options? Did it
come from the original MT and so on? So in post editing this person changed 18% of the
characters and that includes deleting something and typing exactly the same thing again. And
you can see where the other characters come from. The ones typed in were the highest here
and also [inaudible] low for the others. Okay. This is the second one. We know he didn't use
these options; 100% of the characters were typed in. Also, when these options were given
these types of assistance were given he had used them only partially. Yeah, this is the, we had
paused graph. We can just ignore that. Here this is the question you had, learning curve. So
we didn't have a training phase, but they used it continuously for over 40 sentences which is
not a huge training period, so they may have spent maybe an hour or so. I should figure out the
math exactly, but they probably spent an hour or so on each of these conditions.
>>: That is fascinating because you would think that the learning curve would be more for the
prediction and the options because that is actually something there to learn because like hey,
I've got these options. These predictions, but the learning curve is actually better for post
editing because when they sort of lead you would think that there is the least to learn there.
>>: Unless the [inaudible] overtime they have to do… [multiple speakers] [inaudible]
>>: No. You learn what kind of mistakes the system makes and then it's very easy to
[inaudible] for example, if the insert article [inaudible] very easy for you to [inaudible] [multiple
speakers]
>>: But in terms of the tool there is actually something there to learn in the tool. I mean you're
kind of learning what tab does and learning some suggestion and some stuff and some clicks.
>> Philipp Koehn: So that's clearly at the beginning. First five sentences there's a dramatic, the
finest.
>>: [inaudible] [multiple speakers]
>> Philipp Koehn: So what kind of drops out of here is that if you translate a story maybe the
first sentence just takes longer because you're just not in the mindset of the story and it takes
longer to pick out and then three sentences later vocabulary repeats itself, so these were
multiple stories but they had different lengths so there's no like middle bump because after 20
seconds the next story stops because the stories all have different lengths.
>>: What about [inaudible] at the and there, or in the middle? Were people just getting bored
and spending lots of time [inaudible]
>>: [inaudible] new story.
>> Philipp Koehn: I don't know. I cut it off here because the shortest kind of aggregated story
length was 31 but it kind of keeps -- so they translated 40 sentences under each condition but
these might've been multiple stories and they were different stories so it wasn't always exactly
40. It was sometimes 45 and sometimes 32, so apparently 32 was the absolute minimum block
of stories so these curves go a little bit longer, but then I don't have 10 numbers anymore to be
[inaudible].
>>: Do they stay sort of flat or do they continue?
>> Philipp Koehn: That's a good question and we should do more work. So we should probably
do a proper study of how this actually works with a proper training phase and all of these
things. So this is just some glimpse on -- I think important, the one thing I learned from this is
that this didn't really change all that much. So the unassisted, yes, you have to maybe at the
beginning of a story you are slower, so you drop from 6 seconds to 4 seconds, but then you kind
of stay there. While with the assistance, at the beginning you were not that much different but
then it kind of keeps going down, so there is a training effect that if you would do, maybe your
numbers would look more optimistic than what I presented. And yes, usually these should be
better trained and all. Okay. I already talked about this user feedback. We asked them very
quick question here, in which of the five conditions did you think you were the most accurate.
And they didn't, they liked all of our stuff. They didn't say unassisted anywhere there, so
[laughter] or your tool was just distracting me from the right answers. And rank them on a
scale of 1 to 5 by what you thought was most helpful and here we didn't fare so well. And that
is, that doesn't match the empirical results. So this is pretty clearly -- the first one you could
still like argue about whether that means accurate, but the system like what did they think was
the most helpful. And least when we define the goal of this study produced translations fast,
produced good translations fast, the answer would've been post-editing, but if we asked them
what you think…
>>: [inaudible] communities and all of that the goal is to produce the maximum number of
translations.
>> Philipp Koehn: Isn't that the same thing?
>>: No. Because if people, if your tool is annoying enough that people drop out then…
>> Philipp Koehn: That's a good point. So that's a really, really good point so this is, so just to
give you an actual fact in that space, so if you do some simultaneous translation from speech
you do that for 15 minutes and then you have a break for 15 minutes or so. Because it's just
too stressful so you can't keep doing this. So maybe this post editing is just much too stressful
that if you do it for half an hour you just really, you know, going to have to take a half an hour
break and this isn't something we measure because they could stop at any time and you just
always measure time on a single sentence. If we just turned on the browser just went back an
hour later, we didn't know that. So that probably requires a bit more study on the cognitive
load and maybe what's annoying and so on. So…
>>: [inaudible] also what you presented was something that you are forced to edit right away
that is far I don't know less user-friendly than if you actually get options so they can customize
the [inaudible].
>> Philipp Koehn: And that totally reflects my experience with it. Is just much more fun to
build the translation even if you just do this tapping, you actually do post editing, but you have
full control over what pops up in your text area and you're not just confronted with these 30
word junk that you have to go over. It's a very, very different mindset of doing this task. One is
you feel like you are creative, if you build a sentence you can weigh nuances and on the other
one you are just basically fixing some other people's mess. And it's not even other people; it's a
machine’s mess.
>>: You made a good point about [inaudible] so that made the which option works best very
drastically by sentence left. Because really long sentences post-editing might be much harder
than shorter sentences because it's worse or it starts…
>> Philipp Koehn: You could figure that out. The prediction stuff works less and less well the
longer the sentence is due to the technical problem of having to match the user prefix to the
search graph. If you have too many edits then the search for finding the minimal editing parse
is getting actually too slow to be used.
>>: [inaudible].
>> Philipp Koehn: Yeah. So the string edit isn't sitting on the right metric really to do this
because it doesn't account properly for moves, but then if you do anything else a
computational problem of matching -- if you use something like TR, what does the minimum TR
edit cost of a 20 word prefix to a big search graph?
>>: It's sort of orthogonal because you can incrementally compute these things and just -- I
mean if you look at [inaudible] words that [inaudible] and store [inaudible] programming
[inaudible] this stuff.
>> Philipp Koehn: Yeah. I haven't talked about the algorithms yet. It doesn't obviously do
incremental dynamic programming, but it actually still -- for the price base it works. One thing
that I'm struggling with is do the same thing for a tree-based model [inaudible] forest where
you have to match user and input and the string editors against the forest. It's not a trivial
problem and it's still not as fast generally as I would want it to be. It takes -- this is
implemented in C so it has to fast.
>>: [inaudible].
>> Philipp Koehn: Of yeah, basically that's kind of what you have to do. But if you have any
views on that… I can show you where I am.
>>: [inaudible] the graph wouldn't be any faster really then [inaudible].
>> Philipp Koehn: Of yeah, at some point really you should just do – re-decode the sentence
with forest decoding.
>>: [inaudible].
>> Philipp Koehn: Yeah, then you would actually, the graph matching would be easier. So for a
long sentence we should do that but that's kind of just from the machinery of setting those up
it gets trickier.
>>: Sorry if I misunderstood, but can you populate the prediction simply with the next word?
Instead of showing the post edited just always show -- you don't always show -- you don't
compute the minimum path but you just show the next word that the machine would translate
at that point, just take the post edited.
>>: Well, that is the suggestion.
>> Philipp Koehn: That's what the suggestion does.
>>: [inaudible] suggestion [inaudible]
>> Philipp Koehn: You look at the…
>>: But you are talking about finding the path which is, how do…
>>: How to know what the next word is, that's the answer.
>> Philipp Koehn: Let me do this here, meeting. So it has to match, Sarkozy meeting.
>>: I'm just thinking of getting the machine translation once and then just kind of, you know,
showing the next word so that you are effectively post editing but you are…
>>: So I think the issue is that if you look at the MT output here, it says Sarkozy at the meeting
of fishermen angry.
>>: Right.
>>: And so if I said Sarkozy at the meeting of angry then it would say, the next word would be
angry.
>>: Angry. Right.
>>: So if it were a monotone problem and not a reorder problem, then…
>>: You are right, sorry.
>> Philipp Koehn: Yeah, well anyway, yeah.
>>: [inaudible] make your other slide viewer on [inaudible] feedback. Do you distinguish
[inaudible] options would be different if the students were native or non-native.
>> Philipp Koehn: It could be. So it's all reported in the journal paper. We might've broken it
down and maybe not. But then it's sort of a small sample if you get on to that level because
here it's 10 people you average only after your five, so it's really only individuals and any
statistical significance flies out the window. Okay. We have 12 minutes left. I promised I'd get
to the first one third of the talk and I'm happy with that. So let me just summarize this as far as
what we just discussed. So people got faster under all of the conditions and we especially
reduced the big pauses, be reduced a little bit typing effort and post editing for sure. They
made them slightly better although I don't want to make any big claims about quality because
I'm not happy about human judgment on this one. I blame the judges. Even the good
translators got better with the post editing and I think that's somewhat good news. Some of
the good translators, so we had four refusing to just use a tool. That's the general problem if
you give people a tool and are used to one way of working it then why would they actually
change the way they work. And there were the fastest and to some degree also the best with
post editing but they didn't like it. So now there are two ways to go from here. One is to make
our systems better so they catch up with post editing or to make post editing more fun. We are
trying both of these things. So there are various ways we get to improve post editing.
>>: Question, what is the goal? Because if the goal is to get a better way to assist the quality
because in the end this experiment what you got was not necessarily usable [inaudible] 50%. If
that's an acceptable I [inaudible] how good the software [inaudible]
>> Philipp Koehn: I think that number is too low for performance success.
>>: I think that was being too harsh.
>> Philipp Koehn: Yes, way too harsh. That's my take; it was way too harsh. But it's always
easy to find a mistake in the translation I think.
>>: But you could change the question too.
>>: Or you could use a wonderful scale.
>>: [inaudible]
>> Philipp Koehn: But it still, yeah you always [inaudible]
>>: You can do a blue score against the other nine translations [laughter]
>> Philipp Koehn: So yeah, so I actually believe that they are going to say 90% is fine and here
just the person was too lazy or didn't pay attention or just didn't know the language well.
Otherwise why would they actually get anything wrong? They are human translators. They are
perfect. Isn't that our goal really, to be worse than humans? [laughter]
>>: Oh, we think you are really bad.
>> Philipp Koehn: Yeah, yeah.
>>: Like there was a misstep there where translator three was consistently horrible on every
single sentence.
>>: Yeah, out of four references he was consistent.
>>: Did you measure the [inaudible] post editing for different MT qualities? Like if you are
[inaudible] new translator?
>> Philipp Koehn: Just one system. Yeah, that's a variable and it does matter a lot, so I know
from just one of the translation agencies that we work with, they offer, or they do post editing
for MT for French and for Spanish, but they don't do it for German because of the quality of
their MT engine in German is not as good. So the quality is definitely better, so there is one
point in the quality curve of MT where it becomes good enough for post editing and that's very
-- and we reached that point for several language pairs, especially with restricted domain, so
that is kind of good news for MT. There is a real use that people in the industry use post editing
MT and they get better results and they are faster and so on. But we haven't reached that
point for all language pairs and we don't necessarily have it for all of the domains that people
want to translate stuff.
>>: Do you think you would push for more MT quality and use regular post editing or is better
interpreters [inaudible]
>> Philipp Koehn: So we do both. Actually like I said we have applied for too much funding and
in this case we got too much because [inaudible] [laughter] and one is kind of run mostly by this
translation agency that -- I don't, all of this funny stuff you got to put on the screen is just going
to annoy translators. They're not going to -- they don't like this. They want to at most post
editing MT, that's all. They are used to translation memories. We give them out one additional
option which is MT and then maybe hide it that it's MT [laughter] and that's all we are going to
give to them. We are not going to kind of throw funny colored stuff on the screen. So they are
the main focuses here. Also like how can you improve MT like incremental training, those kinds
of ideas? And maybe show some things with confidence measures to not show the bad MT or
word level confidence highlighting words that are more likely wrong.
>>: [inaudible] translators is going to be completely different feedback so that's why the
agencies are [inaudible]. It's postdated for unassisted translation. Translators tend to read the
sentence and they already have an idea about how to translate it. It takes very only 2 seconds,
to come up with something compared [inaudible] actually not used to that exercise doing their
exercise over and over again. So if you compare that with MT it's going to in most cases it's
going to slow them down because they have to advise somebody else’s translation. But for
human translation the translator is a [inaudible].
>> Philipp Koehn: So there certainly is some high bar for MT to pass, otherwise it's completely
useless. It's just like 50% of the sentence is just rubbish and then you slow down the translator
so much looking at all of that rubbish, so it has to be good enough that for the sentences you
show the vast majority should be helpful. Otherwise it's just a waste of time. I mean there are
studies around people trying obviously different [inaudible] pair and task but the few ones that
I know [inaudible] where people get 50% faster for 80% faster depending on language pair, so
this is MT engines that are geared towards their data and they have been trained on their data,
and used on just their data. And so these are kind of, yeah. So numbers you actually get
reported back from translations is maybe 20% and 50% faster. It's not like three or four times
faster.
>>: [inaudible] human element apart from all the technology [inaudible] there is potentially
[inaudible] technology coming into what you have been doing manually forever and some
people will resist it. You have refuseniks [inaudible], so that also comes into [inaudible] apart
from how good the technology is how you on board the people who use it to make it. You
make them amenable to use the technology.
>>: About 10 years ago the same process was happened to translators here when we started to
use translation tools [inaudible] memory [inaudible] recycling tools, so many people have
trouble actually getting on board with that. Now everybody more or less is on board with
translating [inaudible] using those tools and some people actually even provide suggestions.
>>: Many of them.
>>: Yes so that, addressing that even in packaging [laughter] [inaudible] intentions.
>> Philipp Koehn: Yeah. Translators, it's a very, very diverse group of people so I don't know
how much use this is. I know from the AMPTA conference there was one of the keynote speech
was given because previously they were co-located with the translators’ organization. The
translators learned the machine translation [inaudible]
>>: [inaudible] I think they were more open [inaudible]
>> Philipp Koehn: Yeah, so the ones actually go to AMPTA are probably open-minded [laughter]
[multiple speakers]
>> Philipp Koehn: [inaudible] conference I've been only once and there was a majority of
translators and it was more hostile. [laughter] the most hostile audience I ever had.
>>: What would the next part of the talk the? I know you won't have time to get through it.
>> Philipp Koehn: Yeah, we are not going to get through it. I just going to quickly, maybe that's
worth going into.
>>: [inaudible] monolingual translation.
>>: Yeah, I was thinking [laughter]
>> Philipp Koehn: So let's do the monolingual translation. So the idea was, so the idea is
basically straightforward. If the MT system produces the girl entered into room you can figure
out what it means and you can fix it so you don't really even know the source language. It's just
what makes sense in the target. And here is…
>>: Sometimes you, sorry, sometimes it's [laughter] especially with statistical machine
translation can produce something fluid but totally not related to the subject.
>> Philipp Koehn: So accuracy is a yeah, so the quality is different now. Quality metric you
don't expect too match human translators or [inaudible] professional bilingual translators or
someone like this. Okay. Here's the task. That's a pretty picture. This is how we set that up.
This is the one that he can't read and I'll show you the one he can read. So this is the Arabic
sentence. We did it again with people in Edinburgh, students that don't know any Arabic.
Squiggly symbols. It helped a bit probably to see if these were long words, short words but
that's about it. 2008 [inaudible] figure out what that was supposed to mean. So this is, here is
the part. Can you figure out from that the translation? So step one is to just do post editing
MT. Step two is you do this and this was yeah, not that much better as we would've hoped.
There was like one story where it clearly helped people, but otherwise a bit of a mixed bag.
>>: In this case it seems like you have a lot of contacts in your head where you know what's
going on.
>> Philipp Koehn: So this is just real-world knowledge. One big thing we found out is that it
matters a lot. Oh, like that wasn't really obvious [laughter]. The more you know about
something the better you do it. So there was one sports story about an American soccer player
playing in the Cup of America. If you know anything about soccer and you know this
tournament and how it works and how they play, you did much, much better than someone
who just, you know, playing against Columbia. What does that mean? Is a good thing or
[inaudible] expected one?
>>: What is the scenario for monolingual translator? For me that sounds like an oxymoron.
>> Philipp Koehn: It's a very oxymoron. I think that's actually a real, I mean, there is this kind of
whole DARPA scenario of the analyst who wants to find out about some foreign document text,
so you can give them a machine translation or you can give them this and maybe they figure
out more about the text with this, so they actually…
>>: [inaudible] publications scenario, right, if you want to translate what would be into a new
language then you have a bunch of speakers of that language but not bilingual then if they
could fix it up to readable quality then you can use that to reproduce…
>>: But then there's also the crisis scenario too. I mean the example in Haiti where you have a
triage. You're not going to triage in Creole because no one speaks that language. They're going
to need to triage. I mean no one who's providing aid speaks the language, so you need to do
triage in another language so [inaudible] the language [inaudible]
>>: Can you really get it if you're not speak the native…
>> Philipp Koehn: Good question, so these are the numbers we got out of it. Yeah, I think
these are the highlights. So this is broken down for the different stories. So this is the bilingual
translator and there was huge variety also in the different translators. I should show that too.
So do I have that? This is how different people fared and some didn't do very well and some
did really well. This first one did really well. This was on Arabic stories and Chinese stories.
This is the bilingual translator. I think he kind of reference this one really bad bilingual
translate. So this was a test set, so we had three bilingual translators and a reference so we
could pretend these were independently [inaudible] independently translated and you could
score them. There was one bad bilingual translator and three of the monolingual translators
were as good as them. So there's a huge variety. Okay. So now average of all these people you
might notice [inaudible] this is now average over stories for these, and different translators, so
this one is where they are showing the options instead of just doing post-editing of the MT
often helped but for the other ones it didn't matter that much. There's one story there.
>>: The average across…
>> Philipp Koehn: Of people. These numbers here are these 10 monolingual translators were
all the same kind of people and they know the source language. And they all translated the
same stories and these are averaged over three bilingual translators which is the missed
sentences and the same metric saying is that correct or not based on the reference [inaudible].
So numbers…
>>: So the real question is there is no way to bias the mission translation system in a way to
assist a monolingual post editing? Because obviously right now they're doing much worse than
[inaudible]
>> Philipp Koehn: Yeah, so there's some obvious things that kind of jumped out. Names, If they
are not translated right there is just no way, I mean there's hardly any way for human to catch
that, get that right.
>>: But even still you are having the [inaudible] significance in it [inaudible]
>> Philipp Koehn: That's a good argument. [inaudible] point [inaudible] [laughter].
>>: [inaudible] other direction. I'm just talking about [inaudible].
>> Philipp Koehn: Yeah, yeah, so it's not -- and it works. I can show this in the tool too if you
want to play around with it [inaudible] no I can't go back, or how does this work? Anyway, it's
fun to do this. We did something on very early on in [inaudible]. I think it was about 15 years
ago. Just give them like the output of MT or phrase tables and can you figure out what the
stories are and just knowing what makes sense and what doesn't make sense, if you get five
content words you can kind of puzzle it together. I mean you can -- you might be completely
wrong [laughter].
>>: [inaudible].
>> Philipp Koehn: You might be completely wrong. You might miss the mark.
>>: No, no I mean [inaudible]. You would focus here on the pieces among different things so,
for example, pay very close attention to [inaudible] getting right. [inaudible] on the main
[inaudible]
[multiple speakers] [inaudible]
>>: You don't care about the what?
>>: You don't care about the [inaudible] so much because the modeling readers can do the…
>>: As long as they can [inaudible] close enough themselves to be complete word salads
because otherwise [inaudible]
>> Philipp Koehn: Because here there is no language model. So if you just put this together you
can figure out what [inaudible]. The human model is much better than the engram language
model [inaudible]. [laughter]
>>: [inaudible] don't need a language model. You don't need a language model to produce this
if you have [inaudible]
>>: Just pay attention to the contextual [inaudible] because you picked the wrong sense there's
no way well, you need a lot of word knowledge to know that it's not the right [inaudible] so
engage in word sense [inaudible]
>>: Can you model the human, the one’s background knowledge, your common sense
knowledge for this Arabic news article, is there technology to find a comparable U.S. English
news article, have the translator read that and now use the options.
>>: That's kind of like triage in a way.
>>: Because now you kind of [inaudible]
>>: That's kind of like priming with relative things.
>>: Yeah, because now you know that, you know, the kind of who does what to who and then
you instead of…
>> Philipp Koehn: Yeah, so the typical scenario is that the user knows the content. I mean he
cares about the content. Why else would they do this?
>>: [inaudible] scenario, but in other scenarios, in something -- if by finding the comparable
sources and having like the monolingual translator read those you can kind of control for how
much one translator knows versus the other.
>>: But in this case you're not really a translator. Where you are…
>>: No of course not. You're not even interested in post editing. You want to understand is
this something like at the DARPA conference, listening to signal [inaudible] most of it is
knowledge. Do I want to invest in a professional translator for this particular communication
that could be interesting or where I see something? You're not really embellishing the output.
You are really…
>> Philipp Koehn: I think it varies how much detail you are interested in. One question is could
I give this to a professional translator and if the answer is no, then the quality level doesn't have
to be all that high, but if you want to find out when the bomb is going to explode and all you
have is this hand written scribbled thing and the clock is ticking and [laughter] you want to
know more detail.
>>: You don't really care about the final translation. All you want to know is I'm standing the
content and this is beautiful because you can go through the content and get an idea what it
tells you, but you're not going to spend your time [inaudible].
>> Philipp Koehn: Yeah, and you get some sense of, I don't know if this is a good example,
about the uncertainty of certain things. Like there was a subway. There was something about a
Muslim brotherhood story in Egypt, about the group where the word abortion came up, so it
clearly didn't mean the general word understanding of what abortion is. It was kind of an
abolition of the group and you couldn't figure it out because it was the web. The government
and the abortion of the group and yeah, that means they want to ban the group.
>>: [inaudible] says Hamas [inaudible] dictionary [inaudible].
>> Philipp Koehn: [inaudible] ranks these things. Yeah, it has a probabilistic way of ranking
them, but yeah, it's a probabilistic dictionary on a [inaudible] level, but it doesn't use a sentence
context so it doesn't really do -- the way we do this it doesn't use a sentence context. You
could do it in a way that it uses sentence context, yeah.
>>: We should break. Fill out your…
>> Philipp Koehn: Yeah, we already lost part of the audience. [laughter] [applause]
Download