1

advertisement
1
>> Michael Gamon: One of the reasons I was asked to actually introduce both of
these people -- both of these people. Both of the speakers today, sorry. I
can get away with saying that, I guess, because I actually have worked very
closely with both of them over the past few years. Ask him there's also an
interesting kind of history between both Fei and Hany. Both Fei and Hany are
ex-IBMers. They both worked on machine translation for IBM. And so they're
the same group although for much of the time, I guess you were in Cairo, is
that correct, and Fei was in New York. Although there was a time where they
were together in New York and so when Hany was coming to our team, he says, do
you know Fei Xia? I said Fei, I know Fei very well. And then it was like old
home week when you guys got to see each other again after a number of years.
So there's kind of an interesting twist here. This was not planned. We didn't
plan to actually have two ex-IBMers giving talks here. But it gets even more
interesting, because as you know, Hany works here at the machine translation
team here at MSR, and Fei's co-author, as it turns out, is also a Microsoft
employee. So we have a paper that is both being presented by two ex-IBMers and
two current Microsoft employees, although Fei is not a Microsoft employee. I
guess we'll have to work on changing that or something. But Hany started, I
guess it was four or five years ago now, and as I mentioned, he was at IBM
before. He worked on machine translation and information extraction. It was a
kind of combined group. I also happen to know Hany's advisor from Dublin City
University quite well, Andy Wei. He and I are good friends too so it's kind of
this other little bit in there as well. So unfortunately, I don't know what
year your Ph.D. was. He got a Ph.D. from DCU, so we'll just say that. And he
was -- it was a real pleasure to interview Hany a number of years ago. He
worked remotely for a year and was great actually having him come to the team.
He's exceptionally productive on the machine translation team and does a lot of
really cool stuff and it's been a pleasure, actually, working with him on some
of this cool stuff, like the text corrector and some of the other things and
this is something I didn't have the pleasure to work on, but I'll turn it other
to Hany. The talk today is graph-based semi-supervised learning of translation
models. Hany, the floor is yours.
>> Hany Hassan Awadalla: Thank you. Thank you all for being here on this
sunny day. So this work is a paper that will appear in [indiscernible] this
year, graph based semi-supervised learning of translation models. The work is
done with [indiscernible] last summer in cooperation with Christina and Chris.
So this -- in this line of work, we're trying to leverage the multi-lingual
corpus. We all know from machine translation that we really need sentence
2
around corpus. We really need word around corpus to learn the translation
rules. So we are trying in this work to challenge that assumption and measure
how good we can learn translation rules for machine translation for lingual
corpus [indiscernible] comparable corpora and even from multilingual corpora
that is not related at all. So why are we trying to do that? As you all know,
a lot of the parallel corpus on the web now are not really paren. Maybe it's
comparable but not paren. The real [indiscernible] corpus are mostly now
machine translated already, which cause a lot of problem and a lot of noise
there. On the other hand, the multilingual corporas are really available in
many domains, in many areas and produced on regular basis for [indiscernible].
So if I am trying to, for example, train English system for medical domain, I
can easily find more lingual data on both domain but not that parallel data or
even comparable data. As a note from our own very large machine translation
system, we usually utilize any [indiscernible] utilizes shorter phrases, not
longer phrases, and the reason for that actually because we don't have enough
data to learn the longer phrases, two or three or four grand or five grand.
It's very sparse and we can't learn that. If we have the phrase tables in our
machine translation system learning four or five words, we already
[indiscernible] one or two words at most, because the first person we can't
learn enough statistics to have reliable long range or longest parallel
phrases. On the other hand, for multilingual data, it is very easy to be
learning and to [indiscernible] in a [indiscernible] similar to the
multilingual language model. So the problem we're trying to tackle here is
handling the machine translation learning of translation rule similar to the
way we're doing monolingual language model, using [indiscernible] from the
lingual corpora. In this work, we'll try to answer a few questions. One of
them is very challenging, do we really need sentence aligned or word aligned
corp to us learn translation rules or not. And this is very interesting point.
The answer is no, if you are curious what is answer the now. Which is a little
bit surprising. Second, can we have -- how good would translation rules be if
we learned from the monolingual data because most of this work done before in
that area trying to learn how to vocabulary is just to try to compensate for
[indiscernible] but learning real phrase-to-phrase translation like two grams
to three grams, four grams to other four grams is not covered in this area
tool. So we try to evaluate how good those translation rules are. Can we
leverage monolingual corpus not comparable at all that is not related to each
other to learn translation rules. We'll try to investigate that issue as well.
And the fourth one is a little bit tricky, and it's been advocated towards
machine translation community for a long time that we really need very large
language models and that's it. If we have enough large language models, it
3
will even get good translation for more translation without any need for
phrases. So we challenge that function as well, and we will see if we really
need language model or we need longer phrases in our phrase table. As you
transitions in phrases that we see in our [indiscernible] production system,
for example, is 1.4 words learning, which means we are utilizing
[indiscernible] during translation so if we learn to try gram for gram that we
really can use during recording would it help the language model interact with
it or is the language model on itself we've covered that issue. So we'll try
to answer those few questions here. So this is very high level overview for
that kind of work. So simply we start to a machine translation system that is
a phrase for example translation system as usual and the assumption that we'll
have a very large graph. This graph will query for giving any gram here for
the source, like two or three words. We'll query that graph asking for
translation. For never seen by [indiscernible] before. And those translation
would be fit into the run time of the translation system to continue the usual
translation process. So in this talk, we are focusing on how we construct this
graph to query it for any gram on the source to return any gram of the target.
The work done on the run time of the machine translation system itself is
minimal so we are only talking about [indiscernible] processing here that can
produce new [indiscernible] tables that can be used in any translation system.
The translation system itself is, as usual, any vanilla phrase that you could.
So taking a little bit dive into how this graph is constructed, we'll start
with a translation model here which we need from [indiscernible] start with
even few hundred thousand sentence just to have an idea how this language match
each other and we start with many sentence on monolingual corpus. In this
monolingual corpus will fit into the system and decide if we know those phrases
or don't know that phrase. Phrase means here it's in a gram. So it is a
phrase with a definition of phrase [indiscernible] sequence many grams. No
linguistics information, just any gram sequence here. And when [indiscernible]
is labeled mean do we know a translation for this from our translation table or
it is unlabeled, which means we don't know any translation for that
[indiscernible] and then we need to add this to our graph, trying to start the
graph propagation and to augment our translation models with those
translations. So this is [indiscernible] process when we try to enrich our
phrase table with new entries based on some given source. And again, will show
how we construct this graph through queries of translation. So this is what
our graph should look like again, this is this is two different graphs, one
source side graph, one target side graph in the [indiscernible] of each other,
and each one is constructed on monolingual, not really related to each other at
all. And the main objective of this graph is to end up with translation
4
distribution for each node in this graph. Each node in this graph is a phrase
bigram or trigram, for example, and the distribution is target phrases with
their probabilities. As you can see here, the dark nodes here are known,
meaning labeled or we know translation for them, and the white ones here are
unknown and you don't know any translation for them. The main objective of
that process is to, like, propagate information through the graph to end up
with distribution on those nodes similar to their distribution, and this would
be added to our translation table. So again, this is two different graphs, one
constructed on source side, one constructed on the target side. But as you can
imagine, the definition of phrases here is very loose so this graph can be
very, very large if we consider all possible any grams or large monolingual
corpus. So we have some restrictions in how we construct this graph with some
tricks to make sure it is still trackable even with high computational
resources. So we start constructing our graph from the various sides that we
have, and then we have some target mode this is phrases on the right side.
Some target source mode which is on the source side and that is the only link
between the source graph as a target graph. But the source graph itself is
constructed from the source side monolingual corpus and the target graph as
well. If we restricted ourselves to those kind of nodes in that graph, means
that we need to translate phrasing that were never seen in the translation
before. We have only candidates from our phrase table, which is very
inefficient, because that means we can't translate some phrase to a similar
target phase to [indiscernible] but we can't translate to new phrases that were
never seen in the target. Again, if we are trying to translate this phrase
that we never seen in our data before, we need to translate it to some target
phrase as a graph. And the graph constructed only from the pair of corpus is
very limited. It doesn't have enough variety to provide us with various
translation. So we need to expand this graph farther to have more candidates
for those target translation. So the candidates can be from any initial
candidates that we can provide, like bootstrapping from other -- the same
baseline system, bootstrapping it to get graph translation for those phrases
and add those to the graph as possibilities. Or generating for [indiscernible]
variance for those phrases in the source side trying to get similar phrases on
the target side and then we enrich that graph to have many, many nodes that we
can be -- can be translation to that source phrase. Again, if we didn't just
do that restrictions, we'd end up with very large graph on both sides and still
not computationally efficient to do that. For that, we do that restrictions to
just make sure that we have limited space on the graph that you can propagate
with. Any questions? Okay. So that is how we generate the candidates for our
translation. Again, we have different source for generating the
5
[indiscernible] candidates first, assuming that we're trying to translate that
phrase A in the [indiscernible] that were never seen but this phrase is similar
phrase [indiscernible] which is we [indiscernible] this similarity is
constructed but every source of phrase is similar to other source of phrases.
So a possible translation for that would be the translation for those similar
target phrase. That's one possibility. The other possibility is the similar
other target to that target phrase. So now we keep expanding that graph to add
similar candidates as well. Third possibility is morphological similarities
that we generate on a same level, similar candidates for the target and we keep
expanding the graph to have a lot of [indiscernible]. We keep like the top end
best candidates, their source modes, and from here the main objective again is
to propagate the probabilities and the distribution on all that unknown nodes.
So now what I described so far, we only have the topology of the graph. So we
know what is the nodes and we know -- we don't know -- we know how they are
connected, but we don't know even -- we didn't define that connection yet and
we didn't define the [indiscernible] yet. And this is very open issue. The
second thing is we need to do propagation whenever we define the
[indiscernible]. So the first, which is a graph called construction, what is
the [indiscernible] constructed. This is very open question because you can
have a lot of possibilities on how two similar -- two phrases are similar to
each other. For example, you can have distribution of similarity. You can
have vector-spaced models and you can have a lot of variety that can drive a
similarity between two phrases, very open research problem and a lot of work to
be done there actually. Even that kind of work we are just now touching the
surface here and that nodes or edges are very symbol [indiscernible]
information based on the context of the phrase. So for each phrase, we
/SKHREBGT statistics from the monolingual corpus. The vector to the right, the
vector to the left, back and forth towards each side, and you construct a
vector for each of those nodes and the similarity bit wise similarity is
[indiscernible] similarity between those graph. We keep 500 neighbors for each
node of the graph so it's a very dense graph. But the age similarity here is
define here is just the pairwise neutral [indiscernible] which is very limited.
Knowledge for now since this phrase and it can have [indiscernible]
information, semantic similarity. It can even have, like, syntactic signature
on the gaps, on the borders to derive some similarity. So there's a lot of
work to be done here on driving how the similarity between two phrases is.
Again, the simple solution here is just pairwise neutral information for that
work we're presenting. Second, since we now have the edges here, everyone has
weight, which is a combined similarity, we can now propagate. Propagate the
knowledge on that graph to have nodes and distributions here. It means we need
6
to propagate the distribution on this source mode from all possible connections
to end up with a distribution like this for every node. And creating that
graph later, we can have phrase translation for any given phrase. That
propagation problem is easier, actually, once you have a good dense graph. It
is easy to do propagation. We have two option of propagation here. The simple
one, which is labeled propagation, as usual. And here, the probability of that
phrase having that -- being a translation of the source phrase in the iteration
T plus 1, depending on you're coming from iteration T and that probability is
just a summation of the weight on that neighbor for that node. This has some
limitation, the neighbor propagation approach, because it only depends that one
side of the graph. As you recall, our graph is a little bit complicated when
we have two sides, source side and target side. The label propagation here
will only account for the neighbor on the source side, but not the target side.
For that, we prefer using structured label propagation. Structured label
propagation is the same here as label propagation. But as you can see here, it
has another term that propagating is [indiscernible] for the target similarity.
So we are propagating on the same graph, we can have similarity on the source
side, similarity on the target side, and the same propagation [indiscernible].
So whenever we are propagating a translation, we are making sure now it is a
source similar to each other. Its target is similar to each other using that
propagation mechanism. Structured propagation is very efficient. It only
takes two iterations on that graph. This graph is composed of like 600 million
nodes and it takes only 15 minutes to propagate that. So it is very efficient.
The computational [indiscernible] is in constructing the graph, because again,
we are using bit wise neutral information to construct the graph with 500
neighbors. So it is very, very enumerative to do that computation. But
nowadays, larger clusters, it is easy just to [indiscernible] 500 machines to
do that [indiscernible]. But still, it's computationally expensive. It is not
cheap, as any graph is technique. So now, returning to that graph again, we
ended up with having -- all those nodes having distributions, which is
translation probabilities and you end up creating all those to construct a new
[indiscernible] table. And for that, we can use it during translation. For
our evaluation, we used two systems, Arabic to English and Urdu to English.
Arabic to English is a not [indiscernible] language, but we have
[indiscernible] as a [indiscernible] language but because it has a lot of
corpus in real comparable scenarios so we can measure how good when we have
comparable corpus can do versus non-comparable corpus. From Urdu to English is
a very low source language and there's few parallel corpus out there for Urdu
to English. And we examine more scenarios here. So that is the Arabic-English
training data for the parallel data and that is the Urdu-English, which is
7
[indiscernible] evaluation data. For the comparable corpus are from the
[indiscernible] corpus Arabic and English, and those are really comparable
data. It means that [indiscernible] corpus [indiscernible] which is really
similar to each other and this will evaluate how we can get from similar data
same source if the corpus [indiscernible] so as you can see, Urdu-English is
longer, but we can see that they are talking about the same events, same
contents. For the Urdu-English, this is noisy pair data. That means it is
[indiscernible]. It can have machine translation content, can have very noisy
content, but still this will give us an idea how we can neutralize this data
[indiscernible] as well. And [indiscernible] here we are [indiscernible] Urdu
and English monolingual corpus. Not related to each other as well, just
[indiscernible] monolingual corpus to [indiscernible] language model so
evaluate the [indiscernible] on those data as well. So just analysis on why
you are not tackling all the issue as most of the work done using trying to
[indiscernible] corpus before. It is really not all the issue. As we can see
from that statistic, that is from the [indiscernible] test here, number of
sentence [indiscernible] here and here's the number of bigrams on those sets.
Unknown bigram means that I have never seen that in my first table before. So
that this means that it is almost 56 percent of the bigrams on this set not
ever seen on the first table or not seen here in my [indiscernible] corpus at
all. So for more than 56 percent, we resort to more [indiscernible]
translation.
>>:
Is this which language pair?
>> Hany Hassan Awadalla: This is Arabic-English. So means that for those
percentage we resort to one [indiscernible] translation and our [indiscernible]
model depending on our language model to put pieces back into order. For that,
we tackling the bigram issue here trying to compensate for that issue. If we
break this down as more like what is those bigrams known and what is unknown,
unknown means the [indiscernible] vocabulary. So we know that [indiscernible]
in vocabulary and in this case, we resort to having [indiscernible]
translation. Known unknown means one of them is known [indiscernible] the
other one is really [indiscernible]. Post unknown, which is very
[indiscernible] only from those hundred [indiscernible]. So our real focus on
getting read bigrams, that is even known but never a build together in our
[indiscernible] corpus if we can acquire translation for can be more
[indiscernible] that is the assumption we had should we attack this problem or
should we have bigrams, trigrams to compensate for that. For the
Arabic-English result, that is the baseline on this data. The evaluation.
8
[Indiscernible] is MT over 6. This is MT 08. The language model is like 50
million sentence, I think, like. It's small language model. And then we can
see here the baseline and that is when we use the graph propagation using one
gram only. This case, means that is [indiscernible] and you can see here it
has almost no effect. [Indiscernible] and very small effect on the div. So we
really want to handle the bigram case and in that case, that is using the label
propagation when we take into account the first similarity, but not the target
similarity. It tends a little bit, almost here one [indiscernible] point or
more and here this is a [indiscernible] point. And when we add the structured
label propagation, when we take into consideration the source and the target as
propagation criteria, we get much more improvement on [indiscernible]. There's
an issue here a little bit [indiscernible] again, which is at the beginning,
since we are adding bigrams as one piece, should this be compensating the
effect of the language model, because if we have large enough language model,
it can cover those bigrams. We have here like small language model. So what
if we have very large language model with marginalizing the effect or it will
be add-on for the same effect. So we switch to using our very large production
language model, which is built on like 9 billion tokens, and here is the
baseline of the large language model. Jump it like two PowerPoints from here.
And then we can repeat the system again, which is the system here, and you can
even see the improvement is even much better than the smaller system. So that
means that even with using very large language model built on nine billion
tokens, our production actual language model, very, very, very huge system, we
still see the same improvement. That means we really make improvement from
having the phrase as phrase translation not from the language model
[indiscernible]. If we're not seeing the same, we weren't sure what were the
effect of adding those bigrams because it would stick them together and
compensating for the language [indiscernible] will affect or not. But as we
can see here, it is really not the language model affected in the translation
[indiscernible] effect. Here we evaluated the comparable case because this
data used with the graph machine translation built approach are the
[indiscernible] comparable corpus, which is [indiscernible] from similar data
so it means that we need to evaluate the other extreme case when it is not such
[indiscernible]. And we did that with Urdu, actually. The [indiscernible]
baseline using the NIST data. We don't have any graph-based approach here, and
that is the dev and test [indiscernible] and then here, that is another
baseline, adding the noisy [indiscernible] corpus which is [indiscernible] from
the web, which give a lot of improvements here. Those are 160 [indiscernible]
and those are 450 [indiscernible] sentence. So it is quite like
[indiscernible] amount of data. But as you can see, even just noisy crowd from
9
the web, it can give a good jump here. And that [indiscernible] are trying to
evaluate the effect if we are using that noisy parallel data, that's comparable
corp. So we remove the sentence alignment and [indiscernible] chunks of
comparable corpus and feed that into our graph construction system, construct
the graph and the source from the noisy from the [indiscernible] noisy and we
try to measure how good we can learn [indiscernible]. So that is the baseline
and that is the translation rules added from handling this data as comparable
corpus. And if we [indiscernible] parallel corpus because it's [indiscernible]
already, we won't get that much. If we're not handling it as comparable
corpus, we get that much, which is very close to what we get from the parallel
data. And this is very important observation here because that means that we
can learn new [indiscernible] data without even old alignment. These data just
in the gram extracted from those data and we construct the graph using both of
them, and we get translation rules adding it to that system. As you can see,
while losing a little bit, but the prize is that this is not [indiscernible]
data. It's just similar data, which is I find this one of the most important
finding of that work, actually, when we can get relieved from having very
parallel data and try to extract from non-parallel data. That is at
[indiscernible] corner here when we just have monolingual corpus. That we
[indiscernible] using for the Urdu language model and for the English language
model. We don't have any similarity between. They're very different. Nothing
at all. And you still can't get improvement. This one compared to that one.
So we still have one [indiscernible] to almost [indiscernible] between that and
that one, which means that we even can't learn from monolingual not related
corpus at all. So I find through all English cases much descriptive than the
Arabic language because it's really low source language that we don't have any
data for it. That is the only publicly available data for that language pair,
and that is what we have, though we have a lot of [indiscernible] that is all
[indiscernible] you can get from the web for that language pair. So such small
language pair can benefit from these approaches a lot. I'm maybe out of time.
So I have a few, couple of slides here. So that is a few examples from
Arabic-English. Here, that is the bigram where we are trying to translate and
it is not in your phrase table. The translation is bigram, but it is
[indiscernible] difference here. And our baseline just drops at present. So
it should be U.S. [indiscernible] of U.S. president here, and then we dropped
it because that is very usual in phrase-base translation when you have a
[indiscernible] the system prefers to be brief and drop unknowns and the
language [indiscernible]. And this is our system when we have used
presidential envoy. As you can see, the [indiscernible] may not have any
different at all, because presidential, not president, and so on, but this is
10
much better here. This is good example, and that is bad example. This bad
example here, we have that this guy, there's some name, said blah, blah, blah.
The baseline is very brief. He said. Because we don't know him. Which is
good. And then our system proposed another name, very similar name. And this
is the drawback of the pairwise neutral similarity. Both name will appear in
the [indiscernible] in very similar way, and we don't have any better features
to model that here. So Abdullah and Abdalmahmood and Mike would be in the same
context, right. It is the problem of very simple feature that [indiscernible]
here. For that, we are moving away from the pairwise mutual similarity to more
sophisticated features to have better [indiscernible]. For the Urdu-English,
that is the bigram towards here. [Indiscernible] let me know. And that should
be, I am hopeful. The baseline, that's this hope. And we propose that I am
hopeful, which is much better than this. Again, this is varieties of another
examples that we have and where it came from. So here that is the Arabic,
sending reinforcements. The base line is strong reinforcements and we get
sending reinforcements. It means that it is one of the neighbors. It is
already on the graph. From where, it is neighbor for one of the
[indiscernible]. But we can pick it as one of the translations. This is the
same with OOV in the baseline, but you can get it. This is one of the
neighbors again. And this was address, and the baseline can get the same
translation. Different tense, but for [indiscernible] systems, okay. And then
here is a very good one as well. Both are good. Baseline is not bad, but we
are having better here. G means generated. G means it came from the
morphological variance generation or from the baseline rough translation. But
it wasn't not one of the neighbors. We generated it based on some similarity.
This OOV, those are Urdu examples. To defend him, this is clearly
morphological generated, and we get him to defend himself, which is much
better. While speaking, in the. So we get a lot of those improvements that is
really in a gram improvement, not OOV. Even though OOV, the [indiscernible]
OOV. So we get a lot of new phrases that we really propose new translation
for. So we hope that this promotes a new direction for machine translation and
when we get [indiscernible] on getting more parallel data and maybe focus more
on having more, revisiting the graph-based techniques which very efficient but
due to the computational [indiscernible] didn't get into the game until later.
But now with the more? Cheap computational resources everywhere, I think it is
time to revisit such approaches. As you can see, once you constructed a good
graph with [indiscernible] more work on how we construct the graph is the
important thing there. Our future direction here, we will -- we are actually
switching to heterogenous features for the phrase pairs. We can have syntactic
similarity. For example, [indiscernible] natured. We can have semantic
11
similarities. We can have continuous vector representation for the phrase and
then we get relieved from that pairwise mutual computation. So the phrase are
just the vector, and this will give us new dimension for how we can construct
our graph. The feature is that vector would be related to whatever you want to
design there. The bottleneck here is that you need now to learn the similarity
in better way. So -- and this is still future direction that we will try to
tackle in this final [indiscernible]. I think I'm done. So if you have any
questions [indiscernible]. Thank you.
>>: So I'm lucky that I actually know how to pick [indiscernible]. But I
noticed that you did very well in Urdu morphology where -- so [indiscernible]
means hopeful. Just [indiscernible] means hopeful. Hope, actually. So
it's ->> Hany Hassan Awadalla:
What is [indiscernible]?
>>: [Indiscernible] is morphology to change it from hope to hopeful.
morphing.
>> Hany Hassan Awadalla:
The
But it's two words, right?
>>: Yeah, [indiscernible] itself can exist on its own so root morphing. But
it always has to attach to another word. I've been impressed that you were
able to go deeper in the morphology and again the next, the table, the next
slide, [indiscernible], the last line, which [indiscernible] conversation, and
key, which is about morphing on top of that. So in the is just the key part.
So you got -- so I can see where you got the improvement, where you managed to
find the morphing inside, translate, and then get a context sensitive
translation. But I don't [indiscernible] Arabic where the common proper nouns
and common nouns occurred together because people's names have common nouns in
them. And we know this [indiscernible] means slave, but we also use it as a
name. Abdullah, all of those. I wonder if your struggles are occurring
because names have more themes that have common noun meanings as well. And
maybe therein lies some of the trouble.
>> Hany Hassan Awadalla: So this is a very good question. The morphological
candidates are generated using stem-based translation so we don't know any
stems for both language, but Christina has an approach to learn the stem based
on word alignment. So from those alignment we learn the similar stems. And
then we generate stem-based translation, which is very, very rough translation.
12
Can be good. Can be bad. But can give you an idea, given the context, we may
make it or not. So that can work very well with linguistics like prefix,
suffix. For people [indiscernible], this is part of a name and it is actual.
It is just likely to make a collision with the actual word because like the
actual word, using that is less frequent than the data and the people name.
But for the case of the Urdu, when you have the actually prefix, suffix, I
think using word alignment, it will be learned much easier. Maybe that is the
reason for having that.
>>: You have a lot of good progress, even with this.
come next and excited about what will come next.
I'm interested what will
>> Hany Hassan Awadalla: We're hopeful that we can learn those features in
that presentation and we'll get a [indiscernible] of having [indiscernible]
because it can be similar to each other, still. So we hope that in the Urdu
case, like the feature would learn that this piece is really related to that
part, not to that part, and so on. Now we are doing it in more like
[indiscernible] way, but we hope we can learn the features in more systematic
way.
>>:
I'll use it.
If it ever shows up in translator, I'll use it.
>> Hany Hassan Awadalla: You should use it now. So this work still, it is
early research. It takes a lot of computational resources to compute that
table and that extension. So even for Bing translator, we would have to
generate [indiscernible]. Like, for example, we take a lot of data, generate
new tables, and be available [indiscernible]. It wouldn't be on the fly for
now until we have more cheaper cluster [indiscernible] resources can be on the
fly.
>>:
Maybe next year.
>> Hany Hassan Awadalla:
Maybe.
>>: In both of these case, you go into English, which is morphologically poor.
How do you predict your results going to more [indiscernible] language?
>> Hany Hassan Awadalla: Good question. We didn't try anything with the other
direction. We should try that. That is interesting, but we didn't try it.
13
>>: Less cases that this would trigger, right? Because it is simpler and the
vocabulary is smaller so you'll have fewer unseen bigrams.
>> Hany Hassan Awadalla: You will not have the issue of morphological
variance. It will not be that important because you are going the other way.
So even for the English to Arabic, going like even with very large systems, we
have the problem of not able to generate phrases that is highly morphologically
affected. Like when you say in Arabic [indiscernible] for example, that's
[indiscernible]. So we can generate that even in Arabic form, and we resort to
simpler forms. So if you can catch that is one to many or many to one
[indiscernible] but [indiscernible] it's interesting direction.
>>: So it seems also that it's not just that you're generating more
[indiscernible] basically for your phrase table, but maybe the previous page
has a couple examples of your reordering model. So it's not just that you've
got the dropping replacement for phrases, but there's a much sort of bigger set
of interactions is happening. So are you explicitly doing anything about that,
or are you ->> Hany Hassan Awadalla: [Indiscernible] hierarchical ordering, hierarchical
ordering model, but too ->>: [Indiscernible] because I mean, so, right now, it's the graph propagation
gives you what you extract your phrase tables from. So how are you generating
your reordering model?
>> Hany Hassan Awadalla:
[indiscernible] data.
Reordering model was trained on the small
>>: Okay. So the reordering model is basically just given now a subtly
different set of inputs, it's doing what it would have done anyway if it had
had ->> Hany Hassan Awadalla: Yeah, but since we have new phrases and retune our
parameters based on the new phrases. Like in this case, some of those
parameters will be different than the others. So you change the parameters and
that is can it change [indiscernible], but it's not much different in terms of
reordering.
>>:
It's not too much different, but it is interesting to see that these
14
small, what you basically planned as being small drop-in replacements for,
like, bigrams are propagating effects that are quite sort of substantial.
>> Hany Hassan Awadalla: But cannot fit. The [indiscernible] for example, the
phrase, all those can be affected when you introduce similar phrases.
>>:
Yeah, but such a small trigger is generating this.
>> Hany Hassan Awadalla:
It's interesting.
Thank you.
>> Michael Gamon: We're going to move on to the next talk. This is Fei Xia.
I guess most of you know Fei. She was one of the founding faculty of the comp
link program at UW. Starting in 2005. And actually, we've known each other
for years. If you happen to know Fei, one thing you know about her is she's
tireless. She's dedicated. And maybe relentless. These are all good
qualities, actually. If you've ever worked on a paper with Fei, you know
exactly what I'm talking about, or if you're a student of hers. Anyway, I
talked about the IBM connection already. So I'll stop with that. She's
currently an associate professor in linguistics department at UW, so I'll hand
it off to Fei at this point. I was going to say more, but ->> Fei Xia: Thank you very much for the introduction. So I want to say a few
words about my co-author. So Yan actually was still at the city of
[indiscernible] Hong Kong and he went to UW as a visiting student a few years
ago and we work on this together. And after he left, we continue the
collaboration. After he got his Ph.D., he actually went to Beijing, joined
Microsoft over there. So that's a connection. So today I'm going to talk
about domain adaptation, and I think we actually have several experts in this
room for domain adaptation. The goal here is to say that when you train your
system on any label data, if you test it on a different domain, the result can
be really bad. The goal of domain adaptation is to kind of bridge the gap, and
there has been a lot of work on this, and I'm going to cover some of those
quickly. So in this talk, I'm going to kind of discuss a few recent study we
did on domain adaptation. I'm going to first talk about related work and then
basically tried three different approaches and I'm going to go over each one.
And for each one, I'm going to show you some results. So for latent work, one
way to look at this, because there's so many work, is to look at assumption.
So for domain adaptation, you always assume you have a large amount of label
data in a source domain. And then the target domain is where the test data
come from. So for this, you can either have no labeled data in the entire
domain, that's called semi-supervised setting, or you can have a small amount
15
of data. That's called supervised setting. And then you might have additional
unlabeled data, either in the source domain or in the target domain. For main
approaches, if you look at the supervised setting, here I just list some of
those. It's not supposed to be complete. For example, model combination, what
you do is you build a model for each domain and then you give each model
different weights, depending on the similarity between the source domain and
the target domain. Or you can give weights to instance, not the models. Or
you can use the so-called feature augmentation, where you make a copy of the
same feature. So basically, for each feature, you make three copies, one for
source domain, one for target domain, and one for the general domain. And then
the last one here is called the source structure correspondence learning. The
idea is to say we have different domains we want to learn the relation between
features in those domains. And there are more than this. For semi-supervised
learning, we start with the self-training and co-training. That's pretty
standard and then there has been a lot of work on training data selection and
actually here -- two authors are here already. The idea here is to say that
you want to select a subset of training data that looks closer to the target
domain and then you train the model only on this reduced text set. And then we
do something on the last one where we actually use up labeled data. So the
idea there is to say okay, you can actually use unlabeled data and you can kind
of add additional features and now that will give you more information about
this pretty soon. So now let's look at the first study we did. Now let's look
at two existing methods. One is training data selection. One is feature
augmentation. There are pro and cons. So for training data selection, the pro
is that because you only use a subset of data, you can speed up training, and
this is really important for some application. For example, machine
translation where you have like millions and millions of sentences, right. Ask
the performance, in fact, when you use a subset of the training data, you can
actually get better performance than using the whole dataset. The cons here is
that for certain application, you might not have a huge amount of training
data. Then when you select a subset, the unselected part is kind of ignored,
right. So for some applications, you can actually get lower performance and
I'll show you some example. Now for the second approach is this feature
augmentation, right. The pro here is that it's very easy to implement. You
just make three copies of everything. And normally, you get some improvement.
The cons here is that it requires that you have labeled data in the target
domain. So this is the supervised setting. For certain applications, you
don't have label data in the target domain and also because you duplicate
features, you basically triple the number of features, and that could cause a
problem. So how do we address this limitation? This is actually a very simple
16
idea. So idea hire is to combine both. The way you do is that you use
training data selection to kind of look at your source of domain data and
divide it into two subsets. One is you select those. One you don't select
that. And then for the one you select that, you'll see those data may be
closer to the target domain. That's really [indiscernible] in the first place.
Now you can just apply the standard of feature augmentation to those two pseudo
domains. The advantage for this is that for training data selection, you can
use the whole dataset, and for feature augmentation, you can pretend you have
label data from the target domain. So it's a very simple idea. But we
actually show some improvement. In this case, we look at two tasks. One is
Chinese word segmentation, one is POS tagging, and we use a standard baseline,
right, so we use CRF taggers. When you use a training data selection, you
always have to decide what similarity functions you are going to use, and here
we tried to, and you can try different ones. One is based on entropy, and one
is based on coverage. Because what I mean by coverage here is to say that for
word segmentation, one major issue is OOV, right, [indiscernible] recovery
works. So what we do is we select the sentences from the source domain. We
actually want to cover as many [indiscernible] grams in the test data as
possible so that's what it meant. I'm not going to go into the detail about
you I'll just show you the results. So we tried this on the Chinese Treebank
7.0. It has more one million words. The interesting part about this
[indiscernible] is that it has five different genres. Broadcast conversation,
broadcast news, and so on, so forth. So we choose one genre as our training
test data, and then the training data will come from the other four. To just
show you the result here, so here you see some lines here. The baseline is
this. That means you just -- so this one is only you use only training data
selection. So this is you use the whole training dataset, right so that's this
line. And here, this X axis is the percentage of training data you use for
your training. So you can imagine when you use ten percent, the performance is
actually worse, right. So this bottom line here is when you use a random
selection to select ten percent, and then this AEG, this is a measure that we
use entropy-based selection. And this blue line is where we used this
coverage-based selection. But the point here is that both selection methods
[indiscernible] from random selection. That's not surprising. But when you
use less than, say, 60 percent of the data, the performance are actually not
better than the baseline because for this dataset we don't have a huge amount
of training data. So therefore, so in [indiscernible] any data, it actually
hurts the performance. But if you go up a little bit more for training data
selection, you can, you know, get something better than using the whole
dataset. So that's when you use training data selection only. And now suppose
17
you use both, right. So in this case, you use training data selection and then
you use a feature augmentation. This is for POS tagging so that's a baseline.
The baseline is you just use 100 percent of the data. And here the X axis once
again is a percentage data you select. But now when you select ten percent, it
is [indiscernible] 90 percent. It means you pretend 90 percent come from the
source domain. So you still use a whole 100 percent, but you give them
different base. So in that sense, it's like the instance [indiscernible]
something like that, but it's just that you do the training data selection
first, and then you divide it data into two sets and then you use the feature
augmentation. So you can see you can get kind of a nice improvement. And the
next one is for word segmentation. This is actually interesting in the sense
that once again, this one is a baseline. That means you use 100 percent of
data from the source domain. And here is where you -- you know, this one is
used feature augmentation. But this line means that you duplicate every
feature. So when you duplicate every feature, the performance is kind of -it's actually not much better than the baseline because you just have too many
features, right. But this line is that when you duplicate the only certain
kind of features, right. For example, for certain features that are not
[indiscernible] so you only increase the number of features a little bit, not
too much. So that way, you don't get hurt by this explosion of feature
numbers. So you cab see you get some improvement. And this is actually what
we want to see initially, right. We want to say when you select, say, 10
percent, 20 percent, you get a big improvement. For the ones you select more,
suppose you select 60 percent, it's actually not much better than your
baseline, because it's no longer that different, how much you select from your
training data. So that's the first experiment. Now I'm going to move on to
the second one. So the second one, we try something different. So in this
case, what we are saying is that the setting is that suppose we use a
semi-supervised setting, means you don't have label data in the target domain,
which is the same as the previous one, but the idea here is to say what if you
have unsupervised method. Sometimes unsupervised method can complement the
supervised method, right. So and the way we are using that is to say
basically, the idea is just to say you run unsupervised learning, you get some
results, and the results will not be reliable, but then you treat those
decisions as features and you use your source of domain data, which is labeled,
to learn how useful those features are. And in this study, we also look at
word segmentation and then for unsupervised learning method, we use this DLG.
I'll give you very quick introduction about this. The method itself, the DLG,
is actually not that important in the sense that you can replace this with any
method you want to use for unsupervised learning. But just to give you some
18
idea about what's going on here is that for word segmentation, as I mentioned,
the main problem is really [indiscernible], right? You have all those unknown
words in your test data, and the one way you can measure this is to look at a
certain string. So let me explain this measurement a bit. So suppose X is a
corpus, right, and now you want to represent this corpus with some description.
So DL means description length, right. And it's really, N is the size of the
corpus multiplied by the entropy of the corpus. What it means is that suppose
you have a vocabulary from this corpus, and you're saying, okay, I will just
calculate entropy for this corpus, and that's your description, description
length. The DLG is really a function of a string. So here, S is a string.
It's a string in your corpus. So what you do is to say you have my original
description length for the corpus, and now what if I replace this string with
ID. So you say okay, this string, I treat this as a word in my dictionary, and
now instead of having this whole string, I only need the ID to represent that.
So now you get a new corpus by replacing the string with the string ID. But
you still need to add a string to your dictionary. So there will be additional
cost when you add something to your vocabulary, but [indiscernible], right,
because now the description length will be shorter. If you don't get this
part, it's okay, but the intuition is to say that longer, the longer the string
is, the more frequent the string is in your corpus, the higher the score is,
the DLG score will be. So therefore, you can imagine if you want to do
unsupervised word segmentation, what you can do is suppose U is a string I'm
trying to segment, then what I can do is I can try [indiscernible] ways to
segment that, right. And now for each segmentation, I look at a string in that
segmentation and I calculate the score and I can have certain kind of weight
for this. But eventually, I will choose the one with the highest score. So
therefore, you're basically prefer long string and you prefer frequent strings
and you also treat those as words. So that's unsupervised learning in this
case. But the problem with this is that sometimes you have a phrase, like, for
example, you know, Hong Kong. I mean, Hong Kong, you can argue whether it's
one word or two words. But sometimes you have certain kind of correlation,
like could you please, right. So those will not be a word, per se, but they
are very frequent and they are very long so you can treat those -- I mean, if
you use this method, you treat this as one word instead of multiple words. So
unsupervised learning does not even work that well but it's very good at
finding those kind of new words you have never seen before. So what we do next
is to say, okay, now let's use that finding for supervised learning. So what
we do is that our baseline system is just word segmentation, you know, work
segmenter. So we treat this as a sequence labeling problem. That means for
each character, you decide whether you want to add a break or not. And here,
19
B1 means the first character in a word, B2 means the second character in the
word, and so on, so forth. So you can treat these as a POS tagging problem and
then we use a standard feature set. There's nothing kind of strange about
this. And now when we added the DLG feature, what we do here is to say, okay,
let's calculate the DLG score, okay. So what you do is that take the training
data and take the test data. And for training data, ignore the word
segmentation information. Assume it's unlabeled. Now you get a union and for
this whole dataset, for every, you know, character we want to make it shorter
so there will be some less constraint. But other than that, for every n-gram
in this dataset, you calculated the DLG score. And then you kind of say let's
see for each character in my training data or my test data, I want to form a
new feature vector so this is the same feature vector as before, but now I'm
going to add some new, additional features. So those features basically
represent what is the decision if I use unsupervised word segmentation for
this. And then you learn how useful those decisions will be. So, for example,
suppose the sentence is C1, C2, you know, C3, C4, C5. You have five
characters. And so now suppose we form the feature vector for the C3 then what
you do is that you look at all the n-grams in C3 that contains these three, and
now you collect the DLG score not from the sentence but from the whole corpus,
right. So you get a DLG score and you take the log and you take the floor and
you get the integer here. So you say suppose C3 end in this last column is the
tag for C3 in this three. So here the word, if the word is only 3C, then the
tag for that would be S. Means it's a single character word. And suppose the
string is C2, C3, then the tag for that is going to be E. E means the last
character in your word, so on and so forth. So basically, you collect this
from your training and test data. And now what I'm going to do is to say okay,
suppose C3 belongs to a word of length two. Right. Suppose. Then there are
two possibilities. It's either C2, C3, or C3, C4, and I look at those scores
and I make a decision. I choose the highest one, right. So if the highest
score will be 2 and the tag for this will be E. Right. So I'm saying if C3
belongs to a word of length two, then the decision based on unsupervised
learning will be the tag for that and it will be E. And then the strength for
that will be two. So I create a new feature. You can try different ways to
create a feature. But the point is that you add also additional features and
add it on top of your existing features for supervised learning. And now just
to show you the result here, in this case, the tested genre is BC. So BC is
broadcast news. And then the training genre will be either BC or something
else. So if it's BC, then this is training in the same domain, right. So if
you use the baseline here, you know, that's 93.9. When you add the DLG score,
although it's the same domain, you still get an improvement, right. And the
20
web is out of the other four genres, web is closest to BC when we try this
domain application stuff so that means, you know, if you use a baseline, it's
this number. And if you add a DLG, you get some improvement. NW is news, I
guess, something like that, and this is most different from broadcast so you
can see the baseline is much lower. But then you add the DLG, you actually get
a bigger improvement. So that means [indiscernible] domain are more different,
you actually get a bigger boost of adding unsupervised learning results. You
can also combine segmentation with POS tagging, right, and I'll speed this
part. But it's just that when you do this, you can also add a DLG on top of
that. This one is just to allow us to compare with existing work. So you can
see those are some previous work. This work was published like in 2012. So
that was the latest result at that time. So you can see this is a result for
segmentation. This is a result for both segmentation and POS tagging, so when
you add the DLG, you get some improvement and you always get some improvement.
So eventually, the result is about, you know, the best result at the time. So
that's the second approach. The third one is really the most recent one, and I
think this is actually interesting in the sense that here we are not talking
about two domains. We are talking about two languages. So the idea is to say
what if you have two languages that are closely related, right. What happened
is that very often, we know that operating corpora is very expensive. So we
have resources only for certain languages. But there are fewer resources for
ancient languages for obvious reason. First, there's no money there, right, so
nobody wants to work on that. Second, we don't have a lot of data, so on, so
forth. So the idea -- yeah, you have one speaker, no speaker. So the question
is can we actually improve the NLP system for ancient language when you have
resource for modern language. So we want to kind of take a look at what we
get. In this case, there has been some previous study so here I list some of
those. And most of those studies, based on spelling normalization. So, for
example, in middle English, you spell a word this way, but the [indiscernible]
empire, you spell that way. Here we want to do something entirely different.
So the language we used is archaic Chinese and modern Chinese. And once again,
we focus on word segmentation and POS tagging. So the idea here is to say that
you want to find the properties shared by those two languages. And therefore
you, want to kind of explore that information. To just tell you how different
the old Chinese is, right, so if you look at based on this book, of course,
there are different ways to divide Chinese. But there are four eras for
Chinese. So here give you the type period. So basically this archaic Chinese
was spoken before 200 AD. And modern Chinese, you can look at the difference.
It's like huge difference. And as a result, modern Chinese speakers will not
be able to [indiscernible] archaic Chinese with our training. For example, I
21
remember in middle school we actually have to learn how to understand those
text and that's my favorite subject at the time, because anything else I learn
in Chinese school is totally useless. That's actually a separate story because
of the way they taught Chinese, it was awful. But anyway, that's the reason I
did remember, right, like those [indiscernible] are really different. And what
we did is that we look at one book. So this book is actually a collection of
papers or thesis written by this guy and his, you know, retainers in this time
period. The interesting part about this book is that it has 21 chapters
covering a wide range of topics. So you can imagine they are from different
domains. So therefore, it you give you a very good picture about what people
are interested or were interested at this time. So we have this book, and then
my colleagues, they create this corpus. So they did the word segmentation.
They did the POS tagging for that. And the way they did that is they start
with the Chinese Penn Treebank guidelines, because every time you create a
corpus, you need guidelines, right. So they base on the same principle, in the
sense that you talk about word, there are actually different definition of
words, depending on what you are trying to do. So they follow the same
principle, but the decision can be different. For example, this word, right,
this string, the first character means country. The second character means
family. In modern Chinese, it's one word. It just means country. But in old
Chinese, actually it's very often used as two words. So it's really country
and family. It's interesting debate why country plus [indiscernible] may
become country only, but that's a separate story here. That's Chinese
philosophy, right? Country is always more important than family, which I kind
of disagree. But this is just a slice of the corpus, right. So the number of
characters, number of words, number of sentences, and average sentence length.
Now the question is that suppose you have, you know, resource for modern
Chinese. Can you use that to improve the performance. What we did is that the
training data, we assume we have some label data for this old Chinese corpus,
but we also have a modern Chinese corpus. So here we used Chinese Penn
Treebank, version 7. And then for test data come from this old corpus, right.
So we use five-fold cross-validation. Four fold for training and one fold for
testing. You can imagine what kind of baseline we have, right. We can just
use the hidden domain data only so in that case, it's, you know, the old
Chinese only, you can use the modern Chinese data only or you take the union so
that's the obvious baseline. So for our approach, basically to tell you the
story right now is that if you just use a union, you actually don't get any
improvement, as you can imagine. So what we want to do is something fairly
more clever than that. So we want to find properties shared by those two
languages and then kind of represent those properties as features and then
22
build a system. The question is what properties are changed by those two
languages. So we did some study, right. In this one, it's just some
statistic, but the most important part is here, the average word length. So
you can see for modern Chinese, the average word length is 1.6. For old
Chinese, it's 1.1. And it's actually consistent with our understanding about
the Chinese as a language so it goes from monolingual -- a language of
monosyllable words to more kind of words [indiscernible], right. So basically,
you can see this is the kind of difference. So you can imagine for word
segmentation, if you want to do -- you want to get a word segment for old
Chinese, you can just add rake after every character, and you will be 90
percent correct. So that's one thing. And this one is just to give you a
breakdown. So you can see for old Chinese, 86 percent has only one character,
right. So that's a percentage of word tokens with only one character, so on,
so forth. So for modern Chinese, it's kind of half has only one character and
about half has two. And then you have kind of a tail for longer words. A now
if you look at POS tagging so this is the percentage of word tokens
[indiscernible] tag in this corpus. So for NN, that's noun. So PU is
punctuation and VV is a [indiscernible] and so on, so forth. So AD is
[indiscernible]. So you can see for the first four categories, there are
about, you know, there are roughly [indiscernible], right. That's not
surprising. But I want to show you something else. There are actually many
POS tags that do not appear in old Chinese. Some of those are due to content,
right. You can imagine the [indiscernible] are talking about different things
like there are no URL at the time, there are no foreign words for obvious
reason. But then there's something else. For example, BA, right. BA is a POS
tag for this word, ba, in the word ba-construction. And it turn out this
construction did not really appear until like 200 years after this book was
written, right, so there was no construction. The construction was not there
at the time. And similarly, for those POS tags, they are for this DE, right.
So there is this special Chinese character. Actually, there's three that are
all the call DE, right? They actually have a different POS tags, because they
behave very differently. And that's also a kind of new phenomenon in the sense
that they did not really appear, they did not have that usage until much, much
later. Similarly for aspect marker, right, it is very different. So
therefore, you can see if you just use a POS tag distribution, they actually
look pretty different. So therefore, you know, if you're building your POS tag
model from modern Chinese and use that for old Chinese, the result could be
really bad. So will there be anything that they actually share, right? One
thing they actually share is the character set. What I mean by the character
set, I don't mean a set of the characters in the sense that we could say how
23
many characters are there in Chinese. Nobody knows. Like I don't think
there's any single person in this world really know, right. The conservative
estimate will be something like 60,000 characters, right. But the normal daily
conversation or if you look at even the Chinese Penn Treebank become a member
of character bank is only 5,000. So that there will be, if you know 8,000
characters, you'll be fine. But then, you know, very often I will see people's
name that I don't know how to pronounce because I just don't recognize that
character. I don't think there will be a single person in the world know all
the characters or know how many characters are there. But then the meaning of
characters normally don't change that much over the time so this is one
character means top or up. It can appear in this word, means top, right. So
that's a localizer. It can also mean, this one means climb. Climb up. So
this one means climb up. And that this one means Shanghai, right. So this is
just a name. So you can imagine this character can appear in different words
and those words will have different POS tags. But now the meaning of the
character actually doesn't really change that much over the time. Therefore,
if you say for the meaning, sometimes [indiscernible] related to the meaning,
because action very often is a word. Of course, it's not always true. But now
it's supposed to be used for [indiscernible]. This is just a POS tag of a
character. So what we do is we sea okay, for each character, there will be one
word, multiple tags for that, and you want to learn those tags. So here is to
say let's just [indiscernible]. We actually don't have the cTag labels in our
corpus so we just make very simple assumption. So we are saying suppose this
word is LC, right, localizer, we just see each character also have LC as the
cTag for that. And now you're just kind of coming to your corpus and kind of
run the frequency. So you can see for this character those time is part of
localizer, right. So that's for the modern Chinese and that's for old Chinese.
So you can see although the frequencies are different because the corpus size
are different. But if you look at only the top one, that's like the core
meaning, that actually remains roughly the same. So we are saying maybe that's
something they actually share, right. Also something shared by the two
languages is some word formation pattern. So, for example, for suppose a word
is a noun. Suppose it has two characters and suppose the cTag for each one is
[indiscernible] so we will write this as this pattern. So here I give you by
example. So this one means politics and this one means place and politics
place means government, right. So there are different kind of word formation
patterns. And here that's the raw count in those corpus and this is the
percentage. So now you can see the percentage is not so similar, but you can
see similar patterns in those two corpus. So maybe the patterns will also be
useful. That means the cTag can actually give some information about the POS
24
tag of the whole word, right. That's the idea. Now let's see just to
summarize, the two corpora actually differ a lot with respect to word length
distribution or POS tagging, but they share a lot of the same information about
cTag and also word formation patterns. The hypothesis we have is to say if you
just add CTB theoretically to your training data, very likely the performance
will not improve. But if you add cTag as features, maybe you will get some
improvement. Just to show you the result, once again, [indiscernible] word
segmentation and POS tagging, and here just to show you the size of the
pertaining data and test data and here is only for the old Chinese. I did not
include a number for CTB. CTB, of course, is huge. It's like one million
words, right. So we use CTB only as a training stage, as you can imagine.
This one is just to use which [indiscernible] we use when we do word
segmentation. So we record those tag as position taking to distinguish that
from the cTag. A cTag is a POS tag character and position tag is just to say
whether, you know, it's just the beginning of a word or the end of the word, so
on, so forth. We use the standard CRF tagger for this. And then we just use
our standard feature set but now we add the cTag. And the cTag, the frequency
of those cTags will come from training data only. Otherwise, it would be
cheating. So now, this is the -- we can record those basic features. Those
are just character unigram or bigram. This is a cTag feature. The zero here
means if you look at the current character and we look at the most frequency
tag for that character. And the DLG feature, I mentioned DLG before, those are
just additional features we used here. Just to show you the number here, let's
look at the first group here. So now suppose you only use a basic feature, BF
means the basic feature. Means you only use a character n-gram, unigram or
bigram. So this one means you use the basic feature from the old Chinese
corpus only. So that's the baseline. And this is if you use the data from
modern Chinese only CTB. So you can see this result is awful. It's like
awful. It's worse than the baseline where you just add [indiscernible] after
every character. And this is the result when you take the union. So it's not
that awful, but still it's worse than the baseline. And so now we say now for
the baseline features, we only get from the old Chinese corpus and how what
about cTag? So when you use a cTag from the old Chinese corpus, the result is
actually a little bit worse. Not much worse than this. And the reason for
this is that this corpus is pretty small and the cTag, because we used this
very simple way to get a cTag, it's not very reliable and so on, so forth. But
if you get it from the Chinese Penn Treebank, the modern Chinese, the
improvement's more but it's actually not getting worse. And now if you add a
DLG feature, you get some improvement. So you might say oh, maybe that's
improvement. And the reason for this is really you have, you know, more
25
training data. What if you have much less training data from the old Chinese?
So for this experiment, what we do is that we look at the training data for old
Chinese and then we only take a small percentage. Instead of 100 percent, we
take, say, ten percent and now you can compare the performance. So this line,
the red line is for you only use old Chinese corpus, and this one is when you
add the cTag feature from CTB, and this is the line where you add the DLG
feature, right. So you can see when you only use ten percent of training data
from the old Chinese corpus, you actually get a much bigger improvement because
now you really very little amount of data. So for modern Chinese, even though
it's very different from old Chinese, it give you a bigger boost. And
similarly, for POS tagging, we get basically the same kind of result. So for
the feature set, same thing. It's just that we use a word unigram and bigram,
right, and then you can add a word affix. For Chinese, it's hard to decide
whether something is affix or not. So what we do is that the prefix is just a
first character of a word and a suffix is just the last character in that word.
And then the cTag of affix is just to say what is the cTag of that character.
So the same kind of result, right. So for basic feature, if you only use the
old Chinese corpus, that's the baseline. If you use a modern Chinese you, get
a worse result. When you use a union, it hurts, right. And now when you add a
cTag, you get some improvement. And if you use a cTag actually from the modern
Chinese [indiscernible] improvement. And once again, if you add affix feature,
you get some improvement. The same chart as before. So in this case, you have
less training data from the old Chinese. You get much bigger improvement here,
right. So it's like four percent here where you just add the data from the
modern Chinese. So what I mean here is just to summarize those two languages,
I mean some people would say oh, they are really the same language. We don't
want to argue about that. I will say that different languages because I cannot
understand this one result learning what I learned in high school or middle
school. But they are actually very different, right, with respect to the word
length or POS tag distribution. As a result, if you add modern Chinese data
directly, it does not really help. But if you find what they really share and
then you [indiscernible] features, you actually get some improvement,
especially when you have very small amount of data from old Chinese. So to
just summarize, I kind of introduced three different methods, and they kind of
look at different angles, right. So the first two, we used semi-supervised
setting, which means you don't have label data from the target domain. So we
can either, you know, combine those, right. This is [indiscernible], or you
can use unsupervised learning and then use that as a feature to train your
supervised method. And the last one look at two kind of closely related
languages and I think the key point here is really you have to find out whether
26
you share because they can be very different. And for future work, the
question is that I'm sure there are other features that we can use. And we
want to identify those and a more interesting question is whether those
features can be identified automatically without prior knowledge of the domains
or languages. In a sense that if you gave me the domain and give me the data,
can I actually know, without using my kind of expert knowledge, to know what
features I should look at. And then we want to apply this to kind of other
tasks. Thank you.
>>: It seems ancient Chinese and modern Chinese are in this unique
relationship in that one is a descendant of the other. A more interesting
case, I think, would be, more applicable would be ones where they're sister
languages, where they actually share a common ancestor. Now, in finding an
ancestor where you actually have good data like this would be difficult. So
you might, in some families, able to find that. So being able to then, you
know, establish a relationship with that ancestor and then [indiscernible]
through that ancestor to some other modern language that is a resource would be
interesting, because that is directly applicable to a [indiscernible].
>> Fei Xia:
Yeah.
>>: The problem of the ancestor, I guess, you know, you could get into a
reconstruction kind of thing too where you can basically -- and this has been
done a lot, obviously, in linguistics is reconstructing some common ancestor.
If you then associate with that reconstructed ancestor and then somehow use
that information from that ancestor to help inform descendant languages, that
would actually be very interesting.
>> Fei Xia: Yeah, I'll just say there is this one, actually, [indiscernible]
but there are actually middle English tree bank done by Penn and also not only
middle English, but they also have middle maybe German or high German or
something like that. They actually have kind of ancient -- not ancient, but
middle languages for several languages so I think that would be interesting.
Suppose you have middle English and middle French, I don't know.
>>: I think the problem there, okay, so you could use similar techniques here.
Again, the problem is what are the characteristics, what are the features that
you need to reconstruct the tagger or something for one of these ancestor
languages.
27
>> Fei Xia:
Right.
>>: But in a way, what are the descendants of the modern -- excuse me, of
middle high German or whatever it is. Yeah, middle high German, what are the
ancestors, whether it's one -- I guess there are multiple. It's not just high
German, but are we [indiscernible] or something. But you obviously can go
further. What's nice about this is you're going very far back in time. I
mean, this is 2,000 years ago when it's actually -- it's phenomenal.
>> Fei Xia:
Thank you.
>>: There's a language largely in Kenya called Sheng, which is a slang version
of Swahili and English mixed together. They were going to call it Dheng with a
DH in there, but it's Sheng. And the corpus for Sheng comes in the form of
what's cool, you know, social media, advertising, pop songs. But it's always
changing because it's slang and it's current. So if we had to, like, take a
snapshot of Sheng in time, and use the established corpora for Swahili and
English to try and tag Sheng, forgetting the fact that it's always changing,
would this approach be useful, since Sheng is still considered a resource for
right now [indiscernible] any particular snapshot of Sheng is a resource
[indiscernible].
>> Fei Xia: Yeah, I think that the key kind of challenge or something I really
want to look into is to say can you really identify those properties kind of
automatically, or without knowing the language, I think the reason we kind of
look into this is that as a Chinese speaker, I have this intuition that the
[indiscernible] meanings do not really change too much. But on the other hand,
I see if there's any way to kind of find out what features are really shared.
I imagine one could definitely run some experiment to see this, and also
depends on what kind of changes, just spelling or just the alphabet, or is that
grammar? Like what is changing? For example, what we did here -- actually,
I'm saying for parsing, I would imagine because word order change so much, but
there could still be something that is continuing. But I'm saying it's not
[indiscernible] to find out what exactly are shared by languages. I think
that's kind of ->>: Multi-heritage languages, coming from two different language families,
like [indiscernible] children or, you know, things like that. You wouldn't
expect [indiscernible]. Now you have it. But what is one [indiscernible]
parent of the language [indiscernible] one parent is Arabic and one parent is
28
[indiscernible] what do we do now. So could we approach it even then and say
oh, look, we have [indiscernible] corpora or we have the [indiscernible]
corpora, we have [indiscernible] corpora, now we can get Urdu corpora.
>> Fei Xia: I think one thing I'm always fascinated about but never got time
to look at is language code switching, in a sense that when you code switching,
you're here only talking about two languages, but when you have multiple
languages, the question is what that is being switched, right. Is that a word,
is that the syntax, what is being switched. So now when you have a language
that come from multiple parents, which part you get from parent one, which part
you get from parent two, and what's the interaction. I think those will be
very interesting questions to look into. But that's something that I always
feel, okay. I can look into that when I have like free time, I can look at
that. But that's something I'm always fascinated about, yeah.
>>: I don't think that anything in your methodology says that these have to be
sort of parent/child relationships. Do you have any [indiscernible] that do
this on like very closely related like something like [indiscernible], which
are not quite mutually intelligible but really close?
>> Fei Xia: There has been lot of work done for related language pairs. For
example, this Google tree bank [indiscernible] I guess maybe you heard of that
already. So basically have the version one have six languages and for version
two, I think they have like nine languages. And then what they did, you can
imagine, you can train on one language and then just use a model but do the
[indiscernible] version. Means you only have the rules but no [indiscernible]
items and parse the other one. And conceptually you say if those languages are
very similar, then the performance should be better. And that's consistent
with some funding they have. But I think definitely, you don't have to -- I
mean, the two languages do not have to be in this kind of ancestor descendant
relation. It can be related and now the question is that if you know they're
related, certain things that are shared, then you can use that. But certain
things you know they change already. For example, you know one is SVO, one is
SOV, like what do you do. Like do you kind of do pre-processing where you're
doing the model or doing it in post-processing. Like how do you incorporate
the knowledge you learned, and I think that will be very interesting in the
sense that you can do the first run knowing nothing about the languages. But
after first run, you actually know something about the language and then you
have that knowledge and then how do you incorporate the knowledge you just
learned. It's like what [indiscernible] like from old data, you get some
29
information about the grammar and you want to have the second to do something
better, right, because you know the language already. But definitely, the
interest is always the time. It's not my interests. Interested in a lot of
stuff, but in the short-term, I might not be able to work on that.
>>: Actually, it's a really interesting case, because with the olden data,
using data from all the different languages, 1,200 languages, the alignment
data, basically, helped inform for languages that didn't have much data. You
could actually use the alignment information from the other languages to help
inform the tools you were developing for the resource language. So even if the
languages aren't related, there is information you can pull in if you have
enough data and enough information from -- enough signal.
>> Fei Xia: It's almost like when you say nearest neighbor, right. I talk
about nearest. You use all the neighbors and see the difference and somehow
take that into account. So therefore, in theory, we don't really require any
kind of relation between the languages. It's just the closer they are, the
more likely you get some improvement.
>>:
More shared features.
>> Fei Xia:
More shared features, right.
>>: Backtrack to a more technical question. In the first portion of the
technique, when you were doing training data selection, how were you actually
deciding which percent, which portion of the ten percent of the training data
to use?
>> Fei Xia: Oh, which portion of ten percent?
percent or 20 percent, like how do you decide?
>>: Well, yeah.
decide which?
Like you are saying is that ten
If you're taking ten percent or whatever percent, how do you
>> Fei Xia: Oh, that's where you have to define the similarity function. So
what you do is you have training data, you have test data, and then you compare
how similar they are. For example, you can use entropy, right, and saying I'm
doing the language model based on one domain and then test the other one. So
we have a very good paper on that. So you can use any similarity function you
want to use. But the idea is to say you look at each sentence in your source
30
domain and compare that with the target domain data and see how close they are
and then define what you mean by close.
Download