>> So let me first of all welcome you all. ...

advertisement
>> So let me first of all welcome you all. This is the first time that we have the symposium in this
new building and a lot of things are more convenient here. You don't have to register anymore.
You can just walk in. And I hope you have a chance also to see a little bit of the -- at least
through the glass door see a little bit of the atrium and the nice area there.
This is, I just realized that this morning, but this is actually the fifth year that we do the symposia.
We just complete -- with this one we complete five years and 15 symposiums, which is -- yeah,
it's quite a lot. I'm pretty pleased that we went this long because I remember that still in the first
year we were actually not sure if we would go beyond one or two events and whether there would
be enough interest and everything. So it's kind nice to see that every time this fills so many
people.
So today we'll have two talks and then three demos. The talks are in sequence. The demos are
simultaneously. It's just walk through and talk to the people. The first talk is by Douglas Downey
and Faye will introduce him. And the second talk then after that by Chris Quirk and I'll give a little
bit more introduction for that.
>> Hello. So Doug Downey is going to be speaker from University of Washington and he's a PhD
student at CIC Department working with Professor Oren on -- I'm sorry. And he plan to, he's
going to graduate this fall and after that he's going to be Assistant Professor at ECS Department
at North Wesleyan University. And his research interests are in the area of (inaudible), motion
learning and artificial intelligence.
Doug was part of the (inaudible) Project, a system that use the information on the (inaudible) a
knowledge base and his dissertation focused on probabilistic models to enable information
extraction at high precision. Doug is also a Microsoft Live Lab fellow and he did his internship at
Microsoft in 2006.
So let's welcome Doug to give the presentation.
(Applause)
>> Douglas Downey: Thank you. Yes. So I'll start by motivating this autonomous web scale
information extraction task and placing it in a broader research context.
So information extraction is the task of producing machine understandable data from text. So as
an example we see this phrase "Paris, the capital of France." We'd like to be able to conclude
that these objects, Paris and France, share this relation: Capital of. And because text is
sometimes wrong and we can't do extraction perfectly, we like to assign some probability of belief
in this fact. So in this case 0.85.
So why should you care about this problem? Well, there are actually a number of cutting-edge
applications that information extraction underlies. The one I'll talk about is improving search
engines.
So if we wanted to know the capital of France, we could just type that phrase into a search engine
and get the answer immediately. But there's a set of questions that are harder to answer using
existing search engines.
So here are some examples. Say I'm looking for a job and I want to know which nanotechnology
companies are hiring. To answer this question on the web today I'd have to first compile a list of
nanotechnology companies and then manually check each one to determine whether or not they
were offering jobs.
There are other questions like: Who's won an Oscar for playing a villain? It's difficult for the same
reason and that is because there's not a single page that contains the answers. So we can find
the answers on the web but that requires compiling information from multiple pages. And doing
that currently is a tedious and error-prone manual process. Information extraction aims to
automate that task so these questions, these more complex questions, can also be answered in
an instant with a search engine.
There are a couple of other motivations, key problems in artificial intelligence in the semantic
web. I won't say too much about them, but the point I want to make is that in order to deliver on
the promise of any of these applications we need information extraction that scales to the web
and that's a challenging problem. And this is what I'll be focusing on for the rest of the talk.
So why is this challenging? Well, you can appreciate the challenge if you contrast web
information extraction with what's been classically done in information extraction. So a classical
information extraction task looks something like this. We're given the text of say a seminar
announcement and our goal is to build an extractor that can take any of these announcements
and extract the values of certain predefined fields, like the date of this seminar is June 5th. It's in
Room 302 and so on.
There's a set of characteristics that makes this task tractable. One is it's small scale. We're
working with only a few megabytes of text typically. Further it's defined on a restricted domain.
So if I'm not working on seminar announcements, maybe I'm working on mergers and
acquisitions and news releases. But in any case it is a narrowly restricted domain of text that I'm
working with.
And then lastly I start with handle-labeled examples. Right? So in the seminar case, before I
build my extractor a human annotator goes through and identifies the text which corresponds to
fields I want to extract. And then I can use those examples to train my extractor, which makes
this task much easier than it otherwise would be.
Okay. So contrast that with web information extraction. All right. In web information extraction
we start with this messy and multi ferios web and we want to extract structured data. So for that
nanotechnology companies question I need a table that looks something like this one. And of
course the web is not small scale, it's massive scale. And any extraction that we do needs to
solve a set of computational issues as a result.
Further, the web of course is open domain. This is what makes it so valuable is the fact that it
expresses information on a multitude of topics, but that's also what make its difficult to perform
information extraction over it. So we need to not only extract this company's table, we need to
extract information on essentially every concept that we happen to see expressed on the web.
So that's very challenging.
In particular this implies that we can't count on having handle-labeled examples. You can
imagine labeling examples for some particular concepts, but given that the number of concepts
we are going to extract is so large, we can't count on having hand-labeled extractions for each
concept.
Okay. So that's the problem. Now I'll -- here is an outline for the rest of the talk. I'll give a little
more background on sort of how this web information extraction is done and then I'll get into more
technical material on estimating the probabilities that extractions are correct. And then if there's
time I'll also talk about this challenge in the long tail of sparsely mentioned facts.
Okay. So given the challenges I outlined you might wonder is this web information extraction
possible? It turns out that it is. And the key is to utilize what I'll call generic extraction patterns.
So this is an example, all right, the phrase "Cities such as Chicago," seems to indicate that
Chicago is a member of the Class City or more generally the extraction pattern C such as X
indicates That x is an element of the class C. This particular pattern is due to Herst in '92. And
so in this case actually I should say these are generic in the sense that for whatever type of
information you want to extract you can exploit this pattern. So cities, films, nanotechnology
companies, etcetera, you can extract instances of those classes using this same pattern.
Of course, as most of you are aware, natural language is not usually so simple so this sentence:
"We walked the tree-lined streets of the bustling metropolis that is Atlanta." You can imagine
trying to extract that Atlanta is a city from this sentence. To do so you'd have to solve at least a
couple of tough natural language processing problems.
And these are the types of problems like parsing and subclass discovery that natural language
processing has concerned itself with. And they're important problems and people should solve
them. But they sort of remain unsolved and the best solutions don't scale very well and so the
approach I'm going to advocate for this talk is to say well if we're working on the web who cares
about these complicated sentences. All right. Let's just go back and try to find more simple ones
and see what we can do with those.
You know if we go to a search engine we find we don't have to depend on this complicated
sentence to learn that Atlanta is a city, that there are plenty of simple ones out there, as well.
So, I should mention this extraction pattern notion is more general than just this Marty
Herst(phonetic) pattern, right? "Bloomberg, Mayor of New York City," indicates that Bloomberg is
the Mayor of New York City. More generally we can use this construction "X, C of Y." And
there's a small set of these simple syntactic constructions you can use to actually build up a large
set of different facts. And I'll give a more concrete example of this in a second. But the
take-home message here is that when you're working on the web you have scale and
redundancy that makes a lot of facts easy to extract.
Okay. So I'll quickly show a demo of the Text Runner system, which perhaps some of you have
seen. But Text Runner is a system built by Mike Cafarella and Michelle Banka at the University
of Washington that does this type of extraction that's based on pattern. So you can ask it
questions like: "What kills bacteria?" And the answers you get back look something like this. So
you can see Text Runner identifies essentially noun-verb-noun triples that share a relationship.
So these are all the things that share the relationship, kill, with the noun, bacteria. And so
antibiotics and benzoyl peroxide and so on are examples.
Then if you look at the sentences that support this fact you will see they all follow roughly this
extraction pattern kind of form. Although Text Runner is using a more general notion doing some
lightweight syntactic processing to identify these statements. But you can say benzoyl peroxide
kills the bacteria, etcetera. I encourage you if you haven't to play with Text Runner because it
really can answer a remarkable number of questions. It's pretty fun to play with.
So okay. So the catch, though, is that extraction patterns are imperfect. So if we apply Our "X, C
of Y" pattern to this sentence, we will extract That Eric Johnson is the C.E.O. of Texas
Instruments and that's correct. However if we apply the same pattern to a different part of the
sentence we will extract that Texas Instruments is the Mayor of Dallas. All right. So that's an
error.
So there's a key problem here determining which of the extractions that these patterns produce
are actually correct. And what makes this hard is that we need to be able to do it without
hand-labeled examples and we need our solutions to scale to a corpus the size of the web. So
this is the task I'm going to focus on.
In my PhD thesis says relatively free of jargon says that we can do this if we leverage redundancy
in large text collections and probabilistic models, which I'll describe.
Okay. So now I'll get into this more technical topic. Estimating the probability that an extraction
is correct. And solve this problem using what I call the Urns Model of Redundancy.
So when we're talking about using redundancy to determine whether or not an extraction is
correct, there are two intuitions we can draw. One of these is repetition. So notice this phrase,
"Chicago and other Cities" occurs more frequently than the phrase "Illinois and other cities." And
so here we can just use the simple intuition that facts which are repeated more frequently are
more likely to be correct.
But you'll notice in Illinois and other cities actually occurs a lot more frequently than you might
expect. Right? And that's because of sentences like "Springfield, Illinois and other cities." Right.
So if we look at a different pattern that doesn't have that particular failure mode we can do a
much better job of determining which of Chicago and Illinois is a real city.
So those are the intuitions. My goal is to formalize these things in such a way that we can
answer this key question. So given the term X and a set of sentences about a class, what's the
probability that the term X is a member of the class?
So I'm going to build up an answer to this question. I'll start by restricting to the case where we
only have a single pattern. So think of one pattern like "Cities such as X." And now I'll rewrite
this core question as follows. So if see X K times in a set of N sentences, what can we conclude
from that information about the probability that X is an element of the class?
So now we'll start with a series of examples because this problem is a little more subtle than you
might think. And in particular previous techniques for combining evidence and information
extraction actually don't model it correctly.
And then I'll show my proposed solution to this problem, which does in fact take all the important
factors into account. So here's an example of 10 sentences that contain the phrase, "countries
such as." So N equals 10 here. And we can summarize the information that we've seen as
follows. So K is the number of times we saw each of these extractions. And I'm using green to
indicate correct extractions and red to indicate errors. So in this set of 10 there is one error.
Africa was misextracted as the name of a country.
So one way we might think about solving this problem is to say, well, let's treat each sentence as
an independent assertion that the extraction is correct and then all we need is for at least one of
those sentences to be in fact a correct extraction.
So if we use that approach we'll have this -- we'll be using a noisy "Or" model, which has been
employed in information extraction in the past. It assigns a probability equal to one minus the
probability that all of the extracted sentences are wrong. So in this case one minus one minus
one minus P to the K, where P is the probability that a pattern yields a correct extraction. And in
our example, right, nine out of the 10 sentences or extractions are correct. And so P is equal to
0.9.
Here are the probabilities of the noisy "Or" signs. The point I'd make about these is they're
actually quite good for this example. So given that we have to assign the same probability for
fixed K, these probabilities are actually close to optimal. But what I'm going to show further
examples is that the noisy "Or" ignores critical aspects of this problem. So it doesn't include the
sample size and it also doesn't model the distribution of target elements, which I'll make more
clear in a second.
First let's talk about sample size. So if you wanted to find the countries in the world, we'd look at
say 50,000 sentences and when we do that, what we see, we see the behavior of the extractions
changes radically. So at the top, United States, China, these very often talked about countries
get correctly high probabilities from the noisy "Or." You can see the problem if you take a random
sample of the Singleton extractions. And so what you're seeing here is that overwhelmingly the
extractions that we see once in a set of 50,000 country extractions are not really countries.
Right? All of the real countries occur more than once. However, the noisy "Or" still assigns high
probability to these things. So that's a problem. And in general, the noisy "Or" ignores sample
size and so it becomes inaccurate as sample size varies.
Okay. So you might think that is easy to fix. Right? The problem is that it's not this value K that
matters. It's something like K over N. So the frequency with which something is extracted. And
so we can propose this frequency model, which instead of using K uses some factor of K over N.
And in fact we can solve this problem. So we still assign high probabilities to the correct
extractions, but we assign appropriately low probabilities to these Singleton areas.
The problem is that this frequency model doesn't take into account any information about the
target set. So say we took -- we performed this experiment again keeping everything the same.
So n is fixed and this underlying probability P is fixed, but we are just extracting a different class.
So cities instead of countries. And when we do this, we see very different behavior.
So once again, well-known cities are at the top of this list, and get high probabilities. But when
you sample the Singleton extractions we find that actually the majority of these things are real
cities and this shouldn't be a huge shock. Right? There are hundreds of thousands of cities in
the world. And so some of these rarely mentioned cities are fact real cities. Some of these rarely
mentioned extractions are just cities that don't get mentioned frequently on the web. Right?
However, the frequency model assigns low probabilities to these, which is a problem. So in
general we need to include information about the distribution of elements that we're extracting in
order to assign correct probabilities.
Okay. So here -- yes?
>> Question: (Inaudible) ->> Douglas Downey: Sure.
>> Question: How did you sample the correct (inaudible) data that you are basing this to be on?
>> Douglas Downey: Yes. So yeah, this is actually -- I mean, the fact that four out of five of
these is correct are based on sample of a sample of a thousand extractions from this set of
50,000 and so yeah, this is actually close to the real observed proportion. Yeah.
Okay. So here's the solution that does take these factors into account. Takes the form of a
combinatorial balls and Urns model. So this Urn is filled with labeled balls and the labels
represent anything we could extract for a particular class. Here is An Urn for the city class. And
some of these extractions are correct, some of them are errors and different labels can appear on
different numbers of balls in the Urn. Right? And then when we see one of these extraction
phrases, we model the text that appears in that phrase as being drawn from the Urn.
So to be slightly more formal, the Urn is defined by these parameters. We have our target set
and our error set and of course we don't get to observe these. But then we have these
distributions of the number of repetitions of different elements of the different sets. So going back
to our example, here is what this looks like. In the error case, U.K. appears twice, Utah appears
once. And so the distribution of error labels is two to one. The reason why I'm mentioning this is
it's these quantities that we're going to use to compute probabilities. And we'll also, I'll show in a
moment, how we can estimate these distributions without any labeled data. So this is the key to
being able to compute probabilities without labeled data.
So if you recall this is a question we're trying to answer. X appears K then times. What's the
probability it's a member of the target class? In terms of the Urns model you rewrite this as "what
if we see X K times and N draws from the Urn," then what's the probability it came from the target
set?
And if we have these distributions num C and num E, this just a combinatorial calculation that we
can do. And so this is the expression. What I'd point out is that this expression actually includes
the factors that were left out of the noisy "or."
So you can see N appears in this expression as does this distribution num C.
Okay. So this is the key question for being able to use this for web-scale information extraction is
that we can actually estimate these distributions without labeled data. So now I'll describe how
that's done. This is the point at which modeling with Urns actually pays off because there are
assumptions you can make that actually hold across all classes. And then you can use unlabeled
data to estimate the things that vary across classes.
So the first assumption is to assume these num functions are ZIF, which is invariably at least
approximately true in practice. And then with a couple of other assumptions we can learn those
ZIF-n parameters.
So here is a visual depiction of what's going on. Right. We had a target set ZIF distribution and
an arrow set ZIF distribution. And what we observe is a mixture of those two things. And they're
mixed by this extraction precision parameter P. Right? So we observe this gray line on the
bottom and want to infer what the green and red lines are in the top.
And what we're going to do is assume a couple of things. One is that this arrow set is actually the
same. Well, the arrow set is not the same, but the frequency distribution of the arrows is the
same irrespective of what we're extracting. This tends to be approximately true in practice. Also
we're going to assume that we can get this pattern precision. So for any pattern, whether we're
extracting with "insects such as" or "cities such as," the probability that a given randomly drawn
sentence is correct, it's the same. It this also tends to be approximately true in practice.
So once we have done that there is actually only one free parameter in the system and that's the
C distribution and what we've shown empirically is that if you use standard, unsupervised
techniques like expectation maximization, you can actually arrive at this C distribution from just
the unlabeled sample that you observe. So here is the payoff. Right? This is the case that the
noisy "or" frequency models couldn't handle. They didn't assign correct probabilities in the
country and city case, but Urns in fact can. So Urns can tell the difference between the country
and city distributions and it is on low probability in the country case and higher ones in the city
case.
Okay. It's not just that example. We did an experiment that showed that the probabilities
assigned by Urns were much more accurate than noisy "or" and also another technique which I
won't talk about called PMI, which was used in the know-it-all system. And so these were -- lower
is better in this graph if you are looking closely. But these were large improvements in log
likelihood performance.
And so just briefly I mentioned earlier that you could use multiple extraction patterns to do better
classification of extractions and it turns out to be very convenient to implement this idea in the
Urns model. So we can just treat each pattern as its own Urn and then adopt a particular
correlation model between the Urns. So target label frequencies are correlated between the
Urns. So if we see Chicago a lot with one pattern, we should see it a lot with other. Whereas
error labels can sometimes be uncorrelated. So if we see Illinois which has these radically
different frequencies, we assume, ah, it's one of these uncorrelated error labels and we can
detect that Illinois is an error.
Okay. I mentioned earlier that efficiency is important. I won't go into the details. But if you make
some continuous approximations you can actually get a closed form expression for that
probability computation, at least in terms of gamma function. So you can compute these things
quickly, which is important for scalability.
Okay. So I started a little bit late, but am I ->> (Inaudible).
>> Douglas Downey: Okay. All right. This last piece is brief. But I'll go through a little bit of it.
So there's this issue with the long tail of sparsely mentioned extractions and I'm going to show
how you can use language models to help solve this. So what do I mean by the long tail? Well,
there's -- invariably the extractions we see are ZIF distributed in this way. So say we are
extracting mayors. We'll see the common mayors, Bloomberg in New York City, a lot. And
there's a small number of those that we can be relatively sure are correct. The difficult part is that
there is this long ZIF tail of extractions that we're unsure about. Now Urns can assign accurate
probabilities to these extractions, which is good, but it can't tell which of those. So say it assigns
the probability of .5 to this ZIF tail, it can't tell actually which half of those are correct and which
aren't because they are all extracted a small number of times. Right?
So this is -- on this tail you'll see mayors of more obscure cities, like Dave Shaver actually is the
mayor if Pickerington. However, it's not the case with Ronald McDonald is the mayor McDonald
Land. That was just also something we extract, right? Because the mayor of McDonald Land is
of course Mayor McCheese, which if you grew up in the United States you know this.
Okay. So the technique and this I'll just go over this briefly since I'm running short on time. The
strategy though is to build a model of how those common extractions occurs in text. So we'll
assume those common extractions are correct, find out how they appear in text and then we'll
rank the sparse extractions by how well they fit that model.
And the key points in -- well, the insight that my work introduced is the idea that you can use
unsupervised language models to solve this problem. And That gives you two advantages. One
is these language models are precomputed. So you can scan the corpus once, build a language
model and then just query it when it comes time to assess sparse extractions. Also language
modelers have been concerned with sparsity for a long time and we can exploit their techniques
to do better assessment.
So the idea that we're using here is of course the distributional hypothesis, which in our case the
idea that the same relationship tends to be -- instances of the same relationship appear in similar
context. So notice David B. Shaver was elected as the mayor of Pickerington, Ohio. The
sentence doesn't follow a generic patterns, but it does happen to exactly match the text on a
different page. That Mike Bloomberg in New York City participate in. So if we can use these
other mentions we can potentially do better assessment than we could with just extraction
patterns alone.
So I won't go into this. The base-line approach would be to build what's called a context vector
so here we build a vector. The components of the vector, the unique context in the corpus, the
counts are the number of times each term appears with that context. And then we can compute
similarity using dot-products that has some problems in the fact that those vectors are really large
and the intersections are sparse, which makes it both inefficient and sometimes inaccurate
because of the sparse comparisons.
And so the language modeling technique I used was to build an unsupervised hidden mark-up
model and compress -- effectively compress these context vectors into small distributions. In this
case we compressed each word's distributional information into a 20 vector of floating point
numbers and did comparisons with that and that resulted in much, much more efficient
computation in order of magnitude and also improved accuracy quite a bit. Okay.
So I'll just conclude, you know, I think that we've successfully moved from information extraction
to web information extraction. And not just based on what I showed today, but there's a whole
body of work that's helped us move forward in this case. But there are some big problems that
are open so the techniques I described in particular just produce a list of facts. So we can tell
that Daly is the mayor of Chicago and we can tell that Catherine Zeta Jones starred in "Chicago."
But as far as whether or not those are the same Chicago we really have no idea. So these tough,
these tough reasoning problems like entity resolution and schema discovery that are still out there
and are really crucial to being able to answer those questions like: "Which nanotechnology
companies are hiring" automatically.
And then lastly there are some improvements in accuracy and coverage that could be made and I
actually won't say too much, but I'm a firm believer in more sophisticated language models really
helping with information extraction.
All right. That's it. Thank you.
(applause)
>> We'll take some time for questions. Hold on. Okay. Yes.
>> Question: Wondering whether you feel that the redundancy of the web in English is
sufficiently large that it wouldn't get much additional benefit if you were to replicate the technique
with the web in other languages or whether distribution of fact and instance tokens in different
languages are so different that they would -- that additional languages would add substantially?
So do you have any thoughts on that?
>> Douglas Downey: So yeah, I think it's a good question. I'd be tempting to side with the latter
that there are -- the distribution of facts is sort of skewed enough in different languages that you
could get a lot of different facts if you poured these techniques to another language. I'm just
guessing. I don't have the data of course.
The good news is that I think most of these techniques, the ones I described, can be ported to
other languages. You can build language models in other languages. The lightweight syntactic
processing that Text Runner does to identify extractions, I would expect that there are analogues
to those types of things in other languages, as well. So I think it's something that could be tried.
Yeah?
>> Question: -- (inaudible) opened the presentation with the second question was "who won the
Oscars -- who one Oscars for playing a villain."
>> Douglas Downey: Playing a villain; right.
>> Question: What subproblems would that decompose to in this framework?
>> Douglas Downey: Yeah, so I think you need to -- so if I were to decompose that problem, I
would want to find actors who had won Oscars and then the roles that they played in the movies
that they won Oscars for and whether or not that role was a villain role.
So you could imagine rewriting that query as X played Y in Z. X won an Oscar for Z and Y was a
villain. Right? If you could search for the text actual patterns, then you could probably do a good
job of answering the question. However, right, there is this -- the task that I just described of
going from the question to sort of what it means in terms of individual components like actors and
roles, that's some of the stuff that I mentioned at the end is really an open problem. It requires
that we know the schema of what are the relevant attributes of an actor. Right? And some of
them are the roles they played. And that's not something that we necessarily know how to get
just yet.
Yeah?
>> Question: When you were combining the error and the correct value for a single error you had
a value P.
>> Douglas Downey: Right.
>> Question: Is that just a parameter? A single number that you choose?
>> Douglas Downey: It is. Yes. It's a number that you sort of observe and in fact we found that
the number 0.9 was fairly reliable for the Marty Herst patterns so X such as Y -- Y is a member of
class X about 90% of the time.
>> Question: And you got that empirically by looking at the patterns?
>> Douglas Downey: That's right. Yeah?
>> Question: So I don't know if you have time, but quickly I was just wondering if you could back
up to that and just explain a little bit more what you were doing in (inaudible).
>> Douglas Downey: Okay. Yeah, thank you for asking. Yeah, it was very fast. Although, yes, I
can't -- I can't go into too much detail, but I can say that so this model here, right, is the latent
state distribution for a particular term. So Right, so if an HMM computes a distribution probability
of term given latent state. Right? And you can invert that using Bay's Rule. And That is all that
this compressed vector is. So we go over the corpus and we learn an unsupervised HMM that
where the parameters assign high probability to the observed data. So we just use expectation
maximization to build this Unsupervised HMM. And one of the things we get out of that is the
probability of latent state given term and that's exactly what that compressed vector is. So I don't
know if that helps but that's a little more detail.
>> Question: So term is a word, so is your state trying to get at topical stuff or (inaudible)?
>> Douglas Downey: It's -- yeah, it's whatever you happen to learn. But it happens to be the
case that in practice words with similar latent state distributions tend to have similar meaning.
Yeah?
>> Question: (Inaudible) web contents and classic contents (inaudible) web content connected
with (inaudible) and have you maybe you could use this kind of data to solve the questions
(inaudible) that kind of the original one that shows that the role of (inaudible).
>> Douglas Downey: Okay. So you're saying using the web graph as an additional input to -right, to find. I think it's a great idea. There's a lot of stuff that the techniques I propose used
running text. There's a lot more information on the web and the web graph is one of those. The
HTML structure of pages is another extremely informative thing. Like HTML tables have very rich
information that is not identified using the techniques I described. So yeah, I think there's a whole
additional direction that pulls those additional inputs into the computation.
>> Question: (Inaudible) ->> Douglas Downey: Yeah. It's possible. I mean I guess I don't know too much about the state
of the XML version of pages being useful for extracting real knowledge. But I can see it
happening. And so, yeah, one actually and an important caveat to all this work is maybe we're
just sort of wasting our time until there really is a semantic version of every web page that's out
there. So people will just publish the semantic content of their page in XML and then everything
that I described is useless. So that it is something that could happen. Yeah.
All right. Thank you.
(applause)
>> I'm not sure I remember. It's been a while. So Chris has been in the National Language
processing Group for eight years now, I think.
>> Chris Quirk: Just under seven years.
>> Yep. And has been working a lot in the area of statistical machine translation and when it
comes to statistical machine translation I'm -- you can ask him later, but I don't know what he has
not worked on. So I think it's, you know, instead of listing what he has worked on, just assume it's
a lot of different things. And today he's going to talk about models for comparable fragment
extraction for acquiring data for statistical machine translation.
>> Chris Quirk: Thank you.
>> Let's see do we need to ->> Chris Quirk: Do we need to push that thing? Oh, perfect. Oops. (Laughter)
Here we go. There we go. All right. Thanks so much, Michael. So today I'm going to be talking
about joint work with myself, Rob Vandero (inaudible), who's a researcher in Microsoft Research
India in Bangor and with Roman (inaudible), who is also here in Redmond.
So as Michael said, this talk is mostly about extracting fragments, fragments of parallel data from
comparable corpora. And the motivation I think is pretty clear. So all these data-driven machine
translation systems, including statistical machine translation systems are gated by the amount of
parallel data available. In fact one of the best ways to improve data from a statistical machine
translation system is to throw more data at it. That seems to be consistent way to improve things.
And yet finding parallel data is a bit of a tricky feat. So there are a certain amount of data
available in major languages, especially from governmental sources like the European
Parliament. And they give you a surprising amount of general domain coverage. Yet it's not a
perfect cover of the language. So if we hope to find sources of translation information about
broader demise, about new language pairs, then I think in the future we are going to have to look
beyond the simply parallel data on to more comparable sources.
By comparable I tend to refer to like news documents that are written about the same event, but
are not sentence by sentence perfect translations of one another. And I mean this is by no
means a new observation. So people in the past have tried to learn bilingual lexicons from
comparable data, have tried to extract whole sentence pair from comparable data. That sounds
like a fantastic thing when you can find them, but they're rare. There's been some interesting
work recently on extracting names -- named entity translation pairs using a variety of signals. So
distributional similarity as well as temporal and orthographic similarity can be interesting cues for
this.
But I think one of the most promising approaches is that taken by Mutiano(phonetic) and
Marku(phonetic), which is to find arbitrary, subsentencial fragment pairs within sentences. It
assumes all the remainders of these categories. So if done effectively it can gather a huge
amount of parallel data.
As an example of the sorts of documents that we're hoping to work with, here is an
English-Spanish pair, about the untimely dead of a linguistically gifted parrot. And yeah, so it's
kind of an interesting article, actually.
But so the first thing we can see is that there are indeed almost completely parallel sentences in
these documents, although they lie in different places and the whole documents are parallel. Like
all the information in the Spanish side is replicated in the English, except for that small phrase
including zero. So there is a certain level of parallel sentences. However, I went through and
highlighted all the information that's basically parallel. And we can see almost 50% of the words
are in common between these two articles. However, very few of them are whole sentence pairs.
So there's a lot of data, but we're going to have to use slightly different techniques in order to get
at it.
In particular with news articles, one really effective source of parallel data is looking at quotes.
Right? So direct quotes of people are almost always translated very literally. So that's great for
parallel data. But as you can see they are subsentencial often times.
Okay. So the first part of the process here I'll be describing is not anything particularly new. If
we're going to try to extract parallel data from document sources such as large news corpora,
then the first thing we need to do is find documents that are describing approximately the same
event. This portion is kind of focused on the news aspect. There may be other sources of
comparable documents that already have a base alignment and where this wouldn't be
necessary. Here we are not going to do anything new, this is following exactly what
Muntiano(phonetic) and Marku(phonetic) did and I think it's a very sensible approach.
So we require three resources. The first is a seed corpus, a seed parallel corpus that provides
some base translational equivalence knowledge and then a bank of target language documents
and a bank of source language documents that presumably describe some of the same events.
The first thing we're going to do is use standard word alignment techniques in order to build
models that give us a base notion of translational equivalence. Okay. Then we can take all the
documents in one language, I'll pick the source and build an inverted index so that we can quickly
look up a document by the word that it contains. Okay. Then we can march through each target
language document and find similar document pairs using cross-lingual information retrieval.
This isn't magic, just take each document, represented as a bag of words, find its likely
translations from the word alignment models. So you have a source language bag of words, an
issue that is a query against the source language corpus.
So simple measures like Or BM25 are sufficient for finding documents that are actually
surprisingly similar. And this gives us a source of document pairs that are talking about basically
the same thing. They hopefully have some concept overlap. Furthermore, this talk isn't going to
look at documents as a whole. Instead it's going to look at promising sentence pairs within those
documents. So we do a simple filtration process, just find sentences within these document pairs
that contain some terminology in common and restrict our attention to those.
So this is where the task is set up. Given millions, maybe tens of millions, maybe even hundreds
of millions of sentence pairs with some content in common, try to identify a hidden fragment
alignment between them. So going back to the parrot story, we can see a sequence of words like
this and the quoted segment is in common, as is this pepper bird told, but the remainder of the
information in the sentences is not in common. So This is exactly the sort of information we'd like
to extract.
So let me spend just a moment talking about Muntiano(phonetic) and Marku's(phonetic)
approach. So it's motivated by an idea in signal processing. The idea is that we can again get
some notion of translational equivalence and give words a score between negative one and
positive one, which indicates the strength of belief that word has a corresponding translation on
the other side. Let's go to the graph, it's a bit more clear.
So along the X axis, we can see the words of some English sentence and underneath it we can
see a corresponding Romanian sentence. Jarglish(phonetic) is a Romanian so he's looking for
Romanian parallel data.
So now what we can do is take each English word in turn, find its most strongly associated
Romanian word and give it a score along the Y axis according to its strength of association. If a
word has no strong correspondence then it gets a score of negative one. This gives a very noisy
blue dotted signal that you can see. So they run a moving average filter across this signal to kind
of smooth it out and then search for segments of at least three words that are all positive. Okay.
So in particular this circled segment is something that seems to have a translation. All the words
in the English side that participate in such a fragment of assumed to have some correspondence.
They're set aside. Then we flip, do the exact same process based on the Romanian, pull out all
those words that have a parallel correspondence and we end up with some subset of the words
that act as a parallel fragment. It's a decent starting point and it leverages some of the
knowledge that we have. But there are down sides to this approach.
First of all, the mapping from probabilities to scores between negative one and positive one is
somewhat heuristic and seams like a more well founded probabilistic model might be able to
better leverage the data that we have. Second of all, the alignment from English to Romanian are
Romanian to English are computed completely independently as are the words that participate in
fragments. So an English word might be aligned to a stranded Romanian word. So we might pull
in an English word in the side fragment and not pull in its correspondent in the other side, which is
a slightly strange thing to do. So we are not totally convinced that the fragments are parallel.
And finally there's no -- in this sentence there's only one fragment that we extract on the English
side, but sentences may contain multiple fragments. This approach just runs all the words
together and doesn't have any notion of what the subsentence fragment alignment was like. So
these are some things that we hope to improve on.
Okay. So our approach instead was to build on purely probabilistic models and we ended up
doing something simple. The motivation here is that we think a fragment within a sentence pair
should be considered parallel if and only if the joint probability of those two words sequences is
greater than the product of the marginal probabilities. And likewise we can express this fact in
terms of additional probabilities, the conditional probability of target is greater than its marginal
probability and to this end we'll present to generative model of comparable sentence pair to try to
capture that. The hidden alignments behind these models will identify the fragment
correspondences. And we tried to capture some limitations of prior methods, namely that the
selection of the alignment is an independent in each direction and we model the position in the
sentences, etcetera.
But we do follow Muntiano(phonetic) and Marko's(phonetic) lead to evaluate in terms of the
end-to-end machine translation quality. That seems like the best way to identify how good
information you are going to extract.
Okay. So let me do a very brief review on word alignment techniques. This -- I'll be focusing on
HMM-like word alignments. So the idea is we are given an English sentence and we want to
predict the probability of a Spanish sentence. The HMM models this by saying there's a hidden
bit of information, which is of course the word correspondence between the two sentences.
So the English words form the state space that's used to generate the Spanish side. In particular
we march along left to right in the Spanish sentence. We'd pick a word or we'd pick a state that
generates the first Spanish word according to a junk distribution. So in this case L had no strong
correspondence in the English side, so it's generated according to a null distribution. Fraude was
generated by fraud and so on until we've generated the whole sentence.
So it's just -- the probability of this whole sentence and its alignment is the product of those
individual probabilities. And right, so the probability of landing in this state is conditioned only on
the distance from the prior state. So there is some bias towards making somewhat monotone
alignment through the sentence and furthermore, even in truly parallel data, we have a certain
amount of mismatch between the languages. For instance, in this sentence the Spanish used a
determiner where the English did not, says El Fraude, instead of fraud. So these simple IBM
models and HMM model learn the multinomial distribution over target words, given source words,
including null. So it can generate little function words like this as necessary in order to model the
sentence.
As we move from parallel data to comparable data, one simple way to account for words that
have no strong correspondence in the other language is to say they're generated by this null
state. However that's not just a good idea because we hope distribution of words given null will
model those things that are systematically inserted during translation. We don't want to learn that
situation and actual are systematically inserted during translation. That's the wrong sorts of
information to give.
So instead we're going to expand the base word alignment model a little bit and end up with
something that we call comparable model A. So I hope the math isn't too daunting. The top is
the true generative story without any approximations of how we might generate a sentence and a
target sentence and an alignment given a source sentence.
So first we predict the length of the target side. Then we predict each state conditioned on all
prior information and then we pick the translation conditioned on that state and all other prior
information. In practice the standard HMM says that the probability of landing in state A sub J is
conditioned only on the prior state and the probability of generating word T sub J is conditioned
only on the source word that it was aligned to.
So comparable model A makes only one small extension, which is to add an additional state that
says we can generate words monolingually. And when we generate a word monolingually
instead of conditioning on the source word we use standard end ground language modeling
techniques. We generate it according to the context in which it occurred.
Okay. So graphically it looks something like this. We've added an additional row at the bottom,
this gen model that accounts for that probability. A bilingual word now has three pieces. We
predict that it's generated bilingually. We predict its jump from the prior state and we predict the
word itself and the context of the state that generated it.
Monolingual states only have two portions. We predicted this monolingual and then we predict
the word in its trigram context. Okay.
Now those of you familiar with HMM techniques know that we can just throw a standard HMM at
this problem and find the most like state sequence and pretty -- in a small polynomial time.
Right? So this is one means of finding parallel fragments using only end gram language models
and -- sorry HMMs.
Okay. So the one remaining thing is we're only left with a word alignment. When ideally what we
wanted was a fragment alignment. So one simple idea is just to again look at sequences in this
Spanish side that were all generated bilingually, find sequences of at least three words. Look at
the range of English words that they were generated from and place some simple filtering
conditions on that. So this is one simple model for extracting parallel data.
Okay. So the nice thing about it is it's a relatively simple and clean extension of the HMM model
and it's really fast, easy to implement, something you can even use to extract fragments. But
there are limitations we'd like to address. So first of all the alignment metric -- method itself is
asymmetric, that is we're only looking at one direction of these correspondences, not both
directions. And secondly this is a mark-up model. There's nothing to prevent us from revisiting
Dominican or any random state again and again throughout the sentence. So a single source
language word can participate in multiple fragments, which is not ideal.
Third, there are several free parameters to tune. And fourth, we're looking at only a single
vitribial/verterbi alignment, which is okay, but it's not ideal for monitoring more phrasal
translations. We're finding the exact word-to-word correspondence is somewhat more difficult.
Okay so that leads into the second model that we explored, which is what we'll call generative
model B. This is a joint model of source and target fragments. So it's generating both of them
jointly as opposed to generating the source conditioned on the target or vice versa. And so the
idea -- the generative story behind this model is first we pick a number of fragments and then for
each fragment it generates some number of source words and some number of target words
together. It could generate only source words for a monolingual fragment or only target words, as
well.
Again, we'll use source and target end gram language models to estimate our parameters and
start from HMM models of source-given target and target-given source.
Okay. So monolingual fragments are going to be scored only according to the language model
and bilingual fragments -- well we have two ways of estimating the probability. We have two
marginal and conditional distributions, both of which we can get a joint from. So since we're
interested in the highest precision fragments that we can possibly find, we're using the minimum
score out of these two. So that means that both conditional models have to agree that this is
parallel before we will accept it as a fragment. Graphically the model looks like this. As opposed
to predicting one word at a time, we are predicting whole blocks of the sentence at a time and
they are scored according to the formulas that are put on the previous screen. Okay. So parallel
fragments are scored according to minimum of each of the conditional rate bounds and
monolingual fragments are scored only according to language model. Okay. And we progress
this way through the whole sentence.
This search here is pretty easy. Since we don't condition on anything about the prior fragment
structure we only need to remember the one best fragment that covers the first J source words
and the first K target words. And then we can use the simple algorithm in order to find the best
sequence. Works okay. Unfortunately it's really slow. So those of you who have implemented
String Edit distance before will note a similarity between this and String Edit distance. We can do
like an insert, a delete or a copy, except the insert, delete and copy operations are of arbitrary
length. We're not just using one character at a time. We can use arbitrary string length. So that
expands the search computation from order MN to order M Squared, N squared, where M is the
length of the source sentence and N is the length of the target.
And in addition there are these probability scores that need to be computed. So the language
model scores can be computed relatively cheaply, just by looking at the scores of each segment.
But the joint probabilities, marginalizing out across all the HMM alignments takes a little bit more
time. So to speedup the search process we introduce several optimizations. First we can note
that the score of a fragment that covers from J prime to J and K prime to K and in the sentence is
bounded by above -- by predicting that span of source where it is given the whole target
sentence. So as we add more information we only increase the probability of a segment.
Furthermore we can notice that the marginal -- we'll only use a bilingual span if it's more probable
than a monolingual span. So we can focus our attention on those spans for which this bound is
greater than the marginal probability. Okay. And this can be found relatively quickly by dropping
some transition probabilities or estimating them. We can cut it down to about a cubic time and
then next we can only score bilingual segments, where both upper bounds exceed the
monolingual probability. So this cuts down the number of places that we need to visit in the
search space. Then some bounds on fragment length and fragment ratio further cut the search
space. That series of optimizations made the procedure about 10 times faster, which is still not
quite enough given that we want to run this over tens of millions of sentences. So in the end we
also did a beam search. And the simple ideas is you just instead of keeping the best hypothesis
ending at or for each distinct number of source words and target words covered we keep a stack
of hypotheses based on the total number of words covered. And at that point a future cost
estimate becomes necessary. The paper has more details this so I will leave the technical stuff to
that. Overall these optimizations produce about 100 X speed up over the baseline algorithm and
only affect model cost by less than 1%. So it's a pretty effective means of (inaudible).
Okay. So that's model B. It's a symmetric model, which is nice. Each word definitely participates
in exactly one fragment. We're marginalizing out across word alignments inside fragments and
there is no free parameters. So It sounds awfully nice. There are some limitations, though. The
fragment alignment must be monotone and may be subject to search errors and still even with
with all these optimizations it's slower than model A. So that's all I have to say about the models
themselves.
In terms of what we get out of them, well, I think there's two things we are interested in. One is
how much parallel data can we get from these comparable corpora. And the other question is
what sort of impact do they have on an end-to-end translation system? For our machine
translation system we just did something simple, use a phrase with a coder, much like Pharaoh,
though it's an in-house version. And we didn't do anything unusual. Standard methods of
estimating probability with the trigram language distortion penalty.
So our baseline was to train a machine translation system on only the parallel data and the
comparison systems append fragments to that parallel data and then we'll retrain the whole
system from scratch.
In terms of fragment extraction. For our seed parallel corpus we started with European
parliamentary data, which is we had about 15 million English words of parallel English-Spanish
data and then for our fragments we looked at the Spanish and the English giga-words. So these
are both quite large. Spanish is not quite one giga-word, but it's 700 million, good enough. And
we compared three approaches.
So first was an in-house implementation of Muntiano(phonetic) and Marku(phonetic), which we
think is pretty close to what they implemented. And you will notice that it extracted almost 60 or
above 60 million English words, which is almost like 10% of the Spanish data. So it's a significant
fraction of the data that's actually being extracted. It's hard to evaluate this stuff in isolation.
Looking at it seems like it has some good parallel data, but it has a lot of noise. Model A
extracted significantly fewer amounts of data but it looked much cleaner and model B was
somewhere in between. So but the story is that even just from just large bodies of news corpora
we can find a huge amount of somewhat parallel information, at least.
In terms of MT quality, we evaluate it on three test sets. The first was in-domain European
parliament test set and you'll notice that none of these methods succeeded in improving quality in
that domain. I don't think that's terrible surprising, right? We're not going to go out to
NewsCorpora and figure out how to translate things in this domain that help although it is nice to
see that both of these models either didn't degrade or didn't significantly degrade the baseline
system. We're adding in the noisy segments from Muntiano(phonetic) and Marku(phonetic) did
seem to have a noticeable impact on translation quality in domain.
Along with the euro parallel data there is also some news commentary data, which these guys -which the organizers of these SMT workshops claim is out of domain, it's not actually very out
domain. The out-of-vocabulary rate is still less than 1%. So we tried it out on this source and
found that all of the methods had a small, but positive impact on this translation material. Where
this data really seems to make a benefit, however, is on data that users actually want us to
translate. So we have a live translation website up now that Anthony talked about during the last
symposium. So we started gathering the sorts of requests that people want to translate and
using the fragments here makes a significant difference like almost up to three bullet points of
improvement against a baseline system. So the potential for impact is really quite large,
especially in out-of-domain scenarios.
Okay. So that's all I have to say about this. We presented two generative models for comparable
corpora. And these are clean methods for aligning nonparallel data.
And in terms of future work, I think there is a lot of additional sources that can easily be exploited.
For instance Wikipedia has a huge amount of parallel or comparable data and seems like it could
improve quality in a broad set of domains. I do think there's more work to be done in terms of
parameter optimization and bootstrapping, as well as exploring different model structures. And
finally it would be kind of interesting to apply some of these models to our parallel data to see if
they could do some sort of automatic cleaning, find the information that's not actually parallel
within them and see if the translation system would perform better.
So that's all I have. Thanks very much.
(Applause)
>> Yes?
>> Question: Very interesting. One thing I was surprised to see how large the C corpus was,
though. Has one looked at how performance falls off as that gets smaller?
>> Chris Quirk: I haven't done that. That's a really good question because we'd love to know
what kind of impact it could make on like English ending, where there is only a tiny amount of
parallel data. No, no, sorry. I haven't done enough experimentation yet. Yes?
>> Question: Yes, (inaudible) -- they are not contiguous. I mean they have to be contiguous.
>> Chris Quirk: Does have to be contiguous, yeah, and if there's a gab in between you have to
extract two fragments instead.
>> Question: Right. So (inaudible) ->> Chris Quirk: Makes the serve space potentially larger, which kind of scares me. But it's
possible, you know. And maybe approximate methods are sufficient. You know, I think some of
these things -- like the fact that you can make it so much faster without having any major impact
on model costs makes me think that we're just trying too hard at this search stuff. We could get
away with something simpler. You know, maybe something get sampling like in order to look at
the state space quickly. But yeah, it's an interesting idea. Like do you want to chop up a large
fragment into two pieces because there was one little bothersome ->> Question: (Inaudible) also short on (inaudible) you have (inaudible) ->> Chris Quirk: Yeah. Yeah, that's true. Yes?
>> Question: When you say it is to be contiguous, you mean with the exception of those function
words that you were talking about that -- with the exception of those?
>> Chris Quirk: Exactly. Yeah. So anything that can be generated by a null, the HMM's null
model is allowed. Yeah?
>> Question: (Inaudible) try the English/Spanish right now or have you tried other ones, as well?
>> Chris Quirk: I've only tried English/Spanish so far. And yeah, I think -- as the -- as we just
talked about, it would be even more interesting to try it on lower density languages, but there's a
lot of demand for English/Spanish translation interestingly so that's been part of the reason we
focused on that language pair.
All right. Well, I think there's food back there and I should let you guys enjoy it. Thank you so
much for your time.
(Applause)
Download