>> So let me first of all welcome you all. This is the first time that we have the symposium in this new building and a lot of things are more convenient here. You don't have to register anymore. You can just walk in. And I hope you have a chance also to see a little bit of the -- at least through the glass door see a little bit of the atrium and the nice area there. This is, I just realized that this morning, but this is actually the fifth year that we do the symposia. We just complete -- with this one we complete five years and 15 symposiums, which is -- yeah, it's quite a lot. I'm pretty pleased that we went this long because I remember that still in the first year we were actually not sure if we would go beyond one or two events and whether there would be enough interest and everything. So it's kind nice to see that every time this fills so many people. So today we'll have two talks and then three demos. The talks are in sequence. The demos are simultaneously. It's just walk through and talk to the people. The first talk is by Douglas Downey and Faye will introduce him. And the second talk then after that by Chris Quirk and I'll give a little bit more introduction for that. >> Hello. So Doug Downey is going to be speaker from University of Washington and he's a PhD student at CIC Department working with Professor Oren on -- I'm sorry. And he plan to, he's going to graduate this fall and after that he's going to be Assistant Professor at ECS Department at North Wesleyan University. And his research interests are in the area of (inaudible), motion learning and artificial intelligence. Doug was part of the (inaudible) Project, a system that use the information on the (inaudible) a knowledge base and his dissertation focused on probabilistic models to enable information extraction at high precision. Doug is also a Microsoft Live Lab fellow and he did his internship at Microsoft in 2006. So let's welcome Doug to give the presentation. (Applause) >> Douglas Downey: Thank you. Yes. So I'll start by motivating this autonomous web scale information extraction task and placing it in a broader research context. So information extraction is the task of producing machine understandable data from text. So as an example we see this phrase "Paris, the capital of France." We'd like to be able to conclude that these objects, Paris and France, share this relation: Capital of. And because text is sometimes wrong and we can't do extraction perfectly, we like to assign some probability of belief in this fact. So in this case 0.85. So why should you care about this problem? Well, there are actually a number of cutting-edge applications that information extraction underlies. The one I'll talk about is improving search engines. So if we wanted to know the capital of France, we could just type that phrase into a search engine and get the answer immediately. But there's a set of questions that are harder to answer using existing search engines. So here are some examples. Say I'm looking for a job and I want to know which nanotechnology companies are hiring. To answer this question on the web today I'd have to first compile a list of nanotechnology companies and then manually check each one to determine whether or not they were offering jobs. There are other questions like: Who's won an Oscar for playing a villain? It's difficult for the same reason and that is because there's not a single page that contains the answers. So we can find the answers on the web but that requires compiling information from multiple pages. And doing that currently is a tedious and error-prone manual process. Information extraction aims to automate that task so these questions, these more complex questions, can also be answered in an instant with a search engine. There are a couple of other motivations, key problems in artificial intelligence in the semantic web. I won't say too much about them, but the point I want to make is that in order to deliver on the promise of any of these applications we need information extraction that scales to the web and that's a challenging problem. And this is what I'll be focusing on for the rest of the talk. So why is this challenging? Well, you can appreciate the challenge if you contrast web information extraction with what's been classically done in information extraction. So a classical information extraction task looks something like this. We're given the text of say a seminar announcement and our goal is to build an extractor that can take any of these announcements and extract the values of certain predefined fields, like the date of this seminar is June 5th. It's in Room 302 and so on. There's a set of characteristics that makes this task tractable. One is it's small scale. We're working with only a few megabytes of text typically. Further it's defined on a restricted domain. So if I'm not working on seminar announcements, maybe I'm working on mergers and acquisitions and news releases. But in any case it is a narrowly restricted domain of text that I'm working with. And then lastly I start with handle-labeled examples. Right? So in the seminar case, before I build my extractor a human annotator goes through and identifies the text which corresponds to fields I want to extract. And then I can use those examples to train my extractor, which makes this task much easier than it otherwise would be. Okay. So contrast that with web information extraction. All right. In web information extraction we start with this messy and multi ferios web and we want to extract structured data. So for that nanotechnology companies question I need a table that looks something like this one. And of course the web is not small scale, it's massive scale. And any extraction that we do needs to solve a set of computational issues as a result. Further, the web of course is open domain. This is what makes it so valuable is the fact that it expresses information on a multitude of topics, but that's also what make its difficult to perform information extraction over it. So we need to not only extract this company's table, we need to extract information on essentially every concept that we happen to see expressed on the web. So that's very challenging. In particular this implies that we can't count on having handle-labeled examples. You can imagine labeling examples for some particular concepts, but given that the number of concepts we are going to extract is so large, we can't count on having hand-labeled extractions for each concept. Okay. So that's the problem. Now I'll -- here is an outline for the rest of the talk. I'll give a little more background on sort of how this web information extraction is done and then I'll get into more technical material on estimating the probabilities that extractions are correct. And then if there's time I'll also talk about this challenge in the long tail of sparsely mentioned facts. Okay. So given the challenges I outlined you might wonder is this web information extraction possible? It turns out that it is. And the key is to utilize what I'll call generic extraction patterns. So this is an example, all right, the phrase "Cities such as Chicago," seems to indicate that Chicago is a member of the Class City or more generally the extraction pattern C such as X indicates That x is an element of the class C. This particular pattern is due to Herst in '92. And so in this case actually I should say these are generic in the sense that for whatever type of information you want to extract you can exploit this pattern. So cities, films, nanotechnology companies, etcetera, you can extract instances of those classes using this same pattern. Of course, as most of you are aware, natural language is not usually so simple so this sentence: "We walked the tree-lined streets of the bustling metropolis that is Atlanta." You can imagine trying to extract that Atlanta is a city from this sentence. To do so you'd have to solve at least a couple of tough natural language processing problems. And these are the types of problems like parsing and subclass discovery that natural language processing has concerned itself with. And they're important problems and people should solve them. But they sort of remain unsolved and the best solutions don't scale very well and so the approach I'm going to advocate for this talk is to say well if we're working on the web who cares about these complicated sentences. All right. Let's just go back and try to find more simple ones and see what we can do with those. You know if we go to a search engine we find we don't have to depend on this complicated sentence to learn that Atlanta is a city, that there are plenty of simple ones out there, as well. So, I should mention this extraction pattern notion is more general than just this Marty Herst(phonetic) pattern, right? "Bloomberg, Mayor of New York City," indicates that Bloomberg is the Mayor of New York City. More generally we can use this construction "X, C of Y." And there's a small set of these simple syntactic constructions you can use to actually build up a large set of different facts. And I'll give a more concrete example of this in a second. But the take-home message here is that when you're working on the web you have scale and redundancy that makes a lot of facts easy to extract. Okay. So I'll quickly show a demo of the Text Runner system, which perhaps some of you have seen. But Text Runner is a system built by Mike Cafarella and Michelle Banka at the University of Washington that does this type of extraction that's based on pattern. So you can ask it questions like: "What kills bacteria?" And the answers you get back look something like this. So you can see Text Runner identifies essentially noun-verb-noun triples that share a relationship. So these are all the things that share the relationship, kill, with the noun, bacteria. And so antibiotics and benzoyl peroxide and so on are examples. Then if you look at the sentences that support this fact you will see they all follow roughly this extraction pattern kind of form. Although Text Runner is using a more general notion doing some lightweight syntactic processing to identify these statements. But you can say benzoyl peroxide kills the bacteria, etcetera. I encourage you if you haven't to play with Text Runner because it really can answer a remarkable number of questions. It's pretty fun to play with. So okay. So the catch, though, is that extraction patterns are imperfect. So if we apply Our "X, C of Y" pattern to this sentence, we will extract That Eric Johnson is the C.E.O. of Texas Instruments and that's correct. However if we apply the same pattern to a different part of the sentence we will extract that Texas Instruments is the Mayor of Dallas. All right. So that's an error. So there's a key problem here determining which of the extractions that these patterns produce are actually correct. And what makes this hard is that we need to be able to do it without hand-labeled examples and we need our solutions to scale to a corpus the size of the web. So this is the task I'm going to focus on. In my PhD thesis says relatively free of jargon says that we can do this if we leverage redundancy in large text collections and probabilistic models, which I'll describe. Okay. So now I'll get into this more technical topic. Estimating the probability that an extraction is correct. And solve this problem using what I call the Urns Model of Redundancy. So when we're talking about using redundancy to determine whether or not an extraction is correct, there are two intuitions we can draw. One of these is repetition. So notice this phrase, "Chicago and other Cities" occurs more frequently than the phrase "Illinois and other cities." And so here we can just use the simple intuition that facts which are repeated more frequently are more likely to be correct. But you'll notice in Illinois and other cities actually occurs a lot more frequently than you might expect. Right? And that's because of sentences like "Springfield, Illinois and other cities." Right. So if we look at a different pattern that doesn't have that particular failure mode we can do a much better job of determining which of Chicago and Illinois is a real city. So those are the intuitions. My goal is to formalize these things in such a way that we can answer this key question. So given the term X and a set of sentences about a class, what's the probability that the term X is a member of the class? So I'm going to build up an answer to this question. I'll start by restricting to the case where we only have a single pattern. So think of one pattern like "Cities such as X." And now I'll rewrite this core question as follows. So if see X K times in a set of N sentences, what can we conclude from that information about the probability that X is an element of the class? So now we'll start with a series of examples because this problem is a little more subtle than you might think. And in particular previous techniques for combining evidence and information extraction actually don't model it correctly. And then I'll show my proposed solution to this problem, which does in fact take all the important factors into account. So here's an example of 10 sentences that contain the phrase, "countries such as." So N equals 10 here. And we can summarize the information that we've seen as follows. So K is the number of times we saw each of these extractions. And I'm using green to indicate correct extractions and red to indicate errors. So in this set of 10 there is one error. Africa was misextracted as the name of a country. So one way we might think about solving this problem is to say, well, let's treat each sentence as an independent assertion that the extraction is correct and then all we need is for at least one of those sentences to be in fact a correct extraction. So if we use that approach we'll have this -- we'll be using a noisy "Or" model, which has been employed in information extraction in the past. It assigns a probability equal to one minus the probability that all of the extracted sentences are wrong. So in this case one minus one minus one minus P to the K, where P is the probability that a pattern yields a correct extraction. And in our example, right, nine out of the 10 sentences or extractions are correct. And so P is equal to 0.9. Here are the probabilities of the noisy "Or" signs. The point I'd make about these is they're actually quite good for this example. So given that we have to assign the same probability for fixed K, these probabilities are actually close to optimal. But what I'm going to show further examples is that the noisy "Or" ignores critical aspects of this problem. So it doesn't include the sample size and it also doesn't model the distribution of target elements, which I'll make more clear in a second. First let's talk about sample size. So if you wanted to find the countries in the world, we'd look at say 50,000 sentences and when we do that, what we see, we see the behavior of the extractions changes radically. So at the top, United States, China, these very often talked about countries get correctly high probabilities from the noisy "Or." You can see the problem if you take a random sample of the Singleton extractions. And so what you're seeing here is that overwhelmingly the extractions that we see once in a set of 50,000 country extractions are not really countries. Right? All of the real countries occur more than once. However, the noisy "Or" still assigns high probability to these things. So that's a problem. And in general, the noisy "Or" ignores sample size and so it becomes inaccurate as sample size varies. Okay. So you might think that is easy to fix. Right? The problem is that it's not this value K that matters. It's something like K over N. So the frequency with which something is extracted. And so we can propose this frequency model, which instead of using K uses some factor of K over N. And in fact we can solve this problem. So we still assign high probabilities to the correct extractions, but we assign appropriately low probabilities to these Singleton areas. The problem is that this frequency model doesn't take into account any information about the target set. So say we took -- we performed this experiment again keeping everything the same. So n is fixed and this underlying probability P is fixed, but we are just extracting a different class. So cities instead of countries. And when we do this, we see very different behavior. So once again, well-known cities are at the top of this list, and get high probabilities. But when you sample the Singleton extractions we find that actually the majority of these things are real cities and this shouldn't be a huge shock. Right? There are hundreds of thousands of cities in the world. And so some of these rarely mentioned cities are fact real cities. Some of these rarely mentioned extractions are just cities that don't get mentioned frequently on the web. Right? However, the frequency model assigns low probabilities to these, which is a problem. So in general we need to include information about the distribution of elements that we're extracting in order to assign correct probabilities. Okay. So here -- yes? >> Question: (Inaudible) ->> Douglas Downey: Sure. >> Question: How did you sample the correct (inaudible) data that you are basing this to be on? >> Douglas Downey: Yes. So yeah, this is actually -- I mean, the fact that four out of five of these is correct are based on sample of a sample of a thousand extractions from this set of 50,000 and so yeah, this is actually close to the real observed proportion. Yeah. Okay. So here's the solution that does take these factors into account. Takes the form of a combinatorial balls and Urns model. So this Urn is filled with labeled balls and the labels represent anything we could extract for a particular class. Here is An Urn for the city class. And some of these extractions are correct, some of them are errors and different labels can appear on different numbers of balls in the Urn. Right? And then when we see one of these extraction phrases, we model the text that appears in that phrase as being drawn from the Urn. So to be slightly more formal, the Urn is defined by these parameters. We have our target set and our error set and of course we don't get to observe these. But then we have these distributions of the number of repetitions of different elements of the different sets. So going back to our example, here is what this looks like. In the error case, U.K. appears twice, Utah appears once. And so the distribution of error labels is two to one. The reason why I'm mentioning this is it's these quantities that we're going to use to compute probabilities. And we'll also, I'll show in a moment, how we can estimate these distributions without any labeled data. So this is the key to being able to compute probabilities without labeled data. So if you recall this is a question we're trying to answer. X appears K then times. What's the probability it's a member of the target class? In terms of the Urns model you rewrite this as "what if we see X K times and N draws from the Urn," then what's the probability it came from the target set? And if we have these distributions num C and num E, this just a combinatorial calculation that we can do. And so this is the expression. What I'd point out is that this expression actually includes the factors that were left out of the noisy "or." So you can see N appears in this expression as does this distribution num C. Okay. So this is the key question for being able to use this for web-scale information extraction is that we can actually estimate these distributions without labeled data. So now I'll describe how that's done. This is the point at which modeling with Urns actually pays off because there are assumptions you can make that actually hold across all classes. And then you can use unlabeled data to estimate the things that vary across classes. So the first assumption is to assume these num functions are ZIF, which is invariably at least approximately true in practice. And then with a couple of other assumptions we can learn those ZIF-n parameters. So here is a visual depiction of what's going on. Right. We had a target set ZIF distribution and an arrow set ZIF distribution. And what we observe is a mixture of those two things. And they're mixed by this extraction precision parameter P. Right? So we observe this gray line on the bottom and want to infer what the green and red lines are in the top. And what we're going to do is assume a couple of things. One is that this arrow set is actually the same. Well, the arrow set is not the same, but the frequency distribution of the arrows is the same irrespective of what we're extracting. This tends to be approximately true in practice. Also we're going to assume that we can get this pattern precision. So for any pattern, whether we're extracting with "insects such as" or "cities such as," the probability that a given randomly drawn sentence is correct, it's the same. It this also tends to be approximately true in practice. So once we have done that there is actually only one free parameter in the system and that's the C distribution and what we've shown empirically is that if you use standard, unsupervised techniques like expectation maximization, you can actually arrive at this C distribution from just the unlabeled sample that you observe. So here is the payoff. Right? This is the case that the noisy "or" frequency models couldn't handle. They didn't assign correct probabilities in the country and city case, but Urns in fact can. So Urns can tell the difference between the country and city distributions and it is on low probability in the country case and higher ones in the city case. Okay. It's not just that example. We did an experiment that showed that the probabilities assigned by Urns were much more accurate than noisy "or" and also another technique which I won't talk about called PMI, which was used in the know-it-all system. And so these were -- lower is better in this graph if you are looking closely. But these were large improvements in log likelihood performance. And so just briefly I mentioned earlier that you could use multiple extraction patterns to do better classification of extractions and it turns out to be very convenient to implement this idea in the Urns model. So we can just treat each pattern as its own Urn and then adopt a particular correlation model between the Urns. So target label frequencies are correlated between the Urns. So if we see Chicago a lot with one pattern, we should see it a lot with other. Whereas error labels can sometimes be uncorrelated. So if we see Illinois which has these radically different frequencies, we assume, ah, it's one of these uncorrelated error labels and we can detect that Illinois is an error. Okay. I mentioned earlier that efficiency is important. I won't go into the details. But if you make some continuous approximations you can actually get a closed form expression for that probability computation, at least in terms of gamma function. So you can compute these things quickly, which is important for scalability. Okay. So I started a little bit late, but am I ->> (Inaudible). >> Douglas Downey: Okay. All right. This last piece is brief. But I'll go through a little bit of it. So there's this issue with the long tail of sparsely mentioned extractions and I'm going to show how you can use language models to help solve this. So what do I mean by the long tail? Well, there's -- invariably the extractions we see are ZIF distributed in this way. So say we are extracting mayors. We'll see the common mayors, Bloomberg in New York City, a lot. And there's a small number of those that we can be relatively sure are correct. The difficult part is that there is this long ZIF tail of extractions that we're unsure about. Now Urns can assign accurate probabilities to these extractions, which is good, but it can't tell which of those. So say it assigns the probability of .5 to this ZIF tail, it can't tell actually which half of those are correct and which aren't because they are all extracted a small number of times. Right? So this is -- on this tail you'll see mayors of more obscure cities, like Dave Shaver actually is the mayor if Pickerington. However, it's not the case with Ronald McDonald is the mayor McDonald Land. That was just also something we extract, right? Because the mayor of McDonald Land is of course Mayor McCheese, which if you grew up in the United States you know this. Okay. So the technique and this I'll just go over this briefly since I'm running short on time. The strategy though is to build a model of how those common extractions occurs in text. So we'll assume those common extractions are correct, find out how they appear in text and then we'll rank the sparse extractions by how well they fit that model. And the key points in -- well, the insight that my work introduced is the idea that you can use unsupervised language models to solve this problem. And That gives you two advantages. One is these language models are precomputed. So you can scan the corpus once, build a language model and then just query it when it comes time to assess sparse extractions. Also language modelers have been concerned with sparsity for a long time and we can exploit their techniques to do better assessment. So the idea that we're using here is of course the distributional hypothesis, which in our case the idea that the same relationship tends to be -- instances of the same relationship appear in similar context. So notice David B. Shaver was elected as the mayor of Pickerington, Ohio. The sentence doesn't follow a generic patterns, but it does happen to exactly match the text on a different page. That Mike Bloomberg in New York City participate in. So if we can use these other mentions we can potentially do better assessment than we could with just extraction patterns alone. So I won't go into this. The base-line approach would be to build what's called a context vector so here we build a vector. The components of the vector, the unique context in the corpus, the counts are the number of times each term appears with that context. And then we can compute similarity using dot-products that has some problems in the fact that those vectors are really large and the intersections are sparse, which makes it both inefficient and sometimes inaccurate because of the sparse comparisons. And so the language modeling technique I used was to build an unsupervised hidden mark-up model and compress -- effectively compress these context vectors into small distributions. In this case we compressed each word's distributional information into a 20 vector of floating point numbers and did comparisons with that and that resulted in much, much more efficient computation in order of magnitude and also improved accuracy quite a bit. Okay. So I'll just conclude, you know, I think that we've successfully moved from information extraction to web information extraction. And not just based on what I showed today, but there's a whole body of work that's helped us move forward in this case. But there are some big problems that are open so the techniques I described in particular just produce a list of facts. So we can tell that Daly is the mayor of Chicago and we can tell that Catherine Zeta Jones starred in "Chicago." But as far as whether or not those are the same Chicago we really have no idea. So these tough, these tough reasoning problems like entity resolution and schema discovery that are still out there and are really crucial to being able to answer those questions like: "Which nanotechnology companies are hiring" automatically. And then lastly there are some improvements in accuracy and coverage that could be made and I actually won't say too much, but I'm a firm believer in more sophisticated language models really helping with information extraction. All right. That's it. Thank you. (applause) >> We'll take some time for questions. Hold on. Okay. Yes. >> Question: Wondering whether you feel that the redundancy of the web in English is sufficiently large that it wouldn't get much additional benefit if you were to replicate the technique with the web in other languages or whether distribution of fact and instance tokens in different languages are so different that they would -- that additional languages would add substantially? So do you have any thoughts on that? >> Douglas Downey: So yeah, I think it's a good question. I'd be tempting to side with the latter that there are -- the distribution of facts is sort of skewed enough in different languages that you could get a lot of different facts if you poured these techniques to another language. I'm just guessing. I don't have the data of course. The good news is that I think most of these techniques, the ones I described, can be ported to other languages. You can build language models in other languages. The lightweight syntactic processing that Text Runner does to identify extractions, I would expect that there are analogues to those types of things in other languages, as well. So I think it's something that could be tried. Yeah? >> Question: -- (inaudible) opened the presentation with the second question was "who won the Oscars -- who one Oscars for playing a villain." >> Douglas Downey: Playing a villain; right. >> Question: What subproblems would that decompose to in this framework? >> Douglas Downey: Yeah, so I think you need to -- so if I were to decompose that problem, I would want to find actors who had won Oscars and then the roles that they played in the movies that they won Oscars for and whether or not that role was a villain role. So you could imagine rewriting that query as X played Y in Z. X won an Oscar for Z and Y was a villain. Right? If you could search for the text actual patterns, then you could probably do a good job of answering the question. However, right, there is this -- the task that I just described of going from the question to sort of what it means in terms of individual components like actors and roles, that's some of the stuff that I mentioned at the end is really an open problem. It requires that we know the schema of what are the relevant attributes of an actor. Right? And some of them are the roles they played. And that's not something that we necessarily know how to get just yet. Yeah? >> Question: When you were combining the error and the correct value for a single error you had a value P. >> Douglas Downey: Right. >> Question: Is that just a parameter? A single number that you choose? >> Douglas Downey: It is. Yes. It's a number that you sort of observe and in fact we found that the number 0.9 was fairly reliable for the Marty Herst patterns so X such as Y -- Y is a member of class X about 90% of the time. >> Question: And you got that empirically by looking at the patterns? >> Douglas Downey: That's right. Yeah? >> Question: So I don't know if you have time, but quickly I was just wondering if you could back up to that and just explain a little bit more what you were doing in (inaudible). >> Douglas Downey: Okay. Yeah, thank you for asking. Yeah, it was very fast. Although, yes, I can't -- I can't go into too much detail, but I can say that so this model here, right, is the latent state distribution for a particular term. So Right, so if an HMM computes a distribution probability of term given latent state. Right? And you can invert that using Bay's Rule. And That is all that this compressed vector is. So we go over the corpus and we learn an unsupervised HMM that where the parameters assign high probability to the observed data. So we just use expectation maximization to build this Unsupervised HMM. And one of the things we get out of that is the probability of latent state given term and that's exactly what that compressed vector is. So I don't know if that helps but that's a little more detail. >> Question: So term is a word, so is your state trying to get at topical stuff or (inaudible)? >> Douglas Downey: It's -- yeah, it's whatever you happen to learn. But it happens to be the case that in practice words with similar latent state distributions tend to have similar meaning. Yeah? >> Question: (Inaudible) web contents and classic contents (inaudible) web content connected with (inaudible) and have you maybe you could use this kind of data to solve the questions (inaudible) that kind of the original one that shows that the role of (inaudible). >> Douglas Downey: Okay. So you're saying using the web graph as an additional input to -right, to find. I think it's a great idea. There's a lot of stuff that the techniques I propose used running text. There's a lot more information on the web and the web graph is one of those. The HTML structure of pages is another extremely informative thing. Like HTML tables have very rich information that is not identified using the techniques I described. So yeah, I think there's a whole additional direction that pulls those additional inputs into the computation. >> Question: (Inaudible) ->> Douglas Downey: Yeah. It's possible. I mean I guess I don't know too much about the state of the XML version of pages being useful for extracting real knowledge. But I can see it happening. And so, yeah, one actually and an important caveat to all this work is maybe we're just sort of wasting our time until there really is a semantic version of every web page that's out there. So people will just publish the semantic content of their page in XML and then everything that I described is useless. So that it is something that could happen. Yeah. All right. Thank you. (applause) >> I'm not sure I remember. It's been a while. So Chris has been in the National Language processing Group for eight years now, I think. >> Chris Quirk: Just under seven years. >> Yep. And has been working a lot in the area of statistical machine translation and when it comes to statistical machine translation I'm -- you can ask him later, but I don't know what he has not worked on. So I think it's, you know, instead of listing what he has worked on, just assume it's a lot of different things. And today he's going to talk about models for comparable fragment extraction for acquiring data for statistical machine translation. >> Chris Quirk: Thank you. >> Let's see do we need to ->> Chris Quirk: Do we need to push that thing? Oh, perfect. Oops. (Laughter) Here we go. There we go. All right. Thanks so much, Michael. So today I'm going to be talking about joint work with myself, Rob Vandero (inaudible), who's a researcher in Microsoft Research India in Bangor and with Roman (inaudible), who is also here in Redmond. So as Michael said, this talk is mostly about extracting fragments, fragments of parallel data from comparable corpora. And the motivation I think is pretty clear. So all these data-driven machine translation systems, including statistical machine translation systems are gated by the amount of parallel data available. In fact one of the best ways to improve data from a statistical machine translation system is to throw more data at it. That seems to be consistent way to improve things. And yet finding parallel data is a bit of a tricky feat. So there are a certain amount of data available in major languages, especially from governmental sources like the European Parliament. And they give you a surprising amount of general domain coverage. Yet it's not a perfect cover of the language. So if we hope to find sources of translation information about broader demise, about new language pairs, then I think in the future we are going to have to look beyond the simply parallel data on to more comparable sources. By comparable I tend to refer to like news documents that are written about the same event, but are not sentence by sentence perfect translations of one another. And I mean this is by no means a new observation. So people in the past have tried to learn bilingual lexicons from comparable data, have tried to extract whole sentence pair from comparable data. That sounds like a fantastic thing when you can find them, but they're rare. There's been some interesting work recently on extracting names -- named entity translation pairs using a variety of signals. So distributional similarity as well as temporal and orthographic similarity can be interesting cues for this. But I think one of the most promising approaches is that taken by Mutiano(phonetic) and Marku(phonetic), which is to find arbitrary, subsentencial fragment pairs within sentences. It assumes all the remainders of these categories. So if done effectively it can gather a huge amount of parallel data. As an example of the sorts of documents that we're hoping to work with, here is an English-Spanish pair, about the untimely dead of a linguistically gifted parrot. And yeah, so it's kind of an interesting article, actually. But so the first thing we can see is that there are indeed almost completely parallel sentences in these documents, although they lie in different places and the whole documents are parallel. Like all the information in the Spanish side is replicated in the English, except for that small phrase including zero. So there is a certain level of parallel sentences. However, I went through and highlighted all the information that's basically parallel. And we can see almost 50% of the words are in common between these two articles. However, very few of them are whole sentence pairs. So there's a lot of data, but we're going to have to use slightly different techniques in order to get at it. In particular with news articles, one really effective source of parallel data is looking at quotes. Right? So direct quotes of people are almost always translated very literally. So that's great for parallel data. But as you can see they are subsentencial often times. Okay. So the first part of the process here I'll be describing is not anything particularly new. If we're going to try to extract parallel data from document sources such as large news corpora, then the first thing we need to do is find documents that are describing approximately the same event. This portion is kind of focused on the news aspect. There may be other sources of comparable documents that already have a base alignment and where this wouldn't be necessary. Here we are not going to do anything new, this is following exactly what Muntiano(phonetic) and Marku(phonetic) did and I think it's a very sensible approach. So we require three resources. The first is a seed corpus, a seed parallel corpus that provides some base translational equivalence knowledge and then a bank of target language documents and a bank of source language documents that presumably describe some of the same events. The first thing we're going to do is use standard word alignment techniques in order to build models that give us a base notion of translational equivalence. Okay. Then we can take all the documents in one language, I'll pick the source and build an inverted index so that we can quickly look up a document by the word that it contains. Okay. Then we can march through each target language document and find similar document pairs using cross-lingual information retrieval. This isn't magic, just take each document, represented as a bag of words, find its likely translations from the word alignment models. So you have a source language bag of words, an issue that is a query against the source language corpus. So simple measures like Or BM25 are sufficient for finding documents that are actually surprisingly similar. And this gives us a source of document pairs that are talking about basically the same thing. They hopefully have some concept overlap. Furthermore, this talk isn't going to look at documents as a whole. Instead it's going to look at promising sentence pairs within those documents. So we do a simple filtration process, just find sentences within these document pairs that contain some terminology in common and restrict our attention to those. So this is where the task is set up. Given millions, maybe tens of millions, maybe even hundreds of millions of sentence pairs with some content in common, try to identify a hidden fragment alignment between them. So going back to the parrot story, we can see a sequence of words like this and the quoted segment is in common, as is this pepper bird told, but the remainder of the information in the sentences is not in common. So This is exactly the sort of information we'd like to extract. So let me spend just a moment talking about Muntiano(phonetic) and Marku's(phonetic) approach. So it's motivated by an idea in signal processing. The idea is that we can again get some notion of translational equivalence and give words a score between negative one and positive one, which indicates the strength of belief that word has a corresponding translation on the other side. Let's go to the graph, it's a bit more clear. So along the X axis, we can see the words of some English sentence and underneath it we can see a corresponding Romanian sentence. Jarglish(phonetic) is a Romanian so he's looking for Romanian parallel data. So now what we can do is take each English word in turn, find its most strongly associated Romanian word and give it a score along the Y axis according to its strength of association. If a word has no strong correspondence then it gets a score of negative one. This gives a very noisy blue dotted signal that you can see. So they run a moving average filter across this signal to kind of smooth it out and then search for segments of at least three words that are all positive. Okay. So in particular this circled segment is something that seems to have a translation. All the words in the English side that participate in such a fragment of assumed to have some correspondence. They're set aside. Then we flip, do the exact same process based on the Romanian, pull out all those words that have a parallel correspondence and we end up with some subset of the words that act as a parallel fragment. It's a decent starting point and it leverages some of the knowledge that we have. But there are down sides to this approach. First of all, the mapping from probabilities to scores between negative one and positive one is somewhat heuristic and seams like a more well founded probabilistic model might be able to better leverage the data that we have. Second of all, the alignment from English to Romanian are Romanian to English are computed completely independently as are the words that participate in fragments. So an English word might be aligned to a stranded Romanian word. So we might pull in an English word in the side fragment and not pull in its correspondent in the other side, which is a slightly strange thing to do. So we are not totally convinced that the fragments are parallel. And finally there's no -- in this sentence there's only one fragment that we extract on the English side, but sentences may contain multiple fragments. This approach just runs all the words together and doesn't have any notion of what the subsentence fragment alignment was like. So these are some things that we hope to improve on. Okay. So our approach instead was to build on purely probabilistic models and we ended up doing something simple. The motivation here is that we think a fragment within a sentence pair should be considered parallel if and only if the joint probability of those two words sequences is greater than the product of the marginal probabilities. And likewise we can express this fact in terms of additional probabilities, the conditional probability of target is greater than its marginal probability and to this end we'll present to generative model of comparable sentence pair to try to capture that. The hidden alignments behind these models will identify the fragment correspondences. And we tried to capture some limitations of prior methods, namely that the selection of the alignment is an independent in each direction and we model the position in the sentences, etcetera. But we do follow Muntiano(phonetic) and Marko's(phonetic) lead to evaluate in terms of the end-to-end machine translation quality. That seems like the best way to identify how good information you are going to extract. Okay. So let me do a very brief review on word alignment techniques. This -- I'll be focusing on HMM-like word alignments. So the idea is we are given an English sentence and we want to predict the probability of a Spanish sentence. The HMM models this by saying there's a hidden bit of information, which is of course the word correspondence between the two sentences. So the English words form the state space that's used to generate the Spanish side. In particular we march along left to right in the Spanish sentence. We'd pick a word or we'd pick a state that generates the first Spanish word according to a junk distribution. So in this case L had no strong correspondence in the English side, so it's generated according to a null distribution. Fraude was generated by fraud and so on until we've generated the whole sentence. So it's just -- the probability of this whole sentence and its alignment is the product of those individual probabilities. And right, so the probability of landing in this state is conditioned only on the distance from the prior state. So there is some bias towards making somewhat monotone alignment through the sentence and furthermore, even in truly parallel data, we have a certain amount of mismatch between the languages. For instance, in this sentence the Spanish used a determiner where the English did not, says El Fraude, instead of fraud. So these simple IBM models and HMM model learn the multinomial distribution over target words, given source words, including null. So it can generate little function words like this as necessary in order to model the sentence. As we move from parallel data to comparable data, one simple way to account for words that have no strong correspondence in the other language is to say they're generated by this null state. However that's not just a good idea because we hope distribution of words given null will model those things that are systematically inserted during translation. We don't want to learn that situation and actual are systematically inserted during translation. That's the wrong sorts of information to give. So instead we're going to expand the base word alignment model a little bit and end up with something that we call comparable model A. So I hope the math isn't too daunting. The top is the true generative story without any approximations of how we might generate a sentence and a target sentence and an alignment given a source sentence. So first we predict the length of the target side. Then we predict each state conditioned on all prior information and then we pick the translation conditioned on that state and all other prior information. In practice the standard HMM says that the probability of landing in state A sub J is conditioned only on the prior state and the probability of generating word T sub J is conditioned only on the source word that it was aligned to. So comparable model A makes only one small extension, which is to add an additional state that says we can generate words monolingually. And when we generate a word monolingually instead of conditioning on the source word we use standard end ground language modeling techniques. We generate it according to the context in which it occurred. Okay. So graphically it looks something like this. We've added an additional row at the bottom, this gen model that accounts for that probability. A bilingual word now has three pieces. We predict that it's generated bilingually. We predict its jump from the prior state and we predict the word itself and the context of the state that generated it. Monolingual states only have two portions. We predicted this monolingual and then we predict the word in its trigram context. Okay. Now those of you familiar with HMM techniques know that we can just throw a standard HMM at this problem and find the most like state sequence and pretty -- in a small polynomial time. Right? So this is one means of finding parallel fragments using only end gram language models and -- sorry HMMs. Okay. So the one remaining thing is we're only left with a word alignment. When ideally what we wanted was a fragment alignment. So one simple idea is just to again look at sequences in this Spanish side that were all generated bilingually, find sequences of at least three words. Look at the range of English words that they were generated from and place some simple filtering conditions on that. So this is one simple model for extracting parallel data. Okay. So the nice thing about it is it's a relatively simple and clean extension of the HMM model and it's really fast, easy to implement, something you can even use to extract fragments. But there are limitations we'd like to address. So first of all the alignment metric -- method itself is asymmetric, that is we're only looking at one direction of these correspondences, not both directions. And secondly this is a mark-up model. There's nothing to prevent us from revisiting Dominican or any random state again and again throughout the sentence. So a single source language word can participate in multiple fragments, which is not ideal. Third, there are several free parameters to tune. And fourth, we're looking at only a single vitribial/verterbi alignment, which is okay, but it's not ideal for monitoring more phrasal translations. We're finding the exact word-to-word correspondence is somewhat more difficult. Okay so that leads into the second model that we explored, which is what we'll call generative model B. This is a joint model of source and target fragments. So it's generating both of them jointly as opposed to generating the source conditioned on the target or vice versa. And so the idea -- the generative story behind this model is first we pick a number of fragments and then for each fragment it generates some number of source words and some number of target words together. It could generate only source words for a monolingual fragment or only target words, as well. Again, we'll use source and target end gram language models to estimate our parameters and start from HMM models of source-given target and target-given source. Okay. So monolingual fragments are going to be scored only according to the language model and bilingual fragments -- well we have two ways of estimating the probability. We have two marginal and conditional distributions, both of which we can get a joint from. So since we're interested in the highest precision fragments that we can possibly find, we're using the minimum score out of these two. So that means that both conditional models have to agree that this is parallel before we will accept it as a fragment. Graphically the model looks like this. As opposed to predicting one word at a time, we are predicting whole blocks of the sentence at a time and they are scored according to the formulas that are put on the previous screen. Okay. So parallel fragments are scored according to minimum of each of the conditional rate bounds and monolingual fragments are scored only according to language model. Okay. And we progress this way through the whole sentence. This search here is pretty easy. Since we don't condition on anything about the prior fragment structure we only need to remember the one best fragment that covers the first J source words and the first K target words. And then we can use the simple algorithm in order to find the best sequence. Works okay. Unfortunately it's really slow. So those of you who have implemented String Edit distance before will note a similarity between this and String Edit distance. We can do like an insert, a delete or a copy, except the insert, delete and copy operations are of arbitrary length. We're not just using one character at a time. We can use arbitrary string length. So that expands the search computation from order MN to order M Squared, N squared, where M is the length of the source sentence and N is the length of the target. And in addition there are these probability scores that need to be computed. So the language model scores can be computed relatively cheaply, just by looking at the scores of each segment. But the joint probabilities, marginalizing out across all the HMM alignments takes a little bit more time. So to speedup the search process we introduce several optimizations. First we can note that the score of a fragment that covers from J prime to J and K prime to K and in the sentence is bounded by above -- by predicting that span of source where it is given the whole target sentence. So as we add more information we only increase the probability of a segment. Furthermore we can notice that the marginal -- we'll only use a bilingual span if it's more probable than a monolingual span. So we can focus our attention on those spans for which this bound is greater than the marginal probability. Okay. And this can be found relatively quickly by dropping some transition probabilities or estimating them. We can cut it down to about a cubic time and then next we can only score bilingual segments, where both upper bounds exceed the monolingual probability. So this cuts down the number of places that we need to visit in the search space. Then some bounds on fragment length and fragment ratio further cut the search space. That series of optimizations made the procedure about 10 times faster, which is still not quite enough given that we want to run this over tens of millions of sentences. So in the end we also did a beam search. And the simple ideas is you just instead of keeping the best hypothesis ending at or for each distinct number of source words and target words covered we keep a stack of hypotheses based on the total number of words covered. And at that point a future cost estimate becomes necessary. The paper has more details this so I will leave the technical stuff to that. Overall these optimizations produce about 100 X speed up over the baseline algorithm and only affect model cost by less than 1%. So it's a pretty effective means of (inaudible). Okay. So that's model B. It's a symmetric model, which is nice. Each word definitely participates in exactly one fragment. We're marginalizing out across word alignments inside fragments and there is no free parameters. So It sounds awfully nice. There are some limitations, though. The fragment alignment must be monotone and may be subject to search errors and still even with with all these optimizations it's slower than model A. So that's all I have to say about the models themselves. In terms of what we get out of them, well, I think there's two things we are interested in. One is how much parallel data can we get from these comparable corpora. And the other question is what sort of impact do they have on an end-to-end translation system? For our machine translation system we just did something simple, use a phrase with a coder, much like Pharaoh, though it's an in-house version. And we didn't do anything unusual. Standard methods of estimating probability with the trigram language distortion penalty. So our baseline was to train a machine translation system on only the parallel data and the comparison systems append fragments to that parallel data and then we'll retrain the whole system from scratch. In terms of fragment extraction. For our seed parallel corpus we started with European parliamentary data, which is we had about 15 million English words of parallel English-Spanish data and then for our fragments we looked at the Spanish and the English giga-words. So these are both quite large. Spanish is not quite one giga-word, but it's 700 million, good enough. And we compared three approaches. So first was an in-house implementation of Muntiano(phonetic) and Marku(phonetic), which we think is pretty close to what they implemented. And you will notice that it extracted almost 60 or above 60 million English words, which is almost like 10% of the Spanish data. So it's a significant fraction of the data that's actually being extracted. It's hard to evaluate this stuff in isolation. Looking at it seems like it has some good parallel data, but it has a lot of noise. Model A extracted significantly fewer amounts of data but it looked much cleaner and model B was somewhere in between. So but the story is that even just from just large bodies of news corpora we can find a huge amount of somewhat parallel information, at least. In terms of MT quality, we evaluate it on three test sets. The first was in-domain European parliament test set and you'll notice that none of these methods succeeded in improving quality in that domain. I don't think that's terrible surprising, right? We're not going to go out to NewsCorpora and figure out how to translate things in this domain that help although it is nice to see that both of these models either didn't degrade or didn't significantly degrade the baseline system. We're adding in the noisy segments from Muntiano(phonetic) and Marku(phonetic) did seem to have a noticeable impact on translation quality in domain. Along with the euro parallel data there is also some news commentary data, which these guys -which the organizers of these SMT workshops claim is out of domain, it's not actually very out domain. The out-of-vocabulary rate is still less than 1%. So we tried it out on this source and found that all of the methods had a small, but positive impact on this translation material. Where this data really seems to make a benefit, however, is on data that users actually want us to translate. So we have a live translation website up now that Anthony talked about during the last symposium. So we started gathering the sorts of requests that people want to translate and using the fragments here makes a significant difference like almost up to three bullet points of improvement against a baseline system. So the potential for impact is really quite large, especially in out-of-domain scenarios. Okay. So that's all I have to say about this. We presented two generative models for comparable corpora. And these are clean methods for aligning nonparallel data. And in terms of future work, I think there is a lot of additional sources that can easily be exploited. For instance Wikipedia has a huge amount of parallel or comparable data and seems like it could improve quality in a broad set of domains. I do think there's more work to be done in terms of parameter optimization and bootstrapping, as well as exploring different model structures. And finally it would be kind of interesting to apply some of these models to our parallel data to see if they could do some sort of automatic cleaning, find the information that's not actually parallel within them and see if the translation system would perform better. So that's all I have. Thanks very much. (Applause) >> Yes? >> Question: Very interesting. One thing I was surprised to see how large the C corpus was, though. Has one looked at how performance falls off as that gets smaller? >> Chris Quirk: I haven't done that. That's a really good question because we'd love to know what kind of impact it could make on like English ending, where there is only a tiny amount of parallel data. No, no, sorry. I haven't done enough experimentation yet. Yes? >> Question: Yes, (inaudible) -- they are not contiguous. I mean they have to be contiguous. >> Chris Quirk: Does have to be contiguous, yeah, and if there's a gab in between you have to extract two fragments instead. >> Question: Right. So (inaudible) ->> Chris Quirk: Makes the serve space potentially larger, which kind of scares me. But it's possible, you know. And maybe approximate methods are sufficient. You know, I think some of these things -- like the fact that you can make it so much faster without having any major impact on model costs makes me think that we're just trying too hard at this search stuff. We could get away with something simpler. You know, maybe something get sampling like in order to look at the state space quickly. But yeah, it's an interesting idea. Like do you want to chop up a large fragment into two pieces because there was one little bothersome ->> Question: (Inaudible) also short on (inaudible) you have (inaudible) ->> Chris Quirk: Yeah. Yeah, that's true. Yes? >> Question: When you say it is to be contiguous, you mean with the exception of those function words that you were talking about that -- with the exception of those? >> Chris Quirk: Exactly. Yeah. So anything that can be generated by a null, the HMM's null model is allowed. Yeah? >> Question: (Inaudible) try the English/Spanish right now or have you tried other ones, as well? >> Chris Quirk: I've only tried English/Spanish so far. And yeah, I think -- as the -- as we just talked about, it would be even more interesting to try it on lower density languages, but there's a lot of demand for English/Spanish translation interestingly so that's been part of the reason we focused on that language pair. All right. Well, I think there's food back there and I should let you guys enjoy it. Thank you so much for your time. (Applause)