>> Host: Hi, everyone. It's my pleasure this afternoon to introduce Chris Burges. Even though we've been colleagues at MSR for a number of years, it wasn't until I read his bio this morning that I realized he had started in theoretical physics. Since there's been no mention of supersymmetry or the Higgs boson this morning, I think it's fair to say he's moved a considerable distance since then. Chris's work in machine learning has spanned everything from handwriting recognition to audio fingerprinting, information retrieval, support vector machines, neural networks, and a number of other techniques. Today he's going to be talking to us about one of his current interests, which is semantic modeling of language. >> Chris Burges: All right. Thank you very much. Fantastic. So I'm wired, thanks. So I'd like to start by thanking the organizers -- can everybody hear me? Okay in the back? Can you hear me? Up? Okay. Okay. I have a quiet voice, even with amplification. So I'd like to thank the organizers and the program chairs, in particular Will Lewis and [inaudible]. I'm honored to be given the chance to talk to you folks today. I was asked to be somewhat controversial and also technical, and I'm happy to try to oblige. But technical and controversial are hard to do at the same time so I'm going to split the talk into two. The first part will be more speculative and perhaps thought-provoking, big picture, and the second will be some resources we've been working on that will be publicly available. And that's another reason I'm very happy to talk today is we've been working on sort of public service datasets to help forward the field, we hope. So I'd like to especially thank these folks. I'm using some of their slides and information in this talk. I think most of them are here. So let's start with AI. So artificial intelligence. The quest is as old as the hills, as old as computer science. And unfortunately, it has a long history of high expectations not being met. So people approach this with trepidation. And when I tell people, sort of tongue in cheek a little, but not completely, that I want to work on strong AI, I expect the kind of response like this [laughter] and have got it. But what was unexpected and interesting was this one. And the why wasn't so much why on earth would anybody want to solve this problem. It's a very interesting problem to solve. It's more amidst all this wonderful progress from deep learning and everything else, why attack such a hard problem? Why not just, you know, do something equally impactful perhaps and simpler? So this talk I hope contains my answer. If it doesn't, let me know at the end. So let's take a look back at artificial intelligence and ask how far we've come. So to paraphrase George Santiano, those who ignore history are doomed to repeat it, so it's good to look at the history here. There was actually a page in Wikipedia devoted to AI Winters, and this is from that page. This list is a list of events that sparked an AI winter. The name winter denotes a cycle, so it looks like something like this. It's interesting to me that a system that makes all the world's information available to billions of people was not on the AI radar back then. Namely, web search. I mean, it doesn't do all of this, but it's come a long way. So let's just look at the first three of these winters. The failure of machine translation. So MSR translator has now over 100,000 companies using its services, and it's widely used internally as well. And to give you a feel for the growth, this is a graph from daily requests for translations from April 2009 to August 2013. The numbers are proprietaries so I can't show you the y axis, but you can get an idea of the growth. 400 times since the beginning. The gaps are due to the gaps in data collection and the outlier specs are often due to internal experimentation or sometimes [inaudible]. Rick Rashid gave a demo in Beijing you might have seen -- it's there in case you didn't -- in which his speech is recognized, translated, and regenerated in Mandarin using a model of his own voice. So we've made some serious progress on MT, but it's still far from perfect. We've got a long way to go, I think everyone would agree. The abandonment of connectionism is the second AI winter. Let's look at that. So ImageNet is an equivalent of WordNet, image equivalent of WordNet. Image is a place and an ontology like [inaudible], and there are several ImageNet tasks. This one is classification. So top five correct means that the correct class is in the top five choices here. This is a hard task. A thousand classes. And deep convolutional nets are doing pretty well. On the left, the true label is in pink, and the scores the system gives are shown by this histogram. So it gets things wrong, but understandably this looks like a -- actually, let's do this. More like a dalmatian than a cherry perhaps right here. And it's hard to tell the difference between an ageric and mushrooms. But anyway, we're doing well using all tricks. Most of the tricks -- most of the technology is quite old. But faster parallelized hardware and more data have been really crucial here. The second task is even harder. 22,000 categories in ImageNet, 7 million test images. It's much harder. The numbers here are accuracies, not error rates. So it's inverted from the previous slide. And an informal test by Trishul Chilimbi showed that humans probably get around 20 percent correct on the top choice on this dataset. So, for example -- and I should mention that Trishul has some ground-breaking results. I can't tell you the actual numbers because it hasn't been published, but -- on this dataset, beating these numbers, significantly. So these four images were taken from the training set. Can anybody tell which is the parliamentarian and which is the Pembroke Welsh Corgi? You don't get extra points for saying these guys are not Corgis. So it turns out this guy's a parliamentarian and that guy is the Pembroke Welsh Corgi. So this is tough, a tough problem. So the third area I went to was the frustration of speech understanding research. NIST has been doing evaluations on speech since 1988, and this graph shows those evaluations from that time until 2011. The y axis is word error rate, and in the early years the domains were narrow, such as air travel planning, and performance was on a par with the human capability. But open domain problems are much harder. Things like meeting transcription, television news, captioning. The switchboard is a benchmark, and in the late '90s there was good progress, but for the past decade it plateaued, and then from 2010 to 2012 MSR made significant improvements using detective convolutional networks. Again, fairly old technology with some new tricks. So what changed here? I think that there's three things that have changed, and there's two really big things that have changed. The first is compute power. From 1971 to 2011 transistor count, the density increased by over a factor of 1000. We also have huge clusters. So Amazon's elastic compute cloud has -- is estimated to have over a half million servers. And the web -- we have huge datasets that you can think of text on the web as weakly labeled data. And it's true that algorithms have progressed a lot, but I wouldn't say with orders of magnitude improvements in accuracy. I mean, we've gone from sigmoid transfer functions and units to rectolinear transfer functions, and there are other lots of cool tricks, but they're not -- I just don't think you can point at them and say that they're responsible. These ideas are pretty old. So should we just continue down this path and forget strong AI for now? And for many of us, the perfectly reasonable answer is yes. It's perfectly fine. So let's take a look at AI, where we'd like to be, where we're at, and where is machine learning at and kind of provide what we think we need. So this is my blue sky slide. You may say there are several. But here are problems we'd like to solve automatically. I'll just let you read them. These are incredibly hard. Right? But it's tasks like this that makes me excited about this field, and it's what got me into it, and I hope the same is true for some of you. So that's where we'd like to be in the end. Right? But it's very far away. It may seem far-fetched. So how far are we? Well, there are certainly challenges in NLP. I invented this sentence here: I house table but them eaten never turnip boat, which I don't think is grammatical, but certainly for the technologies that users use to check grammar, they're not very helpful in this situation. So it's largely unsolved. Paraphrase detection is also largely unsolved. This is an example of paraphrase succeeding using crisp rocket system that takes a sentence, translates it into many languages, translates it back into English. One of these sentences was written by [indiscernible] and Doyle and the other was generated automatically. Anyone want to take a guess which is which? >>: One is automatic. >> Chris Burges: That's right. I think probably today you get better marks with an English teacher with the first sentence because it's shorter and contains the same information. First is automatic. But paraphrase is a largely unsolved problem. This is just one example that work, but there are many that don't. And I think there's a fundamental reason, which is that we need world models to inform our meaning representations, but we need our meaning representation to properly [indiscernible] our models. So there's this cycle we need to break into. And people are doing it. There's exciting progress. For example, this little piece of text is from data that I'll be talking about in a minute, and to solve it there's technology right now that we might be able to adapt to do this by Luke and others. And it's progressing rapidly. So there's certainly interesting progress and fascinating stuff as attested by this conference. And another example, this is a cite from [indiscernible] work. So sequencing technology provides rapidly growing genomics databases, and it needs systems biology approach to interpret this data for discovering disease genes and drug targets, and the bottleneck is genetic knowledge, especially pathways, which are very complex, incredibly complex things to model. But they're in the text. They're in these papers. So we'd like to automatically extract this. And this is an example of a path we'd like to achieve given this input from some paper. So, again, work is ongoing, and it's very exciting. So how about new kinds of machine learning? Is the machine learning that we have today pretty much enough? Machine learning is largely statistical modeling discipline currently. It's used to learn of structured data with structured outputs, but I do think it's missing a few key ingredients. This is a typical bread and butter machine learning task. This is the first 800 threes in the MNIST, another NIST test, statistical learning set, benchmark set, that, actually, I had a hand in helping create this set for a machine learning [inaudible]. And it's just a classification task. And it's learning by example, which is essentially what the speech and ImageNet results used that I just showed you. So this is the typical thing that people do. Another example, this is unsupervised learning. I don't know if you've seen this result, but this was -- actually [indiscernible] was an intern here. Not when he did this work. They used a deep auto encoder to train images gathered from YouTube, and then they took one of the units in the middle of the auto encoder and said which inputs light up this unit the most, and they regenerated this picture of a cat. So they generated a cat neuron automatically, which is kind of amazing. There's this grandmother neuron idea of how the brain works. But text just seems way, way harder. It's more recursive, fine structure. It's not typically addressed by statistical learning models. It's hard to see how statistics would solve this: The ball fell through the table because it was made of paper. The ball fell through the table because it was made of lead. So in those two cases it refers to something different. And you have to know that the table, if it was made of paper, could have a ball fall through it, but if it's made of lead it probably couldn't, to solve it. These are called winter grad schema sentences. I'm coming back to this later. So progress has certainly been made in machine learning on language. Here's one intriguing example. This is called the skip gram model. There are certain semantic properties encoded by vectors. So you learn the vector that predicts the two previous and the two following words automatically in a ton of unlabeled data. 30 billion words. And you get vectors. And it turns out that if you take the vector from Madrid and subtract the Spain-ness from it and add the France-ness to it and ask what's the closest vector to this one in the space I have, you get Paris. So Paris is Madrid with Spain-ness removed and France-ness added, which is amazing, right? I mean, good lord. And then they extended it to phrases, what they call analogical test samples, and they get pretty good accuracy. So an example being New York is to New York Times as San Jose is to San Jose Mercury News. That's not easy to do. Seems like a step in the right direction. Now, you've got to be careful, though, thinking that we need to model meaning when we don't necessarily have to. Statistics either is or are powerful depending on whether that's plural or singular. Is marble cake a food or a cake of marble? You can probably solve that just by searching for verb phrase marble cake and verb phrase, noun phrase, and count when noun phrase is a known food automatically. A more interesting example is this one: John needs his friend. John needs his support. To whom does his refer? This is a paper by Bergsma and Lin. You just look at a ton of unlabeled data, look for these patterns where this pronoun occurs and this word and see what fraction of the time the gender agrees, the number agrees, and it gives you evidence that if the noun phrase needs pronoun support, the reference is likely not the noun phrase. So there's certainly a lot we can do with pure statistics. Surprising things we can do. This looks like a very tough [indiscernible] resolution problem. But it seems to me that such tricks won't be enough probably. Here's more examples like the ball fell through the table. It's called the Winograd Schema. And the first sentence was proposed by Terry Winograd in 1972. There are 141 examples of these things on this website. And it's hard to see how statistics alone would answer these. Jane knocked on Susan's door but she did not get an answer. So is that Jane or Susan that "she" refers to? And beyond that, these are very specific problems. They're anaphor resolution problems. But there's lots of other problems, of course, where you need world knowledge to know the answer. So I don't think statistics alone or typical machine learning approaches will solve this. So I claim we need a lot more from machine learning. I'm only going to talk about one of these things now, but it would be great to have interpretable systems, systems that you can say if this makes an error, I understand why it made an error. A giant six-layer neural network, it's very hard to say why it made this particular error. There's a decision surface in some space. Scalability. We'd like our systems to be -- to work in different domains and on arbitrarily large datasets. We'd like them to be modular and composable for debugging purposes. People are thinking about this, but it's not front and center. And I think it will need to be for us to really make progress. So let's look at correctability. What do I mean by correctability? So what I mean, there are two things very closely related I mean. One is stability. So decision surfaces are not stable on adding new training data. So here we've got this decision surface, the green from the red. I add a green point right here and the red point is known as classified. Right? Now that I've retrained. This is in stark contrast to human learning, which is very compartmentalized. So this would be like when a child learns to ride a bicycle, he forgets how to brush his teeth. Right? It just isn't how would he learn. We're robust. We can learn something new without forgetting something we already knew. So we don't probably learn by moving giant decision surfaces around in some high-dimensional space. We want to be able to correct errors without introducing new errors. Another way of looking at the same problem is separability, and that is the separation of problems from subproblems. So now we have this same problem and we want to solve the misclassified green point by adding training data, but that's likes the child makes a mistake, we don't lock him in the library for a week and ask him to update his parameters, right? There's very few bits transferred from student to teacher. If you measure when they actually say, they say very little. So there must be some rich shared world model that each actually has a model of in order to have this communication occur. And we're just miles away from this kind of modeling, I think, today. And machine learning, you know -- we need to think about this, I think. All right. So the nature of meaning is slippery. Actually, I'm going to ask you to give me a 10-minute because my clock is completely off. And so I'm not going to try and define meaning because that's just a black hole. It's been discussed since Aristotle's time. It's much safer to use operational tests. And so what do we have for operational tests of meaning? There's Turing's test. But it encourages the wrong thing. [Indiscernible] humans, as Levesque pointed out. And participants must necessarily lie. If you ask the computer about his mother or the person about what operating system he runs, he has to lie [indiscernible]. And, actually, we wouldn't mind knowing we're talking to a computer if it's much smarter than anybody we could possibly be talking to, right? That's a good thing. So how about the Winograd Schema? It's beautiful data, but it's expensive and hard to create this. It's quite creative generating this data. This is the original sentence Terry Winograd proposed, but it only tests pronoun and anaphora resolution too. So question-answering is huge, and I think it's a useful thing to view comprehension as simply question-answering in the following sense. If any question you could ask can be answered by the system, then the system comprehends the text. And this we can measure. The one thing I don't like about this definition is that it has humans in the loop. It's very hard to come up with definitions that involve meaning that don't have humans because then you're defining meaning. So we have humans in the loop just like Turing did. All right. So I'm going on to a good clip, so I'm going to go to part two. That was the speculative. If anybody has questions, by the way, you can jump in during the talk if you like or hold them until the end. It's up to you. So now we generated two datasets because there was a need for these datasets to help, we think, researchers progress in semantic modeling. And, in fact, progress is often tied to the availability of good benchmark test sets, everybody I think would agree. So what are the properties we require of these kinds of datasets? We'd like them to be public domain. Copyright-free. Anybody can do anything they like with this data. There has to be a very simple, clear, unambiguous measure of success. Data collection is scalable. We'd like a big gap. So between humans and machines, you know, [inaudible]. A good baseline. Open domain yet limited scope. The problem with limited domain is you can solve the [indiscernible] data set very, very nicely, but then you've just solved the [indiscernible] data set. We want something that's more generalizable. And ideally we'd like a hardness knob. When people get better at it, we'd like to make the problem harder without a huge new collection of data. I think we've done this. So the first dataset is called the Holmes Data. And it's joint work with Geoff Zweig here at MSR. So this was an SAT-style sentence-completion test, and the task is to select the correct word from five candidates. We took the target sentences from five Sherlock Holmes novels from Project Gutenberg and the training set we used for the N-gram models was also taken from Project Gutenberg. We started with 540 documents. We whittled down to 522 to make sure that everything was copyright-free. So then we used N-gram models to build decoys. And these are the novels we drew the sentences from for the test data. There's huge variation in style across the five novels which is why we settled on Sherlock Holmes. So it's limited scope in that sense. It's just one writer. But he can write about anything. So to generate the decoys we select the sentence at random, we select a focus word with low frequency, we compute the tri-gram probability, we save 150 highest of these for the low-frequency words. If the original word wins, we reject that because it's too easy. Unfortunately, this introduces a bias in the opposite direction. And then we score by the N-gram looking forward. And then we retain the top 30, and the human judges, which was actually my family and Geoff's family, pick the best decoys, syntactically correct but semantically unlikely. And by syntactically correct, we just mean agree in number, gender, tense, and that kind of thing. So humans had to do some curation here. And here were the rules. They must be grammatically correct, it should be a significantly better fit than the decoys, the correct answer, although decoys could fit, but the correct answer has to be -it's like an SAT test. No offensive sentences. These were written a century ago. Here's an example that requires some thought. They all require some thought. Was she his client, musings, discomfiture, choice, opportunity, his friend, or his mistress? Well, friend and mistress are people, so it's probably client. So you have to think like that. Here's one that required world knowledge. Men have to be older than seven years old. And so forth. Again, this is a fairly labor-intensive thing. Here are some samples. I'll just let you look at them. I'll draw your attention to two. So with this question you need to know that fungi don't run, but that's actually not true. You could have a tree with fungi running up the side. It's just there. In this context it's true, so it's a great question. Here you have to know that you don't stare rapidly. If you look at something -- if you stare at something, you're doing it for an extended period of time. You would glance rapidly. So this does require knowledge, we think, that is hard to model. So here is how the various systems do. The human, Geoff's wife, was tested on 100 questions. The generating model didn't do as well as the N-grams because there's a bias against it, as I just explained. The smoothed N-gram models were built from the CMU toolkit, and the positional combination was us being devil's advocate, us knowing exactly how this data was created, do our worst to build a system to solve a problem by knowing the tricks of the biases involved, and we got up to 43 percent. So we didn't succeed in our goal of trying to make this completely impervious to knowing how it was created, but it's still a far distant cry from human performance. And then latent semantic analysis does slightly better. And I'm happy to report that since then, there have been people publishing on this dataset, getting quite good results. So Mnih uses a vector space model. He predicts 10 words, five before and five after. And Mikolov uses the skip gram model I just described and does pretty well on this. But it's still far from 100 percent, which is -- we think a human less harassed might get 100 percent of this. And the data is available at this url. So the urls I'm giving today contain the papers, the background, the data, everything you need to play with this stuff. All right. On to the stories data. This is the second dataset that I just want to tell you guys about. This is joint work with Matt Richardson and Erin Renshaw, and, again, there's a website where we will be collecting results. So hers was a useful dataset, but it's not really scalable. There was a lot of work to actually wind up with this 1040 sentences. So we wanted something also more general, because that's a particular kind of test. We want something that tests comprehension. So we went to comprehending stories with multiple choice questions. This is an old problem. [Indiscernible] in 1972 worked on this kind of thing. The difference really with this is the scalability of the data collection. So reading comprehension seems like a good way to do this. It's how we test people's understanding of text. Again, I'll just let you read this. Progress is easy to measure with multiple choice questions. One of the pushbacks we got when we sort of did this was maybe first grade is too easy, guys. Come on. Is it? So this is a parents' guide to the first grade curriculum. I don't know if you can read this, but -- let me just read it to you. Describe, retail, and answer questions about key ideas and details. Explain the differences between fact and fictional stories. That's a really hard problem to automate. Beginning reading complex text such as prose and poetry. These are not machine-solved problems at all. So we do not think that first grade is too easy. In fact, we think it's the opposite. If you've solved first grade, you've probably gone a long way to solving everything. So the MC Test data looks like this. Fictional stories, meaning you can't go to Wikipedia to find the answer. The answer is in the story and only in the story. Sam's birthday is tomorrow. When is Sam's birthday? Multiple choice. We limited the vocabulary to what a typical seven-year-old would know to try and limit the scope. We keep it open domain. There are no copyright issues. This is generated using Mechanical Turk. So you're free to use this data any way you like. We looked into using SAT and the rest, but we couldn't do it. And we wound up with 660 stories, and we've done this in such a way that we think we could collect 10 times as much. If there's interest in this community on this dataset, we will collect more and support it. So we paid $2.50 a story set because they had to write a story and four questions, and for each question, four answers, one of which is correct and five are incorrect. And we got a wide variety of questions. So M Turk has over a half million workers. They're typically more educated than the U.S. population in general, and the majority of workers are female. Just curious statistics. This is an example story set from the first 160. And to solve this -- I put the correct answer on top. That's not true in the data that's on the web. You need to know that turning two years old happens on a birthday, for example. And there are some extremely challenging NLP problems buried in this data, an aphorism solution in particular where it refers to something several sentences back. So what we did was we did a little test run of 160 of these stories. We gathered 160 and then we curated them manually, and that led to ideas on how to do a second run at a data collection that was more automated, so we know how to scale. And we also worried that people would write the ball was blue and then the question was what color was the ball. It's too easy, right? So we asked that they have at least two of the four questions require multiple sentences to answer. So you have to go into the text in two places. This was actually very confusing to the M Turkers, what we meant by this. We struggled with trying to explain what we meant by this. We did fix errors on the 160, and we found certain workers to avoid. So we touched about two-thirds of the stories. And this enabled us to improve the process to generate much more data. So now we collected 500 more stories, we automatically verified that the distracters appear in the story. Otherwise it's too easy if only the correct answer appears in the story, unless the correct answer does not appear in the story because it might be like how many candies did Jimmy eat, and the answer is not in the story, but you should be able to get it. Creativity terms. We wanted to increase diversity, so we took 15 random nouns from the vocabulary and presented them to the users and said use these nouns and you like, but you don't have to. We added a grammar test, which actually had a significant impact on the quality of the stories. Thank you very much. And then we added a second Turk task to grade stories and questions. So here's the grammar test. I came up with this, mostly. Half of these are grammatically incorrect, and we required that the workers get at least 80 percent correct to do the task. We found that requiring 90 percent correct would have extended the data collection time from days to weeks, so 80 percent was what we went for. I won't test you on the grammar test. So here's the effect of the grammar test. We took ten stories randomly chosen with and without the grammar test in place that were generated and then we blindly graded each for quality. And you can see it made quite a difference. So the quality measures do not attempt to fix the story. It's bad, but it's rescuable has no minor problems, has minor problems, has no problems. And interestingly, somewhat randomly, it also reduced the number of stories about animals [laughter] significantly. And the quality was just better. >>: Does that include humans? >> Chris Burges: Does not include humans. So then we had each story set evaluated by 10 Mechanical Turk workers with the following measures, and we also had them answer the questions. And what that enabled us to do is to say if the M Turker couldn't answer the questions themselves, then they probably shouldn't be trusted and -- I mean, perhaps the story is terrible, but threw out the guys that just can't seem to answer the questions themselves. So we removed the guys that were less than 80 percent accuracy, about 4 percent of people, and then we just did a manual inspection. Now, this is manual. So it's less scalable. But there wasn't much time needed. One person, one day. And we actually corrected -- polished it up to make the stories make sense or the questions and answers make sense for this small subset of the data. So we think we can get actually a factor of 10 more data pretty easily if we need to. So this is a note on the quality of the automated data collection. The grammar test improves MC500 significantly. So here, this is without any grammar test, and this is with. The editing process improves MC500 clarity and also the number correct. And MC160 still has a better gram. So, still, we didn't get as good as manually correcting the stories ourselves by doing this automated Mechanical Turk, but it's a lot better than it was. Then we wrote a simple baseline system. I won't go into details because of lack of time, but it's basically a sliding window. You take the question, the answer, build a set of words, slide it over the story, look for hits, do a TFIDF count, the highest answer wins, which worked pretty well, and then we added that -- that sort of ignores things outside of this moving window, so then we added a distance algorithm of the distance of the question -- end of story from the question words to the answer words. This is the whole thing. It's very simple. And it returns this difference of scores. And this is how it does. It gets about 60 percent correct when random would be 25 percent correct. The MC160 is easier, as we said. Having a grammar test and the other tests actually made the data harder. So here single refers to questions that require only one sentence in the story to answer. Multi requires multiple sentences. And these are the two datasets. And, in fact, really for us everything's a test because only this was used to actually build a model. So all of this is test and all of this is test. And here W is the window and DS is the distance algorithm. So one of the tests we did is we were worried that somebody would come along -- we're still worried -- and solve the whole thing with some trick. So what about viewing then as a textual entailment problem? Let's use an off-the-shelf textual entailment system that says given text T and hypothesis H, does T entail H? So we can turn the questions and answers into statements like this that [indiscernible] had for a different project, question and answer on the web, that we could use to do this. And then we select the answer with the highest likelihood of entailment and an off-the-shelf system which actually turned out not to do as well as the baseline by itself, although combined with the baseline it did slightly better. This may not be fair to the system BIUTTE. There are reasons it may have failed. They may not be tuned for this task, question to statement conversion is not perfect, and so forth. There is an interesting task that we can get close, but the baseline is pretty solid here. At least the first thing we tried, first sophisticated thing we tried, was not as good as the baseline. So MT is a good resource, and I hope you use it. We will maintain this on the web, not only the data itself, but also people's results on it. So if you want to compare with somebody else's system, you're going to be able to get scores per question, per answer to see how your system compares with theirs, and you can do paired T tests and things like that on the data. So we want to keep pretty much everything that people might want to use for that reason. Okay. I still have five minutes, so I can quickly, then, show you a bit more work we did. I will stop on time. This is work I did with Erin and Andrzej labeling this dataset. We only labeled MC160, but we've labeled it pretty exhaustively. And these labels will also be made available on the web for anybody to use, of course. So obviously labels are useful for machine learning, but they're also useful to isolate parts of the system you want to work on and have everything else work perfectly to see how that part is doing. And, similarly, it helps, for the same reason, the test bed for experiment with feedback. So if the coref module and the downstream module are switched on, everything else is just using labels, we can see how the downstream module feeds back and how that works and all the rest. These are the labels we have now. We seeded the sets with SPLAT which is MSR's toolkit and another one, another toolkit we have. We've even run coref chains. And, now, I did a lot of this labeling. I learned a lot about labeling this stuff. It was interesting to me. We also have been playing with Wall Street Journal data. So in the Wall Street Journal data, chief executive officer, even [indiscernible] nouns is tough. The word executive is labeled as a noun about half the time and an adjective about half the time in that -- in WSJ. He went downstairs. Downstairs is an adverb according to OED. The outside can be a noun, adverb, preposition or adjective. And coref gets even more complicated. But we're going to try not to worry about that since if we made a perfect coref system we'd be spending all the time just doing the labeling. But here are some of the labeling issues. You have a whole bunch of mentions of Sam and then Peter and then the word "they." Well, that's a coref chain with Peter and Sam, but which Peter and Sam, which tokens? So we just put the two previous mentions in the chain. And then you can have something like Sarah and Tim, which is referred to by "they," but then Sarah by "she" and Tim by "he" so the mention and submentions. Same with John's shirt. And then this is a really tough one. I don't know if anybody can really get this. So Bob was Bob the Clown during the day and at night he became Bob the axe murderer [laughter]. Bob the Clown was a nice guy. Bob didn't know which Bob he preferred? Where do we put that Bob? Tough. And here's my favorite. This is just because I'd been looking at some number for coref. My family is large, and tonight they are all coming to dinner. The same token is singular here and plural here. So you have to deal with that kind of stuff. So I'm going to wrap up because I'm fairly -- let's just do two minutes. I can do this in two minutes. Okay. So then we also labeled animacy. Again, animacy is tricky. Are dead opera singers animate? Caruso is considered one of the world's best opera singers. He. In that sentence it would be useful to treat him as animate even though he's not alive, right? So we treated animate as at one time animate. Collectives can be animate. The team went out for lunch. Cats like naps. People like sunshine. And some things are just not clear. One boy was named James. Well, James refers to a name which is not animate but also refers to James himself, who is. So that's tricky. And then we did a little bit of testing on the data I just showed you. How ambiguous are the nouns? It's 8000 words. How hard could it be? Well, the number of noun sentences in WordNet of those nouns is about four for each noun on average, and the word sentences are here, and these don't include these kinds of ambiguities, part of speech ambiguities. So there's definitely a problem. This is a tough problem. Similarly, for verb meanings, how many verb meanings are there? So we identified the verbs using the SPLAT part of speech tagger and then we went to simple Wiktionary and asked for how many simple Wiktionary and asked for how many verb -- this is a lot more if you use the full Wiktionary. So mostly they had one, but there's things that have 35. So blow up, explode, get bigger, whistle blows, he blew the whistle. These are all different meanings of blow. We couldn't useful dictionary because it just goes over the [indiscernible]. Alligator is a verb. Ash is a verb in full Wiktionary and it's just going too far for us. All right. Thank you very much. [applause]. >> Host: We have some time for questions. >> Chris Burges: I'll just point, so -- >>: You said humans don't forget things when they learn new things. My kids did overgeneralization for a while, learning to speak. They would say went and then later go for a little while before they went back to went. >> Chris Burges: Yes. Fair point. But I think in general when you take an adult human and they learn something new, it's hard to find examples like this. It's true maybe for kids, but people are learning every day all the time while they live, right? And we don't see this phenomenon very much. Certainly it's true in children to a certain extent. That's a good point, yeah. So maybe -- I don't know what that says about the brain. Maybe something interesting. >>: I was just wondering about the fictional content in the stories, whether that's kind of distracting from the task of building accurate models of the real world when you have talking animals and -- >> Chris Burges: Yes. That's a good point. The ball was blue and the ball was happy and all that -- we get stories like this. That's true. It makes it harder because you can't say because of a ball in the real world because it's inanimate, yeah. We really wanted this set to be self-contained in the sense you couldn't go to Wikipedia and get the answer or the web and get the answer. So fiction kind of solves that problem. And it does raise other problems. In fact, we would like to add non-fiction. So we don't want to stop here. We just wanted to start with something to work with. But we felt that the benefits outweighed the -- there are problems. It makes it a harder problem for sure. >>: So at the beginning of your talk you were kind of asking the question do we have all the bag of tricks we're going to need for machine learning. And so my question for your dataset is, is it the kind of problem that we haven't defined what we need to know from the problem or about the problem to answer these so we don't have machine learning techniques for an unknown problem or -- >> Chris Burges: Well, so that first part of the talk is just opinion. So I don't know the answer to this question. It's a subtle question. So you're saying we don't even know if what we have is sufficient or not. But we can look at limitations of what we have. And there are some pretty strong limitations of current approaches. So I think it's -- like, for example, interpretability, if I'm talking to you and I clearly misunderstand something you say, you can detect that and correct it with just a few bits of information. So you have some model, you have some interpretation of my error and of what I understand. Teachers do this all the time. But that kind of learning we tend not to have in machine learning models. So it may be that we don't need any of this stuff. I'm told this, right? There are sceptics that say you will not need any of this stuff, just use deep learning and lots of data and see where it goes. And that's actually a perfectly valid approach, I think. But I think it's worth considering this question, because it might be true. So it might be that adding -- coming up with a model which is interrogable, you can ask it why it believes what it believes and understand it. It sounds rule-based, but -- and do this in a scalable way, which rule-based systems tend not to be. If we can solve these problems, I think -- interpretability I think is very powerful. If you understand why the system is making the mistake it makes and you can correct only that mistake, then that's progress. Whereas, if I have to just retrain, I don't know what else is going to break. I don't know if that answers your question. I sort of rambled on a bit. Does that help? >>: Well, so I guess my particular question is, in this new data that you've defined, in some sense we don't know what it's going to take to solve that problem -- >> Chris Burges: Okay. So we have done in full tests of the new data of roughly how many questions need world knowledge like the kind of example I was giving, the ball fell through the table because it was made of paper, that kind of thing. We get about 30 percent. So we need something with world knowledge for sure to really know that we're solving this. So that's not been done yet, right? We don't have a reliable, scalable model. >>: So one way in which this has a flavor of a task that might be gamed slightly is that, as far as I understand it, in every case you actually do have an answer available in the story. Have you thought about augmenting it with distracters where a human would be able to recognize fairly straightforwardly that the answer is not present? >> Chris Burges: Great question. And in fact that's one of the ways we plan on making the dataset harder if it turns out to be too easy. We just randomly remove some of the correct answers and put in imposters in all four and you don't know which is which. So we are thinking about how to make this data harder as we collect it. There's another way which is kind of interesting, and that is take this baseline, run it in realtime as the Mechanical Turkers do their thing. If it's too easy, if the baseline gets it, then have them do another one. Then you're going to make it really tough because you know the baseline can't do it. So I think there are ways to make the data arbitrarily hard. Certainly making -- a step you suggest is a kind of intermediate step. But then there's just write down answer, right? And that's even harder. But that's hard to measure too, because then you've got paraphrase to solve. >>: So related to that, since 70 percent of the stories were initially around animals, it shows that people aren't very creative in coming up with the stories. So how well do you think you could do without reading the story, just -- >> Chris Burges: It's interesting you ask that question. So we did a crazy experiment. Let me finish with this crazy experiment. We took -- we were intrigued buy these vector results, right? People were getting amazing results with vectors. So we threw away the story, we threw away the question -- this is just the first crazy baseline you do -- we took the answers and we assigned a random vector to each word. And -- what did we do? We found that we get 31 percent correct, so not the 25 percent you'd expect, by comparing -- yeah, just by looking at the answers themselves. So the answers themselves are biased. Right? They tend to be longer, the correct answers, they tend to be more about baseball than cricket because baseball is more popular. So there are definite biases you can pick up automatically just by training on just the answers. And you have to be careful about that kind of stuff. >>: Question here. So it seems like there might be a slippery slope about being disjunctions in the sentence or sentences requiring math computations. So how did you curate to make sure that there wasn't something that said John had five balls and Suzie took two, so how many balls were left -- >> Chris Burges: There are rather few of those. We did look at all the stories, and the 160 we looked at very carefully, actually. But it's funny you should mention that, because there's one story that was completely insane. It was about people going to the fair, and there was a jar full of jellybeans, and you had to guess how many there were, and somebody took 10, another person took five. But it never said how many were in the jar to start with, and the question was how many were left. Right? So it's an impossible question. So we just had to ditch the story really because -- I mean, it was really -- so there are biases we should worry about, but I think the fact that there are so many people -- and animals are in the stories, but they're actually surprisingly varied. If you look at the main topic, the main topic wasn't animals in these stories, it was all over the map. There are birthdays and parties and parks and pets quite a bit, but there's a surprising variety of topics there. Does that help answer your question? >>: Yeah. And also about disjunctions, you could have set operations across the whole story which might be difficult to compute and might need to invoke a different sort of module that actually does the set computation. >> Chris Burges: I think that's true. In fact, we're also attacking this data ourselves. I haven't talked about that at all. But we would like to attack it from that point of view of taking these little tiny things like find the events in the story, find the entities, find the animate entities, do the coref and see how far we get with just breaking it off in little pieces. So that's what we're attacking right now. We're building a platform to do this. >> Host: We have time for one more just so we have time to transition. >>: So at the beginning you talked about the Turing test and how maybe the computer should know that it's a computer instead of trying to -- >> Chris Burges: Right. >>: And so maybe the world knowledge of computers is then in the computer domain. So instead of using stories that talk about animals and gravity and paper, what if the stories were about things and the computer like bits and bytes and processes and operating systems. >> Chris Burges: But how many thoughts do you have in this directory? Stuff like that. Yeah. >>: I mean, would that change it? So how does that really connect -- >> Chris Burges: So you're saying -- that's an interesting question. So still natural language, so still interesting, hard to task, but in a very limited domain about what the computer knows about itself. What is your memory? >>: And also following instructions. My personal project is about read me files and following the instructions that are in read me files. >> Chris Burges: Having the computer following instructions? >>: Yes. >> Chris Burges: Yeah, well, that's a wonderful thing to work on. And I think it's pretty hard, too. Yes, that's an interesting point. Thank you. >> Host: Thank you very much. [applause] >> Host: Okay. Let's start. We have two talks in the session. Each one will be 15 minutes followed by five minutes of questions. And the first talk, the title is here, and it's going to be presented by Enamul Hoque. >> Enamul Hoque: Thanks a lot. So I'm going to present a visual text analytic system for exploring blog conversation. This work has been done in our NLP group at UBC with my supervisor Giuseppe Carenini, and this work has been accepted in the [indiscernible] conference to be held this year at [indiscernible]. So as you all know, in the last few years there has been this tremendous growth of online conversations in social media. People talk to each other in different domains such as blogs, forums and Twitter to express their thoughts and feelings. And so if you look at some statistics of a particular domain such as blogs, we could see some very fascinating figures. So as you could see, there is more than 100 million blogs already there in the internet, and these figures keep rising exponentially. So now that we have so many blog conversations and other threaded conversations on the internet, we might start wondering whether the current interfaces are sufficient enough to support the user task and requirements. So let's have a look at an example of a blog conversation from Daily Kos, which is a political blog site. And this particular blog conversation started with talking about Obamacare health policy, and then it also started talking about some other related topics. And then other people started to make comments about this particular article, and very quickly it turns into a very complex, long thread of conversation. So imagine a new user coming to read this conversation, having seen all these comments in a large thread, makes it almost next to impossible to go over each of the comments and trying to understand and get insight from what this conversation is all about and what are the different opinions and [indiscernible] are being expressed within the conversation. So that leads to the information overload problem. And, as a result, what happens is that the reader or the participant of the conversation, they start to skip the comments, they often generate shorter response, and eventually they leave the discussion prematurely without fulfilling their information list. So the question is how can we address this problem better so that we can support the user task in a more effective way. So there has been some approach in the information visualization community, and what they're trying to do is that they're trying to visualize, solve the metadata of the conversation, especially these are the earlier words where they're trying to visualize the threaded structure. The idea is that by visualizing the threaded structure, one could possibly navigate through the conversation in some better way. But the question is does it really tell anything about the actual content of the conversation? That's not the case. So by seeing this visualization, a reader can hardly ever understand what's the actual content of the conversation is all about. And so very simply, they move on to some simple NLP using in their visualization research. So, for example, this work in early 2006 tries to use some very simple TFIDF-based statistics using over the timeline, trying to visualize the topic overflow, the terms overflow, over time. And, similarly, recently Tiara system from IBM, they're also trying to visualize like [indiscernible] topic model over time, trying to show the topical evolution so that the user can kind of get some insight about how the conversations evolved over time. But, still, the underlying NLP methods are either too simple so that they're more error prone or they're not designed specifically for the conversation. So that makes it difficult to get insight. On the other hand, NLP approaches are trying to figure out different content information, extracting content information from the conversation. So things like topic portal is trying to assign topic label into the different topical clusters of the conversation. Sentiment analysis is trying to mine the sentiment information for different sentences in the conversation. Similarly, we have summarization that are trying to pick the most important sentences or generating abstract sentences. We could have some other relations being extracted such as this [indiscernible] or [indiscernible] relations, but the question is sometimes this output of NLP can be too complex to be consumed by the user or the question is whether the user is really wanting to see all those different complex output of the system. Do they really want to see the speak set or do they really want to see the retro relations? So that's why we argue that we need to combine both NLP methods and InfoVis together in a synergetic way so that both techniques can complement each other. So the goal of this work is to combine NLP and InfoVis in a synergistic way, and in particular to make this happen, we asked the following questions. For a specific domain of conversation such as blogs, what other NLP methods we should apply. Then what metadata of the conversations are actually useful to the user for performing their task? And how this information should be eventually visualized so that it can actually be helpful to the user. So answer all these questions, we supplied a human centered approach for design, and this was proposed by Munzner for InfoVis application. It's kind of commonly adopted in InfoVis, but there has not been much work in the area of visual text analytics where the idea is to combine both NLP and InfoVis together. So this kind of formed our earlier -- so our work is kind of one of the earlier work that is trying to use this nested model to combine NLP and InfoVis together. So in particular, this nested model -- using this nested model, what we are trying to do is that we're trying to use a systematic approach to go through a set of steps to eventually design a visual text analytics system. So first, taking a domain of conversations such as blogs, we characterize what are the tasks that the user performed, and then based on that, we pick the data and task that are useful to the user and then we mine those data from the conversation by using NLP methods. And, finally, we develop the interactive visualization system for the conversation. So in the next few minutes I'll go over each of these steps. So first to characterize the domain of blogs we rely on the literature on different areas such as computer mediated conversations, social media, human computer interactions and information retrieval. The question that we ask when we review this literature is trying to understand why and how people actually read blogs. And by asking this question, we get results from the literature. For instance, people read blogs because they want to seek more information, they want to check some fact with traditional media, or they might want to keep track of arguments and evidences, and maybe they just want to get fun out of it. Similarly, we also analyze how they want to read blogs, and then using all this information, we decide what tasks are usually performed by the blog readers and what kind of data that we should use to support those tasks. And so in the next step we came up with a set of tasks based on the domain analysis that we have done in the previous [indiscernible]. So these are some of the tasks that typically a user might want to perform when they read blogs. For example, they might want to ask questions about what this conversation is about, which topic is generating more discussion, how controversial was the conversation overall, was there a lot of differences in opinion. So once we have this list of questions and each question is representing a task, the next step is we try to identify what data variables are actually involved in this task. So to do so, we identify a set of data variables that are actually involved with this task. Some of them could come from the NLP methods. For instance, topic modeling and opinion mining could help in some of those tasks, while also we could use some other metadata of the conversation such as author, thread, and comment. So once we identify those data variables, the next step is to mine the data from the conversation using the NLP method. So specifically we use topic modeling and the sentiment analysis. So in case of topic modeling, we use a method that has been developed specifically for dealing with conversational data. So in this case to take advantage of the conversational structure, we take the information of reply relationships between comments and then turn it into a graph called fragment quotation graph, so where we're representing each node as a fragment and the edge is representing the replying relationships between fragments. The intuition behind representing this fragmentation graph is that if one comment or a fragment is replying to another fragment, then there is a high probability that the fragment that is commenting to its parent comment or fragment, talking about the same talking. So basically from this we can actually inform the topic modeling process. So later on, using this fragment quotation graph, we first do the topic segmentation where we basically cluster the whole conversation into different topical cluster. So in this step first we apply lexical cohesion-based segmentation on each of the paths of the conversation. So we basically treat each path of the conversation as a separate conversation. And then we run this lexical cohesion-based segmentation which is commonly used in other like meeting and other corpora to setting to the topical segment. And then once we have those segmentation decisions, we need to consolidate it for the whole conversation, and for that purpose we use a graph-based technique. So first we represent segmentation decisions into a graph. So we represent the graph where each node is representing a sentence and each edge is representing how many times two sentences actually co-occur in the same segment while running this lexical equation. So then this normalized cut clustering technique clusters into a set of topical clusters, and then the next step is just to assign representative key phrases for each topical cluster. So we apply a syntactic filter that only picks nouns and adjectives for each topical cluster, and then we use a co-rank method that, again, take advantage of the [indiscernible] structure by providing more boosting to those key words that are coming from the leading sentence from each topic. So the initial sentences from each topic are more important than the rest of the sentence. That's kind of the intuition behind ranking method. So at the end of this step we have a set of topical clusters. Each topical cluster has been assigned with a set of key phrases. And then we also did the sentiment analysis using SO-CAL system. So it's a lexicon-based approach where we use this program to find a sentiment and orientation and polarity for each sentence, and then for each comment of the conversation we compute the polarity distribution, that is, how many sentences are actually falling in any of these polarity intervals. So once we have done that, we design the visualization, and in this process we actually first start with the paper prototyping. So we do a lot of prototyping trying to impose the principles that we want to promote in the [indiscernible]. For instance, we want to promote multi-facetted exploration based on different facets such as topic, opinions, and authors, and then we also want to use some lightweight interactive features that can help the user to promote these exploration activities. So this is the final interface that we came up with. I'm going to show a demo right now. So in this interface, in the middle we have the thread overview. That's kind of showing the parent-child relationships between comments, and it's also showing the sentiment analysis result. Circularly around this thread overview we have topics and authors that are showing the exploration for using facets. So now I'm going to show a quick demo of this interface. So this is how the interface looked like. So the user can quickly see what is the structure of the thread and they can quickly look at the topics and kind of understand the evolution of topic over the conversation, and if they're interested about a particular topic they can highlight -- they can hover the mouse over that and that highlights the corresponding comments as well as the author that are related to a particular topic. And then if they're interested about a particular topic, they can click on it, and that kind of put some bars on each of these comments. So then the next step would be they can go over each comment, and as a result, the system controls down to the actual comment in the right side and they can then quickly browse through each comment from the overview interface. Similarly, they can also go through the detailed view, and that kind of also mirrored the corresponding operations in the overview. So as you can see, as the user scrolled down, that also highlighted the corresponding comment in the overview. That's kind of providing the idea where they're currently right now and what topic they are currently in. So that's kind of how the interface looks like. So I'm going to wrap up with -- so we've recently did informal evolution, and using the informal evolution we are trying to understand how the people are actually using this interface and how they're getting helped by this interface. In the future we're trying to incorporate the interactive feedback loop from user so that they can perform their -- some interactive feedback to the actual topic model so they can say that this topic is actually more broader, so I want to make this topic to be more [indiscernible] versus these topics are too specific so I have to merge them together. So our idea is to make more control back to the user so that they can have more control over the topic model and text analysis system. That's all. [applause]. >> Host: Questions? >>: Sorry. When you did the human evaluation, perhaps if you could go back one slide, so what were you asking the bloggers to evaluate? Did they have a particular task that they needed to accomplish using your -- >> Enamul Hoque: Yes. The idea was instead of giving some very specific task which might be more synthetic because these tasks are more exploratory in nature and there's no specific aim the user has before they read the blogs. And so we just asked them to read the blog according to their own needs by reading the comments that they like, and then after reading the conversation, they would try to summary, and then from there on we also collect the interaction log data that perform what kind of interface actions that they made, and then based on that we analyzed the output of the evolution. >>: Did the bloggers use different ways of getting through this information or did you see all of them using -- going from the topics -- >> Enamul Hoque: We identified two different strategies for exploration. So some of the bloggers, they would use the visual overview. More specifically, they would click on topics and then they would either go to this fade-overview and then click on specific comments and read that particular comment. On the other hand, other group of users, they almost used this topic less frequently and they would mostly go over the detailed view, very quickly skimming through the comments, but then at the same time they would have a situational awareness of what's going on in this overview. So they would kind of coordinate between these two views in a way that while they're reading in the detailed view, they would kind of understand what topic they are currently reading and what position they're currently in the thread overview. So that kind of allowed them to read some of the comments that are even buried down near the end of the conversation. So in this way what we've identified compared to the traditional system, usually people start reading some of the top few comments and then they totally get lost at some point and quit at some point. Whereas, using this interface, they can find and reach some of the interesting comments that were really buried down near the end. So I guess that's kind of making a difference in terms of where they explored the comments. >> Host: We have time for one quick question. >>: What is that bar underneath the user? >> Enamul Hoque: These bars? >>: Yes. >> Enamul Hoque: This is just mirroring the same from here. So this is representing the same information for each comment. So we define five different polarity intervals, and it's just showing as a stack bar. And this stack bar is the same as here. We just mirrored the corresponding stack bar. >>: Just for that comment? >> Enamul Hoque: Yeah, just for that comment. >>: Not whether the user is generally a negative person? >> Enamul Hoque: Just for that comment. So this is for each comment. >> Host: Let's thank the speaker again. [applause] >> Ramtin Mehdizadeh: Good afternoon, everyone. My name is Ramtin Mehdizadeh, and I am a graduate student at [indiscernible] Lab. Today I'm going to talk about graph propagation for paraphrasing out-of-vocabulary words in SMT. I'm extending this work for my research which is done by my colleagues and presented in ACL 2013. One of the big challenges in statistical machine translation is out-of-vocabulary words which are missing in the training set. And this problem becomes severe when we have a small amount of training data or we have a test set in a different domain than the training set even noisy translation of OOVs helps the reordering. For example, consider this Spanish sentence. If we pass this sentence to an SMT trained on [indiscernible] corpus, we will get the sentence similar to this output. We showed that there are some words here, like this word, that is not a name entity or days or number, but still OOV. We want to use monolingual corpus for the source site to create a graph to find paraphrases to OOVs for which we have translation. Let's look at our framework. First we'll use our training set to create a translation model, and by using this translation model we will extract OOVs from our test set. By using our source site graph that I will explain later, we will extract some translation for these OOV words, and then we will integrate these translations to our original translation model. Each node in our graph is a phrase, and we create our graph based on the distributional hypothesis and we will use context vector to create our graphs nodes and then we will use paraphrase relationship to connect these nodes in a graph. For creating such a huge graph, we need to take advantage of Hadoop MapReduce functionality, and for obtaining translation for these OOVs we will use graph propagations, and our graph propagation algorithm here is modify adsorption. For each node in our graph we consider it distributional profile, and we will create this distributional profile based on context vector throughout the whole corpus. We need an association measure to show how related you are -- how much you are related to words other than you in the vocabulary. For example, consider co-occurrence frequency as the association measure. In this example, once you show how we can create these distributional profile for a token which is shown as a red point here, we have some occurrence of this word into our monotone text, we will consider fixed size window and count the number of words in the neighborhood. By using point-wise mutual information, we can find the correlation between neighborhood words and our node word. And positive value in this formulation shows the co-occurrence more than the expected value under independence assumption. Now for each node we have a distributional profile in dimensional space. How we can measure the similarity between two nodes. Suppose we have just two dimension. Dimension one is word car and dimension two is word cat. For two nodes, park and tail, we can use cosine coefficient to find the distance -- to find the similarity between these two words. In our graph we have some labeled node. Labels here correspond to translation from phrase table, and also we have some auto-vocabulary nodes. Previous works used just these two types of nodes to find translation for always. In our work for the first time we use some other unlabeled node which did not appear in our test set. These nodes can help as a bridge to transfer translation to OOV words that are not connected directly to labeled nodes. Here's the object function of modified adsorption. The first term ensured that for seed nodes we will have a distribution over labels as much as possible similar to the original distribution, the initial values, and second term ensures that nodes that are highly connected, strongly connected, should have similar distribution over labels, and the second term and the third term is a regularizer. It implies prior belief into the graph, as you can see. Now we want to add the new translations into our phrase table. We consider it new for a feature and set it to 1 for our original translation for phrases. And -excuse me -- and then we add our new translations by set other features to 1 and just add the probability obtained from graph propagations to this new feature. For our experiment, we select the French-English task. By using French side of [indiscernible] we construct our graph. We consider two different part of the text to check first that our method is working on a small amount of parlance data or not. And also we consider this training set to see that if we can improve the result when you have a domain shift. For tests on dev set we used WMT 2005 and we showed that if we remove name entities, datas, and numbers, 3.2 percent of tokens is still from the [indiscernible] still OOVs in Europarl. And as you can see, this value is more because we have a domain shift. For evaluations, first type of our evaluation was intrinsic evaluation. First we don't have any gold label for OOVs, so we concatenate a dev set and test set to our training set and run word aligner and then extract all of aligned word on the target side as a standard gold and add it to a list. So since our gold standard is noisy here, we use MRR as a natural choice to compare these two lists and also recall as a measure. This graph shows the impact of size of mono-text and 1x correspond to using 125k sentences from Europarl. And as you can see, there is a linear correlation between the logarithm of the text size and MRR and recall. We also consider different type of graphs. The bipartite one is just using labeled nodes and other vocabulary nodes. Tripartite is just connecting auto -[indiscernible] support connected auto-vocabulary words to unlabeled nodes and then using as a bridge to labeled nodes. And full graph is just connecting all nodes together. As you can see, there is a failure when you are using the full graph. There are some reasons behind this. And we can say that paraphrase of a paraphrase of a paraphrase of an OOV does not necessarily be a good paraphrase for an OOV. So this is the reap that we think that a full graph is not working. And also we consider two types of nodes. First we use unigrams and then we select bigrams. As you can see, bigrams has a better MRR. Here's a real example of our work. For this OOV we find the gold standard approval, and this is the candidate list generated by our graph propagation method. But how much our method can affect a real SMT? For extrinsic results we consider BLEU scores, and we showed that we statistically significant can improve the BLEU score by using this type of graph propagation. We provide a new method to use monolingual text for the source site for SMT and also we showed that we will have an improvement when we have a small size of bi-text or we have domain adaptations. For the future we want to use metric learning for constructing the graph and also using locality sensitive hashing and SMT integrations. Thanks a lot for your attention. [applause]. >>: Thank you for your talk. I have two questions sort of related. The first we is your ages in your graph. Are they rated or they are not? So when you build a graph -- >> Ramtin Mehdizadeh: Connected or not? >>: No, are they rated? >> Ramtin Mehdizadeh: Yeah, they are rated. Their rate is based on how much they can be paraphrased. >>: Okay. >> Ramtin Mehdizadeh: And we calculate this by using this coefficient measure -- let me show you -- coefficient measure between two [indiscernible]. >>: And the second question I have is that at some point your slides you mentioned about the fixed window that you choose to hold your similar argument. And you said that I fix it to three, right? >> Ramtin Mehdizadeh: No, the window size for finding the neighbors is not fixed to three. We consider six in our experiment. >>: And you test it on different sizes of windows? >> Ramtin Mehdizadeh: Yeah. Actually, we have the result for that. >>: Okay. >> Ramtin Mehdizadeh: I think. Let me -- this is the experiments on different size of window and graphs. >>: Okay. And then you choose six for your final result, right? The result that you show? >> Ramtin Mehdizadeh: Here, no. In this experiment we select four. Yeah. But later we tried six. >>: Okay. Thank you. >> Host: We still have time for questions. >>: [inaudible] what did you put in the cells? How do you fill the cells of the vectors? >> Ramtin Mehdizadeh: Okay. The similarity measure here is we used -- point was mutual informations, and we used distributional profile, if I get the question. The question was that or -- >>: So you used BMI? >> Ramtin Mehdizadeh: Yeah, BMI. We experiment different types of similarity -- sorry. I should not go there. We experiment different types of association measure and similarity and using point-wise mutual informations as association measure and cosine coefficient as a similarity has the best result. >>: Do you have very upper length phrases [inaudible]? >> Ramtin Mehdizadeh: We tried on the unigrams and bigrams. Now we are trying to extend it to phrases with more length. >>: [inaudible] >> Ramtin Mehdizadeh: Yeah. >>: How you decide to expand during the graph construction? How do you decide you keep expanding the neighbors? >> Ramtin Mehdizadeh: In our experiments, we considered how much -- when we expand this, we will face some computational problem. So we just consider that's how we can get good results in terms of not needing too much computation. >>: So what is the criteria you decide that you keep expanding the graph? >> Ramtin Mehdizadeh: Sorry? I didn't -- >>: What is the criteria used to expand the graph or these nodes or stop expanding, so forth? Because you can't keep expanding the neighbors. >> Ramtin Mehdizadeh: We actually consider five nearest neighbor for each node because more than this, it was not practical, and because it's going to be very huge, and the graph propagation is not very cheap. >>: Thank you. >> Host: Any more questions? We have only one minute left. Okay. If not, let's thank both speakers. [applause]