1

1 >>: It's an honor today to have Dafna Shahaf with us. Dafna is finishing up her Ph.D. at Carnegie Mellon University, working with Carlos Guestrin and [indiscernible]. She did her Bachelor's work at Tel Aviv University, her Master's at UIUC, and she has many interests and directions but ended up focusing in her dissertation work on methods to help us deal with the large amounts of information and to visualize it and understand it going beyond lists of search results, for example, and navigating the web and deriving information from the web to richer stories, fabrics and maps. And she has won base paper awards, KDD 2010, has been a Microsoft Research fellow so that we're proud of to have her our label, on her forehead, I guess, and also a Siebel scholar. So I'll just turn it over to Dafna, who is talking about trains of thought generating information maps. >> Dafna Shahaf: Thank you. Okay. First of all, can you all hear me? In the back in excellent. So my name is Dafna Shahaf. I'm in Carnegie Mellon, and I'm indeed going to talk to you about my recent research product, which is called train of thought, generating information maps. So what's this product about? Let me start with my new favorite quote. The abundance of books is a distraction. This was said by Seneca, who lived in the first century. Now, a lot of things have changed since the first century, but Seneca [indiscernible] has only gotten worse. And you've all seen the numbers. Right here is just some of them. First is the Google estimate of the number of books out there, and now I have no idea how they got that specific resolution. Also, the number of blogs skyrockets. And even if you just look at scientific publications, right, PubMed has 19 million papers and adding one by the minute. Scopus has twice as much. Now, I was burning the internet to look for figures to use in the slide. One of the exponentially growth figures. You've all seen them. I came across this one. I just had to share it with you. Okay. So this is a paper from the '80s. X axis is a timeline. Solid line is the number of papers about some topic that they found interesting. But what I really like about this figure is the dashed line. And the dashed line is what they call innovative papers. I think what they're trying to say here politely is the number of papers grows 2 exponentially, the number of papers worth reading, not so much. Okay. So hopefully you're convinced that there is a lot of data out there. So, suppose you want to learn some complex topic. Might be news, like you cover the financial crisis in Europe. Or it might be some research area that you want to start looking into. So where do we go? Most people, I know their answer would be we go to a search engine, right? We all love certain engines. Search engines are really great at retrieving nuggets of knowledge, but it doesn't often show you those, what, 30 million results fit in together, what's the big picture. And there have been some systems in the past trying to summarize and visualize complex topics. For example NewsJunkie that came out of here. But usually I try to construct a story line or a timeline and I'm going to claim that this simple summarization works only for really simple stories that are linear by nature. Real stories are not like linear. They spaghetti into branches and side stories and intertwining narratives. If you just think about research if I had to come up with a picture of what research is like, it wouldn't be a line. It would be more like something like this. Come on. I'm sure you all know the feeling, right? One day I'm going to make this screen turn on and then I'm going to get my dissertation. Anyway, so you're dealing with this type of messy tangle. So what do you do? Let me show you the inspiration, the holy grail. How many of you have seen issue maps before? Okay. I haven't seen it until not too long ago. So this is an issue map. It's a set of seven posters. I first stumbled upon them at a corridor at CMU, and this one specifically charts the big old [indiscernible] debate, can computer think? Okay. Let me just zoom in and zoom in some more. You see each note emotions, and you argument, or it's there and read it in this graph is an argument, like machines can have can follow a path and see this argument is supported by that disputed by that argument. And you're supposed to just sit beginning to end and understand the big picture. I first stumble upon it, I was fascinated. It's a topic I liked earlier. But reading it made it all fit nicely in my brain, okay. Then I started reading a bit about how they made it, and turned out they needed about 20 menus to 3 generate them. So the next question was, hey, can we actually build it [indiscernible]. Just have one issue map for every query you want. Wouldn't that be nice? So I don't know how to [indiscernible], but in this talk I'm going to talk about the first few steps we took in this. And this system is called Metro Maps. Okay. And the idea is again you start with a query, like Greece debt crisis, and that what looks like a Metro Map, where each line tells a coherent story and different lines are within different aspects, and you see how they intersect and overlap. Okay. So this is a very, very simplified financial crisis in Europe map. You can see the blue line tells the story how Greece's status was reduced to junk and they had to come up with us a the austerity plan to get the bailout. The headline are how they had all those protests and strikes because of those austerity plans, and you see how both lines intersect at an article about austerity plans. Clear? >>: Sort of? Great. So how do we do that. Yes? Is this simplistic, for example, protests? >> Dafna Shahaf: This is just for you. Each node actually corresponds to an article. I just couldn't fit a title, but you'll see the real map like later in the talk, okay? Each node here is an entire article. Yes. >>: So is the linear structure alignment, or is it just a set up. >> Dafna Shahaf: >>: The blue line goes to [indiscernible] generally but not in that order. >> Dafna Shahaf: >>: It's chronological. What about the red line? >> Dafna Shahaf: >>: Is the linear structure of what? Okay. The red line is supposed to be slanted, but -- 4 >> Dafna Shahaf: I need to Greek people. Okay. Anything else? see a map later and it's all going to be much nicer. Again, you'll So how do we do this? Now, maps are complex creatures, so we start with a simple problem of how do you construct a single metro line? What makes a good line. Then we will move to maps for news and finally I'll tell you how to adapt it to the scientific domain. Okay. So I'll start with lines and we tackled this in a KDD 2010 paper called Connect the Dots, where we made our life even simpler by assuming that we know the end points. Okay. So here's the situation. You want to know about financial crisis, this time in the U.S., so you pick two articles. Your start and your goal. And the idea that you vaguely remembered had something to do with the housing crisis and the bailout. Okay. So this is the input of the system. The output is a smooth chain of articles that bridges the gap between them. For example, output might look like this. So you have a chain telling you that people borrowed money from the bank to pay for houses. And mortgage crisis begins to spiral because banks rely on debt too much. Investors want the Congress to react. Bailout plans starts rolling and finally bailout. This is the type of output we're looking for. Fair enough? Okay. So how do you go about finding such a good chain? And when I ask people this, almost always their reaction is, just the shortest path. It's not a real problem, right? Just build a graph, add a note for each article and edges based on your favorite similarity metric and just find a short path between them or a bottleneck path or your favorite. So why isn't it good enough? Let this. So we try to combine those story and another about a Florida those stories? Remember. Sorry, me show what happens when you actually do two articles, one about the Monica Lewinsky election in recount. Everybody familiar with the data is a bit old. So let me show you what happens when you try to connect those two, shortest path. And let me just show the important parts here. You don't need to read it. The important part is this chain is rather erratic, okay. It goes from Monica Lewinsky to Microsoft, to Palestinians to Florida. Doesn't make any sense. 5 But if you look at the transition of context, it starts to make sense. Because the first two documents are about trials. They share a lot of vocabulary words. Judges and lawyers and juror terminology. The next two are about Microsoft. By the way it's not for you. I've been doing those slides for the last two years. It's in the paper from 2010. So the point is that what you get is this stream of consciousness effect, where each transition makes sense out of context, but the overall effect is -there's no global [indiscernible] throughout it. Okay. Now what would you like the chain to look like? Same two articles. Ideally, the chain would look something like this. Clinton admits the story. He's about to be impeached. He's impeached. He's acquitted. Al Gore starts his campaign and tries to break away from Clinton, because it's a messy thing. Election draws near and finally, election and recount. So hopefully, you all agree that this chain is better. But why? And it looks like what we're looking for, really, we call it coherence, okay. This chain is more coherent than the other. So that's the property we're after. But, of course, that's just different problems. Now instead of looking for a good chain, I'm looking for a coherent chain. How do you define coherence? Let me just give you an overview into this talk and my work in general. A huge chunk of my work is just formulating, craft going objective functions. Okay. Formulating all those terms I've been using. What does coherence even mean? So after you find an objective you're happy with, you need to come up with an algorithm to optimize it, to find good chains. And finally, I need to convince you that it works. Okay? So crafting an objective. So I decided in order to see why this second chain was better than the first one to look into word patterns. What you see here, this is the short et path chain, and then bars correspond to whether the word on the left appears in the article above it. Okay. So Clinton appears beginning and end. And you see this stair-like behavior here? Okay. This means that the topic changes with every transition. Every two documents are related because of a different set of words. Now, compare it to the second chain, and you see that 6 everything is smoother and nicer. It transitions, everything is longer. Clinton is there beginning to end. Lewinsky is there almost everywhere and Al Gore starts showing up and keeps going on for a while. Everything is smoother and nicer and consistent. So we decided to use this as our intuition when we define coherence. Step back, quick list of desiderata. So coherent chains are going to have strong transitions. We're not giving that up, but also something agreeable running throughout it. Okay. So let's start with strong transitions. How do you do that? And perhaps the most naive way is to start thinking about it, is just say well, for every transition, just count the number of words that two articles share. That is some in here over words W indicating function does word W appear in both document DI and -- yeah, that's the face I was making when I first did this. So you want to acknowledge the coherence of the entire chain, you take the minimum, right, because the chain is only as strong as its weakest link. Now, this really doesn't work. And why? Just take a look at this indicator function and I'm going to claim that its way, way too coarse. It completely ignores the importance of different words, based on [indiscernible] level and transition. And also, it has missing words. Because again, if the document has the word judge and jury but not lawyer, then it's really implicitly there. Okay. So we had to replace it with something softer which we call the influence. Of the document DI and document DI plus one for word W. What do I mean by that? Just intuitively, you want to think about the influence is high if the two documents are related, and W plays an important role in what makes them related. [indiscernible] doesn't appear in either of them. You with me? I think I lost some of you. >>: [inaudible]. >> Dafna Shahaf: >>: What's the what? [inaudible]. >> Dafna Shahaf: Yes, I will get to this, okay, in the next slide. So just to 7 tell you that there has been a lot of influence notions in the [indiscernible], but they usually assume that there are some edges, and -- yeah. >>: [indiscernible]. >> Dafna Shahaf: >>: But now it is [indiscernible]. >> Dafna Shahaf: >>: Became what? Symmetric. >> Dafna Shahaf: >>: Yeah. No, not symmetric. Because it's [indiscernible]. >> Dafna Shahaf: Oh, but I only computed for chronological order. So DI will not -- I will not even compute influence if it's after DI plus one. >>: I see. >> Dafna Shahaf: >>: All chains go forward in time. Okay. >> Dafna Shahaf: Okay. The way we computed it, not [indiscernible], but I'm not going to go into this. Yes? >>: [indiscernible] document, there's no order in the document. >> Dafna Shahaf: In the document, no. A document is you increase, okay? Okay. So what was I saying? Oh, that we don't have a single edge in our data set. So we had to come up with our own notion of influence, which I'm not going to get into, but it basically uses word coherence, okay, to achieve those two properties. Yeah? >>: So when you say the W times [indiscernible]. >> Dafna Shahaf: Yeah, sure. So what I was really doing, I was constructing a 8 [indiscernible] graph between words and documents, and I was just looking at paths between, okay so you connect the word if it appears in the document or some [indiscernible] in the edges and I was looking at when you want to go from DI to DI plus one, you zigzag on this [indiscernible] graph, how often do you need to go through W. Okay. Enough? If you want I can talk about it for a whole lot more. I knew I should have kept backup slides. Anyway, what [indiscernible] said was correct. Okay. We have strong transition, but everything I started with telling you, you know, you have to have global thing, smooth and nice transitions. And the thing is, with our [indiscernible] objective, that [indiscernible] chains and shortest path chains can go really well. But the important thing to notice is that it needs a whole lot more words, you know, to get a score. Well, got chains can usually be represented by a much smaller number of words. Okay. Let's play a game. Suppose I tell you that I allow you to choose only three segments, okay? This is a segment. And I'm going to present that these are the only words that appear in the document, and then compute a score. Okay. Like these are the only words. What do you do? So for the article on the right, you can go with, for example, Lewinsky, impeachment and Gore and still get a really good score. The chain on the left, however, there are no three segments that will give you a good score. Precisely because each transition uses a different set of words. >>: Do you really mean segments rather than words. missing for one means you can't -- The fact that Lewinsky is >> Dafna Shahaf: I don't mean segments, because you might like documents one and two are related because of a word, and also like three and four, but not two and three, so you have this zigzagging pattern. So we [indiscernible] actually both ways this works better. So instead of like we did before, summarizing all of our words or taking all words into account, now we only take active words into account, the segments that we picked and we turned this into an optimization [indiscernible], which segments would you pick to get the best score. And we have some constraint on this activations, because we want to simulate the behavior of good chains, like you can choose too many words. You can 9 choose too many words for a transition or like you just said, you can have words zigzagging, like, turning on and off and on and off. Okay. You with me? Okay. So we finally have a coherence definition. I told you it would be a big chunk. Now, regarding algorithm, how to actually find a good chain, pad news is that it's NP-hard. You all guessed it. Good news is that if you don't care about having binary activations, if you're okay with like choosing point five of this word over here, then it has a very natural formulation as a linear problem. A linear problem, LP. Yes? >>: Speaking of words, supposed word is Clinton. But the article is really about Hillary. That's one thing. And another thing is suppose both articles about Hillary. But Hillary is Clinton and here's how why. >> Dafna Shahaf: >>: I would like to see -- And the concept is clear whose wife she is. >> Dafna Shahaf: Yeah, you're talking about LP problems and there are people who spend their entire research career on disambiguation thing. So part of the point of this article was to see how far you can get using just the most basic features. Just words. See if it's good enough before you start throwing in the big cannon, like what you said about disambiguation and reference and wife. And just words can do quite well. We also tried wit some more interesting features, like topic models. But hey, if it works. Okay? Does that answer your question? Good. So we have an LP and we have a rounding schema and I'm just going to tell you that we some approximation guarantees that in expectation we can control the lengths of chain we want to get and that we [indiscernible], we can tell essentially how close we are to the optimal chain, okay? Next thing, need to convince you that it works. So the sad part, we can't do the standard things. We don't have ground truth. We don't have golden standard. I can throw those nice precision recall curves. So we had to do user studies to let people use our chains and competitor chains and see what they like best. So these are our competitors. And we have shortest paths after I spent ten minutes or so bashing them. We have Google Timeline, and we have a system 10 called Event Threading or [indiscernible]. And just to give you an idea, this is one of the chain we showed you the users. We're trying to connect the O.J. Simpson trial to the verdict. And this is the chain. Simpson strategy, there are several killers. Book deal controversy. April transcripts. And something completely unrelated, about the Tandoori murder case. This is one chain we got from Google news timeline. Second police trying one of chain, same article. Issue of racism erupts in the Simpson trial. L.A. have some racial tensions. More about L.A. police and finally, lawyers to use in order to get acquittal and the verdict. Okay. So this was our chains. Now, we've run several users studies. I'm just going to talk to you about one of them. The data is with The New York Times. This is an old one from 1995 to 2003. We had 18 users. We chose five prominent news stories, like the O.J. Simpson trial. And we showed them two chains, generated by different methods. Double blind. And first thing we asked them, just which one is more coherent, because we wanted to see if we captured their notion of coherence. Also, despite the fact that we're not directly optimizing for it, we're also asking which one is more redundant and which one is more relevant. And let me show you the results. And Y axis here is the fraction of time [indiscernible] preferred to the other. People could say two things are the same. So it doesn't have the sum of 200 percent. And first thing is coherence. The only relation we really get out of it is that we're doing better, which is the entire point of this paper, okay. So we were happy. Now, things look a bit not as good when we looked at redundancy. But then we looked at relevance and it all started to make sense. If you think about there's a very clear trade-off between relevance and redundancy. If you want, like, high -- you want to remove redundancy [indiscernible] random articles and your relevance drops. Or if your entire relevance, just stay really close to S word or your input articles and then [indiscernible]. Even you can see it in the chain I'm showing you, right, because the Tandoori murder case is definitely not redundant, but it's also not relevant. We think this is what happens here, 11 that we pay for relevancy with some redundancy, okay. Again, this is just to give you a flavor of what questions you can ask about those chains. Yeah? >>: So you looked into New York Times only, because that may bias. >> Dafna Shahaf: >>: Oh, definitely does. Articles, they use their favorite words so that now Google looks -- >> Dafna Shahaf: So you can actually restrict it to The New York Times. can restrict it to New York Times. >>: That's what you did. >> Dafna Shahaf: I think that's what I did. it was two years ago. >>: You It was two years ago, but yeah, Because otherwise, it may be uneven. >> Dafna Shahaf: By the way, what you said was the key. Because we are just using words and because different writers do tend to use their own words, you can actually see that sometimes it prefers chains by the same writer. >>: The same article, the same vein, the same issue from Wall Street Journal may have nothing to do with it. >> Dafna Shahaf: Yeah. I actually don't have anything other than the New York Times, but it would be fun to play with the Wall Street Journal. Yeah? >>: How did you choose to use -- >> Dafna Shahaf: >>: The five news stories? Yes. >> Dafna Shahaf: I think we went to one of those website, what is the top news story of the year or something like this, and we picked the top, top two for every year or something. So one thing I really like about chains, they allow some interesting forms of 12 interaction. For example, the O.J. Simpson trial, there are so many ways to connect those two end points and tell a coherent story. So what we did is we added some interaction mechanism, where users were shown a tag cloud. They could say give me more about this word or less about that word. And either from online learning or [indiscernible], but just to give you an idea of what it looks like in practice. So this user got a chain focusing on the verdict, okay? And they say I don't care about the verdict. Give me more about the racial aspect. Then they got a chain very similar to what you saw earlier about racial issues and L.A. police. They could say give me more about blood and glove and then they get a chain about DNA expert and fiber evidence. So there's a lot of playing room here. So I hopefully convinced you I know how to construct good lines. But like I said, lines are not good enough. Again, O.J. Simpson, there are so many different aspects to be said. So next thing is we switch to map. And just as a quick reminder, this creature is a map. Lines are coherent and different lines focus on different aspects so when they overlap, they intersect. Now, let me just define it semi-formally. So a map is just a graph, G, and a set of paths, and all you need to know is [indiscernible] correspond to news articles and the edge are related to underlying edges of the path. So the graph is just the union of all of the paths. Now, so how do you define a good map? Well, the first property I gave you for free, right? It's coherence. Every line should tell a coherent story. But is it good enough? So can I just return the top three coherent stories in the data set and call it a map? So let me show you when you actually do this. This is the map we got for the query Clinton. Again, all data sets. So Hillary was not around. The first line is about Clinton's visit to Belfast. And then you have two more lines about Clinton's relationship with some religious leaders. Now, just taking it [indiscernible] what's wrong with these maps? And the thing is there are two things wrong here. First of all, I don't know how to say it, but those are not really important stories, okay. There's so much to be said about Clinton's presidency, and his visit to Belfast is not one of them. Also, there's nothing to go against redundancy, right. Those Clinton 13 red lines are pretty much the same. And yeah, they're both coherent, but they don't give me anything that the other one did not. So there's importance and then redundancy. In other words, the challenge is really to balance this coherence with what we call coverage. Okay. So lines should be coherent, but they should also be topics that the user care about, and as many of them as you can. You with me? Yes? So how do we do that? And we tackled a very similar problem in KDD 09 paper called Turning Down the Noise in Blogosphere, where the idea was to just find a small set of articles that are both diverse and important. So this is a tight cloud about everything that happened on January 17, 2009. You can see that Obama was very frequent. Okay. The size of the word corresponds to its frequency and this was Obama's inauguration. And also the Israeli-Gaza conflict. And New York because the airplane landed on the Hudson river. So the idea was to pick articles that are about important stories. And just one slide summary of how we do this, all we need to know is that the documents cover concepts. Concepts, you can think about them as words. For example, the document covers some of Obama, some of Washington, some of U.S. You throw in the orange one, you completely cover New York and add some coverage to the U.S. At some point, you start to looking at documents that cover some other things. Everything you need to know, we use this coverage notion, it covers both the problems that we had, both importance and redundancy. >>: So what does it mean, that you covered New York? >> Dafna Shahaf: Oh, it means that when you look for -- when your algorithm looks for another document to increase coverage, they're not going to pick something about New York or they're not getting any additional gain from it. >>: But I still don't understand what coverage is. orange both contain references to New York. So -- I mean, the blue and the >> Dafna Shahaf: So it means that New York played, was important in this document. Okay. So this document was about something. In this case, the Hudson river. So they mention New York quite often. So in other words, you're not going to -- actually let's go with Obama. It's 14 an easier example. You had an article about integration. It covered Obama somewhat. It covered integration somewhat. You keep on picking articles that cover those two things and you have this diminishing returns property. At some point, you just stop adding. >>: What is it that's diminishing? I don't understand. >> Dafna Shahaf: Oh, so yeah, that's because I'm hiding it in my sleeve. So the notion of coverage is a matter of function that has diminishing returns. So I don't have the formula here, but it's basically -- do you want me to go into the formula? I can do that. >>: I want you to say that what's the idea. Because at the moment, I don't understand. Some documents mention Obama and New York. Some mention others. How do you know ->> Dafna Shahaf: Okay. First document, we figure out where the important words are. Just the idea of standard NLP stuff. Then what you do is you say well, this document covered Obama, say, a third. Then you add another document. And you don't want -- if this one also covered the third, you don't want to just keep adding Obama and Obama on and on and on. So what you do is you turn it into a [indiscernible] max coverage problem, which document flips a coin and with probability covers the concept Obama. So when you have more documents [indiscernible], just one of them is closer to one and then you don't get any additional coverage from a new document and you need to go into this Gaza-Israel stories. You're still not convinced. >>: Let's discuss it later. >> Dafna Shahaf: You'll I want to tell you, we have a coverage notion that's not part of the main line of work here that helps us figure out if a set of documents is about high important things and also not redundant. Fair enough? And this is what happens when you incorporate coverage in when you look for a map that's both high coverage and coherent. Okay. So this is about Greece again, and you have a line about strikes, a line about Germany, and a line about IMF. Now, what's wrong with this map? Come on, you're all thinking it. Yes, precisely, they're not connected. And 15 especially annoying because we have this article about Germany and IMF. crying out loud, at least those two should have intercepted. For Our last property is connectivity. If two lines are connected, then I want to know about it, okay? And there are multiple ways to formalize connectivity. We experimented with users and seems like the only things they care about is those two lines, I know they're related but the map doesn't show it to me. It doesn't seem to care if they were connected or if it was at the beginning or the end of one article, multiple articles. Just are they connected or not? So we just went with the really simple objective of count the number of lines that intersect. So now we have objectives for coherence, coverage and connectivity, our three C, I guess. And how do you turn it into one big objective function. And the idea is it's really a game of tradeoffs. Like I told you, if you maximize coherence, you get all those Clinton/Belfast stories of low coverage. And if you try to maximize connectivity, then again you're going to get those lines that are almost the same and they're definitely connected, but they're about the same thing. So again coverage drops. If you try to maximize coverage, your connectivity drops, and so on. So here is your properties. How would you combine them? Let's start with coherence. Now, hopefully, I convinced you that we're not after maximizing coherence. We don't want necessarily the most coherent chains that we have. Rather, it's really constraint. You only want the chains to be coherent, right, to be above some threshold. Now, we're left both, but think care about, but here are chains connected, what with coverage and connectivity. We really want to maximize about it. If I tell you here's a map, here's chains that you I don't tell you how they're connected, versus here's a map, that you don't care about, but I'll show you how they're do you prefer? So hopefully, agree that coverage is more important than connectivity. So this is our primary objective, and connectivity is our secondary. So the way you would writing of down, let Kappa be the maximal coverage you can achieve with coherent maps. And you try to find a map that's coherent and that maximizes connectivity, given that coverage is already maximized. 16 You look skeptical. So the only part, it generates disconnected maps, because coverage is a set function. So really, there's no reason ever to use the same article in two different lines, because you don't get any external coverage from it. Okay. So we had to introduce some slack. We're willing to sacrifice an [indiscernible] traction of the coverage if it tells us something about the connectivity. Then there's the map objective. Now, let me just give you a very high level overview of how to get good maps. Okay. So we start from a set of documents. Next thing, remember I told you that coherence is a constraint, that we only care about coherent chains? So we need to find a way to represent all the candidate chains to be used in the map. So when we do is we encode all coherent chains as the graph, which you call the coherence graph. And basically, each node here corresponds to a short coherent chain, and edges between the nodes correspond to [indiscernible] and still remain coherent. It's a transitive property so really, each path in this graph is a coherent chain, okay? Next thing you do is you try to find a set of high coverage chains in this graph, okay. So they're also coherent. Now if you think about defining a path in this graph, you really look for a path that maximizes some function of the nodes visited, right. Which like somebody already solved this problem for us. It's called orienteering, and it's a hard problem. But luckily, again, our coverage notion is sub-modular. What I was trying to tell you earlier about diminishing returns didn't completely work. So we can use this algorithm of sub-modular [indiscernible]. It's a nice little grid that gives us [indiscernible]. It's recursive and it has some approximation oriented. So we know how to find a set of high coverage chains in this graph so they're coherent. Final step, we just have the local search step that tries to increase connectivity without sacrificing coverage. Now, the perfect time -- yes? >>: Sometimes you may have some sources that are reliable for news sources and some sources may not be reliable. For some of it you may care about reliability, but for other problems you may not. And sometimes you need to do 17 a trade-off between coherence coverage, reliability. those kind of tradeoffs? Does this approach can do >> Dafna Shahaf: So here we just said the New York Times, we trust everything they have to say. Which, yeah, has some limitations. We actually came across this problem in the previous paper. We were dealing with blogs. So yeah, there were some really bad things we had to filter. But here, it's nice. It's New York Times. We're going to come again to trust when we talk about scientific papers. Yeah? >>: So the whole process of formal objectives and constraints corresponding seems mostly creative in the sense that you sort of look at the output and see what's wrong. [indiscernible] feedback or problem pair-wise or is that outside of scope. >> Dafna Shahaf: again? How I can quantify, I mean, other than the users studied >>: So coming up with the coherent [indiscernible] and so on, seems like sort of look at it and you say what's wrong with this. So seems it's largely sort of looking at it and it's a very subjective ->> Dafna Shahaf: Banging my head against the wall for a few months coming up with a notion that I like is a pretty good summary. >>: So [indiscernible] users and then use the clicks or use pair wise? Otherwise, [indiscernible]. >> Dafna Shahaf: So yeah. So you might be [indiscernible] would be giving users a chain to tell them coherent, not coherent. If not coherent, is there a small change you can make to make it more coherent. And you'll see through some -- I don't know if it's going to work, but just sift through some local gradient that you can follow. Like what would make this more coherent, or is it just beyond repair. But now, actually, that's pretty much what I've been doing so far, just trying to come up with objectives. It would actually -- the thing is, there's too many possible chains out there to do the standard machine learning. Just give me a couple chains that are good and a couple that are bad. It's not going to work. So maybe feedback on a finer level. Does that sort of answer? I need 18 to think about it some more. Okay. So this is the algorithm. Now, just let me show you an example what the map looks like, and this is the real Greece map, not the simplified version you saw earlier. And you see there's a line about deficit cutting plan or they have to make cuts. They're rated junk and so on. Next thing is you have a line about strikes and riots and next you have a line about Germany and finally a tiny line about the IMF coming out at the end. So this is what the maps look like and this is really sorted chronologically. Next, how do you know that those maps are any good? So again, very high level overview of the user study here. Question had the New York Times data set again, this time slightly newer. 18,000 articles or so. And we try to see what maps are good for, really. So first thing we decide, that's what we call micro-knowledge. Okay. So just using maps as information retrieval tools. Suppose the user has some questions they have in mind, like who's the prime minister of Greece, is the map any good for helping them locate it faster. And then it did show some improvement, but it was minor, compared to the competitors. We didn't really do a lot better. Definitely not statistically significant. And like some people told us in the study, like, if I wanted to know the answer to this question, I would just search for it. There's no really need to search looking at a map. So second thing we tried is what we called macro-knowledge, seeing if the map can help people understand the big picture. And how do you test this. So we decided to wait to see if somebody really understand as story to see if they can explain it to somebody else. Just think about the last time you TA'd. And we asked people to summarize to look at the map or look at the competitors and give a one-paragraph summary of this story. And then we threw the program in mechanical Turk and asked them here's two paragraphs. One of them generated by map users, the other by competitor. Which one tells a more coherent and complete version of the story? Okay. And here are the results. For the Greece death crisis, 72% of the Turkers improved our map paragraphs and it looked less good for the Haiti earthquake. When we only got 59%. Then I actually had to go look at the paragraph. And although the map did have a lot of other aspects of the story like, what was it, some kidnapped orphans and some laws established in the U.S. 19 to help, like, temporary laws immigration laws, it [indiscernible] off the users that summarized those paragraphs just followed the main story line, okay. Earthquake, lots of damage, distributing aid, so on. So our conclusion for now is just that the bottom line is maps are useful for those macro-summaries, as tools to understand the big picture, and especially for stories that are complex. You can feel that they have a single dominant story line. Yes? >>: Are you going to do more on this study? but I can wait. >> Dafna Shahaf: >>: I was going to ask a question, No, I'm going switch to the next. So what were the competitors here? >> Dafna Shahaf: So there was Google news again, just Greece debt crisis or whatever they wanted to type and just read the first, I think, up to five pages or so. And second was TBT that I talked about earlier. >>: And were Greece and Haiti the only two --. >> Dafna Shahaf: Greece, Haiti and Chile, the miners trapped underground. Again, a pretty small scale study. There's a limited number of undergrads that are convinced to come for pizza. >>: So since it's just these three, did you look at the maps that were generated for Haiti and Chile and see if they sort of were as high quality as the Greece map? >> Dafna Shahaf: Actually, again, I think they were good quality, but people did not care about the side stories as much. Seems like when you talk about an earthquake, there's just the main story line and everything else is distractions. While in Greece, they somehow liked other lines more. >>: So you assume it's more because of the topic, not because of the quality of the map? >> Dafna Shahaf: Yeah. At least that's what I think. Okay. So we know how to construct maps, I hope. And now how do you adapt it to science. And 20 [indiscernible] you're supposed to stop and ask, wait, why even bother adapting it? These techniques should still work, right, for scientific papers. Why have you been changing them. And you'd actually be right. Those things work out of the box. The real nice thing is science just gives us all this wonderful additional structure in forms of the [indiscernible] graph. And maybe we can do something smarter with all this extra information. So let me just walk you through how I would modify the maps. Okay. So this is a quick summary of what you saw so far. We have three objectives, coherence, coverage, connectivity. And this is what we did in news domain. Now, let's just go one by one, and I'll tell you how I would modify it for a scientific papers. So coherence first. Hopefully you remember this slide. This used to be our coherence objective. And if you take a close look at it, there are really two main ideas that are going on. One is this compute the influence of words per transitions, and the other is choose a small set of words that capture a story well. And the second thing still works, but how about computing influence. Remember how I told you there's lot of influence notions in the literature, but we can't use them because we don't have edges. Well, now we do have edges. And we have all those people citing each other and really telling us who influenced them. So maybe we can use that. So we're going to change influence. And the idea that we want to capture the way ideas travel in the scientific literature. So when you write a paper, your ideas are influenced by your previous work, the papers you cite, hopefully some novelty involved. But really, we use this motion of influence from beyond key words, from KDD 11, and the idea, again, briefly, is that for each word, you construct a graph. Nodes are papers. And edges means either [indiscernible] or same authors. And [indiscernible] the idea came from because there's a way on each edge you come up with what's the chance on this word for which the graph is constructed, like [indiscernible] from Q or from R or it's something novel. Then you use this graph so in beyond keyword search, they define direct influence, which is just the project that paper P2 got this idea directly from P1, okay. Maybe through a bunch of other papers in the middle, but it originated at P1. 21 And when we use this notion of influence, just plug it in our coherence notion, it really limits the type of chains we can hope to look for, right, because it will only give you chains of paper that directly influence each other that build on top of each other, usually from the same research group. So it's not as interesting. Therefore, we replace it with this notion of ancestor influence. We don't care if you want to directly influence P2, as long as they both got it from a common source, okay, from a common ancestor. And this gives us some nicer chains. By the way, if there are [indiscernible] people in the audience I can use some help for coming up with better algorithms for this. Okay. So this influence, how would you change coverage? Now, all I really wanted you to know about coverage is that our original notion just covered concept. Documents cover concepts. Well, it's not really good enough in scientific domain, because really [indiscernible] not enough. Think about those two papers. SVM in oracle database versus the support vector machines in relational database. And they're both, they have very similar content. Okay. This is their content and you see SVM, data, database, performance, efficiency. But they had very different impact. Again, if you look here, this is the [indiscernible] of the papers citing them, okay. So if your paper cited you, we get the authors and venues in this cloud. And what you should see here is, first of all, the paper on the left affected more authors than venues. Just because there are more words here. Also, despite the fact they're solving the same problem, they're really related. There's very little intersection. I think there's only a single paper citing both of them. Okay. So we decided that in the scientific domain, instead of covering words, we want to cover the papers themselves. So paper will really cover the paper that it had a big impact on. So if you think about it, what I'm saying is if you want a high coverage map, it's a small set of documents that together had impact on a large chunk of the corpus. And so some people might think that descendants is counter intuitive, right, because how can a paper cover future contributions. Like how can number theory papers cover [indiscernible]. But when we think about it really looking at 22 ancestors only gives you some ideas of the context where the paper was written. Well if you look at the descendant, you can really get the gist of what the contribution was. Okay. So we're covering papers instead of concepts. Last thing, connectivity. So previously, we just had counted the number of flags that intersect. And it can work. This is a detail of the map about the support vector machine, and you can see there is a line here about large set -large-scale SVMs and a line about multiclass SVMs, and they both intersect. A simple geometry for large scale multiclass SVMs. So sometimes this notion does work. But more often, it doesn't. Because really, in scientific papers, there's a rich palette of interaction possibilities. You might say that for many reasons. Again, coherence works against us all the time. Because see the blue line here? It's a coherent chain about linear classifiers, perceptrons, SVMs and kernel SVM. And the orange chain is about SVM applications to vision, facial detection, facial recognition, and there's not a single paper that can comfortably fit in both chains, right. You can't really get them to intersect and remain coherent. But they're clearly related, right, all those papers about the vision cite the theory papers. You with me? So we decided to reward chains not just for direct intersection but also for having high impact on another. So just to show you what this resulted, this is a map about reinforcement learning. Let me zoom in. The first line is about MDPs, POMDPs, something called EMDPs. And you can see how it affected a line about coordination and corporation of multiagent systems. You can see here that this paper cites this one. They say that POMDPS extends to [indiscernible] and you can also how the MDP line affected this line about robotic arm movement. Other end of the map, there's a line about exploration exploitation dilemma and bandit problems, and you can see how it interacts with this line about analysis and bounds of reinforcement learning. >>: So where is the text in these speech bubbles? 23 >> Dafna Shahaf: Oh, so if this is actually a direct citation, not just some couple of levels impact. This is the text around this [indiscernible] with limitations of PDF text extraction. Okay. So we know how to adapt maps through scientific literature. One last thing and this is a year study. I'm actually going into a little bit more detail. How do you evaluate those. In evaluating maps for science, it's really tricky. First of all, you can't do double blind. There's no way. The output is just too you increase, which means you need to get a group of users, ask them all the same questions, let half of them play with maps, the other half with competitors. This means you have to find a research domain the entire group can both understand, they need to read those papers, and also they must not be experts in it in advance. So we chose reinforcement learning and we constructed maps over the ACM corpus, about 35,000 papers. Let me just tell you how the user study was. So we had people stepping into my office and I told them to pretend they're a first year grad student, all excited about doing a project in reinforcement learning. And they step into the professor's office going yes, teach me everything you know about reinforcement learning. And the professor gives them a survey paper to read. Now, the last survey paper I know about in reinforcement learning that actually is fitted for a first year grad is written in 1996. So their task was really to update it, to find some more recent research directions and some relevant papers to fit in this new survey. And they're given 40 minutes, which is not a whole lot of time. Just to simulate a quick first pass in the data. They could use Google scholar. They could use our maps and Google scholar. They're given no instructions whatsoever, just stumble upon this thing. And we also have two baselines, which is map itself and the Wikipedia entry. We took snapshots of their progress and we recorded their browsing history. Next thing we did, first of all, we ended up with 30 participants. We had to get rid of four of them that didn't quite understand the task and wrote me really nice essays about reinforcement learning. And then we took the papers that all of the people mentioned. We combined them into one really long list and we sent this to a judge who is an expert in the area. 24 And the judge had tell me relevant, under some bridge whether the label to, for every paper, they didn't know where they came from, irrelevant or seminal. They also, since they put every paper direction category, it had labels and the judge also told me is good or bad, okay? So precision. All we could get out of it were the blue lines -- were the green lines and we're doing better both in score of the papers and score of the labels, okay? If you want to know about the baselines, Wikipedia did quite poorly, really. First of all, they had 15 citations and only four of them qualified what we were looking for, meaning research papers written after 1996. Now, out of those four, only two were deemed relevant. Although in Wikipedia's defense a lot of those references were books that could have been useful for our hypothetical first year grad student. Now, the map, it is a bit harder to compare the map, just because there were more papers, but just to give you the flavors. There were 45 papers. Seven of them were deemed seminal and another 21 relevant. Interestingly enough, many of the irrelevant papers were seemed like they were used to bridge between two relevant papers just to form a chain. Also, it was somewhat concerning that the map has all those seminal papers and users didn't quite see all of them. They didn't list all of them. So there's definitely some research going into this area of how to show people what's important in the map. Actually, I guess, how to know what's important in the map first. And last thing is recall. Because it's really nice that they get papers that are relevant, but are they completely overlooking some real important research direction. And we composed a list of the top ten areas of reinforcement learning and we just computed a fraction of areas that each users found, and again we're [indiscernible] Google scholar users alone. And at the end of using this map, we asked people to experiment and just tell us what they thought. So just to summarize. They thought that the maps were helpful in noticing directions that they didn't know about and useful way to get a basic idea of what science is up to. And a lot of their negative comments can be chalked up to my [indiscernible] skills, frankly. Things like the legend is confusing, or it's hard to understand from the paper title alone. 25 Okay. Just to remind you where we're headed at. Okay, the direction is still the same. We're trying to build issue maps automatically for every query you have in mind. And one thing I think could really add a lot to issue maps is this interactive component, okay. This personalization. There are so many ways you can interact with this, right. You can zoom into something you care about or zoom out. Or maybe like the chains, give me like I want to know more about Germany's role in the debt crisis. Just increase the importance of that word. Another thing I was playing with recently was just having a map that reflects what you already know, your background. Because if I search for reinforcement learning with some experts, they're really looking for different things. So maybe just give the map as input or [indiscernible] text file. Hey, those are the papers I know about. Can you use it in a query somehow. I think it could really make it much more useful. Okay. One more thing. one-minute demo? >>: How am I doing on time? Hm? Do I have time for Absolutely. >> Dafna Shahaf: Just to show you. And I don't have internet access here for some reason, so you're going to survive with what I did at the hotel. This is our site currently. You see there's a map. Okay. I can do both. And you can click on an article and you can read the article like I said, not so much Greece skills, and wait. Yeah. Anyway, that's what I wanted to show you. We have a website, we hope to launch it soon after we finish fighting all the HTML5 kinks and I guess I'm going to get a lot more data and see what works and what doesn't work. Okay. Conclusions. A huge chunk was like you said just, like you said, formulating those metrics, just coming up with good objectives, what's coherence, what's coverage, how do you measure connectivity. And then coming up with efficient methods to actually compute them with some theoretical guarantee. We have some user studies that highlight the potential of this methods and the website on the way hopefully soon. Now, if there's one thing I want you to take out of this talk, it's probably this one. Okay? Search engines are great, but sometimes you need more than this. Sometimes you have more complex 26 information needs and then hopefully you're going to use the maps. Thank you. Okay? Now, if you have any questions. >>: Do you consider it [indiscernible] just to do like figure automatic evaluations so that your mention of surveys seems like an obvious one where you can take the papers before the survey, see if your recall on basically what's cited in the survey by generating from the papers beforehand and even in the maps separating into the sections prevalent in the survey. >> Dafna Shahaf: I lost you at some point. >>: You can use the survey as a point of evaluation, right? From the papers beforehand, do you recover the seminal papers by doing automatic analysis. Do you segment them in the way the survey segments them. Are you assuming they're available for ->> Dafna Shahaf: I was looking at it. I was looking at planning surveys. There are survey that have completely different ways of segmenting this, and they mention different papers. So maybe I should Google like one level higher, just look if they find the good authors. I guess this is less controversial. Maybe. But no, I guess you can use surveys, yes. I want to write this down. Yeah, anyway, we haven't done this. I actually like it. Yes? >>: Can you remind us one more time the definition of connectivity for your purposes? >> Dafna Shahaf: Yes, it was -- okay. For the first one, it was the easy one. Just every two lines that intersect, you get a point. >>: So in other words, another way to put it is it's simply the number of edges in the intersection graph? There's nodes and lines. So why don't you use actual connectivity? Is this graph connected? >> Dafna Shahaf: Because sometimes it just, it really can't be connected because of what I showed you in the scientific domain. Some lines are just, especially when the query's wide, some lines are just too out there. They can't be connected. 27 >>: Why don't you use the number of [indiscernible]. Because say you have ten lines, you're following two -- yes, you each have this connected so they form a chain and alternatively, four of them have nine edges together and all the rest are separate. >> Dafna Shahaf: You can use the number of connecting components. But it's also interesting to see how the things inter-connect the components are connected, right? >>: Right, right, but it seems just the number of connections -- >> Dafna Shahaf: >>: Yes that would work. Don't quite capture -- >> Dafna Shahaf: Since we're doing a local search, this is actually the easiest objective to change, right? [inaudible]. Yes? >>: I want to [indiscernible] on Paul's suggestion earlier when we're talking, I was thinking well, here are the [indiscernible]. Well, you could have as a unit a paragraph, right. So you mention you take all the paragraph and you shuffle them and you [indiscernible] and you assume the paper is the [indiscernible] story. And now you can see how all the papers which [indiscernible] hopefully, basically score with the different measures that you have. >> Dafna Shahaf: Yes. >>: If they score well, it would be actually, you know, a recognition that it's a [indiscernible], assuming that the paper has some coherence, coverage. >> Dafna Shahaf: Which [indiscernible] assumption for most authors. Actually, Eric, remember we talked about it at some point. Yes, this is somewhere on my to-do list and I -- which is long, I admit. And I think a single paragraph is not enough for you to actually see that those two paragraphs are coherent, but maybe, I don't know a third of the article would be good enough. So yes. This is something we've been ->>: The second question is you seem to -- there seems to be an assumption that 28 all the items are somewhat comparable. That you could imagine feature access. One would be, let's say, [indiscernible] versus [indiscernible]. Another one would be surveying like and [indiscernible]. And so you could either restrict to some of these axis and see how the stories differ. Or you could go for diversity between these things or not consider them at all [inaudible]. >> Dafna Shahaf: I really like that question. So one thing that, this is what I think personalization can come in really useful. Because then well, not off the axis you were talking about, but you can definitely bias towards Republican or Democrat. When you use Obama care, it's typically what you think about it. And I was just talking this morning about maybe I should try doing this, compute a map for the same query, one for the New York Times and one for Wall Street Journal or something else and see how they defer or give different point of view. [indiscernible] how to find the other axis that you're talking about high level/low level. >>: Well, it's [indiscernible]. >> Dafna Shahaf: That's doable, yeah. I guess technically, you can do it with a personalization as well, right, just increase the weight of some words that are very charged. Why would you want that, by the way? >>: Oh, at least want to [indiscernible]. >> Dafna Shahaf: That's a perfectly good reason. Okay. Those are really interesting, especially if you try to connect chains that are between end points that are not really connected, you really get something like a conspiracy theory generator, right, because the most coherent stories, not very coherent. >>: You can imagine how propaganda machine [indiscernible] just connections between the gypsies of Europe, now with the downfall of the banks. So anyway, any other comments or questions? Okay. Thanks very much, Dafna.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib