Eyal Lubetzky: Good morning, everyone. It's a great pleasure to have Dafna Shahaf here from Carnegie Mellon and Stanford. She's going to talk about "the Aha moment, from data to insight." Dafna Shahaf: Thank you. Okay, first of all can you hear me? Excellent. So I'm Dafna Shahaf and like Eyal said I'm going to be talking about my recent project that I'm working on. So what does it mean? The thing is there was a time not so long ago when getting your hands on good data was really, really difficult, like this poor interviewer here that had to ride a horse across a country and ask farmers how many cows they had. And of course today it's a completely different story. Cows, they tell you where they are and what they've been up to today. So getting your hands on data is no longer much of a problem. We have more data than we know what to do with. And this is actually great news for everybody, because large-scale data has potential to transform almost every aspect of our life from science to business to sport to public health. And it really addresses most of society's most pressing problems. The thing is for this potential to be realized, it doesn't have to collect this data, to acquire this data, not even to search through this data. You have to actually understand it. You have to make sense of it. You really have to turn the heaps of data into insight. And this is where I come in. My goal is to develop computational approaches for this [Indiscernible] of turning data into insight. These are just some of the questions that I'm most curious about. For example, what is insight? Or given lots of data, how do you do help [indiscernible] the structure? Or the most interesting bits and pieces. And also how do you use this idea to build a system that will facilitate discoveries? I'm going to give you an example. This is going to be in the context of news. So here's the scenario. Suppose you try and listen to a really complex news story, [indiscernible] or a presidential debate. So what do you do? Well, most of us would go to a search engine; right? We love search engines. Thing is search engines are great at giving those nuggets of knowledge, but they won't show you how the 57 million results fit together. So there's absolutely no structure. Now I'm being slightly unfair here, because there has been a lot of work on incorporating structure into search results, including where this came from here, like News Junky. But most of it boils down to a time line. Like this is the Greek debt crisis. And I'm going to claim that this summarization really only works in a simple story that is linear by nature. And the field stories are nothing like linear. If I had to go to a picture of the Greek debt crisis, then it's not a single line, it's more like this. So what do you do with this spaghetti-type of stories? Let me show the Holy Grail. How much you have seen [indiscernible] before? Okay, that's kind of what I expected. So this is an issue map. It's a set of seven posters about the great debate about whether computers can think. Let me just zoom in. It's in a graph where each node is an argument, like machines can't have emotions. And this is saying that that argument is supported by that argument, which is disputed by the other argument. So you're supposed to look at this and get the big picture. Now, I stumble upon those issue maps at Carnegie Mellon, and it's the type of stuff that I absolutely love, one of my favorite topics. So I stood there for an hour and read through the whole thing. And finally it made sense in my brain; finally everything clicked together. So I immediately started jumping up and down, like, okay, great, so how did they make those beautiful creatures? And it took them 20 [indiscernible] years, and they did pretty much everything manually. So my question is, great, how did you build those things mathematically? [indiscernible] So [indiscernible] are complex creatures. I'm going to start simpler. And the system I propose is called Metro Maps. Because your input is a set of documents. You can think of them as a result of a query, documents about the Greek debt crisis. And output is the Metro Map, where a map is a set of lines. So each line is a sequence of articles. And each line follows a coherent narrative thread. So this one tells you how the Greek bonds are junk and how they have to get to 30 points to get a bailout. And you have different lines that focus on different aspects. You might have another one about the strikes and riots triggered by [indiscernible] and another one about Germany. So you're supposed to look at it and get really both the temporal dynamics and the structure and how they relate to each other. So how do you come about finding road maps? So anyway, finding a map is a really hard problem. Because it's very intuitive. If I show you a good map, you know it's a good map, but why. So what I do is on the intuitive level, so what makes a good map. For example, each [indiscernible] is coherent. But coherent is a really fuzzy term so how do you formulate it mathematically? How do you come up with an objective function or something computers can play with. And once you have an objective function you're happy with, everything interesting I do tends to be [indiscernible] hard, so how do you optimize it, how do you come up with an algorithm with some guarantees. [Indiscernible] intuitive level what are properties for a good map, formalize them and find a way to optimize the subjective. Okay, so I start. Take a couple of seconds and just think about it. What makes a good map. Well, I guess I gave you the first thing already; right? It's coherence. So each line follow it's coherent [indiscernible]. But what does that mean? So we had an entire paper on this question, is KDD 2010. And the question was, given a chain of articles, how do you measure the coherence of a chain? And I ask this question about coherence to a whole bunch of people and they always come up with the same answer. Oh, it's a really easy question, just make sure that you have strong transitions. Okay, the D1 is similar to D2, document 2 is similar to document 3, 3 to 4, and you're good to go. And the entire point of this paper is the strong transitions are not enough. The coherence is not a property of local interaction with neighboring articles along the chain. Let me show you what I mean. So the bars here means that the word on the left appear in the article above it. So this is an article about a Greek that cries. Now, suppose you want to be a [indiscernible]. So first thing you do is, you know, you try to find a second article linking up to the second one. And you might come up with this, what the Republicans think about the debt crisis. Now, if you completely forget about the first article and you try to find a third one similar to number 2, you might come up with this one, what the pope thinks about Republicans. And it is going to keep drifting farther and farther away. And you see this circus behavior? It really means that you get a stream of consciousness here, that each transition is strong. But because of the completely different reason than the other transitions. So the overall effect is incoherent. Okay, let's try again. Same document, Greek debt crisis. Same second document, what Republicans think about the debt crisis. But this one you know where it came from and you know the [indiscernible] are not the main point. So you keep finding things that are on topic. And you consider the overall behaviors much smoother and nicer; you don't see the circus behavior. And most importantly, there's a small number 4 that can capture the entire story. So let's try formulating it. First thing we want for a chain to be coherent is we want each transition to be strong. So this means we need to define the score for transition. First thing we need, super simple, a transition between document, DI and DI plus 1 is the number of shared words. Okay, just some of the words score one point if it's shared. Super simple. This was way, way too course. Because first of all, some words are more important than others. And also words have noisy features, so you might have an article about judge and jury but not lawyers. So it's really implicitly there. So I really had to replace this indicator function with a soft notion, that we call an influence of 4 doubling on the transition documents DI and DI plus 1. Intuitively this influence is high if both occurrences are related; and doubly plays a role is what makes them related. Even if it does not appear in either of them. I'm not going to go deep into how we did this, but just to give you the flavor. So we made a [indiscernible] graph between documents and words and we looked at random words within DI and DI plus 1. And then we looked at the same thing when you're not allowed to go through word W, and see how word W effects those random words. >>: Can I ask you a small question? Dafna Shahaf: Yeah, sure. >>: How are these documents selected? Dafna Shahaf: Think of the query. You have the query Greek debt crisis, just big document -- oh, you mean the document of the chain? >>: Yes. Dafna Shahaf: So at this point I'm asking about given an entire chain how do you measure the coherence. Then we're going to talk about how to find actually a good chain. >>: Okay. Dafna Shahaf: Okay, so this is the score for single transitioning. Since you're all going to be strong, then you really want the weakest link to be strong. Yeah? >>: Could you just -- you said you didn't want to go into too many details. But so one side of the bipartite graph was -Dafna Shahaf: Words and documents. >>: Documents. Dafna Shahaf: So an edge means that the word appears in the document and you can weigh by [indiscernible]. >>: How do you do the work? Dafna Shahaf: Okay, we should probably talk about this offline, but the idea is a random work with random [indiscernible] that's selectively high, so it's short works, you don't get too far away. And then you look at random walk from DI, just back and forth to the words, to the DI plus 1, and then you do the same thing with double that becomes a sync node, so it gets trapped until the next restart. Okay, we'll talk about it later. Dafna Shahaf: So you want the weakest link to be as strong as possible. This means that all of the transitions are strong. So now I have all transitions are strong, but what I just told you that there has to be a global theme, you can tell the stream of consciousness, there need to be a small number to capture the entire story; right? So we turned this into an optimization power, actually find the small number of [indiscernible] that captures the entire story. So how about I give you a budget? You're only allowed three segments, okay? And I'm going to pretend that these are the only three words that appear in the documents. And it's scored just like before, but only using those words. So if you go through the segment, there is nothing between document 1 and document 2; right, so your [indiscernible] is zero. But if you pick [indiscernible], your weakest link is much, much better. This is the extent of all the words, [indiscernible] all the words you call active. And the problem becomes just an optimization over activation patterns. So we look at a way to choose those active words subject to constraints that are trying to mimic this behavior of coherent chains. Okay? Breeze in, breeze out. This is our notion of coherence. And the way you solve it was using an LP and a rounding algorithm. >>: So the documents are ordered and there's -Dafna Shahaf: By chronology. >>: Okay. So how do you get the word -Dafna Shahaf: Time stamps. >>: -- it's a whole big thing. Dafna Shahaf: No, no, it's news articles, they all have time stamps. >>: But these stories came out at different times. And you would have a news story that appears later and then describes earlier events. Dafna Shahaf: Yes. I was wondering about it. Later I'm going to talk about this for books and movies and I was wondering if I could do ->>: Also the most coherent. Maybe they're using the word differently than you are. Dafna Shahaf: So as far as I'm concerned if the news story, even if it came out later and talks about something earlier and still follows this behavior, then people will still get something from reading it in order. >>: So you need to determine when the events took place in the article -Dafna Shahaf: No, I'm just using the time stamp. It seems to be okay. I was worried about this, you know, like in the books where they have flashbacks, but this would be good even if the articles were not entirely -- if the time stamps were not completely matching the time of the event. Okay? That's an assumption. We can argue about it. >>: What's the definition of maxed activations and -Dafna Shahaf: Which word do you want to peek to be active? Which segment do you pick? >>: [Indiscernible]. Dafna Shahaf: It appears here that basically you have a budget of how many you're allowed to do. Okay, good. >>: So it's got a maximum of the whole subsets, different size. >>: I may be stretching a bit, but the Greek debt crisis was written up in English, in German, in Greek, and maybe even in Turkish. And how do you review these documents? In other words, not only in other languages, life insurance salesman will tell you the last survivor and you will understand second to die and say the [indiscernible]. Dafna Shahaf: Yes, language is we're only focusing in English right now, exactly because of this. Now I was worried about using words. A part of the goal of this project was to see how far you can push really stupid features. Name identities, noun phrases, and see if I actually have problems and see if I need to use more sophisticated optimals. It didn't seem to be a problem. Not yet. Good. Coherence. So we have a first property of coherence. And at this time you should be asking yourself, great, are we done? Can it just be the top three coherent chains and call it a day? So let me show you what happens when you actually try to do this. So your query is Greek debt crisis, and also the top three coherent chains. Great, so there's one about Asian markets and two about strikes and riots. So what's wrong? Well, two things are wrong. First of all I have a budget of three lines. And Asian markets are not really the most important thing that was going on; right? There are so many more important things, like what Germany was doing. Now, for the bottom lines are redundant, there's really no reason for the map to include both of them. So the challenge seems to be balancing coherence with what I call coverage. The map should be about diverse topics that are important to the user. Okay so what does that mean? So let's formulize it. Coverage. First of all I keep calling it coverage. So let's talk about the elements that I'm trying to cover. And you can think of them as words. So Obama and China. So have a [indiscernible] that each document covers each word. And the can be based on [Indiscernible]. One more thing that we have is a weight for each word for how much we care about covering it. And if you don't know anything, this can be based on frequency. Everybody's talking about Obama he might be important, but this is a perfect place to plug in personalization, if you do know something about him, like they don't care about politics but they love sports. And what I said in the earlier slide was that high coverage map should be about important things and they should encourage diversity. And diversity just means a more [indiscernible]. I have this intuitive [indiscernible] returns. So I'm going to show you how this works. So this is our [indiscernible]. This is how the [indiscernible] with frequency. And you have the documents covered words fractionally. Because this blue document is lots about Obama and Washington. And these are important words, you get lots of [indiscernible] coverage. And then if you pick another document, it completely saturates Obama in New York. So at this point if you pick another document about Obama, you don't know [indiscernible] coverage. So the objective can push you to pick another word that's important and hasn't been covered yet. Okay? So each document of the map flips a coin and with this [indiscernible]. If all of the map document trying doing this independently, this is the parable that at least one of them succeeded. Then you take this, this is kind of how much the entire map covers each word, and sum up the words and you weigh them by how much you want to cover this word. So how much of the words times how much do I care about this word, times how much the map covered it. And this is our notion of coverage. Okay, now I have two things, coherence and coverage. And how do they play together? Now hopefully I convince you that Asian markets, for example, the coherence is not necessarily -- you don't want the most coherent chains in your map. It's not like you're constrained. A chain is either coherent enough for you, or it's not. And coverage is the thing that you're really after. So the problem becomes find a coherent map that achieves maximum possible coverage. Okay, now I'm going to show you what happens when I try to optimize this. You get this map. Okay, same Greek debt crisis; one line about strikes, one line about Germany, one line about IMF, the [Indiscernible] they're important, they're diverse. What's wrong? Come on, you're all thinking that. It's not the map, it's a set of disconnected lines. And it's especially frustrating when you see this thing at the bottom about Germany and the IMS and it should have been connected to the read line. So the last thing I had was connectivity. These two lines are related and I want the map to reflect this. Okay, good. So there are many, many ways to form the connectivity. We need the user study to see what people care about. And people, they got really upset when two lines were related but not connected. But they didn't seem to care much about how they were connected. At one point multiple point, beginning, end. So I started with some really simple objective, trying to encourage connections. So when two lines intersect, you score a point. Super simple, just [indiscernible] if they intersect, you score a point. Now, later on we're going with a more complex objective, but this is our first shot. So you have three things now, coherence, which is a [indiscernible] that we saw with an LP rounding, coverage which is a [indiscernible] function, and connectivity. This is a super simple thing to just encourage in an intersection. Okay, good. So how do they play together? Hopefully, again, like before, coherence is still a constraint. And now coverage and connectivity, you know, ideally I like to optimize both. But if I show a map that is super well connected about something you couldn't care less about, then you probably still couldn't care less about it. But coverage is really a primary objective, and connectivity, they're secondary. So the problem becomes consider all coherent maps that achieve maximum possible coverage. And out of those find the one that's most connected. And this here is like [indiscernible] graphic optimization, so the first one is infinitely more important than the second. Okay, good, let me just give you a brief overview of how we optimize this objective. So our input is set of documents. Again you can think of them as a result of a query. Now, I said that coherence is a restraint. Ideally I like to enumerate all possible coherent trains that can serve as metro lines. But it's clearly unfeasible. So what I do instead is I include all coherent chains in the structure that I call the coherence graph. And the idea is that each node of this graph is a short coherent complaint. And the edge means you can cut an item and it remain coherent. And it's a transitive property, so passing this graph course went to a longer and longer coherent chains. So really include all coherent chains as passive in this graph. Next thing I want to do is I want to pick a path from this graph, that again correspond to coherent chains so that the underlining documents maximize coverage. I want to pick a high-coverage map that's coherent. So let's talk a second about finding a high coverage path in this graph. If you think about it just find me a path, some length L, maximize the coverage of underlying articles, which is just a case of the more general finer paths of [indiscernible] maximizing some function of the nodes visit. Which is likely a well studied problem called [indiscernible]. Many smart people have paid attention to it before. So we use the algorithms of [indiscernible] and paths since our coverage function is modular we can use a modular [indiscernible] algorithm. And it's a very nice algorithm with approximation guarantee. Okay, good. So just going back, so we have the set of documents, we encode them as a -- as passing a graph, and then we [indiscernible] in order to find a set of paths that are high coverage and coherent. The last thing we do, we have the local search step that tries to increase the connectivity without sacrificing coverage. So let me just show an example without this algorithm. So this is a very simplistic map of the Greek debt crisis now that's the real thing. So they have one about how Greeks struggle to stay afloat and they need help, but is it enough. Another line about the strikes and riots. Another line about Germany. And a tiny line about the IMF. This one you should all be staring at and saying okay, it's a nice picture, but is it any good. Like how do you even [indiscernible] those things. And the following maps is challenging, because you don't have grand truth, we don't have a golden standard. And you can use all those surrogate methods from machinery and data mining. But what I'm going to argue is that for what I do the user settings are crucial. First of all, because I want to make sure that we capture those intuitive notion we started from, you know, coherence, and also to make sure that what we're building is useful for somebody. I really want those maps to be useful. So I want to talk about the user study. So the question of the study was, well, can maps help news readers understand news events better than state of the art. Here's a New York times article, 2008 and 2010. And I picture three queries, the miners trapped in Chile, the earthquake in Haiti, and the Greek debt crisis. And the question is again, can maps help people understand those stories better. So first thing we tried was a really simple question answering, because we gave people ten questions like how many miners were trapped and we measured how well they answered us and how long it took them using other maps like Google News or something called proper detection and tracking. And we had roughly 340 users. And we're doing better than competitors, but nothing to write home about, nothing major. And I was talking to people, I said, you know, if I wanted to learn the name of the Greek Prime Minister, I would Google, I would go to Wikipedia. This is a complete overkill. And they're absolutely right. So I think what we learn is maps [indiscernible] small thing about those control F type of searches to really show the big picture. So we need to design another set of maps to help people understand the big picture. Now, how do you know if somebody understands the big picture? So, okay, if you taught a class you would know what I'm talking about. You only understand something if you can explain it to somebody else. That's what I think. So what we did was we asked people to look at the maps or look at Google News and write one paragraph, explaining the story to somebody that has absolutely no idea. Okay, then we took those paragraphs, we put them in a mechanicl third [indiscernible] double blind and ask people so which paragraph does a better job explaining the story? So we had 15 paragraph writers write roughly 300 [indiscernible]. This is also much, much better. So the Greek debt crisis had 72 percent of the people preferring paragraphs generated by maps users. Now, they didn't do as great for Haiti where under 60 percent. And I was curious about why. And I went looking at the actual paragraphs. And it turns out that Haiti had one major story, earthquake and damages and [indiscernible] and a couple tiny lines about, what was it, [indiscernible] running for the presidency or missioners accused of kidnapping children, something like this. And I asked everybody to summarize this map, just focus on the one major story line. So I think that the bottom line is that maps are the most useful in high-level summaries for just like for stories that don't have a single domain story line. Yes? >>: So the people doing the evaluating, do they know anything about the topic beyond what the random person knows? Do they already know about the topic or do they -Dafna Shahaf: So this is a mechanical query with like slide quality control and they have to know English and they have to ->>: How about someone who -- [indiscernible] -Dafna Shahaf: Yeah. >>: -- one could argue that that's a flop. Because it's just a paragraph that is appealing and appears to be helpful to a non-expert. Dafna Shahaf: Yes, one can totally argue this. And I was talking about this a little bit in the paper. But it seems like a reasonable baseline. So I try to convince you that maps are good for news. And my goal for the next few years is to tell that maps are not just about news. They try to really really easy to adapt to other domains. Because the main principles, you know, coverage, coherence, connectivity stay just the same. But you might be able to use the main knowledge [indiscernible] smarter objective. Now I'm going to talk about three examples, science, legal documents, and books. So I'll start with science. So the goal of this project was to see if maps could help somebody understand the state-ofthe-art of some field. For example, what [indiscernible] and the data we had was ACN papers, and we needed to do some slight modifications to the objective, most taking advantage of a citation graph, but other than that, the algorithm stayed exactly the same. So I'll show you an example. This is about reinforcing learning. I don't expect you to read it. I would just walk you through it. So there's one line about the multi-agent setting, one line about the MDP [indiscernible], one line about controlling robotic arms, one line about [indiscernible] as an expression in notation, and a line about [indiscernible]. You see those are actually disconnected, but there's those funny dashed gray lines between them. That's in the scientific document you might have a line about theory and a line about application. And there's really not a single article that would fit them both. There's no directing of section. But those lines are clearly related, there's also citation going on between them. So in the scientific [indiscernible] we allow for indirect connectivity. So if two lines have lots of impact going on between them then we can't as well. So we start to see stuff like how the [indiscernible] had impact on the [indiscernible] or how the MDP [indiscernible] line had both in the robotic arms and on the multi-agent line. Okay, I'm going to talk about the user setting, because this is where the fun was actually. So the question was can maps help somebody, first a grad student, understand the state of the art of some field better than [indiscernible]. So what we did, we brought people in my office and we told them pretend to be a first year grad student who is kind of embarking on the first year learning project. You know, you got the professor, you're all excited, you want him to teach you everything he knows about reinforcement learning. And the professor gives you a survey paper. All I put on this paper was in 1996. So the goal is really to update this survey paper to find the more recent [indiscernible] than the relevant papers. And they could use either Google Scholar or Maps and Google Scholar. Okay? So I had 30 participants. We basically combined all papers into one long list and we had an expert judging precision, to show which papers were relevant and some topic recall. So we composed a list of the top ten sub-area of reinforcing learning in the last years and we wanted to see how many of those areas they managed to find. This is just in a nutshell. And the result, on average the map users had 10 percent more of their papers than [indiscernible], and of those top 10, they managed to find on average almost three more. So that made us very happy. One more thing I want to do in order to convince you that maps are a good idea for science, wish we made a map for our own related work at some point. So this is us connecting the dots to Metro Maps. And again I don't want you to read it, just see what's been around us. So let's focus on some organization, especially news. Lots of [indiscernible] narrative, some on coverage notions. This pink line with visualization like Constant Maps and Mind Maps and this [indiscernible] line about mapping science. Good, so let's do legal documents. So there is a company in town that came knocking on our door one day to do a search engine for lawyers and they wanted to know if maps could help lawyer argue a case. So they gave us some [indiscernible] decisions. Have you seen some [indiscernible] decisions before you might have now. So they're proud that they're insanely long. They can be hundreds of pages. So my [indiscernible] idea was completely not finding signal in those. [Indiscernible] so we turned out that was working that when you site another case, you have the same [indiscernible] sight them. So, you know, in blah versus blah the defendant [indiscernible] applied here. So if we just use this anchor text, you can pinpoint the most interesting parts of the document, then everything else will just fall beautifully into place. I want to show you an example, this one map where we computed for them for a commerce class, you can see, for example, this purplish line about who can [indiscernible] the community. And if you work for a state-owned company, can you sue them? Okay, great, but how about in federal court? Great, does this section apply or not? So we basically showed this map to the lawyers to get a reality check. And they first of all said it made perfect sense, they were even nice enough to label each line for us. And then we went ahead and computed the words that made this line coherent from our point of view. So you can see, for example, the third one, the lawyer said that 11th amendment states [indiscernible]. We said immunity, serenity, amendments eleventh. Or the last time, regulating wholesale energy sale and we said wholesale electricity resale [indiscernible]. So I would be happy about this, we're probably integrating it into their search engine now. Okay the last thing I did, this was just for the fun of it, I wanted to see if Maps could help somebody understand the structure of a complex book. And what is a completed book actually mean "Lord of the Rings?" Mostly because I refuse to read a song of ice and fire until they actually finish writing it. So we had a lot of learnings. And my biggest problem is coherence. Think about it, journalist are really nice, because they actually tell you what happened before. But books don't really work this way, they don't say okay now that we're done with the [indiscernible] and this guy's dad, we can go in and do that. So coherence was completely breaking. So what we decided to do is say hopefully a single character's point of view is a coherent narrative thread. So focus a lot more on named entities. And we just showed a little from the Lord of the Rings map. So this is the Hobbit and Gandalf start walking on their merry way, they collect people all the way to the castle and then they split up. And you see people [indiscernible] going somewhere and some instead of going that way, they [indiscernible]. The bad guys are down there and they're going to eventually meet the good guys. So there's a lot of structure already emerging. Okay, one more thing I want to tell you is what we did recently to make things more useable. So first thing I was worried about was scalability. I really wanted to do a web scale [indiscernible]. So basically we ran our objective, everything I used to say was [indiscernible], actually made the -- sorry, made it parallel and came up with a hierarchy like a clustered version of Metro Maps which Metro Stop is not just a single article anymore. And we brought it down from 11 minutes on several thousand articles to 30 seconds per query on hundreds of thousands of articles. Second thing I was worried about was infraction. And I really don't think it's going to make or break Metro Maps, because I'm not going to nail the right map based on a couple of key words. But on the plus side the Metro Maps has so many awesome interaction mechanisms. So we tried to think. One is they call a [indiscernible] solution, where you can zoom in to learn more or zoom out to get the highlevel overview. But the most interesting technical bit was we had to come with a community [indiscernible] algorithm to make this [indiscernible] function of dense overlaps. So we had a [indiscernible] algorithm with some block coordinate and gradient descent. Second thing we did was [indiscernible]. Remember in the current slide I told you this is a perfect place to plug in personalization to those weights? So we actually had the mechanism that let people say I don't care much about what Germany is doing, but Portugal is interesting. And the map would recompute based on whatever I think is important now. With ideas from [indiscernible] from [indiscernible] feature-based feedback. Another thing I did this semester, and this is a really fun project with a student, is think about controversial topics. So look for something like ObamaCare. So there's really not one map, it's really more on Democrats versus Republicans. So we looked for how to form out this notion of controversy using polarized sentiment and how to kind of cluster documents based on those sentiments and compute two different maps, representing two different point of view. One more thing that's been going on, we have a website, very final stage of debugging, and I have a student whose entire mission for the quarter is to come up with an Open Source package so people can plug in their data and see what comes out. So this has been a really, really exciting semester. So the entire point of this project was to take a news reader or first-year student or [indiscernible] or really anybody that has lots of data and needs to rely on storage. And we wanted to show them a perspective of their field. We want to show them the structure and how things connect to each other. And we talk about how to format this, coherence, coverage, connectivity. We have the algorithm and we have user study to evaluate our idea. Now at this point I was kind of staring at this and trying to think what to do next. So what don't I like about Metro Maps. And the thing I dislike the most is Maps can only show you connections that are explicit made by somebody. A journalist told us that those two are related. But what if you want to make new collections? What if you want to discover something new? So this is how our project came to life, where the goal is you have lots of data, how you find something inside [indiscernible] or really how do you define this notion of inside. So just a word of caution, this is work in process, it's a lot less mature than Metro Maps, but I have been having so much fun with it, I thought I'd tell you some. [Indiscernible] now there's been lots, lots, lots of work about this, right. There's psychology, cognitive psychology, there's data mining. You can argue that [indiscernible] about taking data and getting insight. Same thing for lots of info conferences. And I was going through a lot of papers and trying to abstract those ideas. So what makes an insight. Just like before, what makes an insight? First thing is almost [indiscernible], right, it has to be surprising. If you know about it, nobody cares. But surprise alone is not enough, because give me enough data, I'll find plenty of things that will surprise you, just because there's noise or bias or coincidence. So it has to be what I call plausible, or really supported by the data. And this is a super general idea. So let me just show you how this plays out in the medical demain. And the medical domain is perfect for me, because there's lots of data just lying around and every day you see those articles about researchers found the link between blank and blank. So there's a potential for many, many new links that nobody's covered. Because do you want to use this idea to build a system that will kind of take researchers and give them some promising research directions, so identify where the gaps are in our card knowledge. So I said plausible and surprising. How does that work? First of all for the purposes of this presentation, I'm only going to restrict myself to a really simple kind of insight. It's a pair of medical terms. Like there's a connection between sleep apnea and diabetes that I think is insightful. Because it's just [indiscernible] of medical term. Now for something to be plausible, I need to actually [indiscernible] a lot of in practice. So many sleep apnea patients actually do get diabetes. And in order for this to be surprising, you go through the [indiscernible] and nobody ever thought about it or nobody ever noticed it before. So look for plenty of things that [indiscernible] a lot in practice, but nobody in the literature seems to know about it. So what kind of data do you need to know to compute those things? So for plausible we have seventeen years of hospital notes. [Indiscernible] notes, we have about ten million of them. And surprising we have about eleven million on papers from [indiscernible]. Now this is an overview of the system. And if you lost me by the way, this is a good place to pick up. So I have a system, it starts from a query. Now it doesn't actually have to start from a query, but researchers usually have something they care deeply about, so let's start from a query. In this case sleep apnea. So what the system does, first of all, it goes through the medical notes and looks for a plausible candidate. So what happens to sleep apnea patients in practice? You take the candidates and you rank them according to mid line. So what's surprising? So sleep apnea, what happens in practice and what does the literature not know about it? Now, for this to work, I need to tell it three things. Where are the terms coming from, what's plausible, and what's surprise? Okay, terms, so we're excited medical terms from the notes and from mid line. And first of all, this is a lot more [indiscernible] than I expected. Because it's natural language, so physicians might tell you sleep apnea, acute sleep apnea, recurrent sleep apnea. And this is completely messing up my counts. So I need to know when I can merge something and when I can't. So I decided to use medical hierarchies. So we have this kind of thing, it's a [indiscernible], you have stuff like migraine disorders that has two [indiscernible], common migraine, and not so common migraine. And you really want to know when you see something if you can measure it up or not. So what we decided to do was use [indiscernible] divergences. So compute how much information is lost when you use a [indiscernible] in order to proximate a child. And you can see for example, if it's a common migraine, you can propagate it all the way to migraine disorders, maybe in vascular headaches, depends on your threshold. But if it's this other type of migraine, no, it's a completely different beast. Okay, now surprise. Why not using surprising? Well, first of all, they can't [indiscernible] too often, right? So we have a threshold. So the number of papers mentioning them can't be overkilled. But that's still not good enough, because there might be two terms that just don't appear because nobody cares about them, because five people in the universe combined have them. So it's more surprising if it turns up popular; if nobody notice the connection between sleep apnea and diabetes that are really well researched. So we have, just like before, we have weights. So the importance of a term. And the way I like thinking about this is it has novelty and it has utility. So this is surprise. Now, plausibility. The way it works is exactly the other way around. So two things you [indiscernible] together in practice a lot. So what we did is we aggregate all the notes that a single patient received in a year, get that to our basic documents. And the thing we did with computer really [indiscernible] efficient. So how many patients have both of those things over how many patients have at least one. and let me just show you what happens when you try to plug those two objectives into our system. There's some stuff on dementia. Those are the top six [indiscernible] things that happen with dementia. So the first three are Alzheimer's medications used to treat dementia. So they're going to be filtered away by mid line. But then you're left with hip fractures, atrial fibrillation and wheelchairs. And this [indiscernible] gets really, really suspicious and say he fractures in wheelchairs, oh, we might have a problem here. Because it might not be about dementia, just might be that that population tends to be old. So we needed what we call [indiscernible] power. And I'm trying not to say the word [indiscernible], but you can think about it this way if it helps you. The idea is that we took a group of people that are really, really similar to dementia patients but don't have dementia. And we compared this to the [indiscernible]. And we say hip fractures, are they a lot more common for dementia patients than for this other group that is really similar, but just doesn't happen to have dementia. And if it's not, then you're not capturing the right thing here. It's not about dementia, it's about them being old or something else. So plausibility about having [indiscernible] and also about passing this matching test. So let me show you. How about we start with dementia. Now, wheelchairs and hip fractures don't even pass the test. And the only thing you're left with is atrial fibrillation and, okay, what do you do with it? Is it an insight? Again, how do you evaluate? And ideally I would make just a series of bold predictions and send an army of physicians to chase them down. But this requires, you know, time and money and physicians, and I don't quite seem to have either of those things. So instead of what we did was early discovery. So you ask a physician to give us a list of that, breakthroughs of last five years, and we time travel on the data and say if you had run this algorithm five years ago, what could I have told you? And specifically what would show up in the top three results of my search engine. Now this is if you can predict anything, it's a really strong indication you have something. Now I said it was really [indiscernible], they only gave us four things, obesity and colon cancer, diabetes type 2 and sleep apnea, atria fibrillations and dementia and increase in [indiscernible]. But out of those four we actually managed to figure out two. So this is a much more happy and much more willing to cooperate with me right now and they promise to give me a much longer list. Now, I started by saying this is a really general idea. So I wanted to show you how this next algorithm, this next formulation works in a completely different domain. This is the commerce domain. So the idea is to get a search engine that encourages serendipity. Here's the way to think about it. Suppose I wanted to buy a laundry hamper. I doubt -- if want a laundry hamper, I need a place to store laundry. So the same idea of what an algorithm does is find products that are plausible, in the sense that they solve a similar problem, and they're surprising in the sense of when you go to Amazon, people who viewed this, that nobody in their right mind who is looking for a hamper would consider this other product as an alternative. I have a search engine that says you don't need a laundry hamper, how about you buy a really big trash can? And I was telling it to a friend of mine and he says he uses a trash can for a hamper and I was so happy. But any way, my entire point here is that the algorithm is just the same, just instead of medical note with [indiscernible] in order to learn common sense what are things used for, and instead of mid line, with Amazon, people who viewed this, viewed this graph, and everything is just the same. Now, I already tell everybody, I can give you some shopping tips playing with my algorithm, just in case you want to buy something. So here's some things I learned. First of all at least in this country, pet products and baby products are surprisingly interchangeable. It just keeps showing up; it's scary. Also I want to look up the [indiscernible] department every now and then. So here I was looking for the thing you put in the bathtub in order to sleep, and cars really have the same thing. And also forget the idea that [indiscernible] for stickers that people put on the [indiscernible] in order not to slip and fall. Which made perfect sense to me. Okay, the hat project. The point was that medical researchers, I really wanted to give them a tool that would let them discover some promising new ideas. And you form out the surprising possibility, we have this earlier discovery of some medical breakthroughs. And I did quite talk about there are so many applications in other domain. So what is the product search, what is also something called medical [indiscernible], there's [indiscernible] by discover reinforcing learning, there's also -- we did this through Wikipedia, lots and lots of other places this could work out. This seems like a good place to take a step back and try to answer what we had here. So I talked about two products, the Metro Maps and the hat project. What's a common thread was an underlying common thing. So I guess the obvious answer is this idea of storing lots of data into insight. But there's really more. The process I like the most are the ones that talk about really intuitive problem to finishing, because you know, what's a coherent story line or what's an insight and then you formulate it mathematically, optimize it, and then [indiscernible] user study, both to make sure that you're categorizing and to make sure that your system's actually useful for anybody. I really want to build those things to be useful. Now in order for this to work [indiscernible] borrow ideas from data money and machine learning and information retrieval, lots of algorithms, especially optimization and graph algorithm, and some [indiscernible] visualization. Okay, this is what comes in. What goes out is I tried to apply this in as many domains as I can. And today I talked about medicine and science and news and legal documents and commerce and literature. So this is what I do. Let's talk about what I want to do next. So first of all, I am the inside person. It's super [indiscernible] and I am really excited about this idea of building a set of tools that would help anyone, okay. So scientists, and really anyone, just plug in their data and see what's the most insightful thing that comes out. And really to enable some new discovers reinforcing learning. And again, I wanted to find the importance of finding [indiscernible] that generalizes a cross domains. I really think it makes a technique much stronger. So I talked about self applications that I already tried and let me briefly tell you about the corroboration I've been mostly excited about recently. So first of all I gave you sort of the Stanford computation of social science conference. And there's been all this, you know, social scientists and political scientists came to me afterwards carrying awesome, awesome buckets of data. So there's Congress notes and crime data, lots of real beautiful data sets. Life sciences, I really want to apply this to biology. I actually don't know how, but if you know anybody who might be interested, you know where to find me. And there's a history professor who really wanted to apply this idea of metro-active telegrams. Personal data sensing I've been dying to do for a really long time because of search browse history. Because suppose you're trying to plant a tree and you find yourself an hour later with 75 tabs opening [indiscernible] and wait what just happened. So really I want you to organize your browsing history into some structure. Okay, there's been some interest from corporations about ordering corporate data, financial data. This idea of insight for investigative journalism. And the last thing on this list really surprised me. I never saw this work for anything that's not text based, but recently some researcher that [indiscernible] applied my algorithm to summarize a video in a sequence of images. Yeah. [Indiscernible]. So anyway, apparently this thing also works for images. Again, really surprised me. Okay, so this was long and short-term. Longterm, like I said, I really like this idea of taking fuzzy things and formalizing them. And out of those my favorite by far is this thing here in yellow, this creativity. And I know it sounds kind of megalomanic, so I'll just tell you briefly about one thing I started doing this semester about it. So just two slides. So suppose you own a company, you own a product. You went to college and you want to change your product in order to extend your business. So what do you do? And we find this thing called this [indiscernible] model that's basically a set of questions you should be asking yourself; what can you combine your product with, how can you put to another use, how can you reverse functionality. My favorite example, this company that used to make water pressure shower heads and now they're making water pressure dental floss, which I think is completely brilliant. But anyway my point is I built a prototype system using the same ideas of Concept Map in order to answer those questions in Amazon to fill out the obvious things. Now I have a search engine where you type in something like alarm clock and it just spits out suggestions like combining it as a coffee machine. You know, you wake up to a fresh cup of coffee or combining it with a dimmer so the room becomes lighter and lighter as you wake up, or maybe make a silence alarm clock, something to [indiscernible]. Now either for deaf people or if you don't want to wake somebody else up. I just keep spitting and spitting suggestion. Do you know Sky Mall, the thing you get at the airplanes? [Indiscernible] right now, but hopefully it goes that way and gets better soon. Okay, good. So breeze in, breeze out. The [indiscernible] we have plenty of data and this is excellent, because data can help us understand the universe and make better decisions. But it's not enough to collect this data or to even search for this data. Really have to make sense of this data. We have to [indiscernible] the structure, like in Metro Maps, and we have to eventually discover unknown connections, like the [indiscernible]. And we had user studies and we have all the discovery to validate our ideas, and if there's one thing I want you to take from this talk, it's called this image right here. We really have to go beyond just searching our data. That's about it. Thanks. Eyal Lubetzky: Any questions? >>: I have a question. Because you were saying that the people who looked at the Metro Maps didn't care so much about the relations that might hold between the nodes in the map. Is that because you are drawing a map in document level? Dafna Shahaf: It might actually be. I'm not entirely sure why they're doing this. But it's pretty clear from the data. I know we showed them two things that were related and they couldn't just like at the beginning or end. They did say something was [indiscernible], they just kept like bouncing on and off. But other than this, they didn't seem to care much. And I'm not sure why, actually. >>: When you do medical discoveries, it probably will be more important. Dafna Shahaf: Then I completely agree. >>: And I wonder if you move this to the sub-document level -Dafna Shahaf: If I really wanted -- you know, the place I did at the beginning, they have a node as an argument, and I just don't know how to get this, how to abstract. But hey, we're going to have a meeting and talk about it; right? Excellent. How does it work? People are watching it online, do they have any way of asking questions? Eyal Lubetzky: They can run downstairs. Dafna Shahaf: Hold the bus. Eyal Lubetzky: As they go busting through the door. >>: I have a question. So whether you're measuring the number of active words [inaudible], how did you set the threshold? Because the more words you have the less effective it becomes, but the -Dafna Shahaf: Yeah, the more words I have I can actually go back to this circus behavior, because each transition is going to have its own budgets. >>: Yes, so how do you build the threshold? Dafna Shahaf: I just tried it in a couple of stories that were not useful to use as studies. And we just [indiscernible] this. Again, eventually I would love to learn this. But it's just tweaking parameters. >>: Do you do some experimenting with -- I mean, what you did was select a threshold and then treat all these words equally, just sum over the active words and -Dafna Shahaf: So summing for every active word, we summed over the influences of this word over the transitions, it's not just ->>: But then you copied like three or five, which was your [indiscernible]. Dafna Shahaf: Yeah. >>: As opposed to just -Dafna Shahaf: Yeah, we didn't let them -- I mean it's an algorithm, it's not rounding. It would still work. Eyal Lubetzky: Any other questions? >>: If we have a lot of data and -- how can we use your project [indiscernible]? Dafna Shahaf: I would say email my student and bug him. But you should probably email me, so I'll bug him. But, yeah, he was supposed to finish it like a couple of weeks ago. You know, it's probably going to take him a couple more. But it's almost ready. Now I'm curious about what questions you have. >>: Actually, I think we have a lot of data, we just trying to find a way and, using the data, finding the sites. And now we're still at the level of [indiscernible]. Yeah, we want to move forward and this is very interesting. Dafna Shahaf: Bring it up. I like your drive. >>: Have you published the work on the medical discovery? Dafna Shahaf: So we really want it to be a nature paper, which has been taking me longer than any other paper of everything in my life. If you want I can send you a rough draft. >>: So which hospital were you collaborating with that you could get the -Dafna Shahaf: I'm pretty sure it's the Stanford Hospital. It's the Stanford medical group. It's either this or the Palo Alto Hospital. >>: Because that's an amazing data set. Usually you have no -- like I've worked with data sets with at most 500 patients, and that takes like two years to get access to. So that's an amazing data set. Dafna Shahaf: Stanford medical team has been awesome about this. >>: Yes. Eyal Lubetzky: Okay. Let's thank Dafna again. [Applause]