Eyal Lubetzky: Good morning, everyone. It's a great... Carnegie Mellon and Stanford. She's going to talk about...

advertisement
Eyal Lubetzky: Good morning, everyone. It's a great pleasure to have Dafna Shahaf here from
Carnegie Mellon and Stanford. She's going to talk about "the Aha moment, from data to insight."
Dafna Shahaf: Thank you. Okay, first of all can you hear me? Excellent.
So I'm Dafna Shahaf and like Eyal said I'm going to be talking about my recent project that I'm
working on.
So what does it mean? The thing is there was a time not so long ago when getting your hands on good
data was really, really difficult, like this poor interviewer here that had to ride a horse across a country
and ask farmers how many cows they had. And of course today it's a completely different story. Cows,
they tell you where they are and what they've been up to today. So getting your hands on data is no
longer much of a problem. We have more data than we know what to do with. And this is actually
great news for everybody, because large-scale data has potential to transform almost every aspect of
our life from science to business to sport to public health. And it really addresses most of society's
most pressing problems.
The thing is for this potential to be realized, it doesn't have to collect this data, to acquire this data, not
even to search through this data. You have to actually understand it. You have to make sense of it.
You really have to turn the heaps of data into insight.
And this is where I come in. My goal is to develop computational approaches for this [Indiscernible]
of turning data into insight. These are just some of the questions that I'm most curious about. For
example, what is insight? Or given lots of data, how do you do help [indiscernible] the structure? Or
the most interesting bits and pieces. And also how do you use this idea to build a system that will
facilitate discoveries?
I'm going to give you an example. This is going to be in the context of news. So here's the scenario.
Suppose you try and listen to a really complex news story, [indiscernible] or a presidential debate. So
what do you do? Well, most of us would go to a search engine; right? We love search engines. Thing
is search engines are great at giving those nuggets of knowledge, but they won't show you how the 57
million results fit together. So there's absolutely no structure.
Now I'm being slightly unfair here, because there has been a lot of work on incorporating structure into
search results, including where this came from here, like News Junky. But most of it boils down to a
time line. Like this is the Greek debt crisis. And I'm going to claim that this summarization really only
works in a simple story that is linear by nature. And the field stories are nothing like linear. If I had to
go to a picture of the Greek debt crisis, then it's not a single line, it's more like this.
So what do you do with this spaghetti-type of stories? Let me show the Holy Grail. How much you
have seen [indiscernible] before? Okay, that's kind of what I expected. So this is an issue map. It's a
set of seven posters about the great debate about whether computers can think. Let me just zoom in.
It's in a graph where each node is an argument, like machines can't have emotions. And this is saying
that that argument is supported by that argument, which is disputed by the other argument. So you're
supposed to look at this and get the big picture.
Now, I stumble upon those issue maps at Carnegie Mellon, and it's the type of stuff that I absolutely
love, one of my favorite topics. So I stood there for an hour and read through the whole thing. And
finally it made sense in my brain; finally everything clicked together. So I immediately started jumping
up and down, like, okay, great, so how did they make those beautiful creatures? And it took them 20
[indiscernible] years, and they did pretty much everything manually. So my question is, great, how did
you build those things mathematically? [indiscernible]
So [indiscernible] are complex creatures. I'm going to start simpler. And the system I propose is called
Metro Maps. Because your input is a set of documents. You can think of them as a result of a query,
documents about the Greek debt crisis. And output is the Metro Map, where a map is a set of lines. So
each line is a sequence of articles. And each line follows a coherent narrative thread. So this one tells
you how the Greek bonds are junk and how they have to get to 30 points to get a bailout. And you have
different lines that focus on different aspects. You might have another one about the strikes and riots
triggered by [indiscernible] and another one about Germany. So you're supposed to look at it and get
really both the temporal dynamics and the structure and how they relate to each other.
So how do you come about finding road maps? So anyway, finding a map is a really hard problem.
Because it's very intuitive. If I show you a good map, you know it's a good map, but why. So what I
do is on the intuitive level, so what makes a good map. For example, each [indiscernible] is coherent.
But coherent is a really fuzzy term so how do you formulate it mathematically? How do you come up
with an objective function or something computers can play with. And once you have an objective
function you're happy with, everything interesting I do tends to be [indiscernible] hard, so how do you
optimize it, how do you come up with an algorithm with some guarantees.
[Indiscernible] intuitive level what are properties for a good map, formalize them and find a way to
optimize the subjective.
Okay, so I start. Take a couple of seconds and just think about it. What makes a good map. Well, I
guess I gave you the first thing already; right? It's coherence. So each line follow it's coherent
[indiscernible]. But what does that mean? So we had an entire paper on this question, is KDD 2010.
And the question was, given a chain of articles, how do you measure the coherence of a chain? And I
ask this question about coherence to a whole bunch of people and they always come up with the same
answer. Oh, it's a really easy question, just make sure that you have strong transitions. Okay, the D1 is
similar to D2, document 2 is similar to document 3, 3 to 4, and you're good to go. And the entire point
of this paper is the strong transitions are not enough. The coherence is not a property of local
interaction with neighboring articles along the chain.
Let me show you what I mean. So the bars here means that the word on the left appear in the article
above it. So this is an article about a Greek that cries. Now, suppose you want to be a [indiscernible].
So first thing you do is, you know, you try to find a second article linking up to the second one. And
you might come up with this, what the Republicans think about the debt crisis. Now, if you completely
forget about the first article and you try to find a third one similar to number 2, you might come up with
this one, what the pope thinks about Republicans. And it is going to keep drifting farther and farther
away. And you see this circus behavior? It really means that you get a stream of consciousness here,
that each transition is strong. But because of the completely different reason than the other transitions.
So the overall effect is incoherent.
Okay, let's try again. Same document, Greek debt crisis. Same second document, what Republicans
think about the debt crisis. But this one you know where it came from and you know the
[indiscernible] are not the main point. So you keep finding things that are on topic. And you consider
the overall behaviors much smoother and nicer; you don't see the circus behavior. And most
importantly, there's a small number 4 that can capture the entire story. So let's try formulating it.
First thing we want for a chain to be coherent is we want each transition to be strong. So this means we
need to define the score for transition. First thing we need, super simple, a transition between
document, DI and DI plus 1 is the number of shared words. Okay, just some of the words score one
point if it's shared. Super simple.
This was way, way too course. Because first of all, some words are more important than others. And
also words have noisy features, so you might have an article about judge and jury but not lawyers. So
it's really implicitly there. So I really had to replace this indicator function with a soft notion, that we
call an influence of 4 doubling on the transition documents DI and DI plus 1. Intuitively this influence
is high if both occurrences are related; and doubly plays a role is what makes them related. Even if it
does not appear in either of them. I'm not going to go deep into how we did this, but just to give you
the flavor. So we made a [indiscernible] graph between documents and words and we looked at
random words within DI and DI plus 1. And then we looked at the same thing when you're not allowed
to go through word W, and see how word W effects those random words.
>>: Can I ask you a small question?
Dafna Shahaf: Yeah, sure.
>>: How are these documents selected?
Dafna Shahaf: Think of the query. You have the query Greek debt crisis, just big document -- oh, you
mean the document of the chain?
>>: Yes.
Dafna Shahaf: So at this point I'm asking about given an entire chain how do you measure the
coherence. Then we're going to talk about how to find actually a good chain.
>>: Okay.
Dafna Shahaf: Okay, so this is the score for single transitioning. Since you're all going to be strong,
then you really want the weakest link to be strong.
Yeah?
>>: Could you just -- you said you didn't want to go into too many details. But so one side of the
bipartite graph was -Dafna Shahaf: Words and documents.
>>: Documents.
Dafna Shahaf: So an edge means that the word appears in the document and you can weigh by
[indiscernible].
>>: How do you do the work?
Dafna Shahaf: Okay, we should probably talk about this offline, but the idea is a random work with
random [indiscernible] that's selectively high, so it's short works, you don't get too far away. And then
you look at random walk from DI, just back and forth to the words, to the DI plus 1, and then you do
the same thing with double that becomes a sync node, so it gets trapped until the next restart. Okay,
we'll talk about it later.
Dafna Shahaf: So you want the weakest link to be as strong as possible. This means that all of the
transitions are strong.
So now I have all transitions are strong, but what I just told you that there has to be a global theme, you
can tell the stream of consciousness, there need to be a small number to capture the entire story; right?
So we turned this into an optimization power, actually find the small number of [indiscernible] that
captures the entire story.
So how about I give you a budget? You're only allowed three segments, okay? And I'm going to
pretend that these are the only three words that appear in the documents. And it's scored just like
before, but only using those words. So if you go through the segment, there is nothing between
document 1 and document 2; right, so your [indiscernible] is zero. But if you pick [indiscernible], your
weakest link is much, much better. This is the extent of all the words, [indiscernible] all the words you
call active. And the problem becomes just an optimization over activation patterns. So we look at a
way to choose those active words subject to constraints that are trying to mimic this behavior of
coherent chains.
Okay? Breeze in, breeze out.
This is our notion of coherence. And the way you solve it was using an LP and a rounding algorithm.
>>: So the documents are ordered and there's -Dafna Shahaf: By chronology.
>>: Okay. So how do you get the word -Dafna Shahaf: Time stamps.
>>: -- it's a whole big thing.
Dafna Shahaf: No, no, it's news articles, they all have time stamps.
>>: But these stories came out at different times. And you would have a news story that appears later
and then describes earlier events.
Dafna Shahaf: Yes. I was wondering about it. Later I'm going to talk about this for books and movies
and I was wondering if I could do ->>: Also the most coherent. Maybe they're using the word differently than you are.
Dafna Shahaf: So as far as I'm concerned if the news story, even if it came out later and talks about
something earlier and still follows this behavior, then people will still get something from reading it in
order.
>>: So you need to determine when the events took place in the article -Dafna Shahaf: No, I'm just using the time stamp. It seems to be okay. I was worried about this, you
know, like in the books where they have flashbacks, but this would be good even if the articles were
not entirely -- if the time stamps were not completely matching the time of the event. Okay? That's an
assumption. We can argue about it.
>>: What's the definition of maxed activations and -Dafna Shahaf: Which word do you want to peek to be active? Which segment do you pick?
>>: [Indiscernible].
Dafna Shahaf: It appears here that basically you have a budget of how many you're allowed to do.
Okay, good.
>>: So it's got a maximum of the whole subsets, different size.
>>: I may be stretching a bit, but the Greek debt crisis was written up in English, in German, in Greek,
and maybe even in Turkish. And how do you review these documents? In other words, not only in
other languages, life insurance salesman will tell you the last survivor and you will understand second
to die and say the [indiscernible].
Dafna Shahaf: Yes, language is we're only focusing in English right now, exactly because of this. Now
I was worried about using words. A part of the goal of this project was to see how far you can push
really stupid features. Name identities, noun phrases, and see if I actually have problems and see if I
need to use more sophisticated optimals. It didn't seem to be a problem. Not yet.
Good. Coherence. So we have a first property of coherence. And at this time you should be asking
yourself, great, are we done? Can it just be the top three coherent chains and call it a day? So let me
show you what happens when you actually try to do this. So your query is Greek debt crisis, and also
the top three coherent chains. Great, so there's one about Asian markets and two about strikes and riots.
So what's wrong? Well, two things are wrong. First of all I have a budget of three lines. And Asian
markets are not really the most important thing that was going on; right? There are so many more
important things, like what Germany was doing.
Now, for the bottom lines are redundant, there's really no reason for the map to include both of them.
So the challenge seems to be balancing coherence with what I call coverage. The map should be about
diverse topics that are important to the user. Okay so what does that mean?
So let's formulize it. Coverage. First of all I keep calling it coverage. So let's talk about the elements
that I'm trying to cover. And you can think of them as words. So Obama and China. So have a
[indiscernible] that each document covers each word. And the can be based on [Indiscernible]. One
more thing that we have is a weight for each word for how much we care about covering it. And if you
don't know anything, this can be based on frequency. Everybody's talking about Obama he might be
important, but this is a perfect place to plug in personalization, if you do know something about him,
like they don't care about politics but they love sports.
And what I said in the earlier slide was that high coverage map should be about important things and
they should encourage diversity. And diversity just means a more [indiscernible]. I have this intuitive
[indiscernible] returns. So I'm going to show you how this works.
So this is our [indiscernible]. This is how the [indiscernible] with frequency. And you have the
documents covered words fractionally. Because this blue document is lots about Obama and
Washington. And these are important words, you get lots of [indiscernible] coverage. And then if you
pick another document, it completely saturates Obama in New York. So at this point if you pick
another document about Obama, you don't know [indiscernible] coverage. So the objective can push
you to pick another word that's important and hasn't been covered yet. Okay?
So each document of the map flips a coin and with this [indiscernible]. If all of the map document
trying doing this independently, this is the parable that at least one of them succeeded. Then you take
this, this is kind of how much the entire map covers each word, and sum up the words and you weigh
them by how much you want to cover this word. So how much of the words times how much do I care
about this word, times how much the map covered it. And this is our notion of coverage.
Okay, now I have two things, coherence and coverage. And how do they play together? Now
hopefully I convince you that Asian markets, for example, the coherence is not necessarily -- you don't
want the most coherent chains in your map. It's not like you're constrained. A chain is either coherent
enough for you, or it's not. And coverage is the thing that you're really after. So the problem becomes
find a coherent map that achieves maximum possible coverage.
Okay, now I'm going to show you what happens when I try to optimize this. You get this map. Okay,
same Greek debt crisis; one line about strikes, one line about Germany, one line about IMF, the
[Indiscernible] they're important, they're diverse. What's wrong? Come on, you're all thinking that.
It's not the map, it's a set of disconnected lines. And it's especially frustrating when you see this thing
at the bottom about Germany and the IMS and it should have been connected to the read line.
So the last thing I had was connectivity. These two lines are related and I want the map to reflect this.
Okay, good. So there are many, many ways to form the connectivity. We need the user study to see
what people care about. And people, they got really upset when two lines were related but not
connected. But they didn't seem to care much about how they were connected. At one point multiple
point, beginning, end. So I started with some really simple objective, trying to encourage connections.
So when two lines intersect, you score a point. Super simple, just [indiscernible] if they intersect, you
score a point.
Now, later on we're going with a more complex objective, but this is our first shot. So you have three
things now, coherence, which is a [indiscernible] that we saw with an LP rounding, coverage which is a
[indiscernible] function, and connectivity. This is a super simple thing to just encourage in an
intersection.
Okay, good. So how do they play together? Hopefully, again, like before, coherence is still a
constraint. And now coverage and connectivity, you know, ideally I like to optimize both. But if I
show a map that is super well connected about something you couldn't care less about, then you
probably still couldn't care less about it. But coverage is really a primary objective, and connectivity,
they're secondary. So the problem becomes consider all coherent maps that achieve maximum possible
coverage. And out of those find the one that's most connected. And this here is like [indiscernible]
graphic optimization, so the first one is infinitely more important than the second.
Okay, good, let me just give you a brief overview of how we optimize this objective. So our input is
set of documents. Again you can think of them as a result of a query. Now, I said that coherence is a
restraint. Ideally I like to enumerate all possible coherent trains that can serve as metro lines. But it's
clearly unfeasible. So what I do instead is I include all coherent chains in the structure that I call the
coherence graph. And the idea is that each node of this graph is a short coherent complaint. And the
edge means you can cut an item and it remain coherent. And it's a transitive property, so passing this
graph course went to a longer and longer coherent chains. So really include all coherent chains as
passive in this graph.
Next thing I want to do is I want to pick a path from this graph, that again correspond to coherent
chains so that the underlining documents maximize coverage. I want to pick a high-coverage map
that's coherent.
So let's talk a second about finding a high coverage path in this graph. If you think about it just find me
a path, some length L, maximize the coverage of underlying articles, which is just a case of the more
general finer paths of [indiscernible] maximizing some function of the nodes visit. Which is likely a
well studied problem called [indiscernible]. Many smart people have paid attention to it before. So we
use the algorithms of [indiscernible] and paths since our coverage function is modular we can use a
modular [indiscernible] algorithm. And it's a very nice algorithm with approximation guarantee.
Okay, good. So just going back, so we have the set of documents, we encode them as a -- as passing a
graph, and then we [indiscernible] in order to find a set of paths that are high coverage and coherent.
The last thing we do, we have the local search step that tries to increase the connectivity without
sacrificing coverage. So let me just show an example without this algorithm. So this is a very
simplistic map of the Greek debt crisis now that's the real thing. So they have one about how Greeks
struggle to stay afloat and they need help, but is it enough. Another line about the strikes and riots.
Another line about Germany. And a tiny line about the IMF. This one you should all be staring at and
saying okay, it's a nice picture, but is it any good. Like how do you even [indiscernible] those things.
And the following maps is challenging, because you don't have grand truth, we don't have a golden
standard. And you can use all those surrogate methods from machinery and data mining. But what I'm
going to argue is that for what I do the user settings are crucial. First of all, because I want to make
sure that we capture those intuitive notion we started from, you know, coherence, and also to make sure
that what we're building is useful for somebody. I really want those maps to be useful.
So I want to talk about the user study. So the question of the study was, well, can maps help news
readers understand news events better than state of the art. Here's a New York times article, 2008 and
2010. And I picture three queries, the miners trapped in Chile, the earthquake in Haiti, and the Greek
debt crisis. And the question is again, can maps help people understand those stories better.
So first thing we tried was a really simple question answering, because we gave people ten questions
like how many miners were trapped and we measured how well they answered us and how long it took
them using other maps like Google News or something called proper detection and tracking. And we
had roughly 340 users. And we're doing better than competitors, but nothing to write home about,
nothing major. And I was talking to people, I said, you know, if I wanted to learn the name of the
Greek Prime Minister, I would Google, I would go to Wikipedia. This is a complete overkill. And
they're absolutely right. So I think what we learn is maps [indiscernible] small thing about those
control F type of searches to really show the big picture. So we need to design another set of maps to
help people understand the big picture.
Now, how do you know if somebody understands the big picture? So, okay, if you taught a class you
would know what I'm talking about. You only understand something if you can explain it to somebody
else. That's what I think. So what we did was we asked people to look at the maps or look at Google
News and write one paragraph, explaining the story to somebody that has absolutely no idea. Okay,
then we took those paragraphs, we put them in a mechanicl third [indiscernible] double blind and ask
people so which paragraph does a better job explaining the story? So we had 15 paragraph writers
write roughly 300 [indiscernible]. This is also much, much better. So the Greek debt crisis had 72
percent of the people preferring paragraphs generated by maps users. Now, they didn't do as great for
Haiti where under 60 percent. And I was curious about why. And I went looking at the actual
paragraphs. And it turns out that Haiti had one major story, earthquake and damages and
[indiscernible] and a couple tiny lines about, what was it, [indiscernible] running for the presidency or
missioners accused of kidnapping children, something like this. And I asked everybody to summarize
this map, just focus on the one major story line.
So I think that the bottom line is that maps are the most useful in high-level summaries for just like for
stories that don't have a single domain story line. Yes?
>>: So the people doing the evaluating, do they know anything about the topic beyond what the
random person knows? Do they already know about the topic or do they -Dafna Shahaf: So this is a mechanical query with like slide quality control and they have to know
English and they have to ->>: How about someone who -- [indiscernible] -Dafna Shahaf: Yeah.
>>: -- one could argue that that's a flop. Because it's just a paragraph that is appealing and appears to
be helpful to a non-expert.
Dafna Shahaf: Yes, one can totally argue this. And I was talking about this a little bit in the paper. But
it seems like a reasonable baseline.
So I try to convince you that maps are good for news. And my goal for the next few years is to tell that
maps are not just about news. They try to really really easy to adapt to other domains. Because the
main principles, you know, coverage, coherence, connectivity stay just the same. But you might be
able to use the main knowledge [indiscernible] smarter objective.
Now I'm going to talk about three examples, science, legal documents, and books. So I'll start with
science. So the goal of this project was to see if maps could help somebody understand the state-ofthe-art of some field. For example, what [indiscernible] and the data we had was ACN papers, and we
needed to do some slight modifications to the objective, most taking advantage of a citation graph, but
other than that, the algorithm stayed exactly the same.
So I'll show you an example. This is about reinforcing learning. I don't expect you to read it. I would
just walk you through it. So there's one line about the multi-agent setting, one line about the MDP
[indiscernible], one line about controlling robotic arms, one line about [indiscernible] as an expression
in notation, and a line about [indiscernible]. You see those are actually disconnected, but there's those
funny dashed gray lines between them. That's in the scientific document you might have a line about
theory and a line about application. And there's really not a single article that would fit them both.
There's no directing of section. But those lines are clearly related, there's also citation going on
between them. So in the scientific [indiscernible] we allow for indirect connectivity. So if two lines
have lots of impact going on between them then we can't as well. So we start to see stuff like how the
[indiscernible] had impact on the [indiscernible] or how the MDP [indiscernible] line had both in the
robotic arms and on the multi-agent line.
Okay, I'm going to talk about the user setting, because this is where the fun was actually. So the
question was can maps help somebody, first a grad student, understand the state of the art of some field
better than [indiscernible]. So what we did, we brought people in my office and we told them pretend
to be a first year grad student who is kind of embarking on the first year learning project. You know,
you got the professor, you're all excited, you want him to teach you everything he knows about
reinforcement learning. And the professor gives you a survey paper. All I put on this paper was in
1996.
So the goal is really to update this survey paper to find the more recent [indiscernible] than the relevant
papers. And they could use either Google Scholar or Maps and Google Scholar. Okay? So I had 30
participants. We basically combined all papers into one long list and we had an expert judging
precision, to show which papers were relevant and some topic recall. So we composed a list of the top
ten sub-area of reinforcing learning in the last years and we wanted to see how many of those areas
they managed to find. This is just in a nutshell. And the result, on average the map users had 10
percent more of their papers than [indiscernible], and of those top 10, they managed to find on average
almost three more. So that made us very happy.
One more thing I want to do in order to convince you that maps are a good idea for science, wish we
made a map for our own related work at some point. So this is us connecting the dots to Metro Maps.
And again I don't want you to read it, just see what's been around us.
So let's focus on some organization, especially news. Lots of [indiscernible] narrative, some on
coverage notions. This pink line with visualization like Constant Maps and Mind Maps and this
[indiscernible] line about mapping science.
Good, so let's do legal documents. So there is a company in town that came knocking on our door one
day to do a search engine for lawyers and they wanted to know if maps could help lawyer argue a case.
So they gave us some [indiscernible] decisions. Have you seen some [indiscernible] decisions before
you might have now. So they're proud that they're insanely long. They can be hundreds of pages. So
my [indiscernible] idea was completely not finding signal in those. [Indiscernible] so we turned out
that was working that when you site another case, you have the same [indiscernible] sight them. So,
you know, in blah versus blah the defendant [indiscernible] applied here. So if we just use this anchor
text, you can pinpoint the most interesting parts of the document, then everything else will just fall
beautifully into place.
I want to show you an example, this one map where we computed for them for a commerce class, you
can see, for example, this purplish line about who can [indiscernible] the community. And if you work
for a state-owned company, can you sue them? Okay, great, but how about in federal court? Great,
does this section apply or not?
So we basically showed this map to the lawyers to get a reality check. And they first of all said it made
perfect sense, they were even nice enough to label each line for us. And then we went ahead and
computed the words that made this line coherent from our point of view. So you can see, for example,
the third one, the lawyer said that 11th amendment states [indiscernible]. We said immunity, serenity,
amendments eleventh.
Or the last time, regulating wholesale energy sale and we said wholesale electricity resale
[indiscernible]. So I would be happy about this, we're probably integrating it into their search engine
now.
Okay the last thing I did, this was just for the fun of it, I wanted to see if Maps could help somebody
understand the structure of a complex book. And what is a completed book actually mean "Lord of the
Rings?" Mostly because I refuse to read a song of ice and fire until they actually finish writing it.
So we had a lot of learnings. And my biggest problem is coherence. Think about it, journalist are
really nice, because they actually tell you what happened before. But books don't really work this way,
they don't say okay now that we're done with the [indiscernible] and this guy's dad, we can go in and do
that. So coherence was completely breaking. So what we decided to do is say hopefully a single
character's point of view is a coherent narrative thread. So focus a lot more on named entities. And we
just showed a little from the Lord of the Rings map. So this is the Hobbit and Gandalf start walking on
their merry way, they collect people all the way to the castle and then they split up. And you see people
[indiscernible] going somewhere and some instead of going that way, they [indiscernible]. The bad
guys are down there and they're going to eventually meet the good guys. So there's a lot of structure
already emerging.
Okay, one more thing I want to tell you is what we did recently to make things more useable. So first
thing I was worried about was scalability. I really wanted to do a web scale [indiscernible]. So
basically we ran our objective, everything I used to say was [indiscernible], actually made the -- sorry,
made it parallel and came up with a hierarchy like a clustered version of Metro Maps which Metro Stop
is not just a single article anymore. And we brought it down from 11 minutes on several thousand
articles to 30 seconds per query on hundreds of thousands of articles.
Second thing I was worried about was infraction. And I really don't think it's going to make or break
Metro Maps, because I'm not going to nail the right map based on a couple of key words. But on the
plus side the Metro Maps has so many awesome interaction mechanisms. So we tried to think. One is
they call a [indiscernible] solution, where you can zoom in to learn more or zoom out to get the highlevel overview. But the most interesting technical bit was we had to come with a community
[indiscernible] algorithm to make this [indiscernible] function of dense overlaps. So we had a
[indiscernible] algorithm with some block coordinate and gradient descent.
Second thing we did was [indiscernible]. Remember in the current slide I told you this is a perfect
place to plug in personalization to those weights? So we actually had the mechanism that let people
say I don't care much about what Germany is doing, but Portugal is interesting. And the map would
recompute based on whatever I think is important now. With ideas from [indiscernible] from
[indiscernible] feature-based feedback.
Another thing I did this semester, and this is a really fun project with a student, is think about
controversial topics. So look for something like ObamaCare. So there's really not one map, it's really
more on Democrats versus Republicans. So we looked for how to form out this notion of controversy
using polarized sentiment and how to kind of cluster documents based on those sentiments and
compute two different maps, representing two different point of view.
One more thing that's been going on, we have a website, very final stage of debugging, and I have a
student whose entire mission for the quarter is to come up with an Open Source package so people can
plug in their data and see what comes out. So this has been a really, really exciting semester.
So the entire point of this project was to take a news reader or first-year student or [indiscernible] or
really anybody that has lots of data and needs to rely on storage. And we wanted to show them a
perspective of their field. We want to show them the structure and how things connect to each other.
And we talk about how to format this, coherence, coverage, connectivity. We have the algorithm and
we have user study to evaluate our idea. Now at this point I was kind of staring at this and trying to
think what to do next. So what don't I like about Metro Maps. And the thing I dislike the most is Maps
can only show you connections that are explicit made by somebody. A journalist told us that those two
are related. But what if you want to make new collections? What if you want to discover something
new? So this is how our project came to life, where the goal is you have lots of data, how you find
something inside [indiscernible] or really how do you define this notion of inside.
So just a word of caution, this is work in process, it's a lot less mature than Metro Maps, but I have
been having so much fun with it, I thought I'd tell you some. [Indiscernible] now there's been lots, lots,
lots of work about this, right. There's psychology, cognitive psychology, there's data mining. You can
argue that [indiscernible] about taking data and getting insight. Same thing for lots of info conferences.
And I was going through a lot of papers and trying to abstract those ideas. So what makes an insight.
Just like before, what makes an insight?
First thing is almost [indiscernible], right, it has to be surprising. If you know about it, nobody cares.
But surprise alone is not enough, because give me enough data, I'll find plenty of things that will
surprise you, just because there's noise or bias or coincidence. So it has to be what I call plausible, or
really supported by the data. And this is a super general idea. So let me just show you how this plays
out in the medical demain. And the medical domain is perfect for me, because there's lots of data just
lying around and every day you see those articles about researchers found the link between blank and
blank. So there's a potential for many, many new links that nobody's covered. Because do you want to
use this idea to build a system that will kind of take researchers and give them some promising research
directions, so identify where the gaps are in our card knowledge.
So I said plausible and surprising. How does that work? First of all for the purposes of this
presentation, I'm only going to restrict myself to a really simple kind of insight. It's a pair of medical
terms. Like there's a connection between sleep apnea and diabetes that I think is insightful. Because
it's just [indiscernible] of medical term. Now for something to be plausible, I need to actually
[indiscernible] a lot of in practice. So many sleep apnea patients actually do get diabetes. And in order
for this to be surprising, you go through the [indiscernible] and nobody ever thought about it or nobody
ever noticed it before. So look for plenty of things that [indiscernible] a lot in practice, but nobody in
the literature seems to know about it. So what kind of data do you need to know to compute those
things? So for plausible we have seventeen years of hospital notes. [Indiscernible] notes, we have
about ten million of them. And surprising we have about eleven million on papers from [indiscernible].
Now this is an overview of the system. And if you lost me by the way, this is a good place to pick up.
So I have a system, it starts from a query. Now it doesn't actually have to start from a query, but
researchers usually have something they care deeply about, so let's start from a query. In this case sleep
apnea. So what the system does, first of all, it goes through the medical notes and looks for a plausible
candidate. So what happens to sleep apnea patients in practice? You take the candidates and you rank
them according to mid line. So what's surprising? So sleep apnea, what happens in practice and what
does the literature not know about it?
Now, for this to work, I need to tell it three things. Where are the terms coming from, what's plausible,
and what's surprise? Okay, terms, so we're excited medical terms from the notes and from mid line.
And first of all, this is a lot more [indiscernible] than I expected. Because it's natural language, so
physicians might tell you sleep apnea, acute sleep apnea, recurrent sleep apnea. And this is completely
messing up my counts. So I need to know when I can merge something and when I can't. So I decided
to use medical hierarchies. So we have this kind of thing, it's a [indiscernible], you have stuff like
migraine disorders that has two [indiscernible], common migraine, and not so common migraine. And
you really want to know when you see something if you can measure it up or not. So what we decided
to do was use [indiscernible] divergences. So compute how much information is lost when you use a
[indiscernible] in order to proximate a child. And you can see for example, if it's a common migraine,
you can propagate it all the way to migraine disorders, maybe in vascular headaches, depends on your
threshold. But if it's this other type of migraine, no, it's a completely different beast.
Okay, now surprise. Why not using surprising? Well, first of all, they can't [indiscernible] too often,
right? So we have a threshold. So the number of papers mentioning them can't be overkilled. But
that's still not good enough, because there might be two terms that just don't appear because nobody
cares about them, because five people in the universe combined have them. So it's more surprising if it
turns up popular; if nobody notice the connection between sleep apnea and diabetes that are really well
researched. So we have, just like before, we have weights. So the importance of a term. And the way
I like thinking about this is it has novelty and it has utility. So this is surprise.
Now, plausibility. The way it works is exactly the other way around. So two things you [indiscernible]
together in practice a lot. So what we did is we aggregate all the notes that a single patient received in
a year, get that to our basic documents. And the thing we did with computer really [indiscernible]
efficient. So how many patients have both of those things over how many patients have at least one.
and let me just show you what happens when you try to plug those two objectives into our system.
There's some stuff on dementia. Those are the top six [indiscernible] things that happen with dementia.
So the first three are Alzheimer's medications used to treat dementia. So they're going to be filtered
away by mid line. But then you're left with hip fractures, atrial fibrillation and wheelchairs. And this
[indiscernible] gets really, really suspicious and say he fractures in wheelchairs, oh, we might have a
problem here. Because it might not be about dementia, just might be that that population tends to be
old. So we needed what we call [indiscernible] power. And I'm trying not to say the word
[indiscernible], but you can think about it this way if it helps you. The idea is that we took a group of
people that are really, really similar to dementia patients but don't have dementia. And we compared
this to the [indiscernible]. And we say hip fractures, are they a lot more common for dementia patients
than for this other group that is really similar, but just doesn't happen to have dementia. And if it's not,
then you're not capturing the right thing here. It's not about dementia, it's about them being old or
something else. So plausibility about having [indiscernible] and also about passing this matching test.
So let me show you. How about we start with dementia. Now, wheelchairs and hip fractures don't
even pass the test. And the only thing you're left with is atrial fibrillation and, okay, what do you do
with it? Is it an insight? Again, how do you evaluate? And ideally I would make just a series of bold
predictions and send an army of physicians to chase them down. But this requires, you know, time and
money and physicians, and I don't quite seem to have either of those things. So instead of what we did
was early discovery. So you ask a physician to give us a list of that, breakthroughs of last five years,
and we time travel on the data and say if you had run this algorithm five years ago, what could I have
told you? And specifically what would show up in the top three results of my search engine. Now this
is if you can predict anything, it's a really strong indication you have something.
Now I said it was really [indiscernible], they only gave us four things, obesity and colon cancer,
diabetes type 2 and sleep apnea, atria fibrillations and dementia and increase in [indiscernible]. But out
of those four we actually managed to figure out two. So this is a much more happy and much more
willing to cooperate with me right now and they promise to give me a much longer list.
Now, I started by saying this is a really general idea. So I wanted to show you how this next algorithm,
this next formulation works in a completely different domain. This is the commerce domain. So the
idea is to get a search engine that encourages serendipity. Here's the way to think about it. Suppose I
wanted to buy a laundry hamper. I doubt -- if want a laundry hamper, I need a place to store laundry.
So the same idea of what an algorithm does is find products that are plausible, in the sense that they
solve a similar problem, and they're surprising in the sense of when you go to Amazon, people who
viewed this, that nobody in their right mind who is looking for a hamper would consider this other
product as an alternative. I have a search engine that says you don't need a laundry hamper, how about
you buy a really big trash can? And I was telling it to a friend of mine and he says he uses a trash can
for a hamper and I was so happy. But any way, my entire point here is that the algorithm is just the
same, just instead of medical note with [indiscernible] in order to learn common sense what are things
used for, and instead of mid line, with Amazon, people who viewed this, viewed this graph, and
everything is just the same.
Now, I already tell everybody, I can give you some shopping tips playing with my algorithm, just in
case you want to buy something. So here's some things I learned. First of all at least in this country,
pet products and baby products are surprisingly interchangeable. It just keeps showing up; it's scary.
Also I want to look up the [indiscernible] department every now and then. So here I was looking for
the thing you put in the bathtub in order to sleep, and cars really have the same thing. And also forget
the idea that [indiscernible] for stickers that people put on the [indiscernible] in order not to slip and
fall. Which made perfect sense to me.
Okay, the hat project. The point was that medical researchers, I really wanted to give them a tool that
would let them discover some promising new ideas. And you form out the surprising possibility, we
have this earlier discovery of some medical breakthroughs. And I did quite talk about there are so
many applications in other domain. So what is the product search, what is also something called
medical [indiscernible], there's [indiscernible] by discover reinforcing learning, there's also -- we did
this through Wikipedia, lots and lots of other places this could work out.
This seems like a good place to take a step back and try to answer what we had here. So I talked about
two products, the Metro Maps and the hat project. What's a common thread was an underlying
common thing. So I guess the obvious answer is this idea of storing lots of data into insight. But
there's really more. The process I like the most are the ones that talk about really intuitive problem to
finishing, because you know, what's a coherent story line or what's an insight and then you formulate it
mathematically, optimize it, and then [indiscernible] user study, both to make sure that you're
categorizing and to make sure that your system's actually useful for anybody. I really want to build
those things to be useful. Now in order for this to work [indiscernible] borrow ideas from data money
and machine learning and information retrieval, lots of algorithms, especially optimization and graph
algorithm, and some [indiscernible] visualization.
Okay, this is what comes in. What goes out is I tried to apply this in as many domains as I can. And
today I talked about medicine and science and news and legal documents and commerce and literature.
So this is what I do. Let's talk about what I want to do next. So first of all, I am the inside person. It's
super [indiscernible] and I am really excited about this idea of building a set of tools that would help
anyone, okay. So scientists, and really anyone, just plug in their data and see what's the most insightful
thing that comes out. And really to enable some new discovers reinforcing learning. And again, I
wanted to find the importance of finding [indiscernible] that generalizes a cross domains. I really think
it makes a technique much stronger.
So I talked about self applications that I already tried and let me briefly tell you about the corroboration
I've been mostly excited about recently. So first of all I gave you sort of the Stanford computation of
social science conference. And there's been all this, you know, social scientists and political scientists
came to me afterwards carrying awesome, awesome buckets of data. So there's Congress notes and
crime data, lots of real beautiful data sets.
Life sciences, I really want to apply this to biology. I actually don't know how, but if you know
anybody who might be interested, you know where to find me. And there's a history professor who
really wanted to apply this idea of metro-active telegrams.
Personal data sensing I've been dying to do for a really long time because of search browse history.
Because suppose you're trying to plant a tree and you find yourself an hour later with 75 tabs opening
[indiscernible] and wait what just happened. So really I want you to organize your browsing history
into some structure.
Okay, there's been some interest from corporations about ordering corporate data, financial data. This
idea of insight for investigative journalism. And the last thing on this list really surprised me. I never
saw this work for anything that's not text based, but recently some researcher that [indiscernible]
applied my algorithm to summarize a video in a sequence of images. Yeah. [Indiscernible]. So
anyway, apparently this thing also works for images. Again, really surprised me.
Okay, so this was long and short-term. Longterm, like I said, I really like this idea of taking fuzzy
things and formalizing them. And out of those my favorite by far is this thing here in yellow, this
creativity. And I know it sounds kind of megalomanic, so I'll just tell you briefly about one thing I
started doing this semester about it. So just two slides. So suppose you own a company, you own a
product. You went to college and you want to change your product in order to extend your business.
So what do you do? And we find this thing called this [indiscernible] model that's basically a set of
questions you should be asking yourself; what can you combine your product with, how can you put to
another use, how can you reverse functionality. My favorite example, this company that used to make
water pressure shower heads and now they're making water pressure dental floss, which I think is
completely brilliant. But anyway my point is I built a prototype system using the same ideas of
Concept Map in order to answer those questions in Amazon to fill out the obvious things. Now I have
a search engine where you type in something like alarm clock and it just spits out suggestions like
combining it as a coffee machine. You know, you wake up to a fresh cup of coffee or combining it with
a dimmer so the room becomes lighter and lighter as you wake up, or maybe make a silence alarm
clock, something to [indiscernible]. Now either for deaf people or if you don't want to wake somebody
else up.
I just keep spitting and spitting suggestion. Do you know Sky Mall, the thing you get at the airplanes?
[Indiscernible] right now, but hopefully it goes that way and gets better soon.
Okay, good. So breeze in, breeze out. The [indiscernible] we have plenty of data and this is excellent,
because data can help us understand the universe and make better decisions. But it's not enough to
collect this data or to even search for this data. Really have to make sense of this data. We have to
[indiscernible] the structure, like in Metro Maps, and we have to eventually discover unknown
connections, like the [indiscernible].
And we had user studies and we have all the discovery to validate our ideas, and if there's one thing I
want you to take from this talk, it's called this image right here. We really have to go beyond just
searching our data.
That's about it. Thanks.
Eyal Lubetzky: Any questions?
>>: I have a question. Because you were saying that the people who looked at the Metro Maps didn't
care so much about the relations that might hold between the nodes in the map. Is that because you are
drawing a map in document level?
Dafna Shahaf: It might actually be. I'm not entirely sure why they're doing this. But it's pretty clear
from the data. I know we showed them two things that were related and they couldn't just like at the
beginning or end. They did say something was [indiscernible], they just kept like bouncing on and off.
But other than this, they didn't seem to care much. And I'm not sure why, actually.
>>: When you do medical discoveries, it probably will be more important.
Dafna Shahaf: Then I completely agree.
>>: And I wonder if you move this to the sub-document level -Dafna Shahaf: If I really wanted -- you know, the place I did at the beginning, they have a node as an
argument, and I just don't know how to get this, how to abstract. But hey, we're going to have a
meeting and talk about it; right? Excellent.
How does it work? People are watching it online, do they have any way of asking questions?
Eyal Lubetzky: They can run downstairs.
Dafna Shahaf: Hold the bus.
Eyal Lubetzky: As they go busting through the door.
>>: I have a question. So whether you're measuring the number of active words [inaudible], how did
you set the threshold? Because the more words you have the less effective it becomes, but the -Dafna Shahaf: Yeah, the more words I have I can actually go back to this circus behavior, because each
transition is going to have its own budgets.
>>: Yes, so how do you build the threshold?
Dafna Shahaf: I just tried it in a couple of stories that were not useful to use as studies. And we just
[indiscernible] this. Again, eventually I would love to learn this. But it's just tweaking parameters.
>>: Do you do some experimenting with -- I mean, what you did was select a threshold and then treat
all these words equally, just sum over the active words and -Dafna Shahaf: So summing for every active word, we summed over the influences of this word over
the transitions, it's not just ->>: But then you copied like three or five, which was your [indiscernible].
Dafna Shahaf: Yeah.
>>: As opposed to just -Dafna Shahaf: Yeah, we didn't let them -- I mean it's an algorithm, it's not rounding. It would still
work.
Eyal Lubetzky: Any other questions?
>>: If we have a lot of data and -- how can we use your project [indiscernible]?
Dafna Shahaf: I would say email my student and bug him. But you should probably email me, so I'll
bug him. But, yeah, he was supposed to finish it like a couple of weeks ago. You know, it's probably
going to take him a couple more. But it's almost ready. Now I'm curious about what questions you
have.
>>: Actually, I think we have a lot of data, we just trying to find a way and, using the data, finding the
sites. And now we're still at the level of [indiscernible]. Yeah, we want to move forward and this is
very interesting.
Dafna Shahaf: Bring it up. I like your drive.
>>: Have you published the work on the medical discovery?
Dafna Shahaf: So we really want it to be a nature paper, which has been taking me longer than any
other paper of everything in my life. If you want I can send you a rough draft.
>>: So which hospital were you collaborating with that you could get the -Dafna Shahaf: I'm pretty sure it's the Stanford Hospital. It's the Stanford medical group. It's either this
or the Palo Alto Hospital.
>>: Because that's an amazing data set. Usually you have no -- like I've worked with data sets with at
most 500 patients, and that takes like two years to get access to. So that's an amazing data set.
Dafna Shahaf: Stanford medical team has been awesome about this.
>>: Yes.
Eyal Lubetzky: Okay. Let's thank Dafna again.
[Applause]
Download