1

advertisement
1
>>: It's an honor today to have Dafna Shahaf with us. Dafna is finishing up
her Ph.D. at Carnegie Mellon University, working with Carlos Guestrin and
[indiscernible]. She did her Bachelor's work at Tel Aviv University, her
Master's at UIUC, and she has many interests and directions but ended up
focusing in her dissertation work on methods to help us deal with the large
amounts of information and to visualize it and understand it going beyond lists
of search results, for example, and navigating the web and deriving information
from the web to richer stories, fabrics and maps.
And she has won base paper awards, KDD 2010, has been a Microsoft Research
fellow so that we're proud of to have her our label, on her forehead, I guess,
and also a Siebel scholar. So I'll just turn it over to Dafna, who is talking
about trains of thought generating information maps.
>> Dafna Shahaf: Thank you. Okay. First of all, can you all hear me? In the
back in excellent. So my name is Dafna Shahaf. I'm in Carnegie Mellon, and
I'm indeed going to talk to you about my recent research product, which is
called train of thought, generating information maps.
So what's this product about? Let me start with my new favorite quote. The
abundance of books is a distraction. This was said by Seneca, who lived in the
first century.
Now, a lot of things have changed since the first century, but Seneca
[indiscernible] has only gotten worse. And you've all seen the numbers. Right
here is just some of them. First is the Google estimate of the number of books
out there, and now I have no idea how they got that specific resolution.
Also, the number of blogs skyrockets. And even if you just look at scientific
publications, right, PubMed has 19 million papers and adding one by the minute.
Scopus has twice as much.
Now, I was burning the internet to look for figures to use in the slide. One
of the exponentially growth figures. You've all seen them. I came across this
one. I just had to share it with you. Okay. So this is a paper from the
'80s. X axis is a timeline. Solid line is the number of papers about some
topic that they found interesting. But what I really like about this figure is
the dashed line. And the dashed line is what they call innovative papers. I
think what they're trying to say here politely is the number of papers grows
2
exponentially, the number of papers worth reading, not so much.
Okay.
So hopefully you're convinced that there is a lot of data out there.
So, suppose you want to learn some complex topic. Might be news, like you
cover the financial crisis in Europe. Or it might be some research area that
you want to start looking into. So where do we go? Most people, I know their
answer would be we go to a search engine, right? We all love certain engines.
Search engines are really great at retrieving nuggets of knowledge, but it
doesn't often show you those, what, 30 million results fit in together, what's
the big picture.
And there have been some systems in the past trying to summarize and visualize
complex topics. For example NewsJunkie that came out of here. But usually I
try to construct a story line or a timeline and I'm going to claim that this
simple summarization works only for really simple stories that are linear by
nature.
Real stories are not like linear. They spaghetti into branches and side
stories and intertwining narratives. If you just think about research if I had
to come up with a picture of what research is like, it wouldn't be a line. It
would be more like something like this. Come on. I'm sure you all know the
feeling, right? One day I'm going to make this screen turn on and then I'm
going to get my dissertation.
Anyway, so you're dealing with this type of messy tangle. So what do you do?
Let me show you the inspiration, the holy grail. How many of you have seen
issue maps before? Okay. I haven't seen it until not too long ago. So this
is an issue map. It's a set of seven posters. I first stumbled upon them at a
corridor at CMU, and this one specifically charts the big old [indiscernible]
debate, can computer think? Okay. Let me just zoom in and zoom in some more.
You see each note
emotions, and you
argument, or it's
there and read it
in this graph is an argument, like machines can have
can follow a path and see this argument is supported by that
disputed by that argument. And you're supposed to just sit
beginning to end and understand the big picture.
I first stumble upon it, I was fascinated. It's a topic I liked earlier. But
reading it made it all fit nicely in my brain, okay. Then I started reading a
bit about how they made it, and turned out they needed about 20 menus to
3
generate them. So the next question was, hey, can we actually build it
[indiscernible]. Just have one issue map for every query you want. Wouldn't
that be nice?
So I don't know how to [indiscernible], but in this talk I'm going to talk
about the first few steps we took in this. And this system is called Metro
Maps. Okay. And the idea is again you start with a query, like Greece debt
crisis, and that what looks like a Metro Map, where each line tells a coherent
story and different lines are within different aspects, and you see how they
intersect and overlap.
Okay. So this is a very, very simplified financial crisis in Europe map. You
can see the blue line tells the story how Greece's status was reduced to junk
and they had to come up with us a the austerity plan to get the bailout. The
headline are how they had all those protests and strikes because of those
austerity plans, and you see how both lines intersect at an article about
austerity plans.
Clear?
>>:
Sort of?
Great.
So how do we do that.
Yes?
Is this simplistic, for example, protests?
>> Dafna Shahaf: This is just for you. Each node actually corresponds to an
article. I just couldn't fit a title, but you'll see the real map like later
in the talk, okay? Each node here is an entire article. Yes.
>>:
So is the linear structure alignment, or is it just a set up.
>> Dafna Shahaf:
>>:
The blue line goes to [indiscernible] generally but not in that order.
>> Dafna Shahaf:
>>:
It's chronological.
What about the red line?
>> Dafna Shahaf:
>>:
Is the linear structure of what?
Okay.
The red line is supposed to be slanted, but --
4
>> Dafna Shahaf: I need to Greek people. Okay. Anything else?
see a map later and it's all going to be much nicer.
Again, you'll
So how do we do this? Now, maps are complex creatures, so we start with a
simple problem of how do you construct a single metro line? What makes a good
line. Then we will move to maps for news and finally I'll tell you how to
adapt it to the scientific domain.
Okay. So I'll start with lines and we tackled this in a KDD 2010 paper called
Connect the Dots, where we made our life even simpler by assuming that we know
the end points. Okay. So here's the situation. You want to know about
financial crisis, this time in the U.S., so you pick two articles. Your start
and your goal. And the idea that you vaguely remembered had something to do
with the housing crisis and the bailout.
Okay. So this is the input of the system. The output is a smooth chain of
articles that bridges the gap between them. For example, output might look
like this. So you have a chain telling you that people borrowed money from the
bank to pay for houses. And mortgage crisis begins to spiral because banks
rely on debt too much. Investors want the Congress to react. Bailout plans
starts rolling and finally bailout. This is the type of output we're looking
for. Fair enough?
Okay. So how do you go about finding such a good chain? And when I ask people
this, almost always their reaction is, just the shortest path. It's not a real
problem, right? Just build a graph, add a note for each article and edges
based on your favorite similarity metric and just find a short path between
them or a bottleneck path or your favorite.
So why isn't it good enough? Let
this. So we try to combine those
story and another about a Florida
those stories? Remember. Sorry,
me show what happens when you actually do
two articles, one about the Monica Lewinsky
election in recount. Everybody familiar with
the data is a bit old.
So let me show you what happens when you try to connect those two, shortest
path. And let me just show the important parts here. You don't need to read
it. The important part is this chain is rather erratic, okay. It goes from
Monica Lewinsky to Microsoft, to Palestinians to Florida. Doesn't make any
sense.
5
But if you look at the transition of context, it starts to make sense. Because
the first two documents are about trials. They share a lot of vocabulary
words. Judges and lawyers and juror terminology.
The next two are about Microsoft. By the way it's not for you. I've been
doing those slides for the last two years. It's in the paper from 2010. So
the point is that what you get is this stream of consciousness effect, where
each transition makes sense out of context, but the overall effect is -there's no global [indiscernible] throughout it.
Okay. Now what would you like the chain to look like? Same two articles.
Ideally, the chain would look something like this. Clinton admits the story.
He's about to be impeached. He's impeached. He's acquitted. Al Gore starts
his campaign and tries to break away from Clinton, because it's a messy thing.
Election draws near and finally, election and recount.
So hopefully, you all agree that this chain is better. But why? And it looks
like what we're looking for, really, we call it coherence, okay. This chain is
more coherent than the other. So that's the property we're after. But, of
course, that's just different problems. Now instead of looking for a good
chain, I'm looking for a coherent chain.
How do you define coherence? Let me just give you an overview into this talk
and my work in general. A huge chunk of my work is just formulating, craft
going objective functions. Okay. Formulating all those terms I've been using.
What does coherence even mean?
So after you find an objective you're happy with, you need to come up with an
algorithm to optimize it, to find good chains. And finally, I need to convince
you that it works. Okay?
So crafting an objective. So I decided in order to see why this second chain
was better than the first one to look into word patterns. What you see here,
this is the short et path chain, and then bars correspond to whether the word
on the left appears in the article above it. Okay. So Clinton appears
beginning and end.
And you see this stair-like behavior here? Okay. This means that the topic
changes with every transition. Every two documents are related because of a
different set of words. Now, compare it to the second chain, and you see that
6
everything is smoother and nicer. It transitions, everything is longer.
Clinton is there beginning to end. Lewinsky is there almost everywhere and Al
Gore starts showing up and keeps going on for a while. Everything is smoother
and nicer and consistent.
So we decided to use this as our intuition when we define coherence. Step
back, quick list of desiderata. So coherent chains are going to have strong
transitions. We're not giving that up, but also something agreeable running
throughout it. Okay. So let's start with strong transitions. How do you do
that?
And perhaps the most naive way is to start thinking about it, is just say well,
for every transition, just count the number of words that two articles share.
That is some in here over words W indicating function does word W appear in
both document DI and -- yeah, that's the face I was making when I first did
this.
So you want to acknowledge the coherence of the entire chain, you take the
minimum, right, because the chain is only as strong as its weakest link.
Now, this really doesn't work. And why? Just take a look at this indicator
function and I'm going to claim that its way, way too coarse. It completely
ignores the importance of different words, based on [indiscernible] level and
transition. And also, it has missing words. Because again, if the document
has the word judge and jury but not lawyer, then it's really implicitly there.
Okay. So we had to replace it with something softer which we call the
influence. Of the document DI and document DI plus one for word W. What do I
mean by that? Just intuitively, you want to think about the influence is high
if the two documents are related, and W plays an important role in what makes
them related. [indiscernible] doesn't appear in either of them. You with me?
I think I lost some of you.
>>:
[inaudible].
>> Dafna Shahaf:
>>:
What's the what?
[inaudible].
>> Dafna Shahaf:
Yes, I will get to this, okay, in the next slide.
So just to
7
tell you that there has been a lot of influence notions in the [indiscernible],
but they usually assume that there are some edges, and -- yeah.
>>:
[indiscernible].
>> Dafna Shahaf:
>>:
But now it is [indiscernible].
>> Dafna Shahaf:
>>:
Became what?
Symmetric.
>> Dafna Shahaf:
>>:
Yeah.
No, not symmetric.
Because it's [indiscernible].
>> Dafna Shahaf: Oh, but I only computed for chronological order. So DI will
not -- I will not even compute influence if it's after DI plus one.
>>:
I see.
>> Dafna Shahaf:
>>:
All chains go forward in time.
Okay.
>> Dafna Shahaf: Okay. The way we computed it, not [indiscernible], but I'm
not going to go into this. Yes?
>>:
[indiscernible] document, there's no order in the document.
>> Dafna Shahaf: In the document, no. A document is you increase, okay?
Okay. So what was I saying? Oh, that we don't have a single edge in our data
set. So we had to come up with our own notion of influence, which I'm not
going to get into, but it basically uses word coherence, okay, to achieve those
two properties. Yeah?
>>:
So when you say the W times [indiscernible].
>> Dafna Shahaf:
Yeah, sure.
So what I was really doing, I was constructing a
8
[indiscernible] graph between words and documents, and I was just looking at
paths between, okay so you connect the word if it appears in the document or
some [indiscernible] in the edges and I was looking at when you want to go from
DI to DI plus one, you zigzag on this [indiscernible] graph, how often do you
need to go through W. Okay. Enough? If you want I can talk about it for a
whole lot more. I knew I should have kept backup slides.
Anyway, what [indiscernible] said was correct. Okay. We have strong
transition, but everything I started with telling you, you know, you have to
have global thing, smooth and nice transitions. And the thing is, with our
[indiscernible] objective, that [indiscernible] chains and shortest path chains
can go really well. But the important thing to notice is that it needs a whole
lot more words, you know, to get a score.
Well, got chains can usually be represented by a much smaller number of words.
Okay. Let's play a game. Suppose I tell you that I allow you to choose only
three segments, okay? This is a segment. And I'm going to present that these
are the only words that appear in the document, and then compute a score.
Okay. Like these are the only words. What do you do? So for the article on
the right, you can go with, for example, Lewinsky, impeachment and Gore and
still get a really good score. The chain on the left, however, there are no
three segments that will give you a good score. Precisely because each
transition uses a different set of words.
>>: Do you really mean segments rather than words.
missing for one means you can't --
The fact that Lewinsky is
>> Dafna Shahaf: I don't mean segments, because you might like documents one
and two are related because of a word, and also like three and four, but not
two and three, so you have this zigzagging pattern. So we [indiscernible]
actually both ways this works better.
So instead of like we did before, summarizing all of our words or taking all
words into account, now we only take active words into account, the segments
that we picked and we turned this into an optimization [indiscernible], which
segments would you pick to get the best score.
And we have some constraint on this activations, because we want to simulate
the behavior of good chains, like you can choose too many words. You can
9
choose too many words for a transition or like you just said, you can have
words zigzagging, like, turning on and off and on and off.
Okay. You with me? Okay. So we finally have a coherence definition. I told
you it would be a big chunk. Now, regarding algorithm, how to actually find a
good chain, pad news is that it's NP-hard. You all guessed it. Good news is
that if you don't care about having binary activations, if you're okay with
like choosing point five of this word over here, then it has a very natural
formulation as a linear problem. A linear problem, LP. Yes?
>>: Speaking of words, supposed word is Clinton. But the article is really
about Hillary. That's one thing. And another thing is suppose both articles
about Hillary. But Hillary is Clinton and here's how why.
>> Dafna Shahaf:
>>:
I would like to see --
And the concept is clear whose wife she is.
>> Dafna Shahaf: Yeah, you're talking about LP problems and there are people
who spend their entire research career on disambiguation thing. So part of the
point of this article was to see how far you can get using just the most basic
features. Just words. See if it's good enough before you start throwing in
the big cannon, like what you said about disambiguation and reference and wife.
And just words can do quite well. We also tried wit some more interesting
features, like topic models. But hey, if it works. Okay? Does that answer
your question? Good.
So we have an LP and we have a rounding schema and I'm just going to tell you
that we some approximation guarantees that in expectation we can control the
lengths of chain we want to get and that we [indiscernible], we can tell
essentially how close we are to the optimal chain, okay?
Next thing, need to convince you that it works. So the sad part, we can't do
the standard things. We don't have ground truth. We don't have golden
standard. I can throw those nice precision recall curves. So we had to do
user studies to let people use our chains and competitor chains and see what
they like best.
So these are our competitors. And we have shortest paths after I spent ten
minutes or so bashing them. We have Google Timeline, and we have a system
10
called Event Threading or [indiscernible]. And just to give you an idea, this
is one of the chain we showed you the users. We're trying to connect the O.J.
Simpson trial to the verdict. And this is the chain.
Simpson strategy, there are several killers. Book deal controversy. April
transcripts. And something completely unrelated, about the Tandoori murder
case. This is one chain we got from Google news timeline.
Second
police
trying
one of
chain, same article. Issue of racism erupts in the Simpson trial. L.A.
have some racial tensions. More about L.A. police and finally, lawyers
to use in order to get acquittal and the verdict. Okay. So this was
our chains.
Now, we've run several users studies. I'm just going to talk to you about one
of them. The data is with The New York Times. This is an old one from 1995 to
2003. We had 18 users. We chose five prominent news stories, like the O.J.
Simpson trial. And we showed them two chains, generated by different methods.
Double blind. And first thing we asked them, just which one is more coherent,
because we wanted to see if we captured their notion of coherence.
Also, despite the fact that we're not directly optimizing for it, we're also
asking which one is more redundant and which one is more relevant. And let me
show you the results. And Y axis here is the fraction of time [indiscernible]
preferred to the other. People could say two things are the same. So it
doesn't have the sum of 200 percent.
And first thing is coherence. The only relation we really get out of it is
that we're doing better, which is the entire point of this paper, okay. So we
were happy.
Now, things look a bit not as good when we looked at redundancy. But then we
looked at relevance and it all started to make sense. If you think about
there's a very clear trade-off between relevance and redundancy. If you want,
like, high -- you want to remove redundancy [indiscernible] random articles and
your relevance drops.
Or if your entire relevance, just stay really close to S word or your input
articles and then [indiscernible]. Even you can see it in the chain I'm
showing you, right, because the Tandoori murder case is definitely not
redundant, but it's also not relevant. We think this is what happens here,
11
that we pay for relevancy with some redundancy, okay. Again, this is just to
give you a flavor of what questions you can ask about those chains. Yeah?
>>:
So you looked into New York Times only, because that may bias.
>> Dafna Shahaf:
>>:
Oh, definitely does.
Articles, they use their favorite words so that now Google looks --
>> Dafna Shahaf: So you can actually restrict it to The New York Times.
can restrict it to New York Times.
>>:
That's what you did.
>> Dafna Shahaf: I think that's what I did.
it was two years ago.
>>:
You
It was two years ago, but yeah,
Because otherwise, it may be uneven.
>> Dafna Shahaf: By the way, what you said was the key. Because we are just
using words and because different writers do tend to use their own words, you
can actually see that sometimes it prefers chains by the same writer.
>>: The same article, the same vein, the same issue from Wall Street Journal
may have nothing to do with it.
>> Dafna Shahaf: Yeah. I actually don't have anything other than the New York
Times, but it would be fun to play with the Wall Street Journal. Yeah?
>>:
How did you choose to use --
>> Dafna Shahaf:
>>:
The five news stories?
Yes.
>> Dafna Shahaf: I think we went to one of those website, what is the top news
story of the year or something like this, and we picked the top, top two for
every year or something.
So one thing I really like about chains, they allow some interesting forms of
12
interaction. For example, the O.J. Simpson trial, there are so many ways to
connect those two end points and tell a coherent story. So what we did is we
added some interaction mechanism, where users were shown a tag cloud. They
could say give me more about this word or less about that word. And either
from online learning or [indiscernible], but just to give you an idea of what
it looks like in practice.
So this user got a chain focusing on the verdict, okay? And they say I don't
care about the verdict. Give me more about the racial aspect. Then they got a
chain very similar to what you saw earlier about racial issues and L.A. police.
They could say give me more about blood and glove and then they get a chain
about DNA expert and fiber evidence. So there's a lot of playing room here.
So I hopefully convinced you I know how to construct good lines. But like I
said, lines are not good enough. Again, O.J. Simpson, there are so many
different aspects to be said. So next thing is we switch to map. And just as
a quick reminder, this creature is a map. Lines are coherent and different
lines focus on different aspects so when they overlap, they intersect.
Now, let me just define it semi-formally. So a map is just a graph, G, and a
set of paths, and all you need to know is [indiscernible] correspond to news
articles and the edge are related to underlying edges of the path. So the
graph is just the union of all of the paths.
Now, so how do you define a good map? Well, the first property I gave you for
free, right? It's coherence. Every line should tell a coherent story. But is
it good enough? So can I just return the top three coherent stories in the
data set and call it a map? So let me show you when you actually do this.
This is the map we got for the query Clinton. Again, all data sets. So
Hillary was not around. The first line is about Clinton's visit to Belfast.
And then you have two more lines about Clinton's relationship with some
religious leaders.
Now, just taking it [indiscernible] what's wrong with these maps? And the
thing is there are two things wrong here. First of all, I don't know how to
say it, but those are not really important stories, okay. There's so much to
be said about Clinton's presidency, and his visit to Belfast is not one of
them. Also, there's nothing to go against redundancy, right. Those Clinton
13
red lines are pretty much the same. And yeah, they're both coherent, but they
don't give me anything that the other one did not.
So there's importance and then redundancy. In other words, the challenge is
really to balance this coherence with what we call coverage. Okay. So lines
should be coherent, but they should also be topics that the user care about,
and as many of them as you can. You with me? Yes?
So how do we do that? And we tackled a very similar problem in KDD 09 paper
called Turning Down the Noise in Blogosphere, where the idea was to just find a
small set of articles that are both diverse and important. So this is a tight
cloud about everything that happened on January 17, 2009. You can see that
Obama was very frequent. Okay. The size of the word corresponds to its
frequency and this was Obama's inauguration. And also the Israeli-Gaza
conflict. And New York because the airplane landed on the Hudson river.
So the idea was to pick articles that are about important stories. And just
one slide summary of how we do this, all we need to know is that the documents
cover concepts. Concepts, you can think about them as words. For example, the
document covers some of Obama, some of Washington, some of U.S. You throw in
the orange one, you completely cover New York and add some coverage to the U.S.
At some point, you start to looking at documents that cover some other things.
Everything you need to know, we use this coverage notion, it covers both the
problems that we had, both importance and redundancy.
>>:
So what does it mean, that you covered New York?
>> Dafna Shahaf: Oh, it means that when you look for -- when your algorithm
looks for another document to increase coverage, they're not going to pick
something about New York or they're not getting any additional gain from it.
>>: But I still don't understand what coverage is.
orange both contain references to New York. So --
I mean, the blue and the
>> Dafna Shahaf: So it means that New York played, was important in this
document. Okay. So this document was about something. In this case, the
Hudson river. So they mention New York quite often.
So in other words, you're not going to -- actually let's go with Obama.
It's
14
an easier example. You had an article about integration. It covered Obama
somewhat. It covered integration somewhat. You keep on picking articles that
cover those two things and you have this diminishing returns property. At some
point, you just stop adding.
>>:
What is it that's diminishing?
I don't understand.
>> Dafna Shahaf: Oh, so yeah, that's because I'm hiding it in my sleeve. So
the notion of coverage is a matter of function that has diminishing returns.
So I don't have the formula here, but it's basically -- do you want me to go
into the formula? I can do that.
>>: I want you to say that what's the idea. Because at the moment, I don't
understand. Some documents mention Obama and New York. Some mention others.
How do you know ->> Dafna Shahaf: Okay. First document, we figure out where the important
words are. Just the idea of standard NLP stuff. Then what you do is you say
well, this document covered Obama, say, a third. Then you add another
document. And you don't want -- if this one also covered the third, you don't
want to just keep adding Obama and Obama on and on and on.
So what you do is you turn it into a [indiscernible] max coverage problem,
which document flips a coin and with probability covers the concept Obama. So
when you have more documents [indiscernible], just one of them is closer to one
and then you don't get any additional coverage from a new document and you need
to go into this Gaza-Israel stories. You're still not convinced.
>>:
Let's discuss it later.
>> Dafna Shahaf: You'll I want to tell you, we have a coverage notion that's
not part of the main line of work here that helps us figure out if a set of
documents is about high important things and also not redundant. Fair enough?
And this is what happens when you incorporate coverage in when you look for a
map that's both high coverage and coherent.
Okay. So this is about Greece again, and you have a line about strikes, a line
about Germany, and a line about IMF. Now, what's wrong with this map? Come
on, you're all thinking it. Yes, precisely, they're not connected. And
15
especially annoying because we have this article about Germany and IMF.
crying out loud, at least those two should have intercepted.
For
Our last property is connectivity. If two lines are connected, then I want to
know about it, okay? And there are multiple ways to formalize connectivity.
We experimented with users and seems like the only things they care about is
those two lines, I know they're related but the map doesn't show it to me. It
doesn't seem to care if they were connected or if it was at the beginning or
the end of one article, multiple articles. Just are they connected or not?
So we just went with the really simple objective of count the number of lines
that intersect.
So now we have objectives for coherence, coverage and connectivity, our three
C, I guess. And how do you turn it into one big objective function. And the
idea is it's really a game of tradeoffs. Like I told you, if you maximize
coherence, you get all those Clinton/Belfast stories of low coverage. And if
you try to maximize connectivity, then again you're going to get those lines
that are almost the same and they're definitely connected, but they're about
the same thing. So again coverage drops.
If you try to maximize coverage, your connectivity drops, and so on. So here
is your properties. How would you combine them? Let's start with coherence.
Now, hopefully, I convinced you that we're not after maximizing coherence. We
don't want necessarily the most coherent chains that we have. Rather, it's
really constraint. You only want the chains to be coherent, right, to be above
some threshold.
Now, we're left
both, but think
care about, but
here are chains
connected, what
with coverage and connectivity. We really want to maximize
about it. If I tell you here's a map, here's chains that you
I don't tell you how they're connected, versus here's a map,
that you don't care about, but I'll show you how they're
do you prefer?
So hopefully, agree that coverage is more important than connectivity. So this
is our primary objective, and connectivity is our secondary. So the way you
would writing of down, let Kappa be the maximal coverage you can achieve with
coherent maps. And you try to find a map that's coherent and that maximizes
connectivity, given that coverage is already maximized.
16
You look skeptical. So the only part, it generates disconnected maps, because
coverage is a set function. So really, there's no reason ever to use the same
article in two different lines, because you don't get any external coverage
from it.
Okay. So we had to introduce some slack. We're willing to sacrifice an
[indiscernible] traction of the coverage if it tells us something about the
connectivity.
Then there's the map objective. Now, let me just give you a very high level
overview of how to get good maps. Okay. So we start from a set of documents.
Next thing, remember I told you that coherence is a constraint, that we only
care about coherent chains? So we need to find a way to represent all the
candidate chains to be used in the map.
So when we do is we encode all coherent chains as the graph, which you call the
coherence graph. And basically, each node here corresponds to a short coherent
chain, and edges between the nodes correspond to [indiscernible] and still
remain coherent. It's a transitive property so really, each path in this graph
is a coherent chain, okay?
Next thing you do is you try to find a set of high coverage chains in this
graph, okay. So they're also coherent. Now if you think about defining a path
in this graph, you really look for a path that maximizes some function of the
nodes visited, right. Which like somebody already solved this problem for us.
It's called orienteering, and it's a hard problem. But luckily, again, our
coverage notion is sub-modular. What I was trying to tell you earlier about
diminishing returns didn't completely work. So we can use this algorithm of
sub-modular [indiscernible]. It's a nice little grid that gives us
[indiscernible]. It's recursive and it has some approximation oriented.
So we know how to find a set of high coverage chains in this graph so they're
coherent. Final step, we just have the local search step that tries to
increase connectivity without sacrificing coverage.
Now, the perfect time -- yes?
>>: Sometimes you may have some sources that are reliable for news sources and
some sources may not be reliable. For some of it you may care about
reliability, but for other problems you may not. And sometimes you need to do
17
a trade-off between coherence coverage, reliability.
those kind of tradeoffs?
Does this approach can do
>> Dafna Shahaf: So here we just said the New York Times, we trust everything
they have to say. Which, yeah, has some limitations. We actually came across
this problem in the previous paper. We were dealing with blogs. So yeah,
there were some really bad things we had to filter. But here, it's nice. It's
New York Times. We're going to come again to trust when we talk about
scientific papers. Yeah?
>>: So the whole process of formal objectives and constraints corresponding
seems mostly creative in the sense that you sort of look at the output and see
what's wrong. [indiscernible] feedback or problem pair-wise or is that outside
of scope.
>> Dafna Shahaf:
again?
How I can quantify, I mean, other than the users studied
>>: So coming up with the coherent [indiscernible] and so on, seems like sort
of look at it and you say what's wrong with this. So seems it's largely sort
of looking at it and it's a very subjective ->> Dafna Shahaf: Banging my head against the wall for a few months coming up
with a notion that I like is a pretty good summary.
>>: So [indiscernible] users and then use the clicks or use pair wise?
Otherwise, [indiscernible].
>> Dafna Shahaf: So yeah. So you might be [indiscernible] would be giving
users a chain to tell them coherent, not coherent. If not coherent, is there a
small change you can make to make it more coherent. And you'll see through
some -- I don't know if it's going to work, but just sift through some local
gradient that you can follow. Like what would make this more coherent, or is
it just beyond repair.
But now, actually, that's pretty much what I've been doing so far, just trying
to come up with objectives. It would actually -- the thing is, there's too
many possible chains out there to do the standard machine learning. Just give
me a couple chains that are good and a couple that are bad. It's not going to
work. So maybe feedback on a finer level. Does that sort of answer? I need
18
to think about it some more.
Okay. So this is the algorithm. Now, just let me show you an example what the
map looks like, and this is the real Greece map, not the simplified version you
saw earlier. And you see there's a line about deficit cutting plan or they
have to make cuts. They're rated junk and so on. Next thing is you have a
line about strikes and riots and next you have a line about Germany and finally
a tiny line about the IMF coming out at the end.
So this is what the maps look like and this is really sorted chronologically.
Next, how do you know that those maps are any good? So again, very high level
overview of the user study here. Question had the New York Times data set
again, this time slightly newer. 18,000 articles or so. And we try to see
what maps are good for, really.
So first thing we decide, that's what we call micro-knowledge. Okay. So just
using maps as information retrieval tools. Suppose the user has some questions
they have in mind, like who's the prime minister of Greece, is the map any good
for helping them locate it faster. And then it did show some improvement, but
it was minor, compared to the competitors. We didn't really do a lot better.
Definitely not statistically significant. And like some people told us in the
study, like, if I wanted to know the answer to this question, I would just
search for it. There's no really need to search looking at a map.
So second thing we tried is what we called macro-knowledge, seeing if the map
can help people understand the big picture. And how do you test this. So we
decided to wait to see if somebody really understand as story to see if they
can explain it to somebody else. Just think about the last time you TA'd. And
we asked people to summarize to look at the map or look at the competitors and
give a one-paragraph summary of this story.
And then we threw the program in mechanical Turk and asked them here's two
paragraphs. One of them generated by map users, the other by competitor.
Which one tells a more coherent and complete version of the story?
Okay. And here are the results. For the Greece death crisis, 72% of the
Turkers improved our map paragraphs and it looked less good for the Haiti
earthquake. When we only got 59%. Then I actually had to go look at the
paragraph. And although the map did have a lot of other aspects of the story
like, what was it, some kidnapped orphans and some laws established in the U.S.
19
to help, like, temporary laws immigration laws, it [indiscernible] off the
users that summarized those paragraphs just followed the main story line, okay.
Earthquake, lots of damage, distributing aid, so on.
So our conclusion for now is just that the bottom line is maps are useful for
those macro-summaries, as tools to understand the big picture, and especially
for stories that are complex. You can feel that they have a single dominant
story line. Yes?
>>: Are you going to do more on this study?
but I can wait.
>> Dafna Shahaf:
>>:
I was going to ask a question,
No, I'm going switch to the next.
So what were the competitors here?
>> Dafna Shahaf: So there was Google news again, just Greece debt crisis or
whatever they wanted to type and just read the first, I think, up to five pages
or so. And second was TBT that I talked about earlier.
>>:
And were Greece and Haiti the only two --.
>> Dafna Shahaf: Greece, Haiti and Chile, the miners trapped underground.
Again, a pretty small scale study. There's a limited number of undergrads that
are convinced to come for pizza.
>>: So since it's just these three, did you look at the maps that were
generated for Haiti and Chile and see if they sort of were as high quality as
the Greece map?
>> Dafna Shahaf: Actually, again, I think they were good quality, but people
did not care about the side stories as much. Seems like when you talk about an
earthquake, there's just the main story line and everything else is
distractions. While in Greece, they somehow liked other lines more.
>>: So you assume it's more because of the topic, not because of the quality
of the map?
>> Dafna Shahaf: Yeah. At least that's what I think. Okay. So we know how
to construct maps, I hope. And now how do you adapt it to science. And
20
[indiscernible] you're supposed to stop and ask, wait, why even bother adapting
it? These techniques should still work, right, for scientific papers. Why
have you been changing them. And you'd actually be right. Those things work
out of the box.
The real nice thing is science just gives us all this wonderful additional
structure in forms of the [indiscernible] graph. And maybe we can do something
smarter with all this extra information.
So let me just walk you through how I would modify the maps. Okay. So this is
a quick summary of what you saw so far. We have three objectives, coherence,
coverage, connectivity. And this is what we did in news domain. Now, let's
just go one by one, and I'll tell you how I would modify it for a scientific
papers. So coherence first. Hopefully you remember this slide. This used to
be our coherence objective. And if you take a close look at it, there are
really two main ideas that are going on. One is this compute the influence of
words per transitions, and the other is choose a small set of words that
capture a story well.
And the second thing still works, but how about computing influence. Remember
how I told you there's lot of influence notions in the literature, but we can't
use them because we don't have edges. Well, now we do have edges. And we have
all those people citing each other and really telling us who influenced them.
So maybe we can use that.
So we're going to change influence. And the idea that we want to capture the
way ideas travel in the scientific literature. So when you write a paper, your
ideas are influenced by your previous work, the papers you cite, hopefully some
novelty involved. But really, we use this motion of influence from beyond key
words, from KDD 11, and the idea, again, briefly, is that for each word, you
construct a graph. Nodes are papers. And edges means either [indiscernible]
or same authors.
And [indiscernible] the idea came from because there's a way on each edge you
come up with what's the chance on this word for which the graph is constructed,
like [indiscernible] from Q or from R or it's something novel. Then you use
this graph so in beyond keyword search, they define direct influence, which is
just the project that paper P2 got this idea directly from P1, okay. Maybe
through a bunch of other papers in the middle, but it originated at P1.
21
And when we use this notion of influence, just plug it in our coherence notion,
it really limits the type of chains we can hope to look for, right, because it
will only give you chains of paper that directly influence each other that
build on top of each other, usually from the same research group. So it's not
as interesting.
Therefore, we replace it with this notion of ancestor influence. We don't care
if you want to directly influence P2, as long as they both got it from a common
source, okay, from a common ancestor. And this gives us some nicer chains.
By the way, if there are [indiscernible] people in the audience I can use some
help for coming up with better algorithms for this. Okay. So this influence,
how would you change coverage? Now, all I really wanted you to know about
coverage is that our original notion just covered concept. Documents cover
concepts.
Well, it's not really good enough in scientific domain, because really
[indiscernible] not enough. Think about those two papers. SVM in oracle
database versus the support vector machines in relational database. And
they're both, they have very similar content. Okay. This is their content and
you see SVM, data, database, performance, efficiency. But they had very
different impact.
Again, if you look here, this is the [indiscernible] of the papers citing them,
okay. So if your paper cited you, we get the authors and venues in this cloud.
And what you should see here is, first of all, the paper on the left affected
more authors than venues. Just because there are more words here. Also,
despite the fact they're solving the same problem, they're really related.
There's very little intersection. I think there's only a single paper citing
both of them.
Okay. So we decided that in the scientific domain, instead of covering words,
we want to cover the papers themselves. So paper will really cover the paper
that it had a big impact on. So if you think about it, what I'm saying is if
you want a high coverage map, it's a small set of documents that together had
impact on a large chunk of the corpus.
And so some people might think that descendants is counter intuitive, right,
because how can a paper cover future contributions. Like how can number theory
papers cover [indiscernible]. But when we think about it really looking at
22
ancestors only gives you some ideas of the context where the paper was written.
Well if you look at the descendant, you can really get the gist of what the
contribution was.
Okay.
So we're covering papers instead of concepts.
Last thing, connectivity. So previously, we just had counted the number of
flags that intersect. And it can work. This is a detail of the map about the
support vector machine, and you can see there is a line here about large set -large-scale SVMs and a line about multiclass SVMs, and they both intersect. A
simple geometry for large scale multiclass SVMs. So sometimes this notion does
work. But more often, it doesn't.
Because really, in scientific papers, there's a rich palette of interaction
possibilities. You might say that for many reasons. Again, coherence works
against us all the time. Because see the blue line here? It's a coherent
chain about linear classifiers, perceptrons, SVMs and kernel SVM.
And the orange chain is about SVM applications to vision, facial detection,
facial recognition, and there's not a single paper that can comfortably fit in
both chains, right. You can't really get them to intersect and remain
coherent.
But they're clearly related, right, all those papers about the vision cite the
theory papers. You with me? So we decided to reward chains not just for
direct intersection but also for having high impact on another. So just to
show you what this resulted, this is a map about reinforcement learning.
Let me zoom in. The first line is about MDPs, POMDPs, something called EMDPs.
And you can see how it affected a line about coordination and corporation of
multiagent systems. You can see here that this paper cites this one. They say
that POMDPS extends to [indiscernible] and you can also how the MDP line
affected this line about robotic arm movement.
Other end of the map, there's a line about exploration exploitation dilemma and
bandit problems, and you can see how it interacts with this line about analysis
and bounds of reinforcement learning.
>>:
So where is the text in these speech bubbles?
23
>> Dafna Shahaf: Oh, so if this is actually a direct citation, not just some
couple of levels impact. This is the text around this [indiscernible] with
limitations of PDF text extraction.
Okay. So we know how to adapt maps through scientific literature. One last
thing and this is a year study. I'm actually going into a little bit more
detail. How do you evaluate those. In evaluating maps for science, it's
really tricky. First of all, you can't do double blind. There's no way. The
output is just too you increase, which means you need to get a group of users,
ask them all the same questions, let half of them play with maps, the other
half with competitors.
This means you have to find a research domain the entire group can both
understand, they need to read those papers, and also they must not be experts
in it in advance. So we chose reinforcement learning and we constructed maps
over the ACM corpus, about 35,000 papers.
Let me just tell you how the user study was. So we had people stepping into my
office and I told them to pretend they're a first year grad student, all
excited about doing a project in reinforcement learning. And they step into
the professor's office going yes, teach me everything you know about
reinforcement learning. And the professor gives them a survey paper to read.
Now, the last survey paper I know about in reinforcement learning that actually
is fitted for a first year grad is written in 1996. So their task was really
to update it, to find some more recent research directions and some relevant
papers to fit in this new survey. And they're given 40 minutes, which is not a
whole lot of time. Just to simulate a quick first pass in the data.
They could use Google scholar. They could use our maps and Google scholar.
They're given no instructions whatsoever, just stumble upon this thing. And we
also have two baselines, which is map itself and the Wikipedia entry.
We took snapshots of their progress and we recorded their browsing history.
Next thing we did, first of all, we ended up with 30 participants. We had to
get rid of four of them that didn't quite understand the task and wrote me
really nice essays about reinforcement learning. And then we took the papers
that all of the people mentioned. We combined them into one really long list
and we sent this to a judge who is an expert in the area.
24
And the judge had
tell me relevant,
under some bridge
whether the label
to, for every paper, they didn't know where they came from,
irrelevant or seminal. They also, since they put every paper
direction category, it had labels and the judge also told me
is good or bad, okay?
So precision. All we could get out of it were the blue lines -- were the green
lines and we're doing better both in score of the papers and score of the
labels, okay? If you want to know about the baselines, Wikipedia did quite
poorly, really. First of all, they had 15 citations and only four of them
qualified what we were looking for, meaning research papers written after 1996.
Now, out of those four, only two were deemed relevant. Although in Wikipedia's
defense a lot of those references were books that could have been useful for
our hypothetical first year grad student.
Now, the map, it is a bit harder to compare the map, just because there were
more papers, but just to give you the flavors. There were 45 papers. Seven of
them were deemed seminal and another 21 relevant. Interestingly enough, many
of the irrelevant papers were seemed like they were used to bridge between two
relevant papers just to form a chain.
Also, it was somewhat concerning that the map has all those seminal papers and
users didn't quite see all of them. They didn't list all of them. So there's
definitely some research going into this area of how to show people what's
important in the map. Actually, I guess, how to know what's important in the
map first.
And last thing is recall. Because it's really nice that they get papers that
are relevant, but are they completely overlooking some real important research
direction. And we composed a list of the top ten areas of reinforcement
learning and we just computed a fraction of areas that each users found, and
again we're [indiscernible] Google scholar users alone.
And at the end of using this map, we asked people to experiment and just tell
us what they thought. So just to summarize. They thought that the maps were
helpful in noticing directions that they didn't know about and useful way to
get a basic idea of what science is up to. And a lot of their negative
comments can be chalked up to my [indiscernible] skills, frankly. Things like
the legend is confusing, or it's hard to understand from the paper title alone.
25
Okay. Just to remind you where we're headed at. Okay, the direction is still
the same. We're trying to build issue maps automatically for every query you
have in mind. And one thing I think could really add a lot to issue maps is
this interactive component, okay. This personalization. There are so many
ways you can interact with this, right. You can zoom into something you care
about or zoom out. Or maybe like the chains, give me like I want to know more
about Germany's role in the debt crisis. Just increase the importance of that
word.
Another thing I was playing with recently was just having a map that reflects
what you already know, your background. Because if I search for reinforcement
learning with some experts, they're really looking for different things. So
maybe just give the map as input or [indiscernible] text file. Hey, those are
the papers I know about. Can you use it in a query somehow. I think it could
really make it much more useful.
Okay. One more thing.
one-minute demo?
>>:
How am I doing on time?
Hm?
Do I have time for
Absolutely.
>> Dafna Shahaf: Just to show you. And I don't have internet access here for
some reason, so you're going to survive with what I did at the hotel. This is
our site currently. You see there's a map. Okay. I can do both. And you can
click on an article and you can read the article like I said, not so much
Greece skills, and wait. Yeah. Anyway, that's what I wanted to show you. We
have a website, we hope to launch it soon after we finish fighting all the
HTML5 kinks and I guess I'm going to get a lot more data and see what works and
what doesn't work.
Okay. Conclusions. A huge chunk was like you said just, like you said,
formulating those metrics, just coming up with good objectives, what's
coherence, what's coverage, how do you measure connectivity. And then coming
up with efficient methods to actually compute them with some theoretical
guarantee.
We have some user studies that highlight the potential of this methods and the
website on the way hopefully soon. Now, if there's one thing I want you to
take out of this talk, it's probably this one. Okay? Search engines are
great, but sometimes you need more than this. Sometimes you have more complex
26
information needs and then hopefully you're going to use the maps.
Thank you.
Okay?
Now, if you have any questions.
>>: Do you consider it [indiscernible] just to do like figure automatic
evaluations so that your mention of surveys seems like an obvious one where you
can take the papers before the survey, see if your recall on basically what's
cited in the survey by generating from the papers beforehand and even in the
maps separating into the sections prevalent in the survey.
>> Dafna Shahaf:
I lost you at some point.
>>: You can use the survey as a point of evaluation, right? From the papers
beforehand, do you recover the seminal papers by doing automatic analysis. Do
you segment them in the way the survey segments them. Are you assuming they're
available for ->> Dafna Shahaf: I was looking at it. I was looking at planning surveys.
There are survey that have completely different ways of segmenting this, and
they mention different papers. So maybe I should Google like one level higher,
just look if they find the good authors. I guess this is less controversial.
Maybe. But no, I guess you can use surveys, yes. I want to write this down.
Yeah, anyway, we haven't done this.
I actually like it.
Yes?
>>: Can you remind us one more time the definition of connectivity for your
purposes?
>> Dafna Shahaf: Yes, it was -- okay. For the first one, it was the easy one.
Just every two lines that intersect, you get a point.
>>: So in other words, another way to put it is it's simply the number of
edges in the intersection graph? There's nodes and lines. So why don't you
use actual connectivity? Is this graph connected?
>> Dafna Shahaf: Because sometimes it just, it really can't be connected
because of what I showed you in the scientific domain. Some lines are just,
especially when the query's wide, some lines are just too out there. They
can't be connected.
27
>>: Why don't you use the number of [indiscernible]. Because say you have ten
lines, you're following two -- yes, you each have this connected so they form a
chain and alternatively, four of them have nine edges together and all the rest
are separate.
>> Dafna Shahaf: You can use the number of connecting components. But it's
also interesting to see how the things inter-connect the components are
connected, right?
>>:
Right, right, but it seems just the number of connections --
>> Dafna Shahaf:
>>:
Yes that would work.
Don't quite capture --
>> Dafna Shahaf: Since we're doing a local search, this is actually the
easiest objective to change, right? [inaudible]. Yes?
>>: I want to [indiscernible] on Paul's suggestion earlier when we're talking,
I was thinking well, here are the [indiscernible]. Well, you could have as a
unit a paragraph, right. So you mention you take all the paragraph and you
shuffle them and you [indiscernible] and you assume the paper is the
[indiscernible] story. And now you can see how all the papers which
[indiscernible] hopefully, basically score with the different measures that you
have.
>> Dafna Shahaf:
Yes.
>>: If they score well, it would be actually, you know, a recognition that
it's a [indiscernible], assuming that the paper has some coherence, coverage.
>> Dafna Shahaf: Which [indiscernible] assumption for most authors. Actually,
Eric, remember we talked about it at some point. Yes, this is somewhere on my
to-do list and I -- which is long, I admit. And I think a single paragraph is
not enough for you to actually see that those two paragraphs are coherent, but
maybe, I don't know a third of the article would be good enough. So yes. This
is something we've been ->>:
The second question is you seem to -- there seems to be an assumption that
28
all the items are somewhat comparable. That you could imagine feature access.
One would be, let's say, [indiscernible] versus [indiscernible]. Another one
would be surveying like and [indiscernible]. And so you could either restrict
to some of these axis and see how the stories differ. Or you could go for
diversity between these things or not consider them at all [inaudible].
>> Dafna Shahaf: I really like that question. So one thing that, this is what
I think personalization can come in really useful. Because then well, not off
the axis you were talking about, but you can definitely bias towards Republican
or Democrat. When you use Obama care, it's typically what you think about it.
And I was just talking this morning about maybe I should try doing this,
compute a map for the same query, one for the New York Times and one for Wall
Street Journal or something else and see how they defer or give different point
of view. [indiscernible] how to find the other axis that you're talking about
high level/low level.
>>:
Well, it's [indiscernible].
>> Dafna Shahaf: That's doable, yeah. I guess technically, you can do it with
a personalization as well, right, just increase the weight of some words that
are very charged. Why would you want that, by the way?
>>:
Oh, at least want to [indiscernible].
>> Dafna Shahaf: That's a perfectly good reason. Okay. Those are really
interesting, especially if you try to connect chains that are between end
points that are not really connected, you really get something like a
conspiracy theory generator, right, because the most coherent stories, not very
coherent.
>>: You can imagine how propaganda machine [indiscernible] just connections
between the gypsies of Europe, now with the downfall of the banks.
So anyway, any other comments or questions?
Okay.
Thanks very much, Dafna.
Download