Document 17865270

advertisement
>> Michel Galley: So … let’s get started. So as a former student of Kathy McKeown, it’s my immense
pleasure to welcome her to Microsoft Research. So … there is no need, of course, to introduce Kathy,
but she’s done a lot of work in generation, summarization, and question answering. So she’s a professor
at Columbia University, also the director of newly-founded Institute for Data Sciences and Engineering.
And so today, she’ll be talking about natural language applications from fact to fiction. Kathy?
>> Kathleen McKeown: Okay, thanks Michel. So I’m going … [coughs] sorry. I have a cold, so I hope I
won’t cough too much. I’m going to be talking today about a work that’s been done in my group over
about a ten-year period, where we started looking at data that was primarily about the world—fact—
and we have moved sort of full circle and have recently been looking at work that is fiction. So I’ll be
talking today about the work of some of my current students, who I have here. I want to acknowledge
them upfront, but you’ll see occasionally they’ll pop up at different points during the talk. It also builds
on the work of my past students, who are situated at various places around the country and actually,
around the world.
So I’m going to be talking about work that we’ve done with data that falls along a continuum from fact
to fiction. So at the top, you see texts—kinds of texts that are more factual in nature: they refer to
things that either happened in the world or they report on fact as we know from scientific experiment—
and as we move down the continuum, we come to data which is more subjective in nature. We’re likely
to get things like opinions expressed; it’s likely to be a bit more informal; and when we get to the
bottom of the continuum, then we have data that’s actually not about the real world at all in any way
and has to do more with what people have written. So one could argue about whether scientific journal
articles or news are more factual, but in order to align the continuum with the order in which we’ve
done the work, I’m going to put news first. And if we look at the different kinds of genres that we have
here, one of the things that we can see is that the language of genres is remarkably different. I’m going
to be using a lot of examples from references to hurricane Sandy, which for us in New York, was a pretty
big deal and we still continue to talk about it.
If you look at all that comes down, you can see that it’s quite different, and even from the language
that’s used, you can probably identify which genre is which, but let’s just try it as we go through. So you
can call out whether it’s social media, scientific journal, news, or novel. So the first one: what do people
think? [murmurs] Social media. The second one? [murmurs] Scientific paper. The third one? [murmurs]
News, yes. And the fourth? [murmurs] Novels. ‘Kay? Now this, just in overview before we get into the
talk itself, is a word cloud of the kind of work that we’ve done in my group, and so you can see
summarization is there right front-and-center—that’s been a lot of the focus of our work—this comes …
this word cloud was created from the abstracts of my group’s papers over the last … I think it was about
fifteen years. In addition to summarization, we can see some other things, like generation; we’ve done
some multilingual work, so translation shows up; a little bit of work with patient … I’m not sure why
speech is so large—we … I guess that’s all due to Michel Galley, here in front.
So our vision is to be able to generate presentations that connect things from events from these
different sources. So to be able to connect information about events, opinions about those events,
personal accounts about the same events and their impact on the world. We’d like to be able to link to
supporting science, and ultimately to link to fiction—that’s a little bit more in the future. So let’s start
with news, and as you all know, there’s been a big effort on news within the natural language
community—a lot of work within the linguistic data consortium—on collecting the news, annotating it,
making it available to research groups to use for their research, and for that reason, a lot of the initial
progress, both in my own group and in other groups around the world, was on news. Certainly in my
group, a lot of the work that we’ve done has been on summarization, and shown here is a page out of
our newsblaster system which was developed … gee, almost fifteen years ago—a little bit less than
that—we made it go live on 2001. And this … here we are looking to generate summaries of multiple
articles about the same event. So here you see one on hurricane Sandy, and as we saw, this is where
the initial example came from. “Hurricane Sandy churned about two hundred and ninety miles off the
mid-Atlantic coast Sunday night …” and so forth.
Now the first question you may ask is: why is summarization hard? Well, it seems to require both
interpretation and generation of text. Furthermore, it seems to require doing that in unrestricted
domain; we don’t know ahead of time what topic we’re gonna get text on and what kind of domain
information we need. We need to be able to handle those documents, though, robustly, and thus, it
seems that we need to be able to operate without full semantic interpretation. And in the
summarization field, this has led many researchers to use what is called sentence selection, where
sentences are selected out of the input documents on the basis of salience or importance, and then the
sentences are strung together to form the summary. And this is a way to get a system working quickly.
It also, though, can lead to some problems, where sentences placed side-by-side may create some
misconceptions, or they may have missing information. For example, who is being referred to in the
article may be very clear, but when you put it in the summary, it’s not. So our approach at Columbia has
been to use sentence selection, but then to edit the selected sentences. And the kinds of things that we
worked on have been to correct references to people that are infelicitous. So the first time you have a
reference to a person, you’d like to have a full reference so that we can understand who the person is,
and if we continue referring to that person throughout the summary, then we might have a pronoun. So
this was work done by Ani Nenkova.
Compression is still a big topic, and perhaps even a bigger topic now, because when we extract
sentences—especially from news—they can be quite long, and so what we’d like to do is to be able to
remove extraneous material from those sentences to make them shorter so we have a concise
summary. And particularly, if we think now about doing summarization on mobile devices, we might
like something quite short—just a single sentence—to appear there. This is work that we’ve … an area
that we’ve worked in for quite a while from Hongyan Jing in 2000, from Michel Galley in his dissertation,
and it’s something that we’re continuing to work on now. We … given that a lot of the work that we’ve
done has been done in a multilingual environment, and we’ve had to generate either summaries or
answers from translated texts which are often disfluent—as a matter of fact, the first … yes … the first
time we began working in this area, and we saw what we were going to have to generate answers from,
we were like, “are you kidding?” But of course, things have gotten a lot better; nonetheless, when we
do summarization or question answering from translated text, we have a lot more context from the task
that was not available at the time of translation, and we use that information to make fluent sentences
from disfluent translated ones. And then, we’ve been doing work on generating new sentences by
selecting phrases from the input sentences through fusion. Now this is walking a fine line: it’s easy to
make a good sentence bad. If you extract it, somebody has written it. There has been quite a bit of
work though now in the field of text to text generation, and I won’t go through it, but you can see that a
lot of different people have been looking at this problem.
In our current work, which is the work of Kapil Thadani, we’re modeling text transformation as a
structured prediction problem. So the input is one or more sentences with parses, and the output is a
single sentence with a parse. And we’re doing it in multi-view structured predictions, so we can take
different kinds of informations into account at the same time—simultaneously—as we re-word the
sentence. So we’re doing joint inference over word choice—constraints about word choice—using
information from n-gram ordering and dependency structure. We do this in a supervised fashion; so we
start with a data set for compression, and you can see here one of the very long sentences is in input,
and then in output, we have removed all of the information in white—so, “flying towards Europe
yesterday” which may not be as important for this summary as just the fact that air force fighters were
scrambling to intercept a Libyan airliner. So the framework that has been developed can be used for
different kinds of text to text generation tasks. It can be used for sentence compression: here the input
is a single sentence, and the output will be a shorter sentence with salient information. And we use as
data the kind of data that I showed in the previous slide—so we have a lot of cases where we have a
longer sentence and shorter sentences, and we’re using a baseline from Clarke and Lapata in 2008,
who’ve done a lot of work. And you can see here the two different perspectives on the sentence: the ngrams which are shown in yellow—so this shows you the pairs of words which are likely to occur
together—and dependency structure—so we take dependency relations between pairs of words and
these also form constraints on the output.
We can take the same framework and use it for sentence fusion, simply by changing the data set and by
changing some of the features that we use. So here, the input is multiple sentences, and the output will
be a sentence with common information from both. So it’s a way of taking multiple sentences and
finding what is most salient, because it has been repeated across sentences. We use here a data set
that is created from summarization evaluations—I don’t know if people are aware of the pyramid data,
which has been used for evaluating summaries. But here, we have a case where we have many different
summaries of the same data, and it’s been coded down to the phrase level of what is common across
human summaries, and therefore should appear in an output. And we use that for the … for training our
fusion. And then we have changed the features in some way, so we have some fusion-specific
features—for example, repetition may be one of them. And you can see an example here where we
take a part of the first sentence, another part of the second sentence, go back to the first, and end up at
the second to get a sentence like: “Six years later, independent booksellers’ market share was seventeen
percent.”
We’re also using sentence fusion for machine translation, and here we sit at the end of a pipeline, where
different systems have done translations—we’ve been doing this in a joint project with Martha Palmer,
Kevin Knight, Dan Gildea, and Nianwen Xue, but I’m showing it here with translations that are available
off the web, and in fact, we do often experiment with that. So you have a Chinese sentence at the top,
the reference sentence so you can see what should have been translated, and then you see three
different translations by online engines: Google, Bing, and Systran. And no one of them is perfect, but
nonetheless, there are phrases in each of them that are good, and so our goal is to be able to fuse these
three sentences and to take phrases which can be used and improve the final output. So we do two
types of sentence-level fusion. One is what’s called a sentence-level combination, and that’s where we
look at the output, and we want to choose the sentence that is best. For that, we have developed a
structural language model that we can use on the translations—this is a supertag language model—and
so it can capture sort of long-distance dependencies between the phrases, and we use this to rank the
different outputs and choose the one that is best … followed by this. Now, that serves as our … what
we’re calling the backbone sentence, and now we use phrase-level combination, given this best
sentence, to pick the phrases from the different other translations that we have. And we use a variety
of different kinds of feature-based scoring functions; we do look at consensus in the different
translations on phrases; we have syntactic indicators on whether a phrase works well; and then we have
information about the semantic role labels between the source and the target translation. So if we go
back to this combination, we’re first going to choose one of them as the best using the language model,
and this would be the Systran one, and now we’ll choose phrases from each of them to decide how to
put it together. And we use a paraphrase lattice, which shows the paraphrases between the different
systems—I’m showing just this part of it, where we’ve got translations from each of them—and we’re
going to choose the top as the best sort of n-gram as we go through, and as we go through the second
part of it, “is unlikely to”—we’ll choose this, because it gives us the best … the syntactic features show it
to be the best in that case.
So what have we learned from this work? With monolingual compression, we get a five percent
increase in n-gram recall when we use joint inference with dependency relations over previous
baselines. In our case of multilingual fusion, we get an increase of one point in BLEU score for combined
MT output. Of course, things can go wrong. As I said, we’re not relying on how people wrote the input,
and in fact, one of the things that often goes wrong is when we have multiple people from the same
family who appear in the news. So this example is from way back when—when newsblaster was first
picked up by the press and was appearing in the press—and of course, the journalists wanted to find
everything that was wrong with it. And on the death of the Queen Mother in England, newsblaster had
Queen Elizabeth attending her own funeral when we re-wrote the references.
‘Kay, so let’s turn now to scientific journal articles. We’re dealing with a number of different kinds of
articles, some are from nature—they tend to be mostly scientific—we have a number from Elsevier as
well, but it’s full texts. The first thing that you might ask is: how are articles different? We saw that the
language of the genre was different, and we can see right away, just from the beginning page of it, that
we have some structured information that we didn’t have before: we have titles, authors, and
abstracts—as shown here below—we have citations, typically; and of course the language is different.
So one of the first things we can see that we can get out of a set of scientific articles that we can’t get
elsewhere is a citation network. So if we’re working with this group, we can start at the bottom and see
that in fact from Yarowsky, quite a bit of work came out of that same genre. This is all citing back to
paper … other papers which eventually cite Yarowsky on the same topic. If we look in at the text, we
can see that we can get things out of the language of the text which is not available otherwise. So in the
project that we’re working on with scientific articles—Mony Teufel from Cambridge is working with us—
we can identify, for each sentence in the article, the purpose of that sentence in relation to the overall
article and in relation to how publications go. So here is the aim of the article—one of the points or
contributions that it’s trying to make. Given that there are citations within an article, we can also
discover the text around that citation and what kind of sentiment it has towards it: whether it’s aligning
itself with it, whether it’s positive, negative, or whether it’s simply … it’s objective. And so one of the
problems, of course, is identifying what scope of text around the citation refers to it, but also sentiment.
So here, this is a negative one: “we argue that this approach misses biologically important phenomena,”
and the citation would be in there where the dot, dot, dots are.
So in this project, we’re working on prediction of scientific impact. Our input is a term to represent a
concept in the field, and … or a document. We extract indicators from the full texts of documents that
are related to the term. So we first generate a set of documents that all have to do with a particular
concept, and then our goal is to predict the prominence of either the term or that particular scientific
article. And one of the kinds of features that we use is time series, and you can see here time series for
two different terms: “climate change” and “climate model.” You can see that “climate change” shows a
burst in this particular feature later than “climate model,” and continues to climb while “climate model”
goes down. So that’s just one of the many clues that we use for this, although our work is primarily
focused on indicators from full texts. As part of this project, we’re generati … we want to be able to
explain why the system came up with its prediction, and we do this in sort of two parts: one is a
summary which defines what the technical concept was that corresponds to the term, and then we also
do a justification of the prediction using the output from machine learning—so that’s an original
generation task with … where our input is basically the features or attributes that were used and how.
So this gives you two examples in “what is tissue engineering?”—this is, sort of, one of the terms that
we got in input—and this is done by doing summarization—query-focused summarization—and this is
very large number of articles. We will typically have at least a thousand in our input, and from that, we
have to go down to a single sentence which will tell us what the concept is. For tissue engineering, this
gives you an idea of what our output looks like. This is a very small portion of it—it’s actually relatively
long, and we’re continuing to work on this. Here we have: “we predict a prominence of point seven
three.” We give information about the most prominent indicators, and then we provide some
description about each one of them, so “overall the sentiment in the set of documents is
overwhelmingly objective,” which …
>>: Sorry, so … question … so yeah. So if you were to compare the result that you got from
summarization of—you know—of this lot, compared with the Wikipedia summ … you know, kind of
active. Is there any preference for subjective … people to look for them [indiscernible] supposed to do is
summarize …
>> Kathleen McKeown: Yeah, we did not … when we run the system, we do not have access to … we
only have access to the scientific documents. But we could use that as a sort of ground truth for how
well it’s done. We actually do use Wikipedia to construct an ontology of the concepts that we’ll have
and we use that in the …
>>: After [indiscernible] these examples, I probably will think this as good as Wikipedia.
>> Kathleen McKeown: Uh huh. Well, we haven’t measured it that way, but it would be a good way to
do it. I’ll just say a word about our summarization approach here. It’s an unsupervised approach; we
begin by selecting non-subjective sentences from the text, and we use the argumentative labels to do
that—so background sentences. We then rank by similarity and centrality in a similarity graph, so
sentences appear at the nodes, and they’ll be connected to sentences that are more lexically related,
and then we choose the most central of those, and then we re-rank the top candidates by using
definition-specific heuristics.
So what have we learned? Well, different forms of summarization are definitely needed when we
change genres. We want information about terms and justifications, and while I didn’t talk about it, we
have looked at this in medical scientific articles as well, where we needed work that was tailored to the
reader, whether a physician or a patient. We have information we can exploit which we did not have in
news: the structure of the article itself and the networks—so we have explicit networks based on
citations, and sentiment toward cited work plays a role.
So I’m gonna turn now to our work on online discussion forums, and this would be the kind of data that
we might have available to answer questions. Twitter … but we’re also looking at online discussion
where … from discussion forums, where we would have a bit more information. So how is online
discussion different from the kinds of genres we looked at so far? First, it provides an unedited
perspective from the everyday person, so the language is going to be more informal. It’s often in the
form of dialogue, so we will have some back-and-forth about what people are saying. It contains a lot of
opinions, viewpoints, emotion. And of course, the language of social media is not the same as the
language of news.
We’re looking at being able to answer questions about events, so this would be—and we’re looking at,
sort of, these open-ended questions—this would be a case of what we also refer to as query-focused
summarization. So here’s one case. This was in the course of developing the system. One answer that
the system generated at one point in time—we don’t always get such nice answers.
>>: So this is actual system output as well?
>> Kathleen McKeown: This is. But it’s …
>>: That’s actually pretty cool. It’s almost like novel-style writing, you know? Like something literary …
>> Kathleen McKeown: Yes, but this is one where it chose a larger chunk. So a lot of it comes from the
same chunk, and I can’t claim that all system output looks like this. Here, what is the effect of hurricane
Sandy on New York City? And you can see: “It’s dark. There’s minor price gouging. There are
restaurants selling hot food through their bay windows. The police are doing an amazing concern … job
with traffic concerns. Many stores have set up recharging stations.” That was actually kind of
interesting when you were there, because when you walk through that dark part, it looked like there
were fires and people were all huddled around them, and when you got close, you saw that it was
actually electrical outlets with … the glowing was from the phones. Some of the things that are hard
here is that very often there’s no word overlap between the question and the answer. So we had
“hurricane Sandy,” “New York City,” “effect”—none of that appears in the response, so we need to be
able to do some inference about how things are related.
So our approach has been to start with a small amount of manually-annotated seed data, where we
have query-sentence pairs—where the answers were … we used Amazon Mechanical Turk to get
sentences from the documents that were actually related—we then augmented this with nine years of
unlabeled data from newsblaster, and we made the assumption that the summary headline was a query
about the event, and the summary was approximately an answer. Of course, it would contain many
sentences that are relevant to that query, and a small number of irrelevant sentences. And then we also
looked for query-answer pairs on sites like Yahoo Answers or Quora. Then we developed this semisupervised method that used multiple simple classifiers, for example: keyword overlap, named entity
overlap, and we experimented with different kinds of semantic relatedness. We also have expanded
that earlier approach to use more unlabeled data that we can get on the web, and so we’ve gone to
features that we draw from DBpedia. So we have over one point eight billion facts that we extracted …
that were extracted from Wikipedia info boxes. These have been encoded in the semantic web, and you
can see we have there sort of triples here, and this lets us see relationship, for example, between
hurricane Sandy and location, which helps us to determine—in some cases—relevance. So our goal
here, in moving forward, is to be able to decompose articles or online discussion into a main event—this
initial impact of the hurricane—and sub-events—what happened afterwards: the Manhattan blackout,
Breezy Point fire, public transit outage.
Of course, you may ask if we’re constructing answers like this from on-line discussion, when should …
can we assume that an individual post—or pieces in it—are reliable enough to be able to answer a
question. One factor in this is influence, and that’s something else that we’re looking at in the context
of online discussion. So the research question is: we want to be able to detect online influencers; what
conversational features are important towards that task, and how can we identify situational influence
that is made apparent by the conversation, not by the links between who follows who? So we want
don’t particularly want to identify Justin Bieber, for example. So in our work, an influencer is somebody
whose opinions or ideas profoundly affect the listener or the reader. We’re doing some of this work in
discussion forums like Wikipedia discussions—so these are online discussions that take place between
editors of the Wikipedia forums, and there’s a lot of sort of back-and-forth about how editing should
take place. So here we have—we can see—we have conversation: we have a person who makes an
initial post, responder, person who responds to that and so forth. And the discussion here is about
whether Ahmadinejad was lying about having served in the Iran-Iraq war, and he’s recommending to
work in a certain piece of information into Wikipedia. The woman replies, “That’s a very weak source.
I’d like to ignore it,” and the original poster agrees: “Thanks, I guess we’ll have to wait and see.” So here
the influencer is the woman.
We’re doing this with cascaded machine learning, so we have a number of features at the right, some of
which are fairly complex, and we need to learn them themselves, and then those pass up into a machine
learner for influence. So if we look at a couple of examples for dialogue patterns—let’s say we have a
structure of posts like we see on the left—our pattern—here, the feature irrelevance—would tell us that
posters … posts that have no replies, and which therefore seem irrelevant, are coming from people who
are less likely to be influencers—that would be our intuition, and it’s one feature—whereas someone
who takes the initiative or incites a lot of response would be more likely to be the influencer. Now,
agreement and disagreement is another factor in that, and this is something that we also worked on
fairly early with Michel Galley, and we’re continuing to look at. It’s hard; you can see here with
disagreement: “That’s a very weak source. I’d ignore it”—at no point in time does the speaker say no.
So we have developed some machine … a machine learning component which will look at the kinds of
features that we need to be order … in order to determine whether we’ve got agreement or
disagreement—sentiment plays a major role in that.
So what have we learned here? In our work so far, we do significantly better than a baseline, but
detecting influencers is really hard. If we look at F-measure, we … we’re still … we have a long way to
go. We can gain intuition about language use in social contexts, and so for Wikipedia, we found that
agreement is more useful than the dialogue patterns, but in some of the other online discussion that
we’ve looked at—some blogs—it’s the other way around. And we can validate some hypothesis we
have about which conversational features are more important for different genres. Yes?
>>: Are you ignoring the up-votes and down-votes in various discussion media, where viewers can give a
posting a plus- or a minus-vote? ‘Cause that’d be an easy one.
>> Kathleen McKeown: Yes. In the … in this particular case—the data that we’ve looked at—yes, we did
ignore that. We have gone on to look at … we’re looking right now at—so each of the forums are
different and the kind of information that you can get—right now, we’re looking at CreateDebate, which
is sort of interesting because in the data, you have pros and cons, and so from that, we can easily see
agreement and disagreement, and we can gather a lot of data about it. So follow-on commenters are
typically …
Okay. So going forward, the question that I showed is a kind of question that you might normally get
when you’re looking at the news. If we were looking at other kinds of questions that we might want to
pose when we’re looking at online discussion forums, they’re of a different nature—so what is the
reaction? And if we look into the blogs, we can see things like emotion plays a role: “I’m still speechless
at the widespread damage.” How do people expect Sandy to impact the election—here, we have a lot
of opinionated information: “I can only imagine how this will make the nightmare of voting even worse.”
And then, experiential questions, so it gives us the opportunity to look at a lot of different kinds of
language analysis than we have when we are looking at news.
So I’ll move now to our work on personal narrative. The Autobiography of Malcom X would be an
example of the kind of thing, but not so long. We’re not looking at novels here, we’re looking at short,
online … But we’re looking at things that provide a coherent telling of a story; they have a component
that is particularly compelling—almost shocking—in terms of what happens, so it’s grabbing your
attention. Unlike the online discussion forums, we typically have a monologue, but like the online
discussion forums, we have informal language. So here is an example of what we would get, and we can
sort of chunk it into different areas. So in the front, we have some information that orients us as to
what was going on: “We were sitting down to a late-night dinner on Monday night, when the storm was
supposed to hit.” We then have sort of a sequence of events that is happening, and we end up with this
very compelling, almost shocking element to the story: “He went upstairs to get a tool, and in those few
seconds, ocean waves broke the steel door lock, and flooded the basement six feet high in minutes.”
So our goal in this work is to be able to look at these kind of narratives that occur online and predict
when a narrative is interesting. When will it go viral? Or when should it be selected as part of a larger
story? And we’re approaching the work initially by analyzing the narratives through William Labov’s
theory of narrative. And this theory proposes different structures that should be present in a narrative
which will be interesting. We expect to find background information; we expect to find a series of these
complicating actions; and most importantly, we expect to find this reportable event—this sort of really
compelling, shocking statement about what happened. I’m not gonna say much about how we’ve done;
we’re still at very early stages, but we have developed a supervised approach to be able to identify a
structure that is consistent with Labov’s theory, and we’re getting about seventy point four F-measure.
You’ll note that the kind of labels that we’re putting on structure—I don’t … if you’re familiar with the
discourse relations that come from the Penn Discourse Treebank—it’s different because we’re not
looking at relations between sentences, rather we’re labelling blocks, but we do use the relations from
the Penn Discourse Treebank to help us in that process.
‘Kay, so let me close by looking at our work on narrative. And here, we have looked at a corpus of
novels that come from the nineteenth century, and it—that’s because they’re available through … online
through Google books—why we chose them. If you look at the language of novels, again, you’ll see it’s
quite different. We do have a lot of conversation with different people speaking, and often talking to
each other. So here: “‘What is the matter?’ I cried. ‘A wreck! Close by!’ I sprung out of bed, and asked,
‘What wreck?’ ‘A schooner from Spain or Portugal, loaded with fruit and wine.’” Hey, so we’re using a
corpus of novels from the nineteenth century, and we’re working together with faculty from
comparative literature. And we looked at whether we could use the analysis of novels in order to decide
what theories of … literary theories are of interest—we discussed what collections we could work with
in order to provide evidence before or against them—and there has been quite a bit of work in the
comparative literature on these various theories, but using what is called a close read—one or two
novels in a lot of detail. So what we wanted to do here was use what is called a distant read—look at a
lot of novels to make our conclusion.
So we were looking at whether we could provide evidence for or against literary theory; we did this by
using social network extraction from literature, and as I said, this corpus of nineteenth-century British
literature, where the network was based on the conversation that happened in the speech. So we want
to have a method where we identify who talks to whom, and then extract features from the graph to
evaluate hypotheses about literary theory. So each node would correspond to the characters; a link will
come between them if they’re talking to each other. If we looked in literary theory, the kinds of things
that this particular person in comparative literature was interested in was a hypothesis that had been
developed that said that as the novel moved from rural settings, with a very small number of characters,
to the cities, with very large number of characters, then the network tends to be less connected. And
you can see this in quotes from various people: Franco Moretti, who said, “At ten or twenty characters,
it’s possible to include distant or … and openly hostile groups;” Terry Eagleton, who says, “In a large
community, most of our encounters consist of seeing rather than speaking, glimpsing each other as
objects rather than conversing.” So we want to look at whether we can show empirically that
conversational networks with fewer people are more closely connected.
So to construct the network, the nodes are … represent people who said something, and we did this
work in triple AI with quote attribution, which we could do with eight-three percent accuracy. So the
idea is: can you identify for a particular quote in a novel who said it? Even that is not trivial. And our
edges are people who are talking to each other, and here we use quote adjacency as a heuristic for
detecting conversations. So … I should mention that even goes to the point—often you have in novels
where you have conversation where it alternates between one person and the next—you have no
identification of who’s speaking, but you’ll have a quote, quote, quote, and so that was something that
we could pick up. And then we set the edge weight to the share of the detected conversations, so we
sort of have a … information on how much talking they did. We’re able to identify these links with very
high precision, but only fifty percent recall, so that’s something that could still use some more work.
Nonetheless, as you’ll see, this should work against us in what we aim to provide.
When we look at the network size, as the number of named characters increases, given our hypothesis,
we would expect to find same or less total speech. And in fact, we found that, but with a weak “yes”:
the number—normalized number—of quotes was flat. We would expect to find a less lopsided
distribution of quotes among speakers, and yes we did find that: the share of quotes by the top three
speakers decreases. As the number of named characters increases, we would expect to find lower
density, if our hypothesis is true—that is, each person would have fewer conversational partners as a
percentage of the population—and we did not find this. We found that larger networks are more
connected. We would expect to find same or fewer cliques, like smaller of these groups that actually
have conversation, and we did not find this—in fact, we found that the clique … three-clique rate
increased and larger networks form cliques more often. As the number of speakers increases, we
would—so that was with named characters, now let’s look at just the people who are speaking; perhaps
that would give us the information we were looking for—so as the number of speakers increases, we
would also expect to find less overall dialogue—this is this glimpsing rather than speaking—and we find
that not to be the case. Larger networks are more talkative. And we would also expect to find lower
density, and again, we find that not to be the case: in larger networks, speakers know more of their
neighbors.
We did find an alternative explanation in the data that we were looking at, and that was text
perspective, which dominates the network’s shape. So in third-person tellings, as opposed to firstperson, we did find significant increases in the normalized number of quotes, the average degree—that
is, the number of people they were … each person was speaking to—the graph density—that is, the
percentage of the population that they were … we saw them speaking to—and in the rate of threecliques, and this was with no significant difference in the number of characters or speakers. Our
hypothesis here is that, when we have a first-person narrator, they’re not privy to the other characters’
conversations with each other; they see things only from their own viewpoint. And we can see this in
the graph, so this is a network from a third-person narrative novel, Jane Austen, Persuasion, and we can
see it’s—well, Anne is talking to a lot of people. When we go to what is called a close third—so it’s told
in the third-person, but from the perspective of one character—all we can see the shape changes:
Robert dominates, and most of the links are between Robert and other people, we’re not seeing who
else they’re talking to. And when we go to the first-person narrative, we get a dramatic change, so
everything is seen through I.
So what have we learned here? Well, we’ve learned that high-precision conversational networks can be
extracted from literature, and I think that while natural language has avoided looking at novels, I think
the time is right now to do more exploratory analysis of fiction. And we’re beginning by combining the
work that we’ve done by looking at the social function of characters and see how that plays a role. ‘Kay,
so to conclude, we’ve looked at a lot of different sources of data and what we can do with them. Our
goal is actually to do some integration—I’m not sure yet how novels fit in, but we’d like to bring the rest
in, and we’re at early stages with personal narrative.
So I’d like to show you now a mock-up of where we want to go with this. Let’s see if I can … [pause] of
course, things that work when you start … [pause] I may have to just do it like this. [pause] I can see I
don’t have … well I had this all set up, but … we may just have to go through it by hand. Here is a sort of
timeline of hurricane Sandy, where we have the introductory information about hurricane Sandy
approaching, and then we move through to provide information along a time and space, which will tell
us about what hap … what has been happening. So here, we have the personal narrative, the
compelling event. We then move along, Tuesday eight PM, where New York City has become dark,
again drawing from social media. A little bit later, in conversation, on postpone the vote. And finally, a
month later, where people are beginning to work on the impact of it, and … so drawing from scientific
articles. ‘Kay? So to conclude, that’s our goal, and I’m, at this point, just ending with showing you a
picture of our research group. So thank you. [applause] So … I don’t know if there are any questions.
Yes?
>>: I’d be really curious what kind of reaction you got from the literary theorists, because they can be
sometimes a little detached from reality. So I wonder how they react to—you know—empirical
[indiscernible]
>> Kathleen McKeown: Well, I was … we were working with the chairman of the comparative literature
department at Columbia, and he was really interested in this. In fact, he did not worry that our evidence
came out against the theory. He found that very interesting, and I have to say the evidence is only
against—it doesn’t disprove it, it just provides some information that suggests that it may not be as true
as people thought. The field of comparative literature is moving towards doing more empirical analysis
of texts; Franco Moretti at Stanford—who provided one of that … those quotes—that is what he does.
They tend to use less sophisticated tools than you have in natural language processing, so—you know—I
think there’s a lot of room for interaction. And since we did that work, the department at Columbia
hired a faculty member in computational English, and I thought—you know—how cool to have
Columbia, who is so conservative, hire in that area. And he’s a person who actually does use fairly
sophisticated tools—he does topic modelling and, you know, various other kinds of things. Yeah?
>>: So there—I mean, just for confirmation—there is—I don’t know—nineteenth century, what we’re
studying, right? So I’m thinking we show author writes in the second person: “you did that,” you … I
mean, basically puts you there. I mean, nothing of that would … if you looked at different authors, the
styles … how much more of a difference there is between—I know—authors by themselves versus the
type of novel: with lots of characters, fewer characters …
>> Kathleen McKeown: There’s a whole lot you can do. I mean, so you could look … we talked about a
number of the different things that we could look at. For example, we talked—at one point before
starting—about whether we would look at difference between genres, and of course, when you’re
working with someone in comparative literature, they have a lot more nuanced view of the genres than
I do. But there’s mystery novels and whether … but we didn’t have enough data on each kind of genre
to do that, and that was one of the reasons—you know, you need to be able to get a large enough
corpus that is available online. We had about sixty thousand novels from that time period.
>>: It’s a very interesting presentation. One of the things that I wanted to ask you—for your insight
about—is: literature is very structured in the way that it … and very little sarcasm and very much
normalized. A lot of the—sort of—the social postings today—like Twitter or stuff like that—there’s
facetisms, a lot of sarcasm, and very negatives. So I want to ask you for your thoughts about how you
think this type of approach may work in—sort of—the newer social media types of language.
>> Kathleen McKeown: Well, we are … I mean, we are working with social media, and different portions
of the work that I talked about, we use different techniques. So what we’re doing with novels, we’re not
doing with online discussion forum. In our work with online discussion, sentiment analysis plays a big
role. In order—for example, whether somebody has influence can depend in part on positive or
negative reaction to what they’ve said. There is work at Columbia going on on detection of sarcasm;
we—my group—is not yet using it, but it’s being done by Weiwei Guo in … with Mona Diab and—there’s
another researcher, Smaranda Muresan—who are looking at that and were working together as a team.
So it’s something that we could eventually fold in. I think sarcasm is hard to detect, but of course, it
negates whatever sentiment—you know—the person has expressed, so it’s important.
>>: Did the other … I’m sorry to follow a question like that. Do you have any insights about loss of
context when the—sort of—the message is so short? Because a lot of the things—you mention these
things—depend on knowing all the stuff around it. When you have tweets that are so short and no
context around it, it sometimes can be very difficult to determine, like, what are they even talking
about? Do you have any …
>> Kathleen McKeown: Yes. So I have to say: that’s why I stress a little bit that we are looking more at
online discussion forums, where the context is longer. So in our work on influencer detection, we found
a number of very nice sites on political discussion. Even in our work on disaster, we find some. But with
… we have also done work on Twitter, and within Twitter, we’ve sort of focused on finding threads
where there is conversation. So you can see some back-and-forth as opposed to individual posts,
because otherwise, I do agree with you—like, I’m … obviously, I’m not of the younger generation, and
I’m not always sure of the value of Twitter. Although clearly in the context of disaster, it is important
when—you know—things first happen, and sort of to see the progression of events. Yeah?
>>: How do you—on the subject of influencers—how do you distinguish influencers from trolls you also
in general have in discussion?
>> Kathleen McKeown: Ah. That’s a good question—we don’t right now. I … and I can’t give you a good
answer about how we would do it. We’re having a hard enough time with influencers, so we’re just
assuming for the moment that that doesn’t exist.
>>: Been thinking about the research that just came out talking about trolls and how they’re aligned
with the dark tetrad of psychological attributes.
>>: Well … I was actually wondering whether analyzing the text to identify, that might actually …
>>: Yeah, so …
>>: Identify trolls for the future …
>>: That’s a question that I’ve been asking inside the science community—and so has she—about how
we might be able to identify and sideline the influencing trolls by recognizing: “gee, these are the kinds
of psychological models, and then what kind of language and behavior aligns with those?” I think that’s
a long time in the future.
I did have a question about influence model.
>> Kathleen McKeown: Mmhmm?
>>: I was curious: the work you do, is it normally in online discussion forums within one forum?
Because there is—when you evaluate the influence and the authority of someone—if they come from
Reddit, I evaluate them very differently to if they come from somewhere else. So I’m curious about …
>> Kathleen McKeown: So, we are looking across different kinds of discussion forums. I agree they can
be quite different. We had earlier started out with live journal blogs, mainly because they had a lot of
metadata available about the poster, so we could get more information. We then moved to Wikipedia
discussion forums—those are very different. We’re now looking at online political discussion forums,
and we have worked some with Twitter. And part of what we’re doing is a bit of domain adaptation, so
with our different features, which are learned, we can … as we … we test, as we move from one genre to
another, how well it … how well we do. For example, do we need to retrain on the new genre? Can we
use a combination of training material from both? And so forth. But we haven’t looked specifically at
what you’re raising now, which is where the person’s natural habitat is. Yeah?
>>: In your summary—a Bolivian airliner—what percent of your readers would read “air force” and
assume it’s US Air Force?
>> Kathleen McKeown: Why did we drop that?
>>: Right.
>> Kathleen McKeown: So in other words, why did we drop “US Air Force?”
>>: Italian, pardon.
>> Kathleen McKeown: Um … you know, I can’t give you a logical rationale as to why. It’s learned, so it
happens in the data that we’re looking at; when people did summaries, they dropped “US Air Force.”
>>: They dropped US air, but did not drop Italian Air Force.
>> Kathleen McKeown: Yeah. So it may have been assuming a US readership. I think all the people
doing the data—the majority of the people … this came from data also that was done by people … must
have been from the US. Yeah?
>>: So the analysis that you show on the literature, constructing the network of the found people, who
are populating. So is that … what kind of practical application do you have for this kind of construction?
>> Kathleen McKeown: We don’t.
>>: No? It might help to the … for the reading of novels—you know, big novel … people get confused,
you know [indiscernible]
>> Kathleen McKeown: Yeah. I—sort of—I—you know—I don’t do patents as much as I should, if at all.
And I know it’s very different from being in a company. We actually end up putting a lot of our work
online for free. I have worked with our patent office, and it was enough to make me decide no.
[laughter] I don’t have to do it, I’m not going to.
>>: You know, sometimes compression works, so … you stood … you started from the dot bay—is that
right—from the Pyramid evaluation. So have you looked at what would be needed if you’re moving to a
different language or a different domain than news. So what would you need now from news—
annotated—to have similar quality, because that kind of data it won’t graft—right—because you have to
have all these systems running—you know—and human evaluators getting into the loop, and it’s a big
effort to have—sort of—data to use …
>> Kathleen McKeown: I … right. I think we need to move more toward semi-supervised and
unsupervised approaches. Either that or also using data from the web that we can find that can serve.
So our work on making use of all of the newsblaster summaries to serve as question-answer pairs, which
is—you know—it’s not as accurate as if you did it with human annotation, but it’s huge. And so—you
know—that can compensate for some of the errors that are there. Now, for … we have thought a lot
about how to find good data that we don’t have to sit and annotate to get; nothing is perfect. So we
went with the Pyramid data, which actually is quite nice. We thought about and have looked at using
simple Wikipedia, where you go from longer to shorter sentences—which was partly okay, but wasn’t
clean data, like we were having to do a lot work in figuring out which were good pairs and which were
not. In previous work—this is medical—where we’ve gone from—you know—Reuters posts, news
releases—when a journal article is released—and so there you have a pair with a short lay version of the
new … of the journal article. It’s helpful—but again, not perfect, because the data’s sparse. So I think
that would … that’s the direction that we’re going in, is: how can we look to find the kind of data that we
need as opposed to annotate it? Okay? Thank you very much. [applause]
Download