>> Sue Dumais: Hey, good morning. It's my... who's visiting us from the University of Maryland where he's...

advertisement
>> Sue Dumais: Hey, good morning. It's my pleasure today to introduce Doug Oard
who's visiting us from the University of Maryland where he's a professor in the College
of Information Studies and has several other affiliations.
Today Doug is going to talk to us about identity resolution in large e-mail collections.
But Doug's also done a bunch of work. I first met him when he did work in
cross-language retrieval. He's done a lot of interesting work in trying to analyze and
support access to oral histories of various kinds and other types of conversational media.
And he's also leading the effort in the TREC community, one of the leaders of the efforts
in the TREC community on e-discovery, how to support people and finding a more
thorough examination of important corporate-like e-mail and others related to litigation
of various kinds.
So today it's my pleasure to introduce Doug who will, again, tell us about identity
resolution.
>> Douglas Oard: Okay. Great. Thanks, Sue.
So actually this work comes from all of those things. The beginning on cross-language
retrieval, I after a while realized that I wasn't really interested in language. Sorry to all
the NLP people here. That what I was really interested in was transformation of
language, and the transforming from speech to searchable written form was just as
interesting as transforming from one language to another. It was just as interesting as
OCR was just as interesting as handwriting.
But I'm interested in searching language, but I'm interested in searching transformed
language.
Then I got involved in all this oral history stuff and started hanging with the speech folks.
And what I learned was that speech folks are very difficult to work with, not because they
have difficult personalities, they're actually very nice people, but because they have very
high resource requirements in order to get that speech converted.
So the translation people are great fun to hang out with because, I mean, all they do is
count the words, which is pretty much what IR people do. And you'll end up with usable
representations at least for search.
But the -- to work on speech, you have to do all of this acoustic processing, all this
acoustic transformation. Before you can get anything, you need a fairly strong language
model in order to recover some sense out of that signal. And so speech is just much
harder to work with than foreign languages.
So that motivated me to say, well, let's look at what I really care about, what I've learned
I really care about, which is in conversational speech the conversational aspects to it.
And so that's where this talk comes from. It intersects with a problem where a lawyer
walked in one night at nine o'clock at night to my office after I'd been teaching and he
said, I have a problem. And I'm like, oh, this is very bad when lawyers walk into your
office and say they have a problem. It's particularly bad at nine o'clock at night when
their problem is so urgent that it couldn't wait until the next day.
So here's the problem that he had. The Clinton administration had deposited 32 million
e-mails at the National Archives. Now, they didn't actually deposit at this National
Archives. This National Archives is downtown and you can go see the Constitution and
the Bill of Rights and things like that. They deposited at Archives II, which is their
storage facility, which is on our campus at the University of Maryland.
So all of the electronic records from all of the presidential administrations since Reagan
are located in College Park. And this lawyer had gotten a request, and the request was to
find all of the e-mails in that collection that were about tobacco policy. And so this was a
part of tobacco lawsuit.
And normally what the National Archives does is they say, well, here's all the stuff,
you're welcome to come, you know, flip through the pages and find what you need, not
our problem once we've made it public.
Problem is you can't make this stuff public because there's all sorts of stuff in the Clinton
e-mails that are fairly sensitive. So, you know, there's like invasions of countries, there's
a whole bunch of stuff on the Presidential Management Intern Program. So there are all
kinds of things that would need to be reviewed. So somebody's going to have to look
through all 32 million e-mails one day, decide which can be made public. And this will
happen, you know, a long time after the Clinton administration.
And by then we think the Obama administration is going to put something on the order of
a billion e-mails into the National Archives. So there are several hundred million e-mails
from the second -- from Bush 2, from George W. Bush administration.
So a billion e-mails seems pretty reasonable. And we're never going to be able to
manually review all those. So, in any case, we have to think about how are we going to
support this process.
Well, here's what they actually did. They built a bunch of Boolean queries for things like
Phillip Morris Institute, or PMI, which also turns out to be Presidential Management
Interns, so they had to like not Presidential Management Intern. So they made their
Boolean queries, and then they hired 25 lawyers to work on this for six months to
actually look at every one of the 200,000 documents that came back from the Boolean
query.
And so you look at this, and this is what he wanted to solve. Right? This was the
problem that, you know, he said can we make this, you know, 3,000 or something and get
it to the things that we want. And so this is what the TREC Legal Track does. And that's
where the TREC Legal Track came from was that evening conversation.
This is the problem. After they did this they decided that 80,000 of those e-mails should
be released to the other side. There are 100,000 of those e-mails were relevant, but
20,000 of them were subject to a claim of privilege. So they put 20,000 of them in
privileged logs and they released 80,000.
So the actual original question here, how can I make this process more efficient, is very
hard to get traction on because the requests that come in are so broad that an awful lot of
stuff gets released.
Quick example from the TREC Legal Track last year. At 7 million documents we were
running a model of this process actually on tobacco documents but not on e-mail. It's 7
million documents. And in one of the requests, which it was great fun to watch the
lawyers actually sue each other, and then we take the suits and we put it into the process,
we did an interactive search and we found 700,000 relevant documents out of 7 million.
A 10 percent yield rate -- actually, an 11 percent yield rate on that particular topic.
So it's very hard from an IR standpoint to get traction here. But there's a second problem,
and that is that now we've dumped 80,000 e-mails on these people who didn't work in the
Clinton White House, they don't know who these people are, they don't know what these
words they're using mean, they don't know which things are happening at which times.
So now the question is how do you make what you might think of as sense making tools
for these people. Right? Okay.
So e-mail is interesting because it's conversations. Conversations in e-mail are
interesting because they're simpler to get at than conversations in speech. But what we
learn in e-mail might be useful there. So there's the technology path and here's the
application.
So here's the specific problem that I want to talk to you about today. This is from the
Enron collection. The Enron collection is what we used for all the development here.
We have other e-mail collections that are difficult to share. This one's very widely
shared.
And so you can see here Kay Mann says to Suzanne Adams did Sheila want Scott to
participate. And there are lots of questions you could ask.
So what was this GE conference call about? Now, I've got GE conference call, I've got
the date, I know who's asking. I could go get their calendar if there's a lawsuit going on
and I could probably make sense of that. Right?
Who's Sheila? Well, I might have an org chart and I might know who's related
organizationally to Kay Mann that's named Sheila. And so I might be able to make sense
out of that.
Who's Scott? Right? So we don't know when the call is, but we do know that the call is
too late for Scott to participate in, so we might get some idea where Scott is or something
out of that. So there's a lot of sense to be made out of this. And I want to talk about only
one problem, which is WhoDat, right? So who's Sheila.
So the entire talk today on that. Who cares? Well, I've told you already the lawyers care.
You can imagine that you show up and you see somebody's computer, it's got a bunch of
e-mail on it, they're engaged in some criminal activity. You want to make sense of what's
going on there. Sometimes under very tight time pressure. Right? So law enforcement
or intelligence or whatever might care.
And then historians care. Now, I'm driven by trying to help the historians. Nobody
writes letters anymore but people are writing a lot of e-mail. And so we're going to have
quite a lot of that e-mail make it into the future. But the historians have two problems.
They have no idea what they would want such a system to do because they don't have
these kinds of collections or tools yet, so they're not the best people to ask about what we
should be doing. And they have no money, so they're not the best people to ask for the
resources to do it.
But, on the other hand, given that the lawyers and law enforcement are both interested in
this, what we can do is we can do Robin Hood, right? So we can get the resources and
the insights that we need into the problem, and then we can solve the problem for not just
lawyers and law enforcement but for everybody.
Now, I want to make a distinction in my interests between this sort of exploitation task,
the information is sitting there static and I want to understand it, and personal information
management.
This is not the problem you would start with if you were interested in personal
information management, because if this is my e-mail, I probably know who these people
are. So you have to really think about the sense-making standoff task from not being
inside to understand this.
Okay. So we're back to the problem. And it turns out in this collection after you do a
little bit of looking around, we can associate 55 people with the name Sheila. So we've
seen Sheila someplace in the collection that causes us to believe that there are these 55
different Sheilas. We might be a little wrong on that, there might be 53 or 59, but we've
got fairly good models. I'll show you how good the models are of how many Sheilas
there are.
So we can think of this problem as simply tell me which of these Sheilas it is. Now, it
turns out for every one of these Sheilas we know an e-mail address. These are people
who sent or received e-mail in the collection. So you might wonder how many people
are referred to, like Clinton got referred to a lot in the White House e-mail, did Clinton
have an e-mail address. Bush got referred to a little bit in the Clinton e-mail. Did Bush
have an e-mail address in the Clinton White House, right?
So there are people who don't have e-mail addresses. But we're going to make a
simplifying assumption that the people who we're trying to tag the references to have
e-mail addresses if I can tag that reference, then we can use that e-mail address as a pivot
and build sort of a bio on that person.
And then we can come back at the end and ask how reasonable was that assumption to
have been made, and if it wasn't good, then we can build a classifier that tries to guess
whether people have an e-mail address and to do something with the people who don't.
Okay. So I imagine that I want to send Sue an e-mail and I want to refer to somebody
who sent me an e-mail recently, and I say, hey, Sue, I just got an e-mail from Mark.
Now, that's not going to make a whole lot of sense to Sue, Sue knows probably a lot of
people named Mark. I know a lot of people named Mark, and we probably know several
people named Mark in common.
So there's no context for Sue to attach Mark to the person. So I need a more complex
model than that. I need a way of knowing what kind of reference I can use with Sue.
So if I say something in the context of SIGIR 2004 conference, well then it's probably
going to be pretty clear that it's Mark Sanderson because he was the general chair of the
SIGIR 2004 conference. So I could have -- after I choose a person to mention, which in
this case is not Mark, I could pick some context.
So we wanted to mention here this person. And so Kay Mann wants to mention that
person, in the context of a GE conference call, now which people are associated with
things like GE conference calls, that will tell me what kind of mention to use. Do I have
to use a full-name mention or a first-name mention, or can I use some nickname.
Now, once you see this generate a model, it's pretty obvious how to solve this. All you
do is you reverse the process. You sprinkle a little Bayes rule fairy dust on this. And
instead of estimating the probability of a mention given the person, you want to estimate
the probability that it was the person that you were intending given the mention. Right?
And so that's all we're going to do.
Okay. So to do that we have to go through three things. We have to know who the
people are and what names they can be referred to by. So we'll call this the identity
modeling problem. And then we have to model the contexts. And then it's very
straightforward to do the Bayes inversion and to estimate the probability that any one of
these people might have been the mentioned person. So this comes together at the end
once you do these two steps.
Okay. So step one, let's build a computational model of identity. Where do we get it.
Well, since we have e-mail, we can look at three kinds of things that go on. So here's an
e-mail from Elizabeth Sager that Elizabeth signs Liza. Now, I might not have guessed
that Elizabeth would have been called Liza. I might have had some lookup table, but that
would make all the Elizabeths into Lizas.
So but I can look and see that this person signed an e-mail this way. If this is the only
time they ever did it I can say, okay, they're not very often referred to this way. And if
they do this all the time I can say this is how they're normally referred to. So I can build
a probability distribution across the types of references for each person.
Okay. Now, this is based on user behavior. So it's just a social convention that we stick
our name at the bottom of e-mails and we stick different names at the bottom of e-mails
depending on who we're talking to. Right? So I sign my e-mails differently with my
mother than I do with people who I work for and than I do with people who work for me.
And so you could build a more complex model with a hidden variable in it for the kind of
relationship which we don't have. Right? So we're simply going to note that this person
signed their name this way this time.
Okay. Now, that's a user behavior. That's the hardest thing to get at. The easiest thing to
get at is that we can pull things out of e-mail standards. So, for example, we can find out
that Elizabeth Sager is Sager, Elizabeth, and so we get some idea now which one's the
first name and which one's the last name, which we might not have been able to guess as
easily before. Right?
So it's very easy to get stuff here, but it might not be nearly as informative. It's more
complex to get stuff here. And then the third place you can get stuff is that the mailers
will put in things in a way that the mailer is actually written. This is free text, but it's
written by a machine, an so we can just reverse engineer some standard clients and pull
stuff out.
Okay. So we do this with rules, we do this with rules, and I'll show you how we do that.
Okay. So this is just one of the rules firing that says, okay, here's a name and we found
another -- or we found the full text name with it. This is an artifact of the way the Enron
collection was put up by CMU, that you have to make these associations. Normally
they're given more tightly in the standard.
This is rules firing to get things out of the quoted text. And here's the somewhat
interesting one, how do you know that Sheila is a signature. And what you do is you just
take all the e-mails and you hold them up to the light and you see what's the same in all
of them. Right?
And so if you do that, you'll see pretty quickly that the signature block is there. And you
can do the same thing for salutation blocks. So you just tail a line for the signature
blocks and you had a line for the salutation blocks. You have to do things like get rid of
all the included text to do this so you know where things actually end.
And so there's some cleanup parsing ahead of that that's all done with manually written
rules.
Okay. And then it turns out the social convention is you stick your handwritten part of
the signature above the automatically included social blocks, so the only little trick here
is you go to the top and then once you've done that, you can just stop-word out things like
thanks and pick up what we call nicknames, right, the ways people informally refer to
each other. Okay. And then you can make that association. And you can do the same
thing for the salutations.
And so we get models that look like this. This is an actual model out of our system. And
so when I said we know 55 people named Sheila, what I meant was there are 55 models
in which Sheila appears as a name. Right? Now, we can get Sheila one of two ways.
We can either go back and we can parse this name -- actually three ways. We can parse
this name or we can parse that one.
So we've got 55 in which Sheila is included in a way that we think would be useful.
There are 133,000 e-mail addresses in the Enron collection, which is a collection of only
149 people mailboxes. So I was a little surprised by how large that number is. There are
250,000 messages in the collection that are unique. And so about one out of every two
messages is adding something to what we know the e-mail addresses to exist.
And this collapses it by about a factor of two. Right? So we can sometimes put together
a home and work e-mail address if we observe the same unusual first and last name. So
when we -- we use full names to fold addresses together.
We're not doing anything to try to figure out if somebody is intentionally maintaining two
separate identities. There are some cases in Enron where two people use the same e-mail
address.
So, for example, an executive and their assistant. And the assistant will actually sign
messages with their name, not with the executive's name. So it's quite clear what's
happening. But we don't try to process that at all. We just treat that as the executive has
their assistant's name. Right? So that messes up our models a little bit.
In some of our early processing, I'll show you at the end -- well, I can show you at the
end, if you want to see it, what happened. We weren't processing mailing lists, and they
really screw up this model, because the mailing lists get a capture on every name on the
list. Right? So you have to be careful about things like that.
Okay. So two things we could ask here. One is how well can I use these identity models
to do the resolution I want to do in stage three. But we could also ask how well have I
done -- how well have I built my identity model in the first place. And if I've built it
poorly, then probably going further doesn't make sense.
So the first step we did was we said, okay, let's just take all the names that we learn from
main headers that we saw only once, right, very weak evidence, and let's just see how
often those are right. So probably it will be right some of the time and wrong some of the
time. And then if we have stronger evidence, we'll see how much better that is.
We'll do the same thing for the quoted headers where now we're reverse engineering the
regularities of the systems. If we see it in both headers, fine.
And then we'll also look at these address nickname associations, which we expect might
be more informative but also more error prone. Right? So, for example, we have some
where we have a "hi, all," and so somebody's name ends up being "all." And it's like,
well, you know, it probably wasn't the right thing to do.
So, anyway, we measure all of this, stratify up 600 of these conditions and so we can
measure probabilities with about 5 percent error bars on them out of this.
>>: How do you know when you're right?
>> Douglas Oard: So what we do is we hire the next graduate student over and he tells
us what he thinks. And we don't do any inter-annotator agreement. There's no test
collection being built here. This is actually coming out of the system. Right? So we just
get the data out of the system, we have one person look at it, and we see how well that
person agrees with the system. So the person does know that the system hypothesized
this.
Okay. There are four possibilities. We might be wrong. This is an actual case where
we're wrong. So we think kmpresto's name is home e-mail. It's like, well, you can see
how we made the mistake. If you were, you know, teaching school you might give your
kids partial credit for this, but, you know, not in our business.
There's another one where we learn that June Rick is June Rick, and it's like, yeah, right,
okay, good, it's correct, but I don't really care.
And then down here we learn that this person's name here is Phyllis and we didn't have
any good clues for that. So the question is how often does this happen. That's really
what we care about. But we'd also like to have that happen not very often. Right? So if
this happens, there's not much we can do about it, but we really want to avoid this and we
really want to get that.
Okay. So here are the results. This is how often we're correct. The short version is if we
get it out of the headers, we're always correct. So this is everything we took out of the
headers, that's the 300 samples, all right, and we're right all the time.
Okay. So we don't have the problem with processing data out of the headers. We do
have a problem with processing data out of the salutations and signatures, so we're wrong
about 10 percent of the time.
And in the very worst case, which is very weak evidence coming out of the signatures
where we have to find the top of the signature block on that sort of thing, now we're
wrong 20 percent of the time.
So you look at this and you're like, well, not bad for a bunch of rules. We don't see
numbers like this that often on the first try, let's not worry further about this.
Okay. Now, over here, this is how often we learn something. So this is an indication of,
you know, did we do what we tried to do; this is an indication of was it worth doing.
And you can -- if you just focus on the overall number, you can see them stratified over
here, that we learn something that is very informative.
So this isn't the case where we learn how to split a name, which was that middle case; this
is the case where we actually learn something we could not have observed. And we learn
something about half the time coming out of the headers. Right? And this is very easy
data to process. We're very accurate and we learn a fair amount. So this is low-hanging
fruit.
Here we only learn something about 20 percent of the time coming out of the salutations
and signatures. And it's kind of interesting, the case where we have the weakest evidence
coming out of the signatures is the case where we happened to learn the most, which was
the case where we were making the most mistakes over here.
So this suggests that if you're going to work on this problem further you should actually
work on the low data part of the problem, right, that the -- the cases where you haven't
seen things very often have got information in them that are worth looking at. Right?
Every once in a while you sign your name a certain way.
Okay. Anyway, that's not the main purpose of the thing. That's just to get you to
something like this. So what you want to do is you want to say for these people I'm
going to assume some kind of a prior, that this person isn't called Sheila very often and
that person's called Sheila a whole lot.
And to get to that we have to adopt some kind of a frequentist interpretation of
probability, and this is very straightforward. Right? So this says how often do I see this
e-mail address out of all of the e-mail addresses that I see. Okay. So this person gets
talked about a lot.
We can do that all the way down to how often do I see this type of name -- so, for
example, a first name, a last name, or a nickname -- mentioned this way. So the
particular nickname S.G., how often do I see that given it's this candidate over all the
S.G.s that I see.
Right? So it was just a frequentist interpretation probability. It takes the models that you
saw previously and it hallucinates probabilities, right, that are essentially priors.
Okay. Then we do something that will make you cringe. Any machine learning folks in
the audience? So this was okay. But now what you want to do is smooth, right? Okay.
So we don't -- we just -- if it's zero probability in the prior, we just ignore it. And this
gets our set of candidates down.
Okay. So you could probably do a little bit better here, but, you know, bring money
because you're going to now be computing over the full set of identities, which in our
case is 77,000. But if you run -- I don't know, do you guys run an e-mail service here,
Hotmail or something like this? So you try to do this over Hotmail, you'd be at it for a
little while.
So this is a convenient assumption, and we'll just ask how well does it work if we just
ignore all the zero probability cases. So we know we're throwing out some good ones
here.
Okay. So we're back to the problem ->>: That example had an interesting -- the only people who you associated the name
Sheila with were female. Do you use any gender information?
>> Douglas Oard: So we have no gender information available to us, so we don't have -we're just processing the e-mails. We don't see address books, we don't have side
information, we don't have org charts. So there's a lot you could do if you had side
information.
And indeed these are just pretty pictures. Right? I mean, you know, in fact, I think one
of them is one of our faculty members. So we just -- we're grabbing pictures off of some
photo site to make exactly this point; that, you know, there aren't a lot of Sheilas that are
guys.
Okay. So, anyhow, we're back to the original question but now we're on this second
piece, context reconstruction. So I want to find the context of this mention. So the first
thing we're going to do is we're going to say, well, I could if you do coreferents resolution
you could say, well, the context of this mention is different from the context of this
mention.
But I'm not going to do that. I'm just going to say the context of this mention is the
context of the e-mail. Okay. So another simplifying assumption that anything in the let's
say topical context of this mention will also be in the topical context of this mention in
exactly the same way.
Okay. So now all I have to find is given this e-mail find me things in its context. So how
many contexts can we think of? I can think of five. Maybe you can think of more. But
here are the ones we started with.
We might find the answer in the e-mail. So somebody could say did Sheila want Scott to
participate, and it could be cc'd to Sheila, some Sheila. Right? And I could say, oh, that's
a pretty good guess; that might be it.
In fact, a collection has been made this way. Right? And all you do is you take out the
cc, hide it from the system, and see if it can guess the cc. Right? So you just go
recognizing those cases, it's sort of a classic way of building a known item retrieval test
collection.
So I'll show you results on that. Okay. So, in any case, we just give our full mechanism
the local context and we say if you can find an e-mail address for a person which Sheila
is the name, then there's some information.
The second possibility is that you might use the thread structure. Now, the thread
structure is a little bit dangerous to use for a couple of reasons. One is the thread
structure is not given in the Enron collection. So you have to reconstruct it and you may
make a mistake reconstructing it.
The second thing is even if you reconstruct it correctly, this is the reply chain structure,
and that's not necessarily the sort of semantic thread structure. This is merely every time
I hit reply.
So, for example, I was here, I don't know, half a decade ago, and so I sent a message to
Sue that said I'm coming up. And where did I get her e-mail? Well, from an e-mail half
a decade ago, right, and I just hit reply and all the previous stuff was in there and, you
know, hey, Sue I'm coming back.
So these aren't always perfect, and there are things you could do to clean it, but what we
do initially is we simply thread in the obvious way based on subject lines. And that's a
really bad idea. Because any of you get messages that have nothing in the subject line?
So the way we thread on subject lines is we stop-word the word "re" and index the
subject line.
And so what happens now is everything that had a blank subject line becomes a single
thread. It's like, well, that's probably a bad idea. So the second thing you do is you go in,
you find all the included text, and you search with the included text for the message that
that text was included from. And you can infer threads that way. And that's a much
stronger way of doing it. And if you have these two, it sort of triangulates it.
So we have reasonably good threads and we don't work any further on that. But you
could do more on this if you wanted to.
The topical context is where we had the most fun. So we're RI folks, right, this is the
Ph.D. dissertation work of Tamer Elsayed, who is -- does IR work. And so to get a
topical context, what I do is I take this message, this focal message, and I say what other
messages are in its topical context.
So this is my query. I index these things. We count the words in the usual way. There's
some tf-idf floating around, and we find out that this thread here is in a strong
relationship to that, this thread here is in a strong relationship to that.
Now, I can take the entire thread here as a query and find threads. I can take the message
as a query and find messages. And you can play around and ask what's the right thing to
do, and it turns out those are both the wrong thing to do.
So this is actually one of the really interesting things about conversational content is that
in IR we make two assumptions that make the IR problem easy.
One is a document is a bucket that contains information which, when it's language, it's a
bucket that contains words, so that's the bag of words assumption. And we have no
boundaries on our bucket, now, right? So my bucket is kind of bleeding between these
things. So is this a document or is this a document or is that a document? I don't quite
know. Right? So nobody's given me document boundaries.
The second thing is we assume that I can just show you the document and you can say,
oh, yes, that's the one I want to see, which of course is what we're trying to fix with the
entire talk. So we're breaking both assumptions. So I have to say what a document is for
the purpose of finding topical context.
So here's how we define the document. A document is every message between the focal
message and the root. So it's here, the root of the thread. Or in this case it's this path,
right? And so the similarity between this e-mail and this e-mail is the similarity between
the set of words here and the set of words there.
And if you play around with lots of different possibilities, this one comes out well. And
if you look at it, it sort of makes sense, right, that the other side of the thread probably is
less informative, got more opportunity to be adding noise.
So that's a nice heuristic that you can walk away with and say just the thread back to the
root is a handy first thing to try, right, which, after we tried several things, worked best
for what I'll show you.
So here's an example. I've got the GE conference call, is the e-mail you saw before, came
in to Suzanne Adams. And these three words are common to this document also. So I've
got Sheila GE call. And so if there were a Sheila that was known all the way down to the
e-mail address here, which there is in this case, then we can say, okay, maybe it's Sheila
Walton.
Now, look, I'm not doing anything semantic, I don't know if this is the same GE issue, I -we do look and make sure that we're at least in the same time frame, so this is December
15th and that's December 20th. So we put some bounds on the time frame to do this. But
we just say, okay, fine, this message is in the topical context of that message. Right?
And so I will pick up some probability now that this Sheila should be emphasized over
others. That's independent now of what I did before in the identity modeling. Right? So
the identity model is the prior and then this is the observables. Okay. So we pick, you
know, that Sheila.
Now, the other place you might go, and if you're a social networks person you've been
sitting here going, oh, gotta do social networks, is you might go looking off in the social
network and find a social context.
And so here's an example of a social context. Kay Mann has sent an e-mail to Suzanne
Adams and now we see somebody has sent an e-mail to Kay Mann and they mention
Sheila Tweed. And we're like, well, I don't know about that topical context, that might
have been trash, but I know for sure that Kay Mann knows Sheila Tweed. Right? So
there's a Sheila that I've got some belief in.
The problem is here I don't know Sheila Tweed's e-mail address. So now I have to take
this Sheila Tweed and go do resolution on it. And so we're going to be at this for a while.
Okay.
Oh, so to do this we cheat because we're IR people, so we don't know anything about
social network analysis. So what we do is we say, well, previously all the words in the
e-mail, they were the thing that we built our model on. Well, now we just build it on all
the e-mail addresses in the e-mail or in the thread or in the path back to the root or
whatever it is we're doing.
And so if like the archive server is always getting cc'd, we'll learn to give that low IDF.
Right? So in fact this works remarkably well as a quick hack, is just forget the fact that
these things aren't words and treat them as if they were. Right?
And what we lose essentially is that there's not really very good TF evidence here. We
do see TF because we have multiple messages in the path back to the root, but it's very
spotty because the messages are short.
Okay. So in that case this will tell us some other Sheila, which, if we could resolve it, we
could use.
Okay. Now, if you're a modeler, at this point you're thinking I've got a lot of
information -- oh, I told you there were five, we only used four because the Enron
collection has no attachments the CMU posted.
We're in the process for the TREC Legal Track of building an Enron collection with
attachments, but that's not yet widely distributed, so this work is done by the distributed
one. But coattachment would be a fifth one. Right? So I see the same document with
the same check sum and I'll just note that it's been attached to these two e-mails and that
will define a fifth context for me.
Okay. So now I've got four contexts. And what I want is I want some way of saying
how strong this attachment is. So we'll ask what's the probability that in some context,
let's say the topical context, of some e-mail, let's say e-mail 42, we should be attaching
some other e-mail, let's say e-mail 6.
So that's the representation. And I've got four different ways of doing it. So if you're a
modeler at this point you go off and you get some held-out data and you say let's learn
this. And if you're not, you say, well, let's just add them up and divide by four. Right?
So, actually, if you're a modeler, you'd say that was a uniform distribution and, you
know, that you had made an assumption that only a single context -- but we just add them
up and divide by four. Okay?
So this is another place where if you believed that many contexts were active at a time,
you should do a log linear model or something like that. All right? But if you believe
that only one context was active at a time, this actually isn't such a bad thing to do.
And since we have no belief on that, adding up and dividing by four was convenient.
Okay. So what that means is I can rethink my problem as if I've got a mention here and
through the e-mail that it's mentioned in I can attach through the topical context Sheila
Walton, I can attach Sheila Tweed through the social context with some strength, I can
attach JSheila, an actual observed e-mail address through the social context, and you can
see that I can build this very complex graph structure on which I can now reason.
Okay. So you could look at this as a graph reasoning problem. But you can also take a
simpler approach. So here's the very simplest thing you could think of doing. Actually,
very simplest thing we could think of doing.
We want to find the probability of a candidate given a mention and the context. Okay.
So we can approximate that as a probability to a candidate given a mention and forget the
context. Context-free resolution. Right? And just apply Bayes rule. And we can get
that probability not just for Sheila but we can get that probability for everybody.
Now, the problem is there are 1.3 million references to people in the 250,000 e-mails in
the collection. And so you're going to be doing this for a little while. But after you get
done, you'll have context-free probabilities for all of these. In fact, if that's the only one
you have to resolve, you only have to do the ones that attach. Right? So this is actually
fairly efficient.
So then we can use these context-free resolutions to estimate this. Now, this is the only
one that's going to be useful here because the rest of them don't have e-mail addresses,
and that's what we're trying to attach.
But then you could do that for all 1.3 million of them -- you can see the compute cycles
going up here -- and resolve all those using context-free resolution in the first stage to
find these, and then you could get a better estimate of this. And you could iterate, you
know, until fully cooked, right?
Okay. So that's what we do. And so if you tried to do that, you would very quickly find
that the conversational expansion, so this is the threads, the local expansion, this is the
message itself, have no computational problems at all. The only computational problems
you run into are doing the topical expansion, doing the social expansion, and doing the
resolutions.
Okay. So it's convenient to do this in MapReduce. Everybody here speak MapReduce?
Okay. So all that MapReduce is doing is it's breaking things up, doing a single
processing stage, shuffling, doing a second processing stage and going back out to
storage.
And so this is how the topical context is found in MapReduce. You simply take the
postings list -- so this is like some term, GE, appears in only let's say ten documents. So
those will be on the postings list.
Now, if you work out the way in which we do the similarity in IR, it's those are the only
documents that this term can contribute to the similarity of.
So we have to do an inner product, which means we have to do multiplications and sums.
So we do the multiplications across the postings list, we shuffle, and then do the sums
back in the document space.
So this lets you get as wide parallelism as you like without having to manage all of this
shuffling. That's all the MapReduce is doing for us.
Let's see. I'll give you timing data on that. So it turns out that the social context where
we have a very sparse representation, because there aren't very many e-mail addresses,
that's very fast, and that all those multiplications only have to be done in the topical
context.
We have a trick, a fairly straightforward trick, that gets that down to about 60 minutes on
a quarter million by a quarter million matrix to do the full similarities. We do million by
a million in about three hours, right, on standard IR test collections.
The trick is we just go in an aggressively stop-word, so we take out all the high DF stuff.
The high DF stuff is where all the multiplications are being done, and they're the very
smallest contributions to the results.
And so if you do that, you'll end up giving up about 2 percent by typical IR measures of
effectiveness, and you'll end up picking up something on the order of 96 percent in
efficiency. Right? So you're chopping the time down to 4 percent of the time and you're
giving up very little.
So this is -- the trick we use to do this, we do the same trick in here but it's not very
important, and the whole process takes us about three hours to run an iteration, so you
can turn the crank as many times as you like.
Now, the last thing to say about this is that you can use MapReduce to do this as well, so
in the mappers you just read in the resolutions, so you can compute the first time around
the context-free resolution, and then you just shuffle the contributions and you do -- in
the reduce stage you do the resolution over here. So we do the entire thing massively
parallel. Okay.
Okay. So the big question you should have is fine, interesting, nice trick, how does it
work. So I gave you an intrinsic evaluation before of just the identity modeling part. I'll
give you an extrinsic evaluation here.
So being IR folks what we want is a test collection so we can do repeatable evaluations.
So the documents are obviously going to be e-mails but now we need queries. So here's
how we get the query.
I've got 250,000 e-mails, close my eyes and I pick one uniformly. Now I go through and
I hand mark that e-mail for every reference to a person. So now I don't have any problem
with automated systems not doing that well as well as people can do that. We hand mark
it. We then put it back into the system and it blindly picks some reference. That's the
reference that has to be resolved.
Okay. So this just assumes that I've got a user coming along who cares about only one of
the things that we might need to resolve, and we don't know which one.
Okay. We do 600 of those. And then this is the little bit squishy part. We hire not the
next graduate student over but three undergraduates. And the three undergraduates go
through and we give them a search engine and we give them beer money and we say
would you please see if you can figure out who these people are, and we don't tell them
anything about what our system has done with those. In fact, we haven't even run our
system at this point.
Okay. And they go through and they try to find a resolution. And sometimes they can't
and sometimes they disagree and all the things you'd expect. All right.
And then we just use mean reciprocal rank to evaluate. So you get full credit if you're in
the first position at the top of the list, you get half credit if you're in second position, no
matter what the difference in the probabilities are in your estimates.
I'm going to skip this for a second. So this is -- oh. So anyway, it's three people. We'll
end up with a total of 584 queries. And they worked for about 60 hours. So they're
working for about six minutes apiece on average. Some taking longer, some taking
shorter.
And they couldn't resolve -- so this is first annotator. So first annotator can't resolve it in
20 percent of the cases. If you remove those 20 percent of the cases, in about 80 percent
of the cases it resolves to somebody at Enron. And in about 20 percent of the remaining
cases, taking out the unresolvable, it resolves to somebody not at Enron.
We were quite surprised we this. We had expected much more resolution to people
outside Enron because almost every e-mail address in the collection is from somebody
outside Enron. Right? There are 133,000 e-mail addresses, only 149 Enron mailboxes.
Now, there are more Enron people, but there aren't 133,000 Enron people. Right? So
most of the e-mail addresses are out of Enron, but most of the resolutions are in.
This unresolvable is the number we use to bound how bad things can be if you make a
ludicrous assertion of an association of somebody that's quite clearly doesn't have an
e-mail address in the collection. You can't hurt yourself by more than 20 percent. But
this 20 percent can hurt us on the iterative solution. Right? Because we go through the
first time and somebody captures that, and then that inserts noise the second time around.
So presently we don't try to classify this case. You can think of things that would classify
this case, but we just haven't built the classifier yet.
Okay. Now, when you do inter-annotator agreement, it actually comes out reasonably
well. So if the person works at Enron, then in 90 percent of the cases a second person
who is annotating it, who knows only that it has been previously annotated, they know
nothing more than that, they annotate it the same to the same person at Enron 90 percent
of the time. They don't agree that it's at Enron; they agree that it's the same person and at
Enron. But they didn't know it was an Enron person when they started. Right? So we're
just stratifying it for analysis.
If it's not at Enron, it's a harder problem for the people to do. And they get it right a little
over 60 percent of the time. And so on average we're about 80 percent correct right
across those two.
Okay. So we have a test collection you could use to measure things as long as the results
don't get up, you know, in the 80 percent range. The 80 percent range, we just can't
measure stuff above that.
Okay. So here are the results on our test collection and on previous test collections. I'll
tell you ours first. So if you look across everything, we're right 78 percent of the time.
And remember where the inter-annotator agreement is, it's at 80 percent of the time. So
we're right almost as often as we can measure.
Now, this isn't one best; this is mean reciprocal rank, so we're giving ourselves a little
more credit. You could actually get above 80 percent here.
If you look only at Enron, we're up at 82 percent. You'll remember we're at 90 percent
for inter-annotator agreement here. And if you look at nonEnron, we're at inter-annotator
agreement. Again, this is mean reciprocal rank, though. Right? But we're up here right
in the range of where inter-annotator agreement is.
So we're doing remarkably well. And this is all first iteration. Right? So a context-free
resolution and then use the context-free resolution in one pass to try to guess this.
If you look at the other collections that have been built, these are the two collections that
were built by Minkov at CMU, so we call them the M collections. One is a -- the
mailbox of a fellow named Sager and another is a mailbox of somebody named Shapiro.
And these are very easy problems because all the resolutions occur inside the mailbox.
So we know a very small set of messages that will contain the resolution, and so we only
have a very small number of candidates to resolve to. And we do exceedingly well on
these.
Galileo Nomantha [phonetic] at the University of Maryland built another test set where
he was interested in the question of if I have many people who are sending e-mails to
each other, so not our problem where everything in the e-mail collection, but just the
Enron people sending e-mails to Enron people, then how well can we do.
And so he built a collection that had 54,000 e-mails. This is the N-subset, had 78 queries,
had about a hundred-way ambiguity. And we get very good performance on that.
And his performance was about the same, but he reported results on 54 queries, and we
don't know which 54. So population means we're very close. We're a little bit above, but
that's kind of meaningless in this case.
We threw in all the data from his -- or from the Enron collection just to noise it up, so all
the answers are in here. But now the system's got more ambiguity to deal with. And we
use this as our training collection. So when we test on training, so we use this for all our
parameter settings.
When we test on training, we're at 93 percent. And then when we run on our collection
we see it's actually not quite as good when we're out here randomly sampling out in the
collection.
Okay. So that's the answer. Now, I find this slide very disappointing, I must tell you.
We did as you would expect, an ablation study to find out whether the social context was
better than the topical context, you know, which if I was only going to do one, where
should I spend my time tuning.
And it turns out the social context is the winner for Enron and for nonEnron. And as an
IR guy, that's just not the answer you want, right? But when you start working with
conversational media, then, you know, maybe you have to change the way you think.
So the social context is very useful. Now, note the bottom of the graph is not at zero, it's
at 40 percent on mean reciprocal rank. But look what happens in nonEnron. So I get
very little from adding additional context over here. But we get quite a lot. So we're
going from below 50 to above 60 in nonEnron by adding in additional context.
So what's happening over here is I have very weak evidence in all the contexts, and so
I'm picking up more by adding additional context. Over here I don't really have to add
additional context in order to do well. It's pretty much what you'd expect, right? I have a
better shot at picking up things where I'm not already in a dense part of the space.
Okay. So then the question is what happens when you iterate. And so this is just looking
at the social context and the topical context separately. And you can see that you're
getting very small improvements in the first iteration in both cases. So these are really
blown up here. And then it's not helping you, right? So this is not being able to classify
things that are -- that are in that 20 percent, is just letting noise in.
Now, I don't have this one broken out yet. This just came in yesterday. I do have some
older data. So this was done with two problems. So our name recognizer wasn't very
good here. So it turns out that we originally grepped for names. So everything in every
identity model we just took as a name and grepped for it.
And our bad luck, there's somebody at Enron named May. So, anyway, this was a really
bad idea.
The second thing is that we didn't have the mailing -- we had a mailing list detector and
we didn't have it turned on when we did this. And so the mailing list soaks up
everybody's name that works at Enron. So there's this somewhat famous mailing list at
Enron that's got a lot of people on it, and it's just every name attached to it.
And so this makes things hard for us, both of those things, or we make them hard for
ourselves. But still we see an improvement in the first iteration of the social context and
not in the topical context, but only in nonEnron. So again in the case where we're weak.
So what I want to do is I want to go back and I want to look at this and break out what's
going on at Enron/nonEnron and see if we start to see that we're getting a large
improvement in the nonEnron, maybe even losing something in the Enron.
Okay. So what we learned. Well, it's too easy a problem. So WhoDat, yeah, you could
tune a hundred different things, but if I want to actually run WhoDat, I'd actually run this.
These are numbers I could use.
And the next thing to do is build WhatsDat, right? So I want to know what's the GE
conference call. If I'm going to try to do WhatsDat, I can't use this evaluation paradigm
because I don't have something that I can build a classification scheme around, like those
e-mail addresses, so I have to do something that's clustering. And so we need a test
collection of a different design.
And so we've been giving some thought to what to do for evaluation of WhatsDat, and
once you have that, we have all the mechanisms for finding the contexts, and so it's pretty
straightforward for what to do there.
The other thing that you could imagine doing after you'd done that is if you had a good
WhoDat, then you could use WhoDat to improve your social contexts, which might be
then used in WhatsDat, to give you a better idea of the topical contexts. And so there
may be a virtuous cycle here that you could go around once you got everything tuned up
reasonably well.
Okay. So the bottom line is that if you want to run this, the data's available at Enron
now. And if you would -- if you're a social network person you might start thinking
about social networks in a somewhat more complex way. So instead of just reading
things out of the header, you could look into the content and have them in essentially the
same vocabulary that the things in the header were in, and now you can start to build type
planks that are more than sent from, sent to, received by. And there may be something
here.
I'll give you a quick example of this that was fun. We did this on National Security
Council e-mail. And we built a system that would build quick bios, so what does this
person say, what does this have said to them. And we added to it what do people say
about this person.
So there are only 500 messages in this collection. It's just a little toy collection. But
it's -- they're interesting people, they were in the middle of the Iran-Contra affair, they
almost got Reagan impeached. You know, it's sort of high interest to the historians.
And in our bio of George Shultz there's this line that says: George is an idiot. And we're
like, well, was this in something sent by George Shultz, something received by George
Shultz, or something said about George Shultz. And I'll leave you to guess. So that's the
end of the talk.
>> Sue Dumais: Thank you.
[applause]
>> Sue Dumais: We have time for questions.
>>: Have you done a failure analysis to know what other features you might want? Two
that come to mind are -- you mentioned in passing that you are restricting the topical stuff
to things that had happened in the recent -- in a more or less same time frame ->> Douglas Oard: Topical and social both get restricted that way.
>>: Okay. So presumably if you don't do that it's a real mess.
>> Douglas Oard: Um ->>: And another kind of ->> Douglas Oard: Yeah, we didn't tune that very well. And so you could imagine
throwing that in -- what you really want is a discriminative framework here, and you
want to throw time in as a feature, right? And so we're sort of hacking that because we're
in a generative model and so we have to kind of like hide that in our model someplace.
>>: And the other is what external resources do you think would help this? So if you
were running this like in a real enterprise where you had things like address books and
calendars and ->>: Or charts.
>>: And what?
>>: Org charts.
>>: Yeah, yeah, exactly.
>> Douglas Oard: Yeah.
>>: Those change over time too, which is interesting. So org charts change a lot,
actually.
>> Douglas Oard: Yeah. So we've thought a lot about -- so we have org charts available
for Enron. And we thought about doing that. We don't have calendars available for
Enron. We also have all the trades made by all the traders, so there's a subset of the
collection that are people who we know are traders and we know what their trading
activity is like. And so in cases where that's your focus, you might get some benefits out
of that.
We also have all the phone calls that are made between people -- actually, we have some
phone calls that are made between some of the people in the collection, and we have
some funny cases where people will say things like I can't put this in an e-mail, give me a
call.
And then we have the call, you know, and it's like, well, that's, you know, [laughter]
really -- we're not thinking too clearly here now, are we. And then we also have cases
during a phone call where somebody will say I can't talk to you about that on this line;
call my cell phone. And we don't have that.
So there are some people -- this is sort of like survival of the fittest, right? Yeah, right.
So ->>: [inaudible]
>> Douglas Oard: Yeah. So, anyhow, so we have a whole bunch of things like this that
we could put around that, and we made a conscious decision at the beginning of this not
to mess up the central question with all those extra features. But it'd be very nice to go
back now and ask can we pick up something with those features ->>: Or maybe does your failure analysis lead you to believe what resources or what new
features might be ->> Douglas Oard: We don't have anything yet that come from looking at the question
that way. We probably should. We did go back and look at why we were having trouble
in the iterative solution, and it pointed very clearly at the two things I showed you. So we
have done a little bit of failure analysis, but we haven't done -- we're very close to the
inter-annotator agreement, and so I'm worried about trying to tune too much further with
this evaluation framework.
We actually need to change evaluation frameworks if we're going to see much more I
think. The evaluation framework we have in mind is that we have e-mail from one of our
faculty members at Maryland who's kept it for 20 years and who's allowed us to publish
on it. And so --
>>: [inaudible]
>> Douglas Oard: What's that?
>>: [inaudible]
>> Douglas Oard: And he has it labeled in certain ways that we could ignore, but he's got
it sorted into folders. And he is very interested in working with us, so he could be an
independent source.
One of the critiques of the approach that we used to the human annotation is that the
humans are essentially looking at the same features that the machines have. And so it
might be that real people could do something -- you know, there are more answers out
there to be found that we can't find. So it may look to us like we're doing better than we
are.
We also have several other e-mail collections, some of which we could conceivably get
people to annotate. So we have a collection of corporate e-mail that is available for
research use but it's available under the restrictions that Census data is under. So
essentially what we do is we frisk you on the way out and we make sure you're not
walking out with anything in your socks. And, you know, pre-pub review stuff.
But you can ship algorithms into that data and work on it. And the reason it's sensitive is
because all those people are still around. But that was data that wasn't made public. It
was deposited at the Library on Congress on a court order, on a bankruptcy court.
So the actual way to get collections like this is to buy bankrupt companies. Because the
companies own the data.
Other questions? Okay.
>>: Do public distribution lists look like this at all when you use them?
>> Douglas Oard: No. So we actually ran this on Usenet. So someplace way back in the
beginning. This is not the right way to get there.
So the Human Language Technology Center of Excellence ->>: We can help you with that.
>> Douglas Oard: Yeah. Okay. The Human Language Technology Center of
Excellence supported this work. And so they're at Johns Hopkins. And they were
participating in a cross-document coreferents task for ACE last year. And that
cross-document coreferents task, like many other NLP tasks these days, these bake-offs,
are trying to get more diverse data.
So instead of, you know, train on this and then test on something that looks exactly like
that, it's, you know, train on anything you can find and now test on, you know, we'll
make a sample of the messiness of the world for you.
So one of the things they threw in was Usenet news. And so we said, well, okay, fine, the
two addresses, the newsgroup and the farm address is the -- is the farm addresses and let's
just turn the crank and see how this thing works.
And they gave us 10,000 documents, which drove the NLP people nuts because they
were like 10,000 documents, what are we going to do? They've got all these -- you
know, so they were just at this for a long time trying to figure out how to do that at scale.
And we had all the MapReduce stuff running, and so we were like 10,000, you know,
three minutes later. And they were like, okay, but could you please remove all the stuff
we filtered. And it was like, okay, it's five hours to filter the stuff that should have been
filtered.
The upshot of that was that this failed completely in every case. And it failed completely
in every case because there were only 15 Usenet messages in the test collection. And all
of them were Usenet messages in which someone had quoted a news article.
So we completely lost the social structure. Everything looked exactly like news. And so
we learned from that something about building test collections. But we didn't learn from
that anything about what this would look like in public speech.
I actually don't think you need something like this nearly as much in public speech. So
on something like a newsgroup. Because the people who write for a newsgroup are
expecting many people to have to resolve it, so they're not writing with someone else's
negotiated terminology in mind; they've got some broader view so that the feature that
we're trying to leverage is more diffused and the need is less.
So I don't think it's the low-hanging fruit part of the problem. But it was an interesting
place to look at it and we were very surprised by what the problem turned out to be. So it
taught us a little something about evaluation.
>>: That is the interesting thing, you know, from the NLP perspective is how do people
choose to mention the people that they [inaudible] has done some interesting work the
news articles, that there is a pattern to, you know, kind of -- with the news articles you
first kind of caught the most expansive reference and then you kind of pare down, pare
down, and maybe you -- you know, you've drifted off the topic, then you need to bring it
back, that person with a more expansive reference, but that's kind of [inaudible] for that
one single document. But they are patterns.
>> Douglas Oard: Right. And they are ->>: And that might be more appropriate for the Usenet groups.
>> Douglas Oard: Yeah. And there are patterns that you see with audiences, right? So if
you're reading something in the U.K., you'll often see somebody just referred to by name
that in the U.S. their role is given. But everybody knows that's the prime minister, right?
Except the people in the U.S. who, you know, didn't keep up on like the elections in the
U.K. for the last ten years might not know that's the prime minister. So in the U.S. you
have to tell -- you have to name the role more.
So there are things like this that you can see. But the conversation -- so almost all of the
information content in the world, if you just count words, is spoken. Right? So roughly
the -- I don't know how large Bing's index is, but I did this for Google a few years ago,
and the number of words spoken or heard by people who have cell phones in their pocket
in a day is roughly equal to the number of words that are indexed by Google.
So people -- if you take a Google as a -- as a unit, people speak a Google a day, right?
Now, you'd say to yourself, well, okay, fine, but most of that's trash. It's like, well, okay,
fine, but isn't that what the Web is, right? Isn't that where search is king?
>>: [inaudible]
>> Douglas Oard: What's that?
>>: [inaudible]
>> Douglas Oard: [inaudible] yeah. So most of the content to be found in the world,
there are all sorts of social structures around this, of course, but most of the content that
anybody might ever want to find in the world is being spoken. And so it's mostly
conversational. It's mostly dyadic. And so understanding what to do there strikes me as
being sort of the driving application.
And then you can think about, okay, and how would I take these techniques and move
them out to news stories or to speeches or to classroom lectures. Or, you know, all those
things have their own sort of genre-specific conventions that you might leverage off of.
But conversational speech is the thing we know the least about what to do in the
information access world and it's where our tools probably could be used to the greatest
effect.
>>: [inaudible] had all of his life [inaudible].
>>: He has a lot of it digitized.
>>: Digitized. So presumably his conversations [inaudible].
>> Douglas Oard: Yeah. So this is actually the sort of thing, this life-logging idea,
would be a very natural place for this, because essentially the first insight in this work
was that people must have negotiated what this term meant before they used it. Right?
So you could just -- if you were looking at something and you didn't know what it meant,
you could look back to the beginning where it was first set up, right? So that was our
initial thinking about this.
And if you do that, if first thing you realize is, yes, true, but how often do I negotiate,
meaning outside the observables and then use them inside. And so we didn't actually
build off of that insight. But that's where we started this work.
If I had life-logging, then that's actually what I'd want to do. I'd just want to roll back to
the first interaction among these people and roll that interaction forward and learn what
words mean, and I'd probably want to pay the most attention to the things that happen
very early on and the things that happen most recently. Right? So I'd actually start
sticking some bias on that a priori.
And so that -- that'd be fascinating data to look at, but I'd take a different approach than I
took here if I were going to work with that. Because we have at that point more in the
way of expectation of the structure of the interaction.
>>: How much of negotiation of what terms mean is never that explicit because people
make assumptions, you know, based on what they know about each other's background.
So, for example, in your talk today, you use the terms generative model and
discriminative model. You didn't define that anywhere, but you assumed given your
audience that we would know what that meant.
>> Douglas Oard: So I would agree with you that you would have to do -- you would
also have to look into context, right? I might stick to my original position and say that I
haven't negotiated this with you but I've negotiated it with people like you. Because I've
been to ACL and I have some idea of how people talk there. And I know where I picked
up this concept, right, and I know who it was that first skewered me and said why didn't
you build a discriminative model. And so I can sort of -- I can build a similarity function
among people that way.
So I could still do what I was trying to do, but then I could do some kind of a spreading
activation thing that would move people -- that would move characteristics of prior
knowledge around a social network. So someplace in between those two.
There are some somethings that I said today that you probably had to get just from
context, right, that I actually, you know, through oversight on my part or not enough time
or not what I was emphasizing, there are probably some things that if you wanted in this
you had to put them together. So I think both are active.
>> Sue Dumais: Let's thank Doug again.
[applause]
Download