>> Sue Dumais: Hey, good morning. It's my... who's visiting us from the University of Maryland where he's...

>> Sue Dumais: Hey, good morning. It's my pleasure today to introduce Doug Oard who's visiting us from the University of Maryland where he's a professor in the College of Information Studies and has several other affiliations. Today Doug is going to talk to us about identity resolution in large e-mail collections. But Doug's also done a bunch of work. I first met him when he did work in cross-language retrieval. He's done a lot of interesting work in trying to analyze and support access to oral histories of various kinds and other types of conversational media. And he's also leading the effort in the TREC community, one of the leaders of the efforts in the TREC community on e-discovery, how to support people and finding a more thorough examination of important corporate-like e-mail and others related to litigation of various kinds. So today it's my pleasure to introduce Doug who will, again, tell us about identity resolution. >> Douglas Oard: Okay. Great. Thanks, Sue. So actually this work comes from all of those things. The beginning on cross-language retrieval, I after a while realized that I wasn't really interested in language. Sorry to all the NLP people here. That what I was really interested in was transformation of language, and the transforming from speech to searchable written form was just as interesting as transforming from one language to another. It was just as interesting as OCR was just as interesting as handwriting. But I'm interested in searching language, but I'm interested in searching transformed language. Then I got involved in all this oral history stuff and started hanging with the speech folks. And what I learned was that speech folks are very difficult to work with, not because they have difficult personalities, they're actually very nice people, but because they have very high resource requirements in order to get that speech converted. So the translation people are great fun to hang out with because, I mean, all they do is count the words, which is pretty much what IR people do. And you'll end up with usable representations at least for search. But the -- to work on speech, you have to do all of this acoustic processing, all this acoustic transformation. Before you can get anything, you need a fairly strong language model in order to recover some sense out of that signal. And so speech is just much harder to work with than foreign languages. So that motivated me to say, well, let's look at what I really care about, what I've learned I really care about, which is in conversational speech the conversational aspects to it. And so that's where this talk comes from. It intersects with a problem where a lawyer walked in one night at nine o'clock at night to my office after I'd been teaching and he said, I have a problem. And I'm like, oh, this is very bad when lawyers walk into your office and say they have a problem. It's particularly bad at nine o'clock at night when their problem is so urgent that it couldn't wait until the next day. So here's the problem that he had. The Clinton administration had deposited 32 million e-mails at the National Archives. Now, they didn't actually deposit at this National Archives. This National Archives is downtown and you can go see the Constitution and the Bill of Rights and things like that. They deposited at Archives II, which is their storage facility, which is on our campus at the University of Maryland. So all of the electronic records from all of the presidential administrations since Reagan are located in College Park. And this lawyer had gotten a request, and the request was to find all of the e-mails in that collection that were about tobacco policy. And so this was a part of tobacco lawsuit. And normally what the National Archives does is they say, well, here's all the stuff, you're welcome to come, you know, flip through the pages and find what you need, not our problem once we've made it public. Problem is you can't make this stuff public because there's all sorts of stuff in the Clinton e-mails that are fairly sensitive. So, you know, there's like invasions of countries, there's a whole bunch of stuff on the Presidential Management Intern Program. So there are all kinds of things that would need to be reviewed. So somebody's going to have to look through all 32 million e-mails one day, decide which can be made public. And this will happen, you know, a long time after the Clinton administration. And by then we think the Obama administration is going to put something on the order of a billion e-mails into the National Archives. So there are several hundred million e-mails from the second -- from Bush 2, from George W. Bush administration. So a billion e-mails seems pretty reasonable. And we're never going to be able to manually review all those. So, in any case, we have to think about how are we going to support this process. Well, here's what they actually did. They built a bunch of Boolean queries for things like Phillip Morris Institute, or PMI, which also turns out to be Presidential Management Interns, so they had to like not Presidential Management Intern. So they made their Boolean queries, and then they hired 25 lawyers to work on this for six months to actually look at every one of the 200,000 documents that came back from the Boolean query. And so you look at this, and this is what he wanted to solve. Right? This was the problem that, you know, he said can we make this, you know, 3,000 or something and get it to the things that we want. And so this is what the TREC Legal Track does. And that's where the TREC Legal Track came from was that evening conversation. This is the problem. After they did this they decided that 80,000 of those e-mails should be released to the other side. There are 100,000 of those e-mails were relevant, but 20,000 of them were subject to a claim of privilege. So they put 20,000 of them in privileged logs and they released 80,000. So the actual original question here, how can I make this process more efficient, is very hard to get traction on because the requests that come in are so broad that an awful lot of stuff gets released. Quick example from the TREC Legal Track last year. At 7 million documents we were running a model of this process actually on tobacco documents but not on e-mail. It's 7 million documents. And in one of the requests, which it was great fun to watch the lawyers actually sue each other, and then we take the suits and we put it into the process, we did an interactive search and we found 700,000 relevant documents out of 7 million. A 10 percent yield rate -- actually, an 11 percent yield rate on that particular topic. So it's very hard from an IR standpoint to get traction here. But there's a second problem, and that is that now we've dumped 80,000 e-mails on these people who didn't work in the Clinton White House, they don't know who these people are, they don't know what these words they're using mean, they don't know which things are happening at which times. So now the question is how do you make what you might think of as sense making tools for these people. Right? Okay. So e-mail is interesting because it's conversations. Conversations in e-mail are interesting because they're simpler to get at than conversations in speech. But what we learn in e-mail might be useful there. So there's the technology path and here's the application. So here's the specific problem that I want to talk to you about today. This is from the Enron collection. The Enron collection is what we used for all the development here. We have other e-mail collections that are difficult to share. This one's very widely shared. And so you can see here Kay Mann says to Suzanne Adams did Sheila want Scott to participate. And there are lots of questions you could ask. So what was this GE conference call about? Now, I've got GE conference call, I've got the date, I know who's asking. I could go get their calendar if there's a lawsuit going on and I could probably make sense of that. Right? Who's Sheila? Well, I might have an org chart and I might know who's related organizationally to Kay Mann that's named Sheila. And so I might be able to make sense out of that. Who's Scott? Right? So we don't know when the call is, but we do know that the call is too late for Scott to participate in, so we might get some idea where Scott is or something out of that. So there's a lot of sense to be made out of this. And I want to talk about only one problem, which is WhoDat, right? So who's Sheila. So the entire talk today on that. Who cares? Well, I've told you already the lawyers care. You can imagine that you show up and you see somebody's computer, it's got a bunch of e-mail on it, they're engaged in some criminal activity. You want to make sense of what's going on there. Sometimes under very tight time pressure. Right? So law enforcement or intelligence or whatever might care. And then historians care. Now, I'm driven by trying to help the historians. Nobody writes letters anymore but people are writing a lot of e-mail. And so we're going to have quite a lot of that e-mail make it into the future. But the historians have two problems. They have no idea what they would want such a system to do because they don't have these kinds of collections or tools yet, so they're not the best people to ask about what we should be doing. And they have no money, so they're not the best people to ask for the resources to do it. But, on the other hand, given that the lawyers and law enforcement are both interested in this, what we can do is we can do Robin Hood, right? So we can get the resources and the insights that we need into the problem, and then we can solve the problem for not just lawyers and law enforcement but for everybody. Now, I want to make a distinction in my interests between this sort of exploitation task, the information is sitting there static and I want to understand it, and personal information management. This is not the problem you would start with if you were interested in personal information management, because if this is my e-mail, I probably know who these people are. So you have to really think about the sense-making standoff task from not being inside to understand this. Okay. So we're back to the problem. And it turns out in this collection after you do a little bit of looking around, we can associate 55 people with the name Sheila. So we've seen Sheila someplace in the collection that causes us to believe that there are these 55 different Sheilas. We might be a little wrong on that, there might be 53 or 59, but we've got fairly good models. I'll show you how good the models are of how many Sheilas there are. So we can think of this problem as simply tell me which of these Sheilas it is. Now, it turns out for every one of these Sheilas we know an e-mail address. These are people who sent or received e-mail in the collection. So you might wonder how many people are referred to, like Clinton got referred to a lot in the White House e-mail, did Clinton have an e-mail address. Bush got referred to a little bit in the Clinton e-mail. Did Bush have an e-mail address in the Clinton White House, right? So there are people who don't have e-mail addresses. But we're going to make a simplifying assumption that the people who we're trying to tag the references to have e-mail addresses if I can tag that reference, then we can use that e-mail address as a pivot and build sort of a bio on that person. And then we can come back at the end and ask how reasonable was that assumption to have been made, and if it wasn't good, then we can build a classifier that tries to guess whether people have an e-mail address and to do something with the people who don't. Okay. So I imagine that I want to send Sue an e-mail and I want to refer to somebody who sent me an e-mail recently, and I say, hey, Sue, I just got an e-mail from Mark. Now, that's not going to make a whole lot of sense to Sue, Sue knows probably a lot of people named Mark. I know a lot of people named Mark, and we probably know several people named Mark in common. So there's no context for Sue to attach Mark to the person. So I need a more complex model than that. I need a way of knowing what kind of reference I can use with Sue. So if I say something in the context of SIGIR 2004 conference, well then it's probably going to be pretty clear that it's Mark Sanderson because he was the general chair of the SIGIR 2004 conference. So I could have -- after I choose a person to mention, which in this case is not Mark, I could pick some context. So we wanted to mention here this person. And so Kay Mann wants to mention that person, in the context of a GE conference call, now which people are associated with things like GE conference calls, that will tell me what kind of mention to use. Do I have to use a full-name mention or a first-name mention, or can I use some nickname. Now, once you see this generate a model, it's pretty obvious how to solve this. All you do is you reverse the process. You sprinkle a little Bayes rule fairy dust on this. And instead of estimating the probability of a mention given the person, you want to estimate the probability that it was the person that you were intending given the mention. Right? And so that's all we're going to do. Okay. So to do that we have to go through three things. We have to know who the people are and what names they can be referred to by. So we'll call this the identity modeling problem. And then we have to model the contexts. And then it's very straightforward to do the Bayes inversion and to estimate the probability that any one of these people might have been the mentioned person. So this comes together at the end once you do these two steps. Okay. So step one, let's build a computational model of identity. Where do we get it. Well, since we have e-mail, we can look at three kinds of things that go on. So here's an e-mail from Elizabeth Sager that Elizabeth signs Liza. Now, I might not have guessed that Elizabeth would have been called Liza. I might have had some lookup table, but that would make all the Elizabeths into Lizas. So but I can look and see that this person signed an e-mail this way. If this is the only time they ever did it I can say, okay, they're not very often referred to this way. And if they do this all the time I can say this is how they're normally referred to. So I can build a probability distribution across the types of references for each person. Okay. Now, this is based on user behavior. So it's just a social convention that we stick our name at the bottom of e-mails and we stick different names at the bottom of e-mails depending on who we're talking to. Right? So I sign my e-mails differently with my mother than I do with people who I work for and than I do with people who work for me. And so you could build a more complex model with a hidden variable in it for the kind of relationship which we don't have. Right? So we're simply going to note that this person signed their name this way this time. Okay. Now, that's a user behavior. That's the hardest thing to get at. The easiest thing to get at is that we can pull things out of e-mail standards. So, for example, we can find out that Elizabeth Sager is Sager, Elizabeth, and so we get some idea now which one's the first name and which one's the last name, which we might not have been able to guess as easily before. Right? So it's very easy to get stuff here, but it might not be nearly as informative. It's more complex to get stuff here. And then the third place you can get stuff is that the mailers will put in things in a way that the mailer is actually written. This is free text, but it's written by a machine, an so we can just reverse engineer some standard clients and pull stuff out. Okay. So we do this with rules, we do this with rules, and I'll show you how we do that. Okay. So this is just one of the rules firing that says, okay, here's a name and we found another -- or we found the full text name with it. This is an artifact of the way the Enron collection was put up by CMU, that you have to make these associations. Normally they're given more tightly in the standard. This is rules firing to get things out of the quoted text. And here's the somewhat interesting one, how do you know that Sheila is a signature. And what you do is you just take all the e-mails and you hold them up to the light and you see what's the same in all of them. Right? And so if you do that, you'll see pretty quickly that the signature block is there. And you can do the same thing for salutation blocks. So you just tail a line for the signature blocks and you had a line for the salutation blocks. You have to do things like get rid of all the included text to do this so you know where things actually end. And so there's some cleanup parsing ahead of that that's all done with manually written rules. Okay. And then it turns out the social convention is you stick your handwritten part of the signature above the automatically included social blocks, so the only little trick here is you go to the top and then once you've done that, you can just stop-word out things like thanks and pick up what we call nicknames, right, the ways people informally refer to each other. Okay. And then you can make that association. And you can do the same thing for the salutations. And so we get models that look like this. This is an actual model out of our system. And so when I said we know 55 people named Sheila, what I meant was there are 55 models in which Sheila appears as a name. Right? Now, we can get Sheila one of two ways. We can either go back and we can parse this name -- actually three ways. We can parse this name or we can parse that one. So we've got 55 in which Sheila is included in a way that we think would be useful. There are 133,000 e-mail addresses in the Enron collection, which is a collection of only 149 people mailboxes. So I was a little surprised by how large that number is. There are 250,000 messages in the collection that are unique. And so about one out of every two messages is adding something to what we know the e-mail addresses to exist. And this collapses it by about a factor of two. Right? So we can sometimes put together a home and work e-mail address if we observe the same unusual first and last name. So when we -- we use full names to fold addresses together. We're not doing anything to try to figure out if somebody is intentionally maintaining two separate identities. There are some cases in Enron where two people use the same e-mail address. So, for example, an executive and their assistant. And the assistant will actually sign messages with their name, not with the executive's name. So it's quite clear what's happening. But we don't try to process that at all. We just treat that as the executive has their assistant's name. Right? So that messes up our models a little bit. In some of our early processing, I'll show you at the end -- well, I can show you at the end, if you want to see it, what happened. We weren't processing mailing lists, and they really screw up this model, because the mailing lists get a capture on every name on the list. Right? So you have to be careful about things like that. Okay. So two things we could ask here. One is how well can I use these identity models to do the resolution I want to do in stage three. But we could also ask how well have I done -- how well have I built my identity model in the first place. And if I've built it poorly, then probably going further doesn't make sense. So the first step we did was we said, okay, let's just take all the names that we learn from main headers that we saw only once, right, very weak evidence, and let's just see how often those are right. So probably it will be right some of the time and wrong some of the time. And then if we have stronger evidence, we'll see how much better that is. We'll do the same thing for the quoted headers where now we're reverse engineering the regularities of the systems. If we see it in both headers, fine. And then we'll also look at these address nickname associations, which we expect might be more informative but also more error prone. Right? So, for example, we have some where we have a "hi, all," and so somebody's name ends up being "all." And it's like, well, you know, it probably wasn't the right thing to do. So, anyway, we measure all of this, stratify up 600 of these conditions and so we can measure probabilities with about 5 percent error bars on them out of this. >>: How do you know when you're right? >> Douglas Oard: So what we do is we hire the next graduate student over and he tells us what he thinks. And we don't do any inter-annotator agreement. There's no test collection being built here. This is actually coming out of the system. Right? So we just get the data out of the system, we have one person look at it, and we see how well that person agrees with the system. So the person does know that the system hypothesized this. Okay. There are four possibilities. We might be wrong. This is an actual case where we're wrong. So we think kmpresto's name is home e-mail. It's like, well, you can see how we made the mistake. If you were, you know, teaching school you might give your kids partial credit for this, but, you know, not in our business. There's another one where we learn that June Rick is June Rick, and it's like, yeah, right, okay, good, it's correct, but I don't really care. And then down here we learn that this person's name here is Phyllis and we didn't have any good clues for that. So the question is how often does this happen. That's really what we care about. But we'd also like to have that happen not very often. Right? So if this happens, there's not much we can do about it, but we really want to avoid this and we really want to get that. Okay. So here are the results. This is how often we're correct. The short version is if we get it out of the headers, we're always correct. So this is everything we took out of the headers, that's the 300 samples, all right, and we're right all the time. Okay. So we don't have the problem with processing data out of the headers. We do have a problem with processing data out of the salutations and signatures, so we're wrong about 10 percent of the time. And in the very worst case, which is very weak evidence coming out of the signatures where we have to find the top of the signature block on that sort of thing, now we're wrong 20 percent of the time. So you look at this and you're like, well, not bad for a bunch of rules. We don't see numbers like this that often on the first try, let's not worry further about this. Okay. Now, over here, this is how often we learn something. So this is an indication of, you know, did we do what we tried to do; this is an indication of was it worth doing. And you can -- if you just focus on the overall number, you can see them stratified over here, that we learn something that is very informative. So this isn't the case where we learn how to split a name, which was that middle case; this is the case where we actually learn something we could not have observed. And we learn something about half the time coming out of the headers. Right? And this is very easy data to process. We're very accurate and we learn a fair amount. So this is low-hanging fruit. Here we only learn something about 20 percent of the time coming out of the salutations and signatures. And it's kind of interesting, the case where we have the weakest evidence coming out of the signatures is the case where we happened to learn the most, which was the case where we were making the most mistakes over here. So this suggests that if you're going to work on this problem further you should actually work on the low data part of the problem, right, that the -- the cases where you haven't seen things very often have got information in them that are worth looking at. Right? Every once in a while you sign your name a certain way. Okay. Anyway, that's not the main purpose of the thing. That's just to get you to something like this. So what you want to do is you want to say for these people I'm going to assume some kind of a prior, that this person isn't called Sheila very often and that person's called Sheila a whole lot. And to get to that we have to adopt some kind of a frequentist interpretation of probability, and this is very straightforward. Right? So this says how often do I see this e-mail address out of all of the e-mail addresses that I see. Okay. So this person gets talked about a lot. We can do that all the way down to how often do I see this type of name -- so, for example, a first name, a last name, or a nickname -- mentioned this way. So the particular nickname S.G., how often do I see that given it's this candidate over all the S.G.s that I see. Right? So it was just a frequentist interpretation probability. It takes the models that you saw previously and it hallucinates probabilities, right, that are essentially priors. Okay. Then we do something that will make you cringe. Any machine learning folks in the audience? So this was okay. But now what you want to do is smooth, right? Okay. So we don't -- we just -- if it's zero probability in the prior, we just ignore it. And this gets our set of candidates down. Okay. So you could probably do a little bit better here, but, you know, bring money because you're going to now be computing over the full set of identities, which in our case is 77,000. But if you run -- I don't know, do you guys run an e-mail service here, Hotmail or something like this? So you try to do this over Hotmail, you'd be at it for a little while. So this is a convenient assumption, and we'll just ask how well does it work if we just ignore all the zero probability cases. So we know we're throwing out some good ones here. Okay. So we're back to the problem ->>: That example had an interesting -- the only people who you associated the name Sheila with were female. Do you use any gender information? >> Douglas Oard: So we have no gender information available to us, so we don't have -we're just processing the e-mails. We don't see address books, we don't have side information, we don't have org charts. So there's a lot you could do if you had side information. And indeed these are just pretty pictures. Right? I mean, you know, in fact, I think one of them is one of our faculty members. So we just -- we're grabbing pictures off of some photo site to make exactly this point; that, you know, there aren't a lot of Sheilas that are guys. Okay. So, anyhow, we're back to the original question but now we're on this second piece, context reconstruction. So I want to find the context of this mention. So the first thing we're going to do is we're going to say, well, I could if you do coreferents resolution you could say, well, the context of this mention is different from the context of this mention. But I'm not going to do that. I'm just going to say the context of this mention is the context of the e-mail. Okay. So another simplifying assumption that anything in the let's say topical context of this mention will also be in the topical context of this mention in exactly the same way. Okay. So now all I have to find is given this e-mail find me things in its context. So how many contexts can we think of? I can think of five. Maybe you can think of more. But here are the ones we started with. We might find the answer in the e-mail. So somebody could say did Sheila want Scott to participate, and it could be cc'd to Sheila, some Sheila. Right? And I could say, oh, that's a pretty good guess; that might be it. In fact, a collection has been made this way. Right? And all you do is you take out the cc, hide it from the system, and see if it can guess the cc. Right? So you just go recognizing those cases, it's sort of a classic way of building a known item retrieval test collection. So I'll show you results on that. Okay. So, in any case, we just give our full mechanism the local context and we say if you can find an e-mail address for a person which Sheila is the name, then there's some information. The second possibility is that you might use the thread structure. Now, the thread structure is a little bit dangerous to use for a couple of reasons. One is the thread structure is not given in the Enron collection. So you have to reconstruct it and you may make a mistake reconstructing it. The second thing is even if you reconstruct it correctly, this is the reply chain structure, and that's not necessarily the sort of semantic thread structure. This is merely every time I hit reply. So, for example, I was here, I don't know, half a decade ago, and so I sent a message to Sue that said I'm coming up. And where did I get her e-mail? Well, from an e-mail half a decade ago, right, and I just hit reply and all the previous stuff was in there and, you know, hey, Sue I'm coming back. So these aren't always perfect, and there are things you could do to clean it, but what we do initially is we simply thread in the obvious way based on subject lines. And that's a really bad idea. Because any of you get messages that have nothing in the subject line? So the way we thread on subject lines is we stop-word the word "re" and index the subject line. And so what happens now is everything that had a blank subject line becomes a single thread. It's like, well, that's probably a bad idea. So the second thing you do is you go in, you find all the included text, and you search with the included text for the message that that text was included from. And you can infer threads that way. And that's a much stronger way of doing it. And if you have these two, it sort of triangulates it. So we have reasonably good threads and we don't work any further on that. But you could do more on this if you wanted to. The topical context is where we had the most fun. So we're RI folks, right, this is the Ph.D. dissertation work of Tamer Elsayed, who is -- does IR work. And so to get a topical context, what I do is I take this message, this focal message, and I say what other messages are in its topical context. So this is my query. I index these things. We count the words in the usual way. There's some tf-idf floating around, and we find out that this thread here is in a strong relationship to that, this thread here is in a strong relationship to that. Now, I can take the entire thread here as a query and find threads. I can take the message as a query and find messages. And you can play around and ask what's the right thing to do, and it turns out those are both the wrong thing to do. So this is actually one of the really interesting things about conversational content is that in IR we make two assumptions that make the IR problem easy. One is a document is a bucket that contains information which, when it's language, it's a bucket that contains words, so that's the bag of words assumption. And we have no boundaries on our bucket, now, right? So my bucket is kind of bleeding between these things. So is this a document or is this a document or is that a document? I don't quite know. Right? So nobody's given me document boundaries. The second thing is we assume that I can just show you the document and you can say, oh, yes, that's the one I want to see, which of course is what we're trying to fix with the entire talk. So we're breaking both assumptions. So I have to say what a document is for the purpose of finding topical context. So here's how we define the document. A document is every message between the focal message and the root. So it's here, the root of the thread. Or in this case it's this path, right? And so the similarity between this e-mail and this e-mail is the similarity between the set of words here and the set of words there. And if you play around with lots of different possibilities, this one comes out well. And if you look at it, it sort of makes sense, right, that the other side of the thread probably is less informative, got more opportunity to be adding noise. So that's a nice heuristic that you can walk away with and say just the thread back to the root is a handy first thing to try, right, which, after we tried several things, worked best for what I'll show you. So here's an example. I've got the GE conference call, is the e-mail you saw before, came in to Suzanne Adams. And these three words are common to this document also. So I've got Sheila GE call. And so if there were a Sheila that was known all the way down to the e-mail address here, which there is in this case, then we can say, okay, maybe it's Sheila Walton. Now, look, I'm not doing anything semantic, I don't know if this is the same GE issue, I -we do look and make sure that we're at least in the same time frame, so this is December 15th and that's December 20th. So we put some bounds on the time frame to do this. But we just say, okay, fine, this message is in the topical context of that message. Right? And so I will pick up some probability now that this Sheila should be emphasized over others. That's independent now of what I did before in the identity modeling. Right? So the identity model is the prior and then this is the observables. Okay. So we pick, you know, that Sheila. Now, the other place you might go, and if you're a social networks person you've been sitting here going, oh, gotta do social networks, is you might go looking off in the social network and find a social context. And so here's an example of a social context. Kay Mann has sent an e-mail to Suzanne Adams and now we see somebody has sent an e-mail to Kay Mann and they mention Sheila Tweed. And we're like, well, I don't know about that topical context, that might have been trash, but I know for sure that Kay Mann knows Sheila Tweed. Right? So there's a Sheila that I've got some belief in. The problem is here I don't know Sheila Tweed's e-mail address. So now I have to take this Sheila Tweed and go do resolution on it. And so we're going to be at this for a while. Okay. Oh, so to do this we cheat because we're IR people, so we don't know anything about social network analysis. So what we do is we say, well, previously all the words in the e-mail, they were the thing that we built our model on. Well, now we just build it on all the e-mail addresses in the e-mail or in the thread or in the path back to the root or whatever it is we're doing. And so if like the archive server is always getting cc'd, we'll learn to give that low IDF. Right? So in fact this works remarkably well as a quick hack, is just forget the fact that these things aren't words and treat them as if they were. Right? And what we lose essentially is that there's not really very good TF evidence here. We do see TF because we have multiple messages in the path back to the root, but it's very spotty because the messages are short. Okay. So in that case this will tell us some other Sheila, which, if we could resolve it, we could use. Okay. Now, if you're a modeler, at this point you're thinking I've got a lot of information -- oh, I told you there were five, we only used four because the Enron collection has no attachments the CMU posted. We're in the process for the TREC Legal Track of building an Enron collection with attachments, but that's not yet widely distributed, so this work is done by the distributed one. But coattachment would be a fifth one. Right? So I see the same document with the same check sum and I'll just note that it's been attached to these two e-mails and that will define a fifth context for me. Okay. So now I've got four contexts. And what I want is I want some way of saying how strong this attachment is. So we'll ask what's the probability that in some context, let's say the topical context, of some e-mail, let's say e-mail 42, we should be attaching some other e-mail, let's say e-mail 6. So that's the representation. And I've got four different ways of doing it. So if you're a modeler at this point you go off and you get some held-out data and you say let's learn this. And if you're not, you say, well, let's just add them up and divide by four. Right? So, actually, if you're a modeler, you'd say that was a uniform distribution and, you know, that you had made an assumption that only a single context -- but we just add them up and divide by four. Okay? So this is another place where if you believed that many contexts were active at a time, you should do a log linear model or something like that. All right? But if you believe that only one context was active at a time, this actually isn't such a bad thing to do. And since we have no belief on that, adding up and dividing by four was convenient. Okay. So what that means is I can rethink my problem as if I've got a mention here and through the e-mail that it's mentioned in I can attach through the topical context Sheila Walton, I can attach Sheila Tweed through the social context with some strength, I can attach JSheila, an actual observed e-mail address through the social context, and you can see that I can build this very complex graph structure on which I can now reason. Okay. So you could look at this as a graph reasoning problem. But you can also take a simpler approach. So here's the very simplest thing you could think of doing. Actually, very simplest thing we could think of doing. We want to find the probability of a candidate given a mention and the context. Okay. So we can approximate that as a probability to a candidate given a mention and forget the context. Context-free resolution. Right? And just apply Bayes rule. And we can get that probability not just for Sheila but we can get that probability for everybody. Now, the problem is there are 1.3 million references to people in the 250,000 e-mails in the collection. And so you're going to be doing this for a little while. But after you get done, you'll have context-free probabilities for all of these. In fact, if that's the only one you have to resolve, you only have to do the ones that attach. Right? So this is actually fairly efficient. So then we can use these context-free resolutions to estimate this. Now, this is the only one that's going to be useful here because the rest of them don't have e-mail addresses, and that's what we're trying to attach. But then you could do that for all 1.3 million of them -- you can see the compute cycles going up here -- and resolve all those using context-free resolution in the first stage to find these, and then you could get a better estimate of this. And you could iterate, you know, until fully cooked, right? Okay. So that's what we do. And so if you tried to do that, you would very quickly find that the conversational expansion, so this is the threads, the local expansion, this is the message itself, have no computational problems at all. The only computational problems you run into are doing the topical expansion, doing the social expansion, and doing the resolutions. Okay. So it's convenient to do this in MapReduce. Everybody here speak MapReduce? Okay. So all that MapReduce is doing is it's breaking things up, doing a single processing stage, shuffling, doing a second processing stage and going back out to storage. And so this is how the topical context is found in MapReduce. You simply take the postings list -- so this is like some term, GE, appears in only let's say ten documents. So those will be on the postings list. Now, if you work out the way in which we do the similarity in IR, it's those are the only documents that this term can contribute to the similarity of. So we have to do an inner product, which means we have to do multiplications and sums. So we do the multiplications across the postings list, we shuffle, and then do the sums back in the document space. So this lets you get as wide parallelism as you like without having to manage all of this shuffling. That's all the MapReduce is doing for us. Let's see. I'll give you timing data on that. So it turns out that the social context where we have a very sparse representation, because there aren't very many e-mail addresses, that's very fast, and that all those multiplications only have to be done in the topical context. We have a trick, a fairly straightforward trick, that gets that down to about 60 minutes on a quarter million by a quarter million matrix to do the full similarities. We do million by a million in about three hours, right, on standard IR test collections. The trick is we just go in an aggressively stop-word, so we take out all the high DF stuff. The high DF stuff is where all the multiplications are being done, and they're the very smallest contributions to the results. And so if you do that, you'll end up giving up about 2 percent by typical IR measures of effectiveness, and you'll end up picking up something on the order of 96 percent in efficiency. Right? So you're chopping the time down to 4 percent of the time and you're giving up very little. So this is -- the trick we use to do this, we do the same trick in here but it's not very important, and the whole process takes us about three hours to run an iteration, so you can turn the crank as many times as you like. Now, the last thing to say about this is that you can use MapReduce to do this as well, so in the mappers you just read in the resolutions, so you can compute the first time around the context-free resolution, and then you just shuffle the contributions and you do -- in the reduce stage you do the resolution over here. So we do the entire thing massively parallel. Okay. Okay. So the big question you should have is fine, interesting, nice trick, how does it work. So I gave you an intrinsic evaluation before of just the identity modeling part. I'll give you an extrinsic evaluation here. So being IR folks what we want is a test collection so we can do repeatable evaluations. So the documents are obviously going to be e-mails but now we need queries. So here's how we get the query. I've got 250,000 e-mails, close my eyes and I pick one uniformly. Now I go through and I hand mark that e-mail for every reference to a person. So now I don't have any problem with automated systems not doing that well as well as people can do that. We hand mark it. We then put it back into the system and it blindly picks some reference. That's the reference that has to be resolved. Okay. So this just assumes that I've got a user coming along who cares about only one of the things that we might need to resolve, and we don't know which one. Okay. We do 600 of those. And then this is the little bit squishy part. We hire not the next graduate student over but three undergraduates. And the three undergraduates go through and we give them a search engine and we give them beer money and we say would you please see if you can figure out who these people are, and we don't tell them anything about what our system has done with those. In fact, we haven't even run our system at this point. Okay. And they go through and they try to find a resolution. And sometimes they can't and sometimes they disagree and all the things you'd expect. All right. And then we just use mean reciprocal rank to evaluate. So you get full credit if you're in the first position at the top of the list, you get half credit if you're in second position, no matter what the difference in the probabilities are in your estimates. I'm going to skip this for a second. So this is -- oh. So anyway, it's three people. We'll end up with a total of 584 queries. And they worked for about 60 hours. So they're working for about six minutes apiece on average. Some taking longer, some taking shorter. And they couldn't resolve -- so this is first annotator. So first annotator can't resolve it in 20 percent of the cases. If you remove those 20 percent of the cases, in about 80 percent of the cases it resolves to somebody at Enron. And in about 20 percent of the remaining cases, taking out the unresolvable, it resolves to somebody not at Enron. We were quite surprised we this. We had expected much more resolution to people outside Enron because almost every e-mail address in the collection is from somebody outside Enron. Right? There are 133,000 e-mail addresses, only 149 Enron mailboxes. Now, there are more Enron people, but there aren't 133,000 Enron people. Right? So most of the e-mail addresses are out of Enron, but most of the resolutions are in. This unresolvable is the number we use to bound how bad things can be if you make a ludicrous assertion of an association of somebody that's quite clearly doesn't have an e-mail address in the collection. You can't hurt yourself by more than 20 percent. But this 20 percent can hurt us on the iterative solution. Right? Because we go through the first time and somebody captures that, and then that inserts noise the second time around. So presently we don't try to classify this case. You can think of things that would classify this case, but we just haven't built the classifier yet. Okay. Now, when you do inter-annotator agreement, it actually comes out reasonably well. So if the person works at Enron, then in 90 percent of the cases a second person who is annotating it, who knows only that it has been previously annotated, they know nothing more than that, they annotate it the same to the same person at Enron 90 percent of the time. They don't agree that it's at Enron; they agree that it's the same person and at Enron. But they didn't know it was an Enron person when they started. Right? So we're just stratifying it for analysis. If it's not at Enron, it's a harder problem for the people to do. And they get it right a little over 60 percent of the time. And so on average we're about 80 percent correct right across those two. Okay. So we have a test collection you could use to measure things as long as the results don't get up, you know, in the 80 percent range. The 80 percent range, we just can't measure stuff above that. Okay. So here are the results on our test collection and on previous test collections. I'll tell you ours first. So if you look across everything, we're right 78 percent of the time. And remember where the inter-annotator agreement is, it's at 80 percent of the time. So we're right almost as often as we can measure. Now, this isn't one best; this is mean reciprocal rank, so we're giving ourselves a little more credit. You could actually get above 80 percent here. If you look only at Enron, we're up at 82 percent. You'll remember we're at 90 percent for inter-annotator agreement here. And if you look at nonEnron, we're at inter-annotator agreement. Again, this is mean reciprocal rank, though. Right? But we're up here right in the range of where inter-annotator agreement is. So we're doing remarkably well. And this is all first iteration. Right? So a context-free resolution and then use the context-free resolution in one pass to try to guess this. If you look at the other collections that have been built, these are the two collections that were built by Minkov at CMU, so we call them the M collections. One is a -- the mailbox of a fellow named Sager and another is a mailbox of somebody named Shapiro. And these are very easy problems because all the resolutions occur inside the mailbox. So we know a very small set of messages that will contain the resolution, and so we only have a very small number of candidates to resolve to. And we do exceedingly well on these. Galileo Nomantha [phonetic] at the University of Maryland built another test set where he was interested in the question of if I have many people who are sending e-mails to each other, so not our problem where everything in the e-mail collection, but just the Enron people sending e-mails to Enron people, then how well can we do. And so he built a collection that had 54,000 e-mails. This is the N-subset, had 78 queries, had about a hundred-way ambiguity. And we get very good performance on that. And his performance was about the same, but he reported results on 54 queries, and we don't know which 54. So population means we're very close. We're a little bit above, but that's kind of meaningless in this case. We threw in all the data from his -- or from the Enron collection just to noise it up, so all the answers are in here. But now the system's got more ambiguity to deal with. And we use this as our training collection. So when we test on training, so we use this for all our parameter settings. When we test on training, we're at 93 percent. And then when we run on our collection we see it's actually not quite as good when we're out here randomly sampling out in the collection. Okay. So that's the answer. Now, I find this slide very disappointing, I must tell you. We did as you would expect, an ablation study to find out whether the social context was better than the topical context, you know, which if I was only going to do one, where should I spend my time tuning. And it turns out the social context is the winner for Enron and for nonEnron. And as an IR guy, that's just not the answer you want, right? But when you start working with conversational media, then, you know, maybe you have to change the way you think. So the social context is very useful. Now, note the bottom of the graph is not at zero, it's at 40 percent on mean reciprocal rank. But look what happens in nonEnron. So I get very little from adding additional context over here. But we get quite a lot. So we're going from below 50 to above 60 in nonEnron by adding in additional context. So what's happening over here is I have very weak evidence in all the contexts, and so I'm picking up more by adding additional context. Over here I don't really have to add additional context in order to do well. It's pretty much what you'd expect, right? I have a better shot at picking up things where I'm not already in a dense part of the space. Okay. So then the question is what happens when you iterate. And so this is just looking at the social context and the topical context separately. And you can see that you're getting very small improvements in the first iteration in both cases. So these are really blown up here. And then it's not helping you, right? So this is not being able to classify things that are -- that are in that 20 percent, is just letting noise in. Now, I don't have this one broken out yet. This just came in yesterday. I do have some older data. So this was done with two problems. So our name recognizer wasn't very good here. So it turns out that we originally grepped for names. So everything in every identity model we just took as a name and grepped for it. And our bad luck, there's somebody at Enron named May. So, anyway, this was a really bad idea. The second thing is that we didn't have the mailing -- we had a mailing list detector and we didn't have it turned on when we did this. And so the mailing list soaks up everybody's name that works at Enron. So there's this somewhat famous mailing list at Enron that's got a lot of people on it, and it's just every name attached to it. And so this makes things hard for us, both of those things, or we make them hard for ourselves. But still we see an improvement in the first iteration of the social context and not in the topical context, but only in nonEnron. So again in the case where we're weak. So what I want to do is I want to go back and I want to look at this and break out what's going on at Enron/nonEnron and see if we start to see that we're getting a large improvement in the nonEnron, maybe even losing something in the Enron. Okay. So what we learned. Well, it's too easy a problem. So WhoDat, yeah, you could tune a hundred different things, but if I want to actually run WhoDat, I'd actually run this. These are numbers I could use. And the next thing to do is build WhatsDat, right? So I want to know what's the GE conference call. If I'm going to try to do WhatsDat, I can't use this evaluation paradigm because I don't have something that I can build a classification scheme around, like those e-mail addresses, so I have to do something that's clustering. And so we need a test collection of a different design. And so we've been giving some thought to what to do for evaluation of WhatsDat, and once you have that, we have all the mechanisms for finding the contexts, and so it's pretty straightforward for what to do there. The other thing that you could imagine doing after you'd done that is if you had a good WhoDat, then you could use WhoDat to improve your social contexts, which might be then used in WhatsDat, to give you a better idea of the topical contexts. And so there may be a virtuous cycle here that you could go around once you got everything tuned up reasonably well. Okay. So the bottom line is that if you want to run this, the data's available at Enron now. And if you would -- if you're a social network person you might start thinking about social networks in a somewhat more complex way. So instead of just reading things out of the header, you could look into the content and have them in essentially the same vocabulary that the things in the header were in, and now you can start to build type planks that are more than sent from, sent to, received by. And there may be something here. I'll give you a quick example of this that was fun. We did this on National Security Council e-mail. And we built a system that would build quick bios, so what does this person say, what does this have said to them. And we added to it what do people say about this person. So there are only 500 messages in this collection. It's just a little toy collection. But it's -- they're interesting people, they were in the middle of the Iran-Contra affair, they almost got Reagan impeached. You know, it's sort of high interest to the historians. And in our bio of George Shultz there's this line that says: George is an idiot. And we're like, well, was this in something sent by George Shultz, something received by George Shultz, or something said about George Shultz. And I'll leave you to guess. So that's the end of the talk. >> Sue Dumais: Thank you. [applause] >> Sue Dumais: We have time for questions. >>: Have you done a failure analysis to know what other features you might want? Two that come to mind are -- you mentioned in passing that you are restricting the topical stuff to things that had happened in the recent -- in a more or less same time frame ->> Douglas Oard: Topical and social both get restricted that way. >>: Okay. So presumably if you don't do that it's a real mess. >> Douglas Oard: Um ->>: And another kind of ->> Douglas Oard: Yeah, we didn't tune that very well. And so you could imagine throwing that in -- what you really want is a discriminative framework here, and you want to throw time in as a feature, right? And so we're sort of hacking that because we're in a generative model and so we have to kind of like hide that in our model someplace. >>: And the other is what external resources do you think would help this? So if you were running this like in a real enterprise where you had things like address books and calendars and ->>: Or charts. >>: And what? >>: Org charts. >>: Yeah, yeah, exactly. >> Douglas Oard: Yeah. >>: Those change over time too, which is interesting. So org charts change a lot, actually. >> Douglas Oard: Yeah. So we've thought a lot about -- so we have org charts available for Enron. And we thought about doing that. We don't have calendars available for Enron. We also have all the trades made by all the traders, so there's a subset of the collection that are people who we know are traders and we know what their trading activity is like. And so in cases where that's your focus, you might get some benefits out of that. We also have all the phone calls that are made between people -- actually, we have some phone calls that are made between some of the people in the collection, and we have some funny cases where people will say things like I can't put this in an e-mail, give me a call. And then we have the call, you know, and it's like, well, that's, you know, [laughter] really -- we're not thinking too clearly here now, are we. And then we also have cases during a phone call where somebody will say I can't talk to you about that on this line; call my cell phone. And we don't have that. So there are some people -- this is sort of like survival of the fittest, right? Yeah, right. So ->>: [inaudible] >> Douglas Oard: Yeah. So, anyhow, so we have a whole bunch of things like this that we could put around that, and we made a conscious decision at the beginning of this not to mess up the central question with all those extra features. But it'd be very nice to go back now and ask can we pick up something with those features ->>: Or maybe does your failure analysis lead you to believe what resources or what new features might be ->> Douglas Oard: We don't have anything yet that come from looking at the question that way. We probably should. We did go back and look at why we were having trouble in the iterative solution, and it pointed very clearly at the two things I showed you. So we have done a little bit of failure analysis, but we haven't done -- we're very close to the inter-annotator agreement, and so I'm worried about trying to tune too much further with this evaluation framework. We actually need to change evaluation frameworks if we're going to see much more I think. The evaluation framework we have in mind is that we have e-mail from one of our faculty members at Maryland who's kept it for 20 years and who's allowed us to publish on it. And so -- >>: [inaudible] >> Douglas Oard: What's that? >>: [inaudible] >> Douglas Oard: And he has it labeled in certain ways that we could ignore, but he's got it sorted into folders. And he is very interested in working with us, so he could be an independent source. One of the critiques of the approach that we used to the human annotation is that the humans are essentially looking at the same features that the machines have. And so it might be that real people could do something -- you know, there are more answers out there to be found that we can't find. So it may look to us like we're doing better than we are. We also have several other e-mail collections, some of which we could conceivably get people to annotate. So we have a collection of corporate e-mail that is available for research use but it's available under the restrictions that Census data is under. So essentially what we do is we frisk you on the way out and we make sure you're not walking out with anything in your socks. And, you know, pre-pub review stuff. But you can ship algorithms into that data and work on it. And the reason it's sensitive is because all those people are still around. But that was data that wasn't made public. It was deposited at the Library on Congress on a court order, on a bankruptcy court. So the actual way to get collections like this is to buy bankrupt companies. Because the companies own the data. Other questions? Okay. >>: Do public distribution lists look like this at all when you use them? >> Douglas Oard: No. So we actually ran this on Usenet. So someplace way back in the beginning. This is not the right way to get there. So the Human Language Technology Center of Excellence ->>: We can help you with that. >> Douglas Oard: Yeah. Okay. The Human Language Technology Center of Excellence supported this work. And so they're at Johns Hopkins. And they were participating in a cross-document coreferents task for ACE last year. And that cross-document coreferents task, like many other NLP tasks these days, these bake-offs, are trying to get more diverse data. So instead of, you know, train on this and then test on something that looks exactly like that, it's, you know, train on anything you can find and now test on, you know, we'll make a sample of the messiness of the world for you. So one of the things they threw in was Usenet news. And so we said, well, okay, fine, the two addresses, the newsgroup and the farm address is the -- is the farm addresses and let's just turn the crank and see how this thing works. And they gave us 10,000 documents, which drove the NLP people nuts because they were like 10,000 documents, what are we going to do? They've got all these -- you know, so they were just at this for a long time trying to figure out how to do that at scale. And we had all the MapReduce stuff running, and so we were like 10,000, you know, three minutes later. And they were like, okay, but could you please remove all the stuff we filtered. And it was like, okay, it's five hours to filter the stuff that should have been filtered. The upshot of that was that this failed completely in every case. And it failed completely in every case because there were only 15 Usenet messages in the test collection. And all of them were Usenet messages in which someone had quoted a news article. So we completely lost the social structure. Everything looked exactly like news. And so we learned from that something about building test collections. But we didn't learn from that anything about what this would look like in public speech. I actually don't think you need something like this nearly as much in public speech. So on something like a newsgroup. Because the people who write for a newsgroup are expecting many people to have to resolve it, so they're not writing with someone else's negotiated terminology in mind; they've got some broader view so that the feature that we're trying to leverage is more diffused and the need is less. So I don't think it's the low-hanging fruit part of the problem. But it was an interesting place to look at it and we were very surprised by what the problem turned out to be. So it taught us a little something about evaluation. >>: That is the interesting thing, you know, from the NLP perspective is how do people choose to mention the people that they [inaudible] has done some interesting work the news articles, that there is a pattern to, you know, kind of -- with the news articles you first kind of caught the most expansive reference and then you kind of pare down, pare down, and maybe you -- you know, you've drifted off the topic, then you need to bring it back, that person with a more expansive reference, but that's kind of [inaudible] for that one single document. But they are patterns. >> Douglas Oard: Right. And they are ->>: And that might be more appropriate for the Usenet groups. >> Douglas Oard: Yeah. And there are patterns that you see with audiences, right? So if you're reading something in the U.K., you'll often see somebody just referred to by name that in the U.S. their role is given. But everybody knows that's the prime minister, right? Except the people in the U.S. who, you know, didn't keep up on like the elections in the U.K. for the last ten years might not know that's the prime minister. So in the U.S. you have to tell -- you have to name the role more. So there are things like this that you can see. But the conversation -- so almost all of the information content in the world, if you just count words, is spoken. Right? So roughly the -- I don't know how large Bing's index is, but I did this for Google a few years ago, and the number of words spoken or heard by people who have cell phones in their pocket in a day is roughly equal to the number of words that are indexed by Google. So people -- if you take a Google as a -- as a unit, people speak a Google a day, right? Now, you'd say to yourself, well, okay, fine, but most of that's trash. It's like, well, okay, fine, but isn't that what the Web is, right? Isn't that where search is king? >>: [inaudible] >> Douglas Oard: What's that? >>: [inaudible] >> Douglas Oard: [inaudible] yeah. So most of the content to be found in the world, there are all sorts of social structures around this, of course, but most of the content that anybody might ever want to find in the world is being spoken. And so it's mostly conversational. It's mostly dyadic. And so understanding what to do there strikes me as being sort of the driving application. And then you can think about, okay, and how would I take these techniques and move them out to news stories or to speeches or to classroom lectures. Or, you know, all those things have their own sort of genre-specific conventions that you might leverage off of. But conversational speech is the thing we know the least about what to do in the information access world and it's where our tools probably could be used to the greatest effect. >>: [inaudible] had all of his life [inaudible]. >>: He has a lot of it digitized. >>: Digitized. So presumably his conversations [inaudible]. >> Douglas Oard: Yeah. So this is actually the sort of thing, this life-logging idea, would be a very natural place for this, because essentially the first insight in this work was that people must have negotiated what this term meant before they used it. Right? So you could just -- if you were looking at something and you didn't know what it meant, you could look back to the beginning where it was first set up, right? So that was our initial thinking about this. And if you do that, if first thing you realize is, yes, true, but how often do I negotiate, meaning outside the observables and then use them inside. And so we didn't actually build off of that insight. But that's where we started this work. If I had life-logging, then that's actually what I'd want to do. I'd just want to roll back to the first interaction among these people and roll that interaction forward and learn what words mean, and I'd probably want to pay the most attention to the things that happen very early on and the things that happen most recently. Right? So I'd actually start sticking some bias on that a priori. And so that -- that'd be fascinating data to look at, but I'd take a different approach than I took here if I were going to work with that. Because we have at that point more in the way of expectation of the structure of the interaction. >>: How much of negotiation of what terms mean is never that explicit because people make assumptions, you know, based on what they know about each other's background. So, for example, in your talk today, you use the terms generative model and discriminative model. You didn't define that anywhere, but you assumed given your audience that we would know what that meant. >> Douglas Oard: So I would agree with you that you would have to do -- you would also have to look into context, right? I might stick to my original position and say that I haven't negotiated this with you but I've negotiated it with people like you. Because I've been to ACL and I have some idea of how people talk there. And I know where I picked up this concept, right, and I know who it was that first skewered me and said why didn't you build a discriminative model. And so I can sort of -- I can build a similarity function among people that way. So I could still do what I was trying to do, but then I could do some kind of a spreading activation thing that would move people -- that would move characteristics of prior knowledge around a social network. So someplace in between those two. There are some somethings that I said today that you probably had to get just from context, right, that I actually, you know, through oversight on my part or not enough time or not what I was emphasizing, there are probably some things that if you wanted in this you had to put them together. So I think both are active. >> Sue Dumais: Let's thank Doug again. [applause]

>> Sue Dumais: Hey, good morning. It's my... who's visiting us from the University of Maryland where he's...

Related documents

Products

Support

&gt;&gt; Sue Dumais: Hey, good morning. It's my... who's visiting us from the University of Maryland where he's...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Sue Dumais: Hey, good morning. It's my... who's visiting us from the University of Maryland where he's...