>> Bill Dolan: So I'm really pleased today to welcome Ellen Voorhees, who is visiting us from NIST where she's the manager of the text retrieval group? Just retrieval group. Which encompasses a range of technologies, and I'm sure you're familiar with many of these evaluations like TREC and a new one TREC VID and a newly forming group set of evaluations called TAC that I'm sure Ellen will be telling us all about. Ellen's background is Ph.D. in information retrieval for Cornell working with Gerald Salton. She joined NIST then after ten years at Sea Moons, where she did a lot of interesting work that I remember really well with WordNet and information retrieval back in the '90s. So I won't belabor this. Thank you, Ellen. >> Ellen Voorhees: Thank you. So thanks very much. I'm glad you could be here today. I'm going to be talking about some of the language technology, evaluations that we're doing at NIST. And one of the things that I frequently, one of the questions I frequently get is people who not necessarily meanly but just wonder why is it on earth that NIST does this at all, why should NIST care about language evaluations? And so here you have up at the top of the slide here the official NIST mission. So just to make sure everybody knows it, NIST is the National Institute of Standards and Technology. It's part of the federal government's commerce department. And so therefore its role -- and it's also the home of the national clock, so time, NIST time is from NIST. It is the home of the nation's kilogram. It is the home of the definition -- well, the kilogram is actually still an artifact, right? There are a lot of people at NIST working to make that not an artifact anymore because all of the other standard units of measure are now defined through other processes such as the meter is the length -- the wavelength of some element at zero degrees celsius, et cetera, et cetera, et cetera. Information technology hasn't really quite gotten at that level of standardization. But the NIST mission as a whole is to promote innovation and competitiveness by advancing measurement, science, standards and technology and we fit in this quite well by our group's mission statement is then to do that promotion of innovation and industrial competitiveness but focusing on information technology problems. And one of the reasons why it makes sense for NIST to be running these sorts of evaluations and I'll get to what sorts in a moment, is we're a technology neutral site by which we mean we don't actually have an axe to grind about whether or not we want to do your retrieval or your entailment by Gaussian networks or by rules or whatever method, right, we're there to evaluate the end result and not have to promote any particular method by doing that. And at this point, we have a reasonable amount of expertise and experience in operationalizing these tasks and evaluating whether or not the evaluation itself is actually accomplishing what we want it to accomplish. So within the retrieval group, the most well known of our evaluations is probably TREC. TREC started in 1992. It's a workshop that focusing on building the infrastructure for evaluating text retrieval systems. I have text in parentheses there because we have branched out into some other sorts of media but basically we're still concentrating on text. And three main things that TREC has done is create a realistic test collections, a standardized way of evaluating retrieval, and then a forum for exchanging these results. And as I said, it started in 1992, so at this point we have -- we're now in our 17th TREC and over the years we have run on the order of started with just two tasks so it's these tasks here, but we had branched out into a variety of different tasks over the years. And I mostly have this slide up theory just so you get sort of the idea that it is not a single monolithic thing, there are these individual tracks and each TREC sort of has it's own combination of tracks and it's own combination of things that people are focusing on. And so if you take the whole set of these different tasks that TREC has looked at over the years, we've actually looked at a fair number of different problems, and each of these tracks have -- not each of them, but many of them have had individual tasks within them as well. So we're probably looking at here on upwards of about 100 our so individual tasks which have been evaluated over the years. So what good is that? Well, TREC itself was based -- did not come up with the way that retrieval systems were -- or evaluated. It was built upon an already existing paradigm and the information retrieval community that was based on the idea of a test collection. So we're talking here now about an ad hoc search, okay, so ad hoc meaning it's sort of a one-off search, you don't have any history for this search, it's your typical -- it's a prototypical search of say a Web search or a search of any large collection. And the abstraction there was to create -- was to obtain a set of queries or questions, a document set and then relevance judgments. These relevance judgments are the things that says which documents should be retrieved for which questions, okay. And this was originally created by Cyril Cleverdon in the late 1960s. It's called the Cranfield paradigm because he was at Cranfield Aeronautical College when he did this. And the whole idea was specifically to try and abstract a way the problems of having individual users while trying to evaluate search. Okay? And the idea behind it is that we -- you want and in order to do evaluation you want to have an abstract task, you can't do -- you can't evaluate -- well, it is difficult to evaluate the real end user task because of the amount of variability in there. So the idea was to abstract a way to a task which has sufficient closeness to a real user task to be applicable to it but to be able to control a lot of the variability and therefore be able to do more controlled experiments and get more valuable results from those experiments. And this sort of core task is what Karen Spark Jones called the core competency of evaluation task. So it's an abstract task which has real application to a user task but has enough control so that you can actually learn something from an experiment. The consequences of this abstraction is that you do gain these control over lots of these variables so you get more power for less expense at the cost of you lose some realism because obviously you have purposefully doing an abstraction here. So whatever is not accounted for in the abstraction is a loss of realism. So is Cranfield reliable? Lots of people will argue and have argued for 60 years now that you can't possibly evaluate retrieval systems if you're going to base it on a set of static relevance judgments, right, because we all know and we do know that number one users are not -- when a real user uses a retrieval system their information need changes while they interact with the collection and so it's not static. We also know that if you give the same set of documents for the two people who have putatively the same information needs, they are going to consider different sets of documents to be relevant, okay? And furthermore, we now know from having a big set of results such that TREC the cumulative effect of the different TRECs have allowed us to look at, even though Cranfield was specifically developed to try to abstract away the user effect the user effect is still the biggest variable within Cranfield paradigm. So if you look at an ANOVA test of the TREC three results and this was reported back by in the first issue of the journal information retrieval, they took the TREC three results and ended ANOVA over those results with systems as one set of variables and the topics as a second set of variables and looked for interaction effects, et cetera, and found that the topic effect was larger than the system effect and that the interactions were significant but smaller, right. So basically what this means a topic effect, remember here the topic is TREC ease for what question was asked. So here you are, you're trying to evaluate systems when the topic effect so what was asked is bigger than the effect you're trying to measure, okay. That complicates retrieval evaluation by a whole bunch. Doesn't make it impossible but it complicates it a lot. We also the other thing that people will argue about and whether Cranfield is at all meaningful is the measures that you use to evaluate the effectiveness. There is a whole slew of different measures that you might use. TREC eval which is in the evaluation package that you can download from the TREC site reports something like 109, I think it is, different numbers for any run that you give it. And we also know enough about looking at these evaluation measures to know that different measures really are measuring different things. That's what I mean by they represent different tasks. So one of the more common measures is precision at 10, precision means how accurate were the set of documents that you retrieved. So precision at 10 means how many of the top 10 documents on average were actually relevant, right? It's a very understandable measure, very intuitive of what it means. It is in fact a fairly unstable measure because it's based only on the top 10 documents. Okay? But that's a very different sort of task that's being measured. If we're only looking at the top 10, then if you're looking at something like mean average precision, which is the measure that frequently gets reported in the literature which looks at the entire set of documents that are retrieved based on very many more documents retrieved doesn't have nearly the intuitive understanding of what that measure actually means though. So as I've already alluded to here, these measures mean different things, but they also have different inherent stability. To me average precision because it's based on the entire set of documents that are retrieved. It's a much more stable measure than precision at 10. So how if you're evaluating yourself, how does that all interplay? Well, we have taken the TREC results and we've done a number of studies to try to show what actually is happening under these numbers that get reported. So one, the first thing I want to show here I'm going to go through these all very quickly, but I wanted to show is it really is the case that these averages and everything is being reported on averages because we know that the variability -because we know that individual topics are the topic effect is so great that individual topics are not particularly meaningful when trying to evaluate systems, you want to take averages over lots of topics. But those averages hide a lot of stuff. Here the big red line, thick red line is the average re-call precision graph for a set of 50 topics from this run. And the little tiny lines are the individual topics that that is an average of, okay. It's only actually like 20 I think of the 50. Because if I put all 50 up there, then you can't see anything. But you can see that it really is, this is one run it gets reported with this nice little typical re-call precision graph, but this is what it's hiding. Some of those topics it worked really, really well, because remember your best point is up there. Other topics were really, really bad. But all you generally see is that big red number and actually you're sometimes lucky to see even a whole re-call precision graph because frequently one number gets reported, okay. So that's what is behind the numbers. The other thing then we looked at was well, okay, we know that relevance judgments do change. How does that effect the evaluation results? So one of the things that we did was we took -- we asked different people, we have these assessors that come to NIST, they're actually retired information analysts from the intelligence community. Normally, because we know that different assessors have different opinions about what's relevant and we do, each topic gets one relevance judge so the internal -- the judgments are at least internally consistent even though they represent one person's opinion, right, what we did another time, though, or two other times was we gave those same query and the same set of documents that were retrieved by the systems and we gave it to a second judge and then we gave it to a third judge. And we did that for all the 50 topics in the TREC 4 ad hoc test set. And then since there were 50 topics and we had three judges for 50 topics I could create the different set of relevance judgments for the whole 50 topics by choosing one of those three judges for each of the 50 topics. So we have three to the 50 different sets of relevance judgments, right? Now, that's a whole bunch and I didn't create all of those but I created 100,000 of those. And I evaluated the same set of runs so this is exactly the same set of runs that retrieved exactly the same set of documents because it's the same run and I evaluated then using each of those 100,000 different relevance judgment sets. And what's plotted here and the bar is the minimum and a maximum mean average precision value that was obtained by that one run over the different relevance sets. And you can see there is a range. Each of these runs actually does have a fairly big spread about what their -- that one particular run would have been evaluated depending on who we had used as assessors. But the important point here is the scores that any particular run gets using a given relevance set of judgments is highly correlated with each of the runs is highly correlated depending upon when set of judges you use and what -- let me say that again better. If you look at the individual scores so for instance the red try angle, that was the official TREC 4 set of judges, if you use the same set of relevance judgments for each run you will see that they stay basically in the same order. What this means is we have hard judges and we have lenient judges but all of the systems generally rank the same in the same order if you use the same judge for those two different runs. >>: I have a question about your reliance on these judges from the intelligence community. So I suppose it's good in a sense because their whole life has been learning this for how to make judgments. But it seems like are you getting sort of a picture of what relevance is versus just [inaudible] off the street? >> Ellen Voorhees: It depends upon what you're trying to represent. So yes, if you have some very particular notion of relevance that you want to -- that you want your judgments to represent, you had better use judges that represent that notion. We used intelligence analysts because when TREC began it was specifically targeted for the intelligence community. But in fact, even these, we have fairly -- it's a fairly homogenous group of assessors because they're basically all retired information analysts. We repeated this experiment, however, with a TREC 6.1 and there the second judges were University of Waterloo graduate students which are pretty different than retired intelligence analysts. And the same thing holds. And in fact, the variation within our group and their group was not distinguishable among the variation between those two sets. So people just different, even. >>: [inaudible]. >> Ellen Voorhees: Actually the graduate students are a lot cheaper, essentially since it was University of Warterloo doing it and we didn't even do it, it was a lot cheaper for us. But all of this is just to say yes, it is the case undoubtedly people's opinions differ. We know that. But as long as what you're using this for and the test collection paradigm all says that all you're trying to do is compare systems, in fact it doesn't matter because the comparison is stable. And I got to go a lot faster than this if I'm going to -Another study that we did with this set of results was to the look at the differences that the inherent error rate within a measure. And basically we looked at this set of different measures precision at 10, precision at 30, average precision, our precision and recall at 1,000. If you don't know what all those measures are, don't worry about it. Basically the idea is if you create the two main things to see here is if you want to allow for a bigger delta in between at the time you conclude the two systems are different, so remember all you're doing here is comparing two systems, you have a score say a precision 10 score for each of these two systems, if the score -- if one score is .1, then the other score is .9, you probably want to consider those two different. But if the one score is .1 and the other is .11 you probably don't want to consider those two different, right. So this fuzziness value is the percentage of the maximum score which is considered to be different, right. So three percent of this -- of the -- three percent of the bigger score has, the difference between the scores has to be at least that in order to consider the two scores different. And here it's five percent and et cetera. The bigger score that you -- difference that you require obviously the smaller your error rate is going to be. And here the error rate is looking over multiple comparisons and looking at whether it's the conclusion is that system A is better than system B. If you look again at a different version of that A and B the number of times that says that B is better than A. We did this using a set of -- the same set of topics but different queries. And which we developed in the -- which we got this data from the TREC query track. I don't want to go into that detail. I know that that wasn't very clear but if you have lots of -- if you want lots of details about that ask me later. Right now all I want to say is different measures have different inherent error rates. They are some measures which are inherently less stable than others. And it's not really a big surprise it's just that this shows it. And the reason why they have inherently different error rates is largely just related to how much information is encoded in that measure. And precision which is a very common measure and which people like for a lot and for good reasons happens to be a very unstable measure because the only thing that you really can base this measure that the value of that measure depends on is the number of relevant documents at that cutoff. It didn't matter how they're ranked before it or below it, in order to change your retrieval system and to change that score, the only thing that matters is if another relevant document entered that set or left that set. So that's a very impoverished measure. It's a very user friendly measure but it's a very impoverished measure. And the precision is one of the less stable measures. >>: [inaudible]. >> Ellen Voorhees: Yes? >>: I understand this and I think it's great to look at these measures but there are cases where what you really care about is precision at 10 or precision at 1 or something like that. >> Ellen Voorhees: Right. >>: Do you know how this changes, do you know how the inherent error rate changes if you add more queries program? If you have more queries that shouldn't reduce your error? >> Ellen Voorhees: Right. It goes up. The error rates go down. Your stability increases. >>: [inaudible] looks like it. >> Ellen Voorhees: It's funny you should ask. So the next thing we looked at was how error rates depended upon topics at size. And the big thing again here to see is that the more topics you have, the more stable your evaluation becomes where once again stability here means that you have consistently are saying the same system is better in your comparison. Okay? Again, these different lines are dependent upon the difference in the score that you're using to decide whether or not systems are actually different. And this particular graph is just showing mean average precision but similar sorts of graphs except starting at higher error rates are for precision. I mean, it holds for precision and it holds for other measures. >>: [inaudible] air rate over the range of map value say, so maybe map is more sensitive at the higher range of the values than at the lower range of the values? Something like [inaudible] certainly is. >> Ellen Voorhees: Right. We have not looked or at least I have not looked specifically at the absolute value of the measure and how sensitive the -- it is at the absolute value. Except for to say that obviously if you're requiring some given distance between the two measures and the result is very poor you're not even going to get to that level. So very poor, very low numbers of your measure where when high error is better, very low numbers are actually quite stable but that's because you have no room to be unstable. Once you get to zero you're zero. So yes, it's going to be -- it will be somewhat more unstable at larger numbers but that's only because you have more room to have fluctuation. At least part of the explanation is because you have more room to be -- to fluctuate. All right. All of the this was to say that we have looked -- we have spent a fair amount of time trying to look at the Cranfield methodology, to see what you can and can't learn from it. And me being a TREC person and being a big fan of the Cranfield paradigm come to the conclusion -- comes to the conclusion that Cranfield is both reliable and useful, right. We -- it's reliable within certain parameters, right, you can't -- it is not feel and it doesn't -- it will not answer every question that you have but for the purpose of comparing retrieval system behavior it is quite reliable and it's useful in the sense of what this graph is showing you is the improvement in ad hoc evaluation over the first eight years of TREC. This is for one system. This is just the smart system which was from Cornell and Saber. What they did is they kept a frozen copy of their system each year, right, because you have a new test set each year. The test set itself might be harder or easier so you can't really compare scores from year to year because that's not a fair comparison. But the smart group kept a copy of their system for each year and ran each new test set with each system. And basically what these graphs are showing is that retrieval performance basically doubled between TREC 2 and TREC 6. Now, while this is only one system, the smart system was among the top systems in all those years so it's representative of the field as a whole. And then it also says that ad hoc performance has leveled off since then. Why has it leveled off? Partly because TREC went into all these other different tasks and so people were concentrating on other tasks and partly because it's not clear where the next big improvement is going to come from and retrieval systems. Now, I wasn't going to do this all on just on retrieval. I wanted to use that whole part there as sort of a lead-in of what type of evaluation do we then want for other types of tasks and other language tasks in particular. So but from using all that as background, what makes it then a good evaluation task? Well, you want it to be an abstraction of some real task. And that's actually not as easy as it sounds. And in particular the entailment task which I know some of you are involved with here is actually I would claim not an abstraction of a real task. There is not going to be an actual user tasked in which you're going to ask someone is this sentence entailed by that sentence. And that does affect the way that you can build the evaluation because now when you have questions about well, should we be trying to you know this part heavier than that part or whatever, you don't have the real task to go back to and say we should do it because of this, we just have to say we should do it this way because we think so. All right. So but anyway, you want them to be an abstraction of a task so that you can control the variables. You do need to control the user effect here. But of course it has to capture the salient aspect of that user task or else your whole exercise is pointless. You want the metrics to accurately reflect the predictive value of this thing, right, so you don't want to -- you don't want your metric to give really high scores to that system which is actually the worst, right, you want your metrics to reflect reality. And then you get into some more meta-level things as you have to have an adequate level of difficulty, right. If you make the original task too easy, everybody does well and you learn nothing. People are happier but you don't learn anything. If you make it too difficult, you again don't learn anything because everybody does abysmally, you don't learn anything but now people are angry, so if you're going to error one way or the other, you might as well make it too easy, but still you want to actually try and hit a sweet spot where the systems can do it but it's not a slam dunk. You'd really like your measures to be diagnostic by which I mean if you have a measure that says you're 50 percent of what the maximum value of that measure is, it would be good if you could actually -- that measure actually shows you where it is that the problem is. So where you in this case, this is system builder, right, you'd like to know what the problem is. And it's also best if their infrastructure that is needed to do this evaluation is reusable, and that's best in the sense that it helps the community more because there will be more available but it also with -- makes, costs bearable if you can then reuse it. Now, there isn't any evaluation task that I know of that meets all of these in all aspects. But those are the types of things that you want to shoot for. So as Bill mentioned, one of the things that we're starting at NIST this year is the text analysis conference basically the analysis here is that we're trying to create an evaluation conference series that would be for the natural language processing community as TREC has been for the retrieval community. It's the successor to the document understanding conference DUC which has been going on for several years now. Now, DUC has been focused specifically on summarization and summarization is a track in this inaugural year of TAC but we've also taken the question answering track from TREC and put it into TAC largely because the approach is to question entering these days are much more natural language processing centric than they are retrieval. Ken, you look unhappy. >>: [inaudible] just thinking of the bad jokes about ducking and tacking. >> Ellen Voorhees: Yeah, well, it was an incredible fight to get people to agree on the name this conference. [laughter] ask Lucy sometime. Lucy was involved in this argument. >>: [inaudible]. >> Ellen Voorhees: Okay. [laughter]. >>: [inaudible]. >> Ellen Voorhees: Right. And so we're keeping TAC. And the third part of TAC this year is going to be a textural entailment task, both making a two way and a three way decision. And I'll get into what that means in a minute. So what I wanted to go through now was sort of a more detailed explanation of how we go about defining a new evaluation task when we're sort of starting from scratch. I said in the retrieval we didn't start from scratch, Cranfield was already well established by the time TREC started. But then we came up with this idea that we wanted a test question answering. And that was different from retrieval, right, and so how did we go about defining that evaluation? Well, one of the things we started with was saying we wanted to start with factoid question and answering so these were little text snippets as opposed to what's the best way to fix my washing machine, right? No, no, no. We wanted a little short answer based on facts, right. So how many calories in a Big Mac, who invented the paperclip, right, these sorts of little factoid questions. And we had the systems return rank lists of document IDs and answer strings where the idea was that that answer string was extracted from that document ID and the answer string contained an answer to this factoid question. We started with factoids because of our previous slide where I said we wanted tasks that is were of adequate difficulty. When we started this task we had no idea how well systems would be able to do. But we had a pretty good belief that answering factoid questions were going to be a heck of a lot easier than answering process type or, you know, essay questions. We also had a reasonable idea of how we could go about evaluating factoid questions because we were expecting a single answer and it was either that answer or it wasn't that answer and we could compute you know, scores from that. And we did that for several years. And it worked reasonably well and the systems got better. And then we hit in TREC 2001, so the factoid answering questions track started in 1999. So three years later we looked in the and we saw that about 25 percent of the test set which we had extracted from logs, one of which was donated by Microsoft, so this was web search logs, we took the questions from those and about a quarter of the test set while could be answered in short -- they could be answered in little short phrases, they weren't really these factoid type questions, they were much more definition questions. So what is an atom, what's an invertebrate, who is Colin Powell. These types of questions were hard for the systems to do, but it was also hard for the assessors to judge because what's an answer to this question? Right? >>: [inaudible]. >> Ellen Voorhees: And it depends critically upon who the user is or at least what the use case is, right. So is the person who is asking what an atom is a second grader writing a book report or is it an atomic physicist? You know probably we aren't going to get an atomic physicist ( asking our system what an atom is, but you know you get the idea here. And there's also this problem that looking through a huge corpus of newspaper, news wire documents probably isn't the best way of answering these definition questions, right. >>: [inaudible] something related to [inaudible] seem to be [inaudible]. >> Ellen Voorhees: Well, remember we're at 2001 here, so Wikipedia isn't as popular as it was. But these days, yes, people do go to Wikipedia to answer these questions and we can get to that. But, yes, you've hit the right -- the idea here is newspapers aren't the place to go looking for answers to this type of thing. So what we did is we actually created a new task which was specifically focused on doing definition questions, which definition questions here meaning what is object or a who is person or a who is company type of question. And the problem with trying to evaluate that is we now have to define what a good answer is. And there will be different parts of that what's a good answer. One will depend upon what you're assuming your user base is, but there's also how we actually going to evaluate it in terms of, you know, giving scores because what we really wanted to do is reward the systems that retrieve all of the concepts that they should have retrieved and penalize them for retrieving concepts that they shouldn't have retrieved but what's a concept? Right? And in particular concepts can be expressed in very many ways and so we're going -- we're back here to all the problems that we had when we're trying to evaluate natural language processing systems of there's no one to one correspondence between the things that we want and the way they might have been expressed and now we're going to have to deal with that. And in particular, we decided that we were going to evaluate these definition questions or a good way of evaluating these would be by re-call and precision of these concepts but these different, different questions had very different -- had very different sizes of what should have been retrieved, right. So a very, very specific question like one of the ones we had was what was some sort of -- I don't know, it was some sort of circuit that was used in laser printers. I mean that was pretty much all it ways, it was it was a circuit that was used in laser printers versus who was Colin Powell, right. I mean what do you run for Colin Powell. So there was very much more would get returned for Colin Powell than this circuit. The first way and we had a couple of pilot evaluations in how we were going to evaluate definition questions and this was all part of the work that we were doing with the acquaint program. Acquaint was a question-answer program sponsored by at the time what was ARDA, which was an intelligence research funding group. They are now IARPA. And the idea of the first one we were going to try, the first evaluation we try was we were just going to say, okay, we just going to give -- systems are going to retrieve snippets that they had retrieved out of these documents, they're going to put these snippets, they're going to collect all these snippets, we're going to give the snippets the whole set, the whole bag of snippets to our assessor, the assessor is just going to say on a scale of zero to 10, I think this is a five as an answer. Where zero completely worthless and 10 was the best possible answer they could imagine. But they were going to do it actually two scores, one on content and the other for presentation and those of you who are summarizers will know that presentation is a big thing, here, too. And we decided okay, we'll use a linear combination of the two scores and that will be the final score and that will be it. And by and large that didn't work. It had a couple of good things. It had the -- it does give you an intuitive feeling for what a human things of this set of facts it gets returned, and it's not really possible to gain that score. On the other hand, the differences among the assessors whose scores they were assigning because they sort of had their own -- each assessor has their own sort of baseline. The differences in the scores among the assessors completely swamped any differences in the systems, just no contest, there was no signal there, right, it was all noise. And this score doesn't help the system builders at all. Right, the fact that I think that this summary is a seven, now you go change your system, right, I mean what are you going to do, right? So we tried another thing. We tried what we called the nugget based evaluation. And here we had the assessor actually go through the corpus and the responses that got returned by all the systems and create a list of what I called nuggets. These are concepts that the definition should have contained. Right. And we asked the assessor to make these atomic in the sense that they could make a binary decision about whether or not that nugget appeared in a system response. That was our definition of atomic. Not only were they to make up this list of things that should appear but they should also go and mark once that absolutely had to appear in the definition for that definition to be considered good. So these were the vital ones. And then concepts that could have appeared in it, it was okay, they weren't offended by the fact that this fact was there in this definition but it didn't have to be there. And those things which -- those facts which while true, they didn't want to see reported, they should never have even made nuggets, okay. So that doesn't even make the nugget list. So nuggets are only those things that the assessor either definitely wants to see or is okay to see in a definition of this target. Then once they created that list they went back to the system responses and they marked where in the system responses these nuggets occur if they did at most wants. Okay. It could be that an individual item returned by a system had one or more or zero concept, right and maybe none of the nuggets were actually in it. Here's an example. So one of the questions what is a golden parachute. The assessor created the set of nuggets shown here on the left, those 6 things and the ones in red, the first 3, are the ones that this particular assessor thought was important that had to be in the definition. This one particular response from a system is shown on the right and those are the nuggets that the assessor assigned, right, so that means that the assessor saw in the response A from the system both the fact that it provides remuneration to executives and that it's usually generous. Nugget 6 was in item D. F didn't can taken anything or else contained only duplicates that we didn't distinguish between that. Now, from that set of things we cannot compute re-call, we can compute how many of the nuggets the system should have retrieved that actually did retrieve but we still don't have any precision there because we don't know how many concepts were actually contained in the system response. How would you consider to be the big payment that [inaudible] received in January was intended as a golden parachute. How many nuggets of information was contained in that? Who knows? We actually did ask a couple of assessors to try to enumerate all the -- and we tried it ourselves where ourselves mean this staff, so I even tried one of this. It's impossible. You cannot possibly say how many nuggets of information are in the system response. So what are we going to do about precision? We punted. And we took a lesson from the summarization evaluation which was going on at the time and we just used length. But the idea being that the shorter the response that contains the same number of nuggets the better, okay. So systems got some basic allotment of space and then they're scores started going down once they exceeded that length. So now once we had that definition we wanted to do some of the same sort of evaluation of the evaluation that we had done with some of the retrieval stuff and we wanted to look at a couple of things, one of which is mistakes by -- so differences in scores are going to come about by a variety of things, one is which mistakes by assessors. Now, when I report scores and things from -- to participants on our evaluations they're always horrified at this idea that they might be mistakes. And my general response and my response now will be get over it, right. Assessors are humans, they're going to make mistakes. Deal. We tried to make sure that there were as few mistakes as they are and we try to make sure that we even set things up so that they will make fewer mistakes, but there are going to be mistakes and there were mistakes. So in this particular case we could actually measure the mistakes because there were some things which were actually duplicated, some systems actually submitted the exact same thing more than once and we didn't have any mechanic for catching that and they were judged differently on occasion. So that's not a difference of opinion, that's a mistake. But there were differences of opinion. There were differences of opinion even in factoids that are certainly going to be differences of opinions here. And the sample of questions, some systems might just be able to do some particular type of question easier than they can do another. We did for a couple of these go ahead and have the questions independently judged by two assessors. And what we saw was roughly the same thing -- no actually it wasn't really the same thing as per retrieval. And here they were highly correlated again. But in this particular evaluation that we did we got swaps where this is one system would say it was better than the other, any way if you changed the assessor it would be the other way around. With fairly large differences in scores. And differences in scores -- the difference in score that you needed to be confident that you weren't going to get these swaps was bigger than the differences in the scores that the systems were producing. And that's obviously a problem. And here again we also did the same sort of thing we're looking at the number of questions that you would need in order to drive that intended error down below five percent. And it ended up, as I said, bigger than the differences that the systems were producing and so our only thing we knew to do there was to increase the number of sizes of the questions that were being asked. And so this is I've labeled this try two prime because what we then did from that starting in TREC 2004, we basically made the entire test set for question answering definition questions where the -- it was structured such that each -- there was a set of targets, the target was the thing you were trying to define, and then there was a series of questions about that target. So we actually had multiple individual questions all about that same target. And this allowed us to do a couple of things. One, it allowed us to have 75, rather than 50 definition questions because the entire thing was all focused on definition questions now. But it also allowed us to do a little bit of context, only a little bit, but a little bit of context within these series because within the series what we did was we started to use pronouns and things to refer to the target and previous answers which doesn't sound maybe like a whole big thing but was a really big deal to the question, answering systems at the time because now he had to deal with that sort of an [inaudible] and most of them didn't deal with it well, at least the first couple of times. They were told within these series what type of questions that the systems were told, they were told that this was a factoid question or this is a list question and that was basically only because we had different reporting that the type of response that they had to give differed depending upon the type. And one of the other things that we have since then done to make these definition questions even more stable, remember, way back in the beginning there I said that the assessor had to say whether or not a nugget was vital or just okay? And if it was okay then basically we ignored it in the scoring. That was the upshot of the way it was scored, an okay nugget was just treated as a no-up. They didn't get penalized in length but they didn't get any credit for retrieving it either. And if you're using one person's opinion on that, it's really unstable. And so instead of doing that and since creating the nugget list is the expensive part of this evaluation because the assessor has to sit there and go through all these things and think of what they want and et cetera, et cetera, once we have one assessor create the nugget list but then we give to all 10 assessors, we generally have 10 assessors doing this, to ask whether or not they think this given nugget is vital, and in using the number of assessors who think that it's vital, and that's sort of the analogous to the pyramid evaluation in summarization, we weight the scores depending upon how many of the assessors thinks it's vital. And that just gives a nice more smooth evaluation measure and therefore makes it more stable. Okay. One of the problems with the QA things, though, even the factoids, is that it's not reusable in the sense as Cranfield test collections are reusable. We have some partial solutions by this in the sense that we make answer patterns that we assume to be that if you retrieve a factoid response you retrieve a response for a factoid answer -- sorry, factoid question and your answer matches this answer pattern, a pearl pattern. We consider it correct and if it doesn't, we consider it incorrect. That can go a long way, but you can gain that. You can make answers, strings that match the patterns but are in fact not correct. And the problem of why they aren't really reusable is that we still don't have a solution to this problem that have unique identifiers, right? Once you have unique identifiers which document IDs do and Cranfield, then you're sure that this is the thing that should have been retrieved but in the absence of unique identifiers you're still back in the sort of squishiness and don't have real reusability. And the problem with not having really reusability is that it makes it much harder to learn, to train systems and to learn from previous evaluations if you can't get comparable scores for you why you're creating your system. We still don't have a solution to that. I think I probably don't have enough time here to go through the entailment. Just to say basically to say that we did have for the entailment, the entailment task is to decide whether or not a second sentence must be true given a first sentence. I've got a couple of examples there. And NIST was part of the textual entailment evaluation that has been running the past couple of years. Last year in that we added a -- in the original definition of the entailment you simply said that yes it was entailed or no it didn't -- wasn't. But no it wasn't includes both can't possibly be true and just simply isn't entailed. And so we split that out into three ways so, no, it's -- so it's either entailed contradicts, which is no, can't possibly be true or neither, it doesn't contradict, it isn't entailed it's just you know sort of unrelated. So you can see the first example here is an actual contradiction. The second one is a neither entailed nor contradicts. And the upshot of this was we went through, we used the exact same test set that was used for the two way fix and the exact same evaluation measures, et cetera, we figured all that would be fine because it was fine for the two-way case but then we looked at the results for the three way case and we realized that because accuracy which is the measure that's being used is still based on this, it has to be exactly the system gets their point only if they give the exact answer so whether it's yes, no, or neither, they still have to give that answer so for the systems it's still a binary score but now they have three choices. And that lowers the human agreement rate because the humans also have three choices and it lowers the human agreement rate such that the evaluation that we used last year for this three way was unstable, meaning the differences that the systems were reporting were within the error bounds that we could confidently rely on among the human agreement. So what do we do there, well, we could add more cases, but how many? We don't know how many more. So what are we doing to do for TAC? We're going to do somewhere last year it was 500, I think this year is it's going to be 900. We can change the measure, but to what? I don't have any brilliant ideas on what measure, so for right now we're leaving the measure where it is. We can get more annotation agreement, we can sort of force the assessors to agree more, but that only worked by making up rules and those rules are only reasonable if they actually represent phenomena you want to be represented, right. Arbitrary rules just for the sake of rules doesn't help anybody. And the other thing that we can do and we probably will do, is if two assessors disagree we'll just make the official answer neither because that's sort of good evidence that the answer is it neither contradicts nor entails because the assessors can't agree. And that will -- that's sort of a subcase of forcing more annotation agreement. So my case here through these community evaluations are coming from NIST I need to be a big but I am a firm believer in being able to measure things, to be able to really concrete, you know, give a number to two things. But they form or solidify an already existing community. They are instrumental in establishing the research methodology for that community. They facilitate technology transfer, they can document what the state of the art currently is. That's helpful especially for students who want to start coming in to see where it is that they need to start from. And another nice thing is they amortize the cost, right, it's much more cost effective to have one central place creating all this infrastructure that is needed to do these evaluations than to have everybody off in their own research ground doing the same, you know, recreating different incompatible but as expensive infrastructure. Now, there are some downsides. These evaluations do take resources that could be used for other things. The researcher time. And there's the money to defray the evaluation costs. We try to minimize this effect by keeping the thing, the tasks that is are for the evaluation only such as reporting, the reporting format sorts of things as simple as possible to make those costs as low as possible. And then we also have the problem of overfitting. Once a data set is out there, right, the entire world now uses that data set and you get this entire community who has now decided that three percentage points of F measure over this one data set is the be all and end all of their research, right, that's a problem. Especially because any given test set is going to have some peculiarity which you don't really want people exploiting that peculiarity of the data set. And we try to minimize that effect by having multiple test sets that evolve, just the same way the question answering evolved throughout its lifetime. To try to both make sure that we're testing the things that needs to be tested as the systems get better but also to prevent this extreme focus on what is almost certainly insignificant performance improvements. And that's it. Thank you. [applause]. >>: [inaudible]. >>: You mention I mean this was a long story about a rather specific case but I think there's a bunch of sort of more generalizations that could be pulled out of this, what would be good to [inaudible] evaluation. You mentioned costs, various benefits like disability and so on, and then at many times you sort of longingly referred to the Cranfield and things as a sort of poster child of how it might be and how you wished you could make this example fit that. Now, what I'm kind of wondering is if we could roll back the clock and knew what we knew then and not what we know now, would the community have all been united and agreed that Cranfield was just a wonderful success on all the things you're describing? I kind of have a feeling that for a long time, you know, say Jerry Salton who is working without as many will resources as we have today and there wasn't such a consensus that it was such a great success? >> Ellen Voorhees: Okay. There's a couple answers to that. One is Cranfield has never enjoyed and still to this day does not enjoy a community-wide acceptance. It probably enjoys its acceptance more now than ever, and that's largely due to TREC and related things because we have sort of -- because I will claim at least that we demonstrated that it really has real benefit. Having said that, there's also many -- I mean, just go to any SIGIO and look at one of the workshop, there's bound to be one workshop on why it is that Cranfield is so terrible and what are we going to do about it, right? And they have many good points in there, which is particularly Cranfield does not represent the user in the way the people want the user now represented. In my opinion good reasons why Cranfield doesn't represent the user. It's specifically was trying not to. But it is not going to be -- Cranfield is not the abstraction that people who are interested in how users interact with search systems can use. It's just not the right abstraction. There are people working on trying to extend Cranfield to be that -- to be such as abstraction and I have worked some with this and trying how would you do that and I'm personally very pessimistic because any time you try and change Cranfield, even a little bit, the costs sky rocket because you've introduced a lot more variability and now you can't conclude anything anymore. I mean, without much more large-scale efforts. The one thing I did, you said I kept longingly looking at, the one thing I would like these other evaluations to reduce which Cranfield does and these don't is the reusability. And I recognize why these don't produce the reusability. It's the lack of a unique ID. And I don't know how to insert a unique ID into them. But until you get that reusability or at least some very good close approximation to reusability the data sets are not nearly as useful for researchers. They're basically only evaluation data sets and not training data sets. >>: Have you ever [inaudible] using something that [inaudible] evaluating how good [inaudible]. >> Ellen Voorhees: So the short answer to that question is yes, I have considered it. The longer answer is as a NIST evaluation it's not going to happen. I am not allowed to use mechanical TERC as I can't give US government funds to a private company in escrow to hold for, so mechanical TERC as such is not going to happen. >>: If somebody else funded it, could it happen. >> Ellen Voorhees: If somebody else funded it, it has then funding it and running it could happen and I could report it and I could use those judgments perhaps, but -- [laughter]. >>: [inaudible]. I just asking. [laughter] I mean, we're not talking about a lot of money. >> Ellen Voorhees: I know that. That's why I was interested in it. [laughter]. >>: [inaudible]. >>: Especially when you say [inaudible]. >> Ellen Voorhees: Right. No, we're not talking about a large amount of money, I know that, but ->>: Right and [inaudible] red dollars for green things. But if somebody else [inaudible]. >> Ellen Voorhees: Right. And actually I mean there are lots -- well not lots, but if at this past, you know, ACL, there were reasonable number of groups that had in fact gone out to mechanical TERC to get different types of annotation. >>: [inaudible]. >> Ellen Voorhees: Did probably, yes. [laughter] Lucy? >>: [inaudible] evaluations are a judgment for the RTE where there were two assessors? >> Ellen Voorhees: Okay. The way and last years RTE worked is the people who created the test set, which was the select organization in Italy, they had a -- they had at least two, it might have been three, but it was at least two assessors both creating -- selecting the pairs and making whether or not it was entailed. And if they disagreed it didn't get into the test set at all. Okay? But then when NIST went to do the three way, we decided oh, we might as well, and I had all of the NIST assessors who were doing this -- I had NIST assessors who were doing this judge such that the union that they judged was the entire test set. And I did that because -- or I didn't necessarily have to do that because we did not change anything that the RTE said was entailed, we left -- that the original RTE test set said was entailed we left it entailed despite what our assessor set and I knew that we were going to do that, so I didn't necessarily have to have our assessors judge those once but I did anyway to see if there were disagreements and in fact there were disagreements. >>: [inaudible] so in this case they are disagreeing because I'm interested in what the disagreements are. >> Ellen Voorhees: Okay. I'm sorry. I misunderstood what you were saying. >>: [inaudible]. >> Ellen Voorhees: The disagreements were things along the lines of the one I remember very particularly was there was a pair that had to do, that hinged on whether or not Hollywood and Los Angeles were the same and the Italians said yes and our assessor said no. >>: [inaudible] how close you were [inaudible]. >> Ellen Voorhees: Right. There was another one where there was east Jerusalem did such and such and such and such and the text said Jerusalem such and such and such and such, now is east Jerusalem a part of Jerusalem, is it the same? Yeah, well, maybe but the West Virginia part of Virginia? No. Right. >>: [inaudible] it used to be. [laughter]. >> Ellen Voorhees: Yeah. But that doesn't matter anymore, right. There was another one where two Italy January politicians and I wanted to see why this disagreement was so I actually went to Wikipedia to see who these people were and why this was. Apparently these politicians were in the same party but they were opponents within the party. So the question said such and such and such and such were opponents where from the text you wouldn't possibly have known that. They were in the same party, right. And so our the NIST assessors said they were not opponents, the Italians said they were opponents. So really even at this little level of textual entailment, the context and the world knowledge that you know really, really matter. >>: [inaudible] I mean it would be interesting [inaudible]. >> Ellen Voorhees: I have all the judgments that NIST made. I kept them all, yes. >>: Because [inaudible] challenge that you know the remaining challenges for RTE, this is a very interesting set, where is world knowledge truly needed. >>: We have a similar experience where we tried to use people in India to judge our HRS process and we found that they had a hard time distinguishing between American issues and British issues. Things like this. They wouldn't know the pop stars, the latest ones. >>: Right. >> Ellen Voorhees: Neither will I. >>: [inaudible] is relevant here is like difficult. >> Ellen Voorhees: Right. All which is just to say as much as we might like to make these, and we in fact we tried to make these evaluations sort of independent from things which aren't specific to what's trying to be tested, you really can't, you really can't do that. We can try, and I think it's -- and I personally think it's a good thing to try because we want to control as much as we can, but we do have to recognize that we aren't going to get it eliminated completely. And what do we do with those cases? That's another interesting question, right. If you can't get humans to agree, we can't really expect the systems to agree so maybe we just give them a gold star that they say there's as much agreement among this system with a human as another human, and if you can get that far, which the systems aren't anywhere close to, right, but if you can get that far, then I'll go ahead and we will have to go on to some other type of evaluation. I'm sorry. Yes. >>: So [inaudible] strong resources to [inaudible] Wikipedia instead of the same resources? >> Ellen Voorhees: We're talking now within RTE? >>: Yes. >> Ellen Voorhees: To the best of my knowledge, RTE assessors don't choose any external resources, it's supposed to be the case that that they're entailment decision is supposed to be made strictly on the piece of the text which is given. But that's really hard for humans to do, right? Humans can't subtract out what they know, right, it's hard for the systems that it's also hard for the humans to subtract it out. And so no, they don't really use external resources. Sometimes when it -- when in sort of a document judge or more question answering mostly what we have asked assessors to judge questions for which they have no concept of the area they will go do basically Web searches to sort of get up to speed. But that's pretty much it. Because again we have to make sure that it's controlled in some way. Also in question, answer, I never really did get back to that. The way we at NIST implemented the question answering track we allowed systems to use external resources, they were allowed basically to use anything that they wanted. Largely because there isn't any effective way of really saying what you can use and what you can't use or one person's external resource is another integral part of their system. I mean it's a very slippery slope to try to out all that. WordNet is a prime example. For some people that is an external resource but for other people it's just part of their system. And so we didn't try, we just said you can use anything that you want but you have to return a document in the collection that supports that answer. And so in fact there were a fair number of systems went out through the Web, answered the question and came back and projected the answer back into the document collection. Actually Microsoft's system was one of those that went out to the Web and they're very reasonably, right, Microsoft one in particular was a why tried to do fancy things when you can just use the large numbers and go find the answer and project it back in. And you could tell systems that did that because by and large the projection didn't work well. [laughter]. But we did allow it. And we made our definition questions. Those questions that we selected to be part of the definition we actually did try to make such that we didn't think you could be answered simply by going to biography.com or Wikipedia, et cetera, and scraping that and just returning that. That's actually getting harder now because the resources on the Web are getting more and more complete. [applause]