>> Bill Dolan: So I'm really pleased today to... visiting us from NIST where she's the manager of the...

advertisement
>> Bill Dolan: So I'm really pleased today to welcome Ellen Voorhees, who is
visiting us from NIST where she's the manager of the text retrieval group? Just
retrieval group. Which encompasses a range of technologies, and I'm sure
you're familiar with many of these evaluations like TREC and a new one TREC
VID and a newly forming group set of evaluations called TAC that I'm sure Ellen
will be telling us all about.
Ellen's background is Ph.D. in information retrieval for Cornell working with
Gerald Salton. She joined NIST then after ten years at Sea Moons, where she
did a lot of interesting work that I remember really well with WordNet and
information retrieval back in the '90s.
So I won't belabor this. Thank you, Ellen.
>> Ellen Voorhees: Thank you. So thanks very much. I'm glad you could be
here today. I'm going to be talking about some of the language technology,
evaluations that we're doing at NIST. And one of the things that I frequently, one
of the questions I frequently get is people who not necessarily meanly but just
wonder why is it on earth that NIST does this at all, why should NIST care about
language evaluations?
And so here you have up at the top of the slide here the official NIST mission.
So just to make sure everybody knows it, NIST is the National Institute of
Standards and Technology. It's part of the federal government's commerce
department. And so therefore its role -- and it's also the home of the national
clock, so time, NIST time is from NIST. It is the home of the nation's kilogram. It
is the home of the definition -- well, the kilogram is actually still an artifact, right?
There are a lot of people at NIST working to make that not an artifact anymore
because all of the other standard units of measure are now defined through other
processes such as the meter is the length -- the wavelength of some element at
zero degrees celsius, et cetera, et cetera, et cetera.
Information technology hasn't really quite gotten at that level of standardization.
But the NIST mission as a whole is to promote innovation and competitiveness
by advancing measurement, science, standards and technology and we fit in this
quite well by our group's mission statement is then to do that promotion of
innovation and industrial competitiveness but focusing on information technology
problems.
And one of the reasons why it makes sense for NIST to be running these sorts of
evaluations and I'll get to what sorts in a moment, is we're a technology neutral
site by which we mean we don't actually have an axe to grind about whether or
not we want to do your retrieval or your entailment by Gaussian networks or by
rules or whatever method, right, we're there to evaluate the end result and not
have to promote any particular method by doing that. And at this point, we have
a reasonable amount of expertise and experience in operationalizing these tasks
and evaluating whether or not the evaluation itself is actually accomplishing what
we want it to accomplish.
So within the retrieval group, the most well known of our evaluations is probably
TREC. TREC started in 1992. It's a workshop that focusing on building the
infrastructure for evaluating text retrieval systems. I have text in parentheses
there because we have branched out into some other sorts of media but basically
we're still concentrating on text. And three main things that TREC has done is
create a realistic test collections, a standardized way of evaluating retrieval, and
then a forum for exchanging these results.
And as I said, it started in 1992, so at this point we have -- we're now in our 17th
TREC and over the years we have run on the order of started with just two tasks
so it's these tasks here, but we had branched out into a variety of different tasks
over the years. And I mostly have this slide up theory just so you get sort of the
idea that it is not a single monolithic thing, there are these individual tracks and
each TREC sort of has it's own combination of tracks and it's own combination of
things that people are focusing on.
And so if you take the whole set of these different tasks that TREC has looked at
over the years, we've actually looked at a fair number of different problems, and
each of these tracks have -- not each of them, but many of them have had
individual tasks within them as well. So we're probably looking at here on
upwards of about 100 our so individual tasks which have been evaluated over
the years.
So what good is that? Well, TREC itself was based -- did not come up with the
way that retrieval systems were -- or evaluated. It was built upon an already
existing paradigm and the information retrieval community that was based on the
idea of a test collection. So we're talking here now about an ad hoc search,
okay, so ad hoc meaning it's sort of a one-off search, you don't have any history
for this search, it's your typical -- it's a prototypical search of say a Web search or
a search of any large collection.
And the abstraction there was to create -- was to obtain a set of queries or
questions, a document set and then relevance judgments. These relevance
judgments are the things that says which documents should be retrieved for
which questions, okay.
And this was originally created by Cyril Cleverdon in the late 1960s. It's called
the Cranfield paradigm because he was at Cranfield Aeronautical College when
he did this. And the whole idea was specifically to try and abstract a way the
problems of having individual users while trying to evaluate search. Okay? And
the idea behind it is that we -- you want and in order to do evaluation you want to
have an abstract task, you can't do -- you can't evaluate -- well, it is difficult to
evaluate the real end user task because of the amount of variability in there.
So the idea was to abstract a way to a task which has sufficient closeness to a
real user task to be applicable to it but to be able to control a lot of the variability
and therefore be able to do more controlled experiments and get more valuable
results from those experiments.
And this sort of core task is what Karen Spark Jones called the core competency
of evaluation task. So it's an abstract task which has real application to a user
task but has enough control so that you can actually learn something from an
experiment.
The consequences of this abstraction is that you do gain these control over lots
of these variables so you get more power for less expense at the cost of you lose
some realism because obviously you have purposefully doing an abstraction
here. So whatever is not accounted for in the abstraction is a loss of realism.
So is Cranfield reliable? Lots of people will argue and have argued for 60 years
now that you can't possibly evaluate retrieval systems if you're going to base it on
a set of static relevance judgments, right, because we all know and we do know
that number one users are not -- when a real user uses a retrieval system their
information need changes while they interact with the collection and so it's not
static.
We also know that if you give the same set of documents for the two people who
have putatively the same information needs, they are going to consider different
sets of documents to be relevant, okay? And furthermore, we now know from
having a big set of results such that TREC the cumulative effect of the different
TRECs have allowed us to look at, even though Cranfield was specifically
developed to try to abstract away the user effect the user effect is still the biggest
variable within Cranfield paradigm. So if you look at an ANOVA test of the TREC
three results and this was reported back by in the first issue of the journal
information retrieval, they took the TREC three results and ended ANOVA over
those results with systems as one set of variables and the topics as a second set
of variables and looked for interaction effects, et cetera, and found that the topic
effect was larger than the system effect and that the interactions were significant
but smaller, right.
So basically what this means a topic effect, remember here the topic is TREC
ease for what question was asked. So here you are, you're trying to evaluate
systems when the topic effect so what was asked is bigger than the effect you're
trying to measure, okay. That complicates retrieval evaluation by a whole bunch.
Doesn't make it impossible but it complicates it a lot.
We also the other thing that people will argue about and whether Cranfield is at
all meaningful is the measures that you use to evaluate the effectiveness. There
is a whole slew of different measures that you might use. TREC eval which is in
the evaluation package that you can download from the TREC site reports
something like 109, I think it is, different numbers for any run that you give it.
And we also know enough about looking at these evaluation measures to know
that different measures really are measuring different things. That's what I mean
by they represent different tasks. So one of the more common measures is
precision at 10, precision means how accurate were the set of documents that
you retrieved.
So precision at 10 means how many of the top 10 documents on average were
actually relevant, right? It's a very understandable measure, very intuitive of
what it means. It is in fact a fairly unstable measure because it's based only on
the top 10 documents. Okay? But that's a very different sort of task that's being
measured. If we're only looking at the top 10, then if you're looking at something
like mean average precision, which is the measure that frequently gets reported
in the literature which looks at the entire set of documents that are retrieved
based on very many more documents retrieved doesn't have nearly the intuitive
understanding of what that measure actually means though.
So as I've already alluded to here, these measures mean different things, but
they also have different inherent stability. To me average precision because it's
based on the entire set of documents that are retrieved. It's a much more stable
measure than precision at 10. So how if you're evaluating yourself, how does
that all interplay?
Well, we have taken the TREC results and we've done a number of studies to try
to show what actually is happening under these numbers that get reported. So
one, the first thing I want to show here I'm going to go through these all very
quickly, but I wanted to show is it really is the case that these averages and
everything is being reported on averages because we know that the variability -because we know that individual topics are the topic effect is so great that
individual topics are not particularly meaningful when trying to evaluate systems,
you want to take averages over lots of topics. But those averages hide a lot of
stuff.
Here the big red line, thick red line is the average re-call precision graph for a set
of 50 topics from this run. And the little tiny lines are the individual topics that
that is an average of, okay. It's only actually like 20 I think of the 50. Because if I
put all 50 up there, then you can't see anything.
But you can see that it really is, this is one run it gets reported with this nice little
typical re-call precision graph, but this is what it's hiding. Some of those topics it
worked really, really well, because remember your best point is up there. Other
topics were really, really bad. But all you generally see is that big red number
and actually you're sometimes lucky to see even a whole re-call precision graph
because frequently one number gets reported, okay.
So that's what is behind the numbers. The other thing then we looked at was
well, okay, we know that relevance judgments do change. How does that effect
the evaluation results? So one of the things that we did was we took -- we asked
different people, we have these assessors that come to NIST, they're actually
retired information analysts from the intelligence community. Normally, because
we know that different assessors have different opinions about what's relevant
and we do, each topic gets one relevance judge so the internal -- the judgments
are at least internally consistent even though they represent one person's
opinion, right, what we did another time, though, or two other times was we gave
those same query and the same set of documents that were retrieved by the
systems and we gave it to a second judge and then we gave it to a third judge.
And we did that for all the 50 topics in the TREC 4 ad hoc test set.
And then since there were 50 topics and we had three judges for 50 topics I
could create the different set of relevance judgments for the whole 50 topics by
choosing one of those three judges for each of the 50 topics. So we have three
to the 50 different sets of relevance judgments, right? Now, that's a whole bunch
and I didn't create all of those but I created 100,000 of those.
And I evaluated the same set of runs so this is exactly the same set of runs that
retrieved exactly the same set of documents because it's the same run and I
evaluated then using each of those 100,000 different relevance judgment sets.
And what's plotted here and the bar is the minimum and a maximum mean
average precision value that was obtained by that one run over the different
relevance sets. And you can see there is a range. Each of these runs actually
does have a fairly big spread about what their -- that one particular run would
have been evaluated depending on who we had used as assessors. But the
important point here is the scores that any particular run gets using a given
relevance set of judgments is highly correlated with each of the runs is highly
correlated depending upon when set of judges you use and what -- let me say
that again better.
If you look at the individual scores so for instance the red try angle, that was the
official TREC 4 set of judges, if you use the same set of relevance judgments for
each run you will see that they stay basically in the same order. What this
means is we have hard judges and we have lenient judges but all of the systems
generally rank the same in the same order if you use the same judge for those
two different runs.
>>: I have a question about your reliance on these judges from the intelligence
community. So I suppose it's good in a sense because their whole life has been
learning this for how to make judgments. But it seems like are you getting sort of
a picture of what relevance is versus just [inaudible] off the street?
>> Ellen Voorhees: It depends upon what you're trying to represent. So yes, if
you have some very particular notion of relevance that you want to -- that you
want your judgments to represent, you had better use judges that represent that
notion. We used intelligence analysts because when TREC began it was
specifically targeted for the intelligence community. But in fact, even these, we
have fairly -- it's a fairly homogenous group of assessors because they're
basically all retired information analysts. We repeated this experiment, however,
with a TREC 6.1 and there the second judges were University of Waterloo
graduate students which are pretty different than retired intelligence analysts.
And the same thing holds.
And in fact, the variation within our group and their group was not distinguishable
among the variation between those two sets. So people just different, even.
>>: [inaudible].
>> Ellen Voorhees: Actually the graduate students are a lot cheaper, essentially
since it was University of Warterloo doing it and we didn't even do it, it was a lot
cheaper for us.
But all of this is just to say yes, it is the case undoubtedly people's opinions differ.
We know that. But as long as what you're using this for and the test collection
paradigm all says that all you're trying to do is compare systems, in fact it doesn't
matter because the comparison is stable. And I got to go a lot faster than this if
I'm going to -Another study that we did with this set of results was to the look at the
differences that the inherent error rate within a measure. And basically we
looked at this set of different measures precision at 10, precision at 30, average
precision, our precision and recall at 1,000. If you don't know what all those
measures are, don't worry about it. Basically the idea is if you create the two
main things to see here is if you want to allow for a bigger delta in between at the
time you conclude the two systems are different, so remember all you're doing
here is comparing two systems, you have a score say a precision 10 score for
each of these two systems, if the score -- if one score is .1, then the other score
is .9, you probably want to consider those two different. But if the one score is .1
and the other is .11 you probably don't want to consider those two different, right.
So this fuzziness value is the percentage of the maximum score which is
considered to be different, right. So three percent of this -- of the -- three percent
of the bigger score has, the difference between the scores has to be at least that
in order to consider the two scores different. And here it's five percent and et
cetera.
The bigger score that you -- difference that you require obviously the smaller
your error rate is going to be. And here the error rate is looking over multiple
comparisons and looking at whether it's the conclusion is that system A is better
than system B. If you look again at a different version of that A and B the
number of times that says that B is better than A.
We did this using a set of -- the same set of topics but different queries. And
which we developed in the -- which we got this data from the TREC query track.
I don't want to go into that detail. I know that that wasn't very clear but if you
have lots of -- if you want lots of details about that ask me later. Right now all I
want to say is different measures have different inherent error rates. They are
some measures which are inherently less stable than others. And it's not really a
big surprise it's just that this shows it.
And the reason why they have inherently different error rates is largely just
related to how much information is encoded in that measure. And precision
which is a very common measure and which people like for a lot and for good
reasons happens to be a very unstable measure because the only thing that you
really can base this measure that the value of that measure depends on is the
number of relevant documents at that cutoff. It didn't matter how they're ranked
before it or below it, in order to change your retrieval system and to change that
score, the only thing that matters is if another relevant document entered that set
or left that set. So that's a very impoverished measure. It's a very user friendly
measure but it's a very impoverished measure. And the precision is one of the
less stable measures.
>>: [inaudible].
>> Ellen Voorhees: Yes?
>>: I understand this and I think it's great to look at these measures but there are
cases where what you really care about is precision at 10 or precision at 1 or
something like that.
>> Ellen Voorhees: Right.
>>: Do you know how this changes, do you know how the inherent error rate
changes if you add more queries program? If you have more queries that
shouldn't reduce your error?
>> Ellen Voorhees: Right. It goes up. The error rates go down. Your stability
increases.
>>: [inaudible] looks like it.
>> Ellen Voorhees: It's funny you should ask. So the next thing we looked at
was how error rates depended upon topics at size. And the big thing again here
to see is that the more topics you have, the more stable your evaluation becomes
where once again stability here means that you have consistently are saying the
same system is better in your comparison. Okay?
Again, these different lines are dependent upon the difference in the score that
you're using to decide whether or not systems are actually different. And this
particular graph is just showing mean average precision but similar sorts of
graphs except starting at higher error rates are for precision. I mean, it holds for
precision and it holds for other measures.
>>: [inaudible] air rate over the range of map value say, so maybe map is more
sensitive at the higher range of the values than at the lower range of the values?
Something like [inaudible] certainly is.
>> Ellen Voorhees: Right. We have not looked or at least I have not looked
specifically at the absolute value of the measure and how sensitive the -- it is at
the absolute value. Except for to say that obviously if you're requiring some
given distance between the two measures and the result is very poor you're not
even going to get to that level. So very poor, very low numbers of your measure
where when high error is better, very low numbers are actually quite stable but
that's because you have no room to be unstable. Once you get to zero you're
zero.
So yes, it's going to be -- it will be somewhat more unstable at larger numbers
but that's only because you have more room to have fluctuation. At least part of
the explanation is because you have more room to be -- to fluctuate.
All right. All of the this was to say that we have looked -- we have spent a fair
amount of time trying to look at the Cranfield methodology, to see what you can
and can't learn from it. And me being a TREC person and being a big fan of the
Cranfield paradigm come to the conclusion -- comes to the conclusion that
Cranfield is both reliable and useful, right. We -- it's reliable within certain
parameters, right, you can't -- it is not feel and it doesn't -- it will not answer every
question that you have but for the purpose of comparing retrieval system
behavior it is quite reliable and it's useful in the sense of what this graph is
showing you is the improvement in ad hoc evaluation over the first eight years of
TREC. This is for one system. This is just the smart system which was from
Cornell and Saber.
What they did is they kept a frozen copy of their system each year, right,
because you have a new test set each year. The test set itself might be harder
or easier so you can't really compare scores from year to year because that's not
a fair comparison. But the smart group kept a copy of their system for each year
and ran each new test set with each system.
And basically what these graphs are showing is that retrieval performance
basically doubled between TREC 2 and TREC 6. Now, while this is only one
system, the smart system was among the top systems in all those years so it's
representative of the field as a whole.
And then it also says that ad hoc performance has leveled off since then. Why
has it leveled off? Partly because TREC went into all these other different tasks
and so people were concentrating on other tasks and partly because it's not clear
where the next big improvement is going to come from and retrieval systems.
Now, I wasn't going to do this all on just on retrieval. I wanted to use that whole
part there as sort of a lead-in of what type of evaluation do we then want for other
types of tasks and other language tasks in particular. So but from using all that
as background, what makes it then a good evaluation task? Well, you want it to
be an abstraction of some real task. And that's actually not as easy as it sounds.
And in particular the entailment task which I know some of you are involved with
here is actually I would claim not an abstraction of a real task. There is not going
to be an actual user tasked in which you're going to ask someone is this
sentence entailed by that sentence.
And that does affect the way that you can build the evaluation because now
when you have questions about well, should we be trying to you know this part
heavier than that part or whatever, you don't have the real task to go back to and
say we should do it because of this, we just have to say we should do it this way
because we think so. All right. So but anyway, you want them to be an
abstraction of a task so that you can control the variables. You do need to
control the user effect here. But of course it has to capture the salient aspect of
that user task or else your whole exercise is pointless.
You want the metrics to accurately reflect the predictive value of this thing, right,
so you don't want to -- you don't want your metric to give really high scores to
that system which is actually the worst, right, you want your metrics to reflect
reality. And then you get into some more meta-level things as you have to have
an adequate level of difficulty, right. If you make the original task too easy,
everybody does well and you learn nothing. People are happier but you don't
learn anything.
If you make it too difficult, you again don't learn anything because everybody
does abysmally, you don't learn anything but now people are angry, so if you're
going to error one way or the other, you might as well make it too easy, but still
you want to actually try and hit a sweet spot where the systems can do it but it's
not a slam dunk.
You'd really like your measures to be diagnostic by which I mean if you have a
measure that says you're 50 percent of what the maximum value of that measure
is, it would be good if you could actually -- that measure actually shows you
where it is that the problem is. So where you in this case, this is system builder,
right, you'd like to know what the problem is.
And it's also best if their infrastructure that is needed to do this evaluation is
reusable, and that's best in the sense that it helps the community more because
there will be more available but it also with -- makes, costs bearable if you can
then reuse it.
Now, there isn't any evaluation task that I know of that meets all of these in all
aspects. But those are the types of things that you want to shoot for. So as Bill
mentioned, one of the things that we're starting at NIST this year is the text
analysis conference basically the analysis here is that we're trying to create an
evaluation conference series that would be for the natural language processing
community as TREC has been for the retrieval community.
It's the successor to the document understanding conference DUC which has
been going on for several years now. Now, DUC has been focused specifically
on summarization and summarization is a track in this inaugural year of TAC but
we've also taken the question answering track from TREC and put it into TAC
largely because the approach is to question entering these days are much more
natural language processing centric than they are retrieval. Ken, you look
unhappy.
>>: [inaudible] just thinking of the bad jokes about ducking and tacking.
>> Ellen Voorhees: Yeah, well, it was an incredible fight to get people to agree
on the name this conference. [laughter] ask Lucy sometime. Lucy was involved
in this argument.
>>: [inaudible].
>> Ellen Voorhees: Okay. [laughter].
>>: [inaudible].
>> Ellen Voorhees: Right. And so we're keeping TAC.
And the third part of TAC this year is going to be a textural entailment task, both
making a two way and a three way decision. And I'll get into what that means in
a minute.
So what I wanted to go through now was sort of a more detailed explanation of
how we go about defining a new evaluation task when we're sort of starting from
scratch. I said in the retrieval we didn't start from scratch, Cranfield was already
well established by the time TREC started. But then we came up with this idea
that we wanted a test question answering. And that was different from retrieval,
right, and so how did we go about defining that evaluation?
Well, one of the things we started with was saying we wanted to start with factoid
question and answering so these were little text snippets as opposed to what's
the best way to fix my washing machine, right? No, no, no. We wanted a little
short answer based on facts, right. So how many calories in a Big Mac, who
invented the paperclip, right, these sorts of little factoid questions.
And we had the systems return rank lists of document IDs and answer strings
where the idea was that that answer string was extracted from that document ID
and the answer string contained an answer to this factoid question.
We started with factoids because of our previous slide where I said we wanted
tasks that is were of adequate difficulty. When we started this task we had no
idea how well systems would be able to do. But we had a pretty good belief that
answering factoid questions were going to be a heck of a lot easier than
answering process type or, you know, essay questions.
We also had a reasonable idea of how we could go about evaluating factoid
questions because we were expecting a single answer and it was either that
answer or it wasn't that answer and we could compute you know, scores from
that. And we did that for several years. And it worked reasonably well and the
systems got better.
And then we hit in TREC 2001, so the factoid answering questions track started
in 1999. So three years later we looked in the and we saw that about 25 percent
of the test set which we had extracted from logs, one of which was donated by
Microsoft, so this was web search logs, we took the questions from those and
about a quarter of the test set while could be answered in short -- they could be
answered in little short phrases, they weren't really these factoid type questions,
they were much more definition questions. So what is an atom, what's an
invertebrate, who is Colin Powell.
These types of questions were hard for the systems to do, but it was also hard
for the assessors to judge because what's an answer to this question? Right?
>>: [inaudible].
>> Ellen Voorhees: And it depends critically upon who the user is or at least
what the use case is, right. So is the person who is asking what an atom is a
second grader writing a book report or is it an atomic physicist? You know
probably we aren't going to get an atomic physicist ( asking our system what an
atom is, but you know you get the idea here.
And there's also this problem that looking through a huge corpus of newspaper,
news wire documents probably isn't the best way of answering these definition
questions, right.
>>: [inaudible] something related to [inaudible] seem to be [inaudible].
>> Ellen Voorhees: Well, remember we're at 2001 here, so Wikipedia isn't as
popular as it was. But these days, yes, people do go to Wikipedia to answer
these questions and we can get to that. But, yes, you've hit the right -- the idea
here is newspapers aren't the place to go looking for answers to this type of
thing. So what we did is we actually created a new task which was specifically
focused on doing definition questions, which definition questions here meaning
what is object or a who is person or a who is company type of question. And the
problem with trying to evaluate that is we now have to define what a good answer
is.
And there will be different parts of that what's a good answer. One will depend
upon what you're assuming your user base is, but there's also how we actually
going to evaluate it in terms of, you know, giving scores because what we really
wanted to do is reward the systems that retrieve all of the concepts that they
should have retrieved and penalize them for retrieving concepts that they
shouldn't have retrieved but what's a concept? Right? And in particular
concepts can be expressed in very many ways and so we're going -- we're back
here to all the problems that we had when we're trying to evaluate natural
language processing systems of there's no one to one correspondence between
the things that we want and the way they might have been expressed and now
we're going to have to deal with that.
And in particular, we decided that we were going to evaluate these definition
questions or a good way of evaluating these would be by re-call and precision of
these concepts but these different, different questions had very different -- had
very different sizes of what should have been retrieved, right. So a very, very
specific question like one of the ones we had was what was some sort of -- I
don't know, it was some sort of circuit that was used in laser printers. I mean that
was pretty much all it ways, it was it was a circuit that was used in laser printers
versus who was Colin Powell, right. I mean what do you run for Colin Powell.
So there was very much more would get returned for Colin Powell than this
circuit. The first way and we had a couple of pilot evaluations in how we were
going to evaluate definition questions and this was all part of the work that we
were doing with the acquaint program. Acquaint was a question-answer program
sponsored by at the time what was ARDA, which was an intelligence research
funding group. They are now IARPA.
And the idea of the first one we were going to try, the first evaluation we try was
we were just going to say, okay, we just going to give -- systems are going to
retrieve snippets that they had retrieved out of these documents, they're going to
put these snippets, they're going to collect all these snippets, we're going to give
the snippets the whole set, the whole bag of snippets to our assessor, the
assessor is just going to say on a scale of zero to 10, I think this is a five as an
answer. Where zero completely worthless and 10 was the best possible answer
they could imagine.
But they were going to do it actually two scores, one on content and the other for
presentation and those of you who are summarizers will know that presentation
is a big thing, here, too. And we decided okay, we'll use a linear combination of
the two scores and that will be the final score and that will be it. And by and
large that didn't work. It had a couple of good things. It had the -- it does give
you an intuitive feeling for what a human things of this set of facts it gets
returned, and it's not really possible to gain that score. On the other hand, the
differences among the assessors whose scores they were assigning because
they sort of had their own -- each assessor has their own sort of baseline. The
differences in the scores among the assessors completely swamped any
differences in the systems, just no contest, there was no signal there, right, it was
all noise.
And this score doesn't help the system builders at all. Right, the fact that I think
that this summary is a seven, now you go change your system, right, I mean
what are you going to do, right? So we tried another thing. We tried what we
called the nugget based evaluation. And here we had the assessor actually go
through the corpus and the responses that got returned by all the systems and
create a list of what I called nuggets. These are concepts that the definition
should have contained. Right. And we asked the assessor to make these
atomic in the sense that they could make a binary decision about whether or not
that nugget appeared in a system response. That was our definition of atomic.
Not only were they to make up this list of things that should appear but they
should also go and mark once that absolutely had to appear in the definition for
that definition to be considered good. So these were the vital ones. And then
concepts that could have appeared in it, it was okay, they weren't offended by the
fact that this fact was there in this definition but it didn't have to be there. And
those things which -- those facts which while true, they didn't want to see
reported, they should never have even made nuggets, okay. So that doesn't
even make the nugget list. So nuggets are only those things that the assessor
either definitely wants to see or is okay to see in a definition of this target.
Then once they created that list they went back to the system responses and
they marked where in the system responses these nuggets occur if they did at
most wants. Okay. It could be that an individual item returned by a system had
one or more or zero concept, right and maybe none of the nuggets were actually
in it.
Here's an example. So one of the questions what is a golden parachute. The
assessor created the set of nuggets shown here on the left, those 6 things and
the ones in red, the first 3, are the ones that this particular assessor thought was
important that had to be in the definition. This one particular response from a
system is shown on the right and those are the nuggets that the assessor
assigned, right, so that means that the assessor saw in the response A from the
system both the fact that it provides remuneration to executives and that it's
usually generous. Nugget 6 was in item D. F didn't can taken anything or else
contained only duplicates that we didn't distinguish between that.
Now, from that set of things we cannot compute re-call, we can compute how
many of the nuggets the system should have retrieved that actually did retrieve
but we still don't have any precision there because we don't know how many
concepts were actually contained in the system response. How would you
consider to be the big payment that [inaudible] received in January was intended
as a golden parachute. How many nuggets of information was contained in that?
Who knows? We actually did ask a couple of assessors to try to enumerate all
the -- and we tried it ourselves where ourselves mean this staff, so I even tried
one of this. It's impossible. You cannot possibly say how many nuggets of
information are in the system response. So what are we going to do about
precision?
We punted. And we took a lesson from the summarization evaluation which was
going on at the time and we just used length. But the idea being that the shorter
the response that contains the same number of nuggets the better, okay.
So systems got some basic allotment of space and then they're scores started
going down once they exceeded that length. So now once we had that definition
we wanted to do some of the same sort of evaluation of the evaluation that we
had done with some of the retrieval stuff and we wanted to look at a couple of
things, one of which is mistakes by -- so differences in scores are going to come
about by a variety of things, one is which mistakes by assessors. Now, when I
report scores and things from -- to participants on our evaluations they're always
horrified at this idea that they might be mistakes. And my general response and
my response now will be get over it, right. Assessors are humans, they're going
to make mistakes. Deal.
We tried to make sure that there were as few mistakes as they are and we try to
make sure that we even set things up so that they will make fewer mistakes, but
there are going to be mistakes and there were mistakes.
So in this particular case we could actually measure the mistakes because there
were some things which were actually duplicated, some systems actually
submitted the exact same thing more than once and we didn't have any
mechanic for catching that and they were judged differently on occasion.
So that's not a difference of opinion, that's a mistake. But there were differences
of opinion. There were differences of opinion even in factoids that are certainly
going to be differences of opinions here.
And the sample of questions, some systems might just be able to do some
particular type of question easier than they can do another.
We did for a couple of these go ahead and have the questions independently
judged by two assessors. And what we saw was roughly the same thing -- no
actually it wasn't really the same thing as per retrieval. And here they were
highly correlated again. But in this particular evaluation that we did we got swaps
where this is one system would say it was better than the other, any way if you
changed the assessor it would be the other way around.
With fairly large differences in scores. And differences in scores -- the difference
in score that you needed to be confident that you weren't going to get these
swaps was bigger than the differences in the scores that the systems were
producing. And that's obviously a problem.
And here again we also did the same sort of thing we're looking at the number of
questions that you would need in order to drive that intended error down below
five percent. And it ended up, as I said, bigger than the differences that the
systems were producing and so our only thing we knew to do there was to
increase the number of sizes of the questions that were being asked. And so this
is I've labeled this try two prime because what we then did from that starting in
TREC 2004, we basically made the entire test set for question answering
definition questions where the -- it was structured such that each -- there was a
set of targets, the target was the thing you were trying to define, and then there
was a series of questions about that target. So we actually had multiple
individual questions all about that same target.
And this allowed us to do a couple of things. One, it allowed us to have 75,
rather than 50 definition questions because the entire thing was all focused on
definition questions now. But it also allowed us to do a little bit of context, only a
little bit, but a little bit of context within these series because within the series
what we did was we started to use pronouns and things to refer to the target and
previous answers which doesn't sound maybe like a whole big thing but was a
really big deal to the question, answering systems at the time because now he
had to deal with that sort of an [inaudible] and most of them didn't deal with it
well, at least the first couple of times.
They were told within these series what type of questions that the systems were
told, they were told that this was a factoid question or this is a list question and
that was basically only because we had different reporting that the type of
response that they had to give differed depending upon the type.
And one of the other things that we have since then done to make these
definition questions even more stable, remember, way back in the beginning
there I said that the assessor had to say whether or not a nugget was vital or just
okay? And if it was okay then basically we ignored it in the scoring. That was
the upshot of the way it was scored, an okay nugget was just treated as a no-up.
They didn't get penalized in length but they didn't get any credit for retrieving it
either.
And if you're using one person's opinion on that, it's really unstable. And so
instead of doing that and since creating the nugget list is the expensive part of
this evaluation because the assessor has to sit there and go through all these
things and think of what they want and et cetera, et cetera, once we have one
assessor create the nugget list but then we give to all 10 assessors, we generally
have 10 assessors doing this, to ask whether or not they think this given nugget
is vital, and in using the number of assessors who think that it's vital, and that's
sort of the analogous to the pyramid evaluation in summarization, we weight the
scores depending upon how many of the assessors thinks it's vital. And that just
gives a nice more smooth evaluation measure and therefore makes it more
stable.
Okay. One of the problems with the QA things, though, even the factoids, is that
it's not reusable in the sense as Cranfield test collections are reusable. We have
some partial solutions by this in the sense that we make answer patterns that we
assume to be that if you retrieve a factoid response you retrieve a response for a
factoid answer -- sorry, factoid question and your answer matches this answer
pattern, a pearl pattern. We consider it correct and if it doesn't, we consider it
incorrect. That can go a long way, but you can gain that. You can make
answers, strings that match the patterns but are in fact not correct.
And the problem of why they aren't really reusable is that we still don't have a
solution to this problem that have unique identifiers, right? Once you have
unique identifiers which document IDs do and Cranfield, then you're sure that this
is the thing that should have been retrieved but in the absence of unique
identifiers you're still back in the sort of squishiness and don't have real
reusability. And the problem with not having really reusability is that it makes it
much harder to learn, to train systems and to learn from previous evaluations if
you can't get comparable scores for you why you're creating your system.
We still don't have a solution to that. I think I probably don't have enough time
here to go through the entailment. Just to say basically to say that we did have
for the entailment, the entailment task is to decide whether or not a second
sentence must be true given a first sentence. I've got a couple of examples
there. And NIST was part of the textual entailment evaluation that has been
running the past couple of years. Last year in that we added a -- in the original
definition of the entailment you simply said that yes it was entailed or no it
didn't -- wasn't.
But no it wasn't includes both can't possibly be true and just simply isn't entailed.
And so we split that out into three ways so, no, it's -- so it's either entailed
contradicts, which is no, can't possibly be true or neither, it doesn't contradict, it
isn't entailed it's just you know sort of unrelated.
So you can see the first example here is an actual contradiction. The second
one is a neither entailed nor contradicts. And the upshot of this was we went
through, we used the exact same test set that was used for the two way fix and
the exact same evaluation measures, et cetera, we figured all that would be fine
because it was fine for the two-way case but then we looked at the results for the
three way case and we realized that because accuracy which is the measure
that's being used is still based on this, it has to be exactly the system gets their
point only if they give the exact answer so whether it's yes, no, or neither, they
still have to give that answer so for the systems it's still a binary score but now
they have three choices. And that lowers the human agreement rate because
the humans also have three choices and it lowers the human agreement rate
such that the evaluation that we used last year for this three way was unstable,
meaning the differences that the systems were reporting were within the error
bounds that we could confidently rely on among the human agreement.
So what do we do there, well, we could add more cases, but how many? We
don't know how many more. So what are we doing to do for TAC? We're going
to do somewhere last year it was 500, I think this year is it's going to be 900. We
can change the measure, but to what? I don't have any brilliant ideas on what
measure, so for right now we're leaving the measure where it is.
We can get more annotation agreement, we can sort of force the assessors to
agree more, but that only worked by making up rules and those rules are only
reasonable if they actually represent phenomena you want to be represented,
right. Arbitrary rules just for the sake of rules doesn't help anybody.
And the other thing that we can do and we probably will do, is if two assessors
disagree we'll just make the official answer neither because that's sort of good
evidence that the answer is it neither contradicts nor entails because the
assessors can't agree. And that will -- that's sort of a subcase of forcing more
annotation agreement. So my case here through these community evaluations
are coming from NIST I need to be a big but I am a firm believer in being able to
measure things, to be able to really concrete, you know, give a number to two
things. But they form or solidify an already existing community. They are
instrumental in establishing the research methodology for that community. They
facilitate technology transfer, they can document what the state of the art
currently is. That's helpful especially for students who want to start coming in to
see where it is that they need to start from.
And another nice thing is they amortize the cost, right, it's much more cost
effective to have one central place creating all this infrastructure that is needed to
do these evaluations than to have everybody off in their own research ground
doing the same, you know, recreating different incompatible but as expensive
infrastructure.
Now, there are some downsides. These evaluations do take resources that
could be used for other things. The researcher time. And there's the money to
defray the evaluation costs. We try to minimize this effect by keeping the thing,
the tasks that is are for the evaluation only such as reporting, the reporting format
sorts of things as simple as possible to make those costs as low as possible.
And then we also have the problem of overfitting. Once a data set is out there,
right, the entire world now uses that data set and you get this entire community
who has now decided that three percentage points of F measure over this one
data set is the be all and end all of their research, right, that's a problem.
Especially because any given test set is going to have some peculiarity which
you don't really want people exploiting that peculiarity of the data set.
And we try to minimize that effect by having multiple test sets that evolve, just the
same way the question answering evolved throughout its lifetime. To try to both
make sure that we're testing the things that needs to be tested as the systems
get better but also to prevent this extreme focus on what is almost certainly
insignificant performance improvements. And that's it. Thank you.
[applause].
>>: [inaudible].
>>: You mention I mean this was a long story about a rather specific case but I
think there's a bunch of sort of more generalizations that could be pulled out of
this, what would be good to [inaudible] evaluation. You mentioned costs, various
benefits like disability and so on, and then at many times you sort of longingly
referred to the Cranfield and things as a sort of poster child of how it might be
and how you wished you could make this example fit that.
Now, what I'm kind of wondering is if we could roll back the clock and knew what
we knew then and not what we know now, would the community have all been
united and agreed that Cranfield was just a wonderful success on all the things
you're describing? I kind of have a feeling that for a long time, you know, say
Jerry Salton who is working without as many will resources as we have today
and there wasn't such a consensus that it was such a great success?
>> Ellen Voorhees: Okay. There's a couple answers to that. One is Cranfield
has never enjoyed and still to this day does not enjoy a community-wide
acceptance. It probably enjoys its acceptance more now than ever, and that's
largely due to TREC and related things because we have sort of -- because I will
claim at least that we demonstrated that it really has real benefit.
Having said that, there's also many -- I mean, just go to any SIGIO and look at
one of the workshop, there's bound to be one workshop on why it is that
Cranfield is so terrible and what are we going to do about it, right? And they
have many good points in there, which is particularly Cranfield does not
represent the user in the way the people want the user now represented. In my
opinion good reasons why Cranfield doesn't represent the user. It's specifically
was trying not to.
But it is not going to be -- Cranfield is not the abstraction that people who are
interested in how users interact with search systems can use. It's just not the
right abstraction. There are people working on trying to extend Cranfield to be
that -- to be such as abstraction and I have worked some with this and trying how
would you do that and I'm personally very pessimistic because any time you try
and change Cranfield, even a little bit, the costs sky rocket because you've
introduced a lot more variability and now you can't conclude anything anymore. I
mean, without much more large-scale efforts. The one thing I did, you said I kept
longingly looking at, the one thing I would like these other evaluations to reduce
which Cranfield does and these don't is the reusability. And I recognize why
these don't produce the reusability. It's the lack of a unique ID. And I don't know
how to insert a unique ID into them. But until you get that reusability or at least
some very good close approximation to reusability the data sets are not nearly as
useful for researchers. They're basically only evaluation data sets and not
training data sets.
>>: Have you ever [inaudible] using something that [inaudible] evaluating how
good [inaudible].
>> Ellen Voorhees: So the short answer to that question is yes, I have
considered it. The longer answer is as a NIST evaluation it's not going to
happen. I am not allowed to use mechanical TERC as I can't give US
government funds to a private company in escrow to hold for, so mechanical
TERC as such is not going to happen.
>>: If somebody else funded it, could it happen.
>> Ellen Voorhees: If somebody else funded it, it has then funding it and running
it could happen and I could report it and I could use those judgments perhaps,
but -- [laughter].
>>: [inaudible]. I just asking. [laughter] I mean, we're not talking about a lot of
money.
>> Ellen Voorhees: I know that. That's why I was interested in it. [laughter].
>>: [inaudible].
>>: Especially when you say [inaudible].
>> Ellen Voorhees: Right. No, we're not talking about a large amount of money,
I know that, but ->>: Right and [inaudible] red dollars for green things. But if somebody else
[inaudible].
>> Ellen Voorhees: Right. And actually I mean there are lots -- well not lots, but
if at this past, you know, ACL, there were reasonable number of groups that had
in fact gone out to mechanical TERC to get different types of annotation.
>>: [inaudible].
>> Ellen Voorhees: Did probably, yes. [laughter] Lucy?
>>: [inaudible] evaluations are a judgment for the RTE where there were two
assessors?
>> Ellen Voorhees: Okay. The way and last years RTE worked is the people
who created the test set, which was the select organization in Italy, they had a --
they had at least two, it might have been three, but it was at least two assessors
both creating -- selecting the pairs and making whether or not it was entailed.
And if they disagreed it didn't get into the test set at all. Okay? But then when
NIST went to do the three way, we decided oh, we might as well, and I had all of
the NIST assessors who were doing this -- I had NIST assessors who were doing
this judge such that the union that they judged was the entire test set.
And I did that because -- or I didn't necessarily have to do that because we did
not change anything that the RTE said was entailed, we left -- that the original
RTE test set said was entailed we left it entailed despite what our assessor set
and I knew that we were going to do that, so I didn't necessarily have to have our
assessors judge those once but I did anyway to see if there were disagreements
and in fact there were disagreements.
>>: [inaudible] so in this case they are disagreeing because I'm interested in
what the disagreements are.
>> Ellen Voorhees: Okay. I'm sorry. I misunderstood what you were saying.
>>: [inaudible].
>> Ellen Voorhees: The disagreements were things along the lines of the one I
remember very particularly was there was a pair that had to do, that hinged on
whether or not Hollywood and Los Angeles were the same and the Italians said
yes and our assessor said no.
>>: [inaudible] how close you were [inaudible].
>> Ellen Voorhees: Right. There was another one where there was east
Jerusalem did such and such and such and such and the text said Jerusalem
such and such and such and such, now is east Jerusalem a part of Jerusalem, is
it the same? Yeah, well, maybe but the West Virginia part of Virginia? No.
Right.
>>: [inaudible] it used to be. [laughter].
>> Ellen Voorhees: Yeah. But that doesn't matter anymore, right. There was
another one where two Italy January politicians and I wanted to see why this
disagreement was so I actually went to Wikipedia to see who these people were
and why this was. Apparently these politicians were in the same party but they
were opponents within the party. So the question said such and such and such
and such were opponents where from the text you wouldn't possibly have known
that. They were in the same party, right. And so our the NIST assessors said
they were not opponents, the Italians said they were opponents. So really even
at this little level of textual entailment, the context and the world knowledge that
you know really, really matter.
>>: [inaudible] I mean it would be interesting [inaudible].
>> Ellen Voorhees: I have all the judgments that NIST made. I kept them all,
yes.
>>: Because [inaudible] challenge that you know the remaining challenges for
RTE, this is a very interesting set, where is world knowledge truly needed.
>>: We have a similar experience where we tried to use people in India to judge
our HRS process and we found that they had a hard time distinguishing between
American issues and British issues. Things like this. They wouldn't know the
pop stars, the latest ones.
>>: Right.
>> Ellen Voorhees: Neither will I.
>>: [inaudible] is relevant here is like difficult.
>> Ellen Voorhees: Right. All which is just to say as much as we might like to
make these, and we in fact we tried to make these evaluations sort of
independent from things which aren't specific to what's trying to be tested, you
really can't, you really can't do that.
We can try, and I think it's -- and I personally think it's a good thing to try because
we want to control as much as we can, but we do have to recognize that we
aren't going to get it eliminated completely. And what do we do with those
cases? That's another interesting question, right. If you can't get humans to
agree, we can't really expect the systems to agree so maybe we just give them a
gold star that they say there's as much agreement among this system with a
human as another human, and if you can get that far, which the systems aren't
anywhere close to, right, but if you can get that far, then I'll go ahead and we will
have to go on to some other type of evaluation. I'm sorry. Yes.
>>: So [inaudible] strong resources to [inaudible] Wikipedia instead of the same
resources?
>> Ellen Voorhees: We're talking now within RTE?
>>: Yes.
>> Ellen Voorhees: To the best of my knowledge, RTE assessors don't choose
any external resources, it's supposed to be the case that that they're entailment
decision is supposed to be made strictly on the piece of the text which is given.
But that's really hard for humans to do, right? Humans can't subtract out what
they know, right, it's hard for the systems that it's also hard for the humans to
subtract it out.
And so no, they don't really use external resources. Sometimes when it -- when
in sort of a document judge or more question answering mostly what we have
asked assessors to judge questions for which they have no concept of the area
they will go do basically Web searches to sort of get up to speed. But that's
pretty much it. Because again we have to make sure that it's controlled in some
way.
Also in question, answer, I never really did get back to that. The way we at NIST
implemented the question answering track we allowed systems to use external
resources, they were allowed basically to use anything that they wanted. Largely
because there isn't any effective way of really saying what you can use and what
you can't use or one person's external resource is another integral part of their
system. I mean it's a very slippery slope to try to out all that. WordNet is a prime
example. For some people that is an external resource but for other people it's
just part of their system.
And so we didn't try, we just said you can use anything that you want but you
have to return a document in the collection that supports that answer. And so in
fact there were a fair number of systems went out through the Web, answered
the question and came back and projected the answer back into the document
collection. Actually Microsoft's system was one of those that went out to the Web
and they're very reasonably, right, Microsoft one in particular was a why tried to
do fancy things when you can just use the large numbers and go find the answer
and project it back in.
And you could tell systems that did that because by and large the projection
didn't work well. [laughter].
But we did allow it. And we made our definition questions. Those questions that
we selected to be part of the definition we actually did try to make such that we
didn't think you could be answered simply by going to biography.com or
Wikipedia, et cetera, and scraping that and just returning that. That's actually
getting harder now because the resources on the Web are getting more and
more complete.
[applause]
Related documents
Download