Document 17881140

advertisement
20912.
>>: Good morning, everyone. It is my great pleasure to welcome Kristen
Grauman here to give us a talk about some of her recent work. Kristen is a
graduate, got her Ph.D. at MIT with Trevor Dorrell, and she's also one of the
Microsoft new faculty fellows. And she's now a professor at University of
Austin, Texas. Good morning, Kristen.
>> Kristen Grauman:
Thank you.
Good morning, everybody.
So today I wanted to talk to you about in some detail, about a couple of
projects that are related that we're doing in object recognition. And this is
work done with Sudheendra Vijayanarasimhan, Prateek Jain, and Sung Ju Hwang.
Please I'd welcome any questions that you might have as I go along.
If I may show this big picture. What I'm trying to convey here is the richness
of the visual world that we have. I'm interested in recognizing, naming,
reidentifying different objects, activities, people, and so on. And to do so
means we've got a really rich kind of space to work with and quite complex and
very large in scale.
So early models for object recognition would turn to more handcrafted
techniques, so trying to really mold it to a certain task and so on. Later on,
people start developing and turning to the power of statistical learning
methods instead. So there what we do is ship the problem into one where, if
you get some kind of labeled data, images, videos, then you can train the
model. So let's say some classifier to distinguish this category from all
others.
But what this picture is showing is this big kind of divide between the
vastness and the complexity of this visual space and what we're trying to do
here with the learning algorithms. It's meant to be a cartoon, right? What we
really need to be focusing on now, if we're going to take this path, is this
interaction between the data itself, the learning algorithms we're using, and
the way any annotator is providing some kind of supervision.
So focusing on this part, a couple issues that I think we need to be concerned
about is, one, that as much as we want to be able to fit the standard kind of
point and label mode of supervised learning algorithms, it's not always the
most beautiful fit for visual data, right? Meaning that, if you try to cram
your visual space into nice vectors with labels, we can surely lose a lot of
information. So trying to necessarily target that space can be too
restrictive.
Secondly, we have the high expansive human annotation effort. So if we're
thinking of our annotators here trying to convey everything they can
immediately know about any of these kind of instances to a classifier in the
form of labels. This is expensive, and right now the channel is quite narrow
when they do do this.
So when I talk about our approach to trying to adjust these possible
limitations, and it has two components. So, one, we're going to look at not
just thinking of this as a one-pass thing where we learn models, annotators
tell us in labels, and then we fix them. Instead, we go back through our
models and let them self-improve. Let them look at things that are unclear,
uncertain, and let them actively be refined by asking annotators the most
useful questions.
Secondly, I'm going to look at the flip side, not just going in the direction
from the model data to the annotator, but looking here at this connection
between what the annotators are telling us about visual data and what they're
not telling us but we could get if we listen more carefully. That's a little
vague right now, but I'll get to what I mean in the second half of the talk.
First, let's look at this cost-sensitive active learning challenge. Let me
tell you sort of the standard protocol today that many of you, I'm sure, are
very familiar with of how you might train an object recognition system. You've
got an annotator. They've prepared label data. Let's say you're learning
about cows and trees. So you could then come up with some classifier using
your favorite algorithm, and now you're done. You have the model, and you can
get different images and categorize them, or run some kind of detection.
So this is a one shot kind of pipeline. It's fairly rigid, and it assumes that
both, usually, more data we put in here, the better, and also that the
annotator knew which data to pick to put in here so that we have a good
reputation, and in effect, a good model. So instead of this kind of standard
pipeline, what we're interested in is making this a continuous active learning
process. So the active learner in the human classroom is the one who's asking
questions about what's most unclear and thereby clarifying or more quickly
coming to understanding the problem.
So in the same way, we want to have a system where, instead of just this
one-pass kind of effort, the current category models that we have can survey
unlabeled data, partially labeled data, and determine what's most unclear to
the system and come back to the annotator with a very well chosen question. So
the idea is you're not just learning in one-shot pipeline, and you're also
saving yourself a lot of effort if you do it well because, if you're paying
this annotator, you only want to put forth the things that are going to most
change the way you recognize objects. So this is the active selection task.
Okay. So when we start trying to bring this idea into the object recognition
challenge, we first have to think about traditional active learning algorithms.
Here you'd have, say, some small set of positively labeled examples, negative
labels, so cows and non-cows, and some current model. And then the idea is, if
we survey all the unlabeled data, we can ask about one of those points that we
want the label for next. So we won't just accept whatever annotation the
annotator gives us at random passively. Instead, we might pick something
that's going to have the most influence on the model.
And given whatever that label turned out to be, we could retrain the model -say our decision boundary changed based on the new information. That's a
standard kind of formula for an active learner. Problem is it doesn't
translate so easily to learning about visual object categories, and there's a
few reasons for that, why it doesn't translate.
So one is that assume that we could say yes or no about a point. So some
vector description, we say yes or no, and that tells us about the object. But,
in fact, there's much richer annotations that we can provide to the system. So
we could say yes or no, there's a phone or there's not a phone in these images,
but the annotator could also go so far to outline the phone within the image.
That's even less -- or more informative annotation, or going even further, we
could think about annotating all the parts of the object and so on.
Okay. So notice that here we have a trade-off between the kind of level of the
annotation, meaning how unambiguous the information is, and the effort that's
involved to get there, right? So this is the most clean and perfect
annotation, but it costs more in human time. This is really fast to provide,
but it's not so specific to the algorithm because we still don't know which
pixels actually belong to the phone. So we want to actively manage this kind
of cost and information trade-off.
Two other things to notice here, these points -- or not points. Every image is
itself possibly containing multiple labels. It's not just one object already
cropped out for us. So it's multi-label data. And the other thing to note, we
can't necessarily afford to have a system that, every time we get one label
back, we go and retrain and figure out what we need next.
In reality, I want to have multiple annotators working in parallel. So I could
use human resources even though I have more than one single annotator working.
So these are the kinds of challenges we need to adjust if we're going to do
cost sensitive, active visual learning.
So let me tell you about our approach. I'm going to show you a decision
theoretic selection function that looks at the data and weighs those parameters
I just described, so both what example looks most informative to label next,
but also how should I get it labeled? So we call this a multi-level active
selection. The levels are the types of annotations that one could provide.
So as an example, we'll also say what kind of annotation we want, and we'll do
this while balancing the predicted effort of the annotation. So different
things, more complex images require more time or money from the annotator, and
so we'll try to predict this trade-off explicitly. And the idea here is we
expect that the most wise use of human resources is going to require a mixture
of different label strengths. So we'll have to continually choose among these
different kind of levels in the annotation.
So there's my cartoon of what I just described. You can imagine the system
that's looking at the data, as you see here, and making these kind of
measurements about the predicted information and the predicted effort, and, of
course, we want to go to the sample next that has the largest gain in
information relative to the effort. So, you know, maybe this image looks quite
complicated. So we could predict that it's going to be expensive to get
annotated, but it's not worth it because it's not going to change our models
too much. Whereas this one over here, it's still going to cost some annotation
because this is fairly complex in the colors and textures, but we think it's
going to change the models based on what we've seen before. So we would choose
this one.
Okay. So let me talk -- I need to explain both the way we can capture these
labels at different granularities, or different levels, and then I'll talk
about how to just formulate this function so we can predict how things are
going to change in the model.
>>:
Question.
How do you calculate like the predictive cost for annotating?
>> Kristen Grauman:
So just hold up.
And I'm going to get to that in a few slides in detail.
So multiple distance learning is how we're going to be able to encode these
labels at different granularity. As opposed to just having positives and
negatives labeled at the instance level, we'll have bags that are labeled as
well as instances. So what this picture is showing is multiple instance
learning setup where you have negative bags. If you have a negative bag, it's
a set of instances, all of them are negative.
But if you have a positive bag -- here these ellipses in green -- you don't
know the individual labels of the points inside. You just know that one or
more is positive and you're guaranteed that, and you might have accompanying
those some actual points that are labeled as instances. So notice that there's
a good relationship here and the ambiguity of the label that we can use to
encode these different levels in annotation.
So if you think of images as bags of regions, so you could do some segmentation
automatically and come up with different blobs. So if someone says, no,
there's no apple in this image, that's a negative bag. All those blobs aren't
apples. But if someone says, here's an image that does have an apple, you know
there's one blob or more that has the apple. You don't know which ones. So
this is how multiple instance learning would fit the learning framework and
allow us to provide either course labels at the image label or finer instance
labels where we've actually labeled, say, some region.
I'm describing this in the binary two-label case, but, in fact, there are
multi-label variance of multi-instance learning. But we're going to think of
it more in the two-label case just for simplicity right now.
>>:
What's the purpose of bagging together negatives?
>> Kristen Grauman: What's the purpose? So they're bagged together here, but,
right, in terms of implementation, you'd unpack that bag, and you'd have a set
of negative instances. So they would exactly be label constraints on each of
those individual points.
>>: [inaudible] indices among the objects within the bag? And leverage that
information in doing the policy? So they were actually independent instances
of the bag.
>> Kristen Grauman: So they're considered -- so the Dietterich work is the
first interaction of the MIL setting, and it wasn't for the image scenario.
>>: Have they focused on the idea that there are rich tendencies among the
objects within a bag?
>> Kristen Grauman: No, no. So it's just the setting of here are all these
things, and we've captured thus far that something in here is positive, but we
don't know which one or how many.
>>: I mean, for example, a bag of negative things, a red bag, even just
dependencies in the existence of things that co-occur in bag-like things?
Which I imagine would be spatial.
>> Kristen Grauman: Right, yes. Definitely. And this is something that we
can try to explore, for example, when we're trying to evaluate the information
gain and we know there are relationships, and special ones in our case with the
images, based on spatial images or objects that can co-occur. So if I know it
has a positive label for apple, I'm more inclined to think the unknown positive
label for pear or something might also be true.
>>: In MIL, does the -- is there opportunity to do really rich work in that
space to really come up with a really -- some meaningful learning approaches
based on the positive learning [inaudible] that was done in the past?
>> Kristen Grauman: I've seen some work with multi-instance, multiple-instance
learning, MMIL, looking at some sort of label co-occurrence setting and
exploiting that in this kind of ambiguous label setting, yeah.
So we have a way now to represent labels at two granularities in this case, and
now what we want to do is make our active selection function to choose among
different questions that you can pose given this kind of label space. So
there's three kinds of questions we're interested in. One, we want to be able
to take an image and say, should I ask next for just a label on this region?
We correspond the labeling in instance here, and the black dot represents an
unlabeled instance. Second kind of question, we could just say give me an
image level tag. So name some object that's here, and that would correspond to
giving an image level or a bag level label.
And finally, the most possibly expensive annotation we could ask for is say,
here's the image. Segment all the objects and name them. And that would
correspond to, in a multi-label version, taking this positively labeled bag and
saying, what are all the instances within it? And actually drawing these
boundaries.
So what we want to do then, and kind of the important novel aspect of
considering multi-level active learning is you trade off these different
questions the same time as you look at all your points. So let me tell you
briefly how we do this. So we're using a value of information based selection
function, and this function is going to look at some candidate annotations and
decide how much is this going to improve my situation versus how much it's
going to cost to get annotated? So z carries with it both the example and the
question you're considering to ask about the example.
And what we'll look at to measure VOI is to say what's the total cost, given my
label then unlabeled data under the current data set, and how is that total
cost going to change once I move this Z into the label set? So this denotes
there's some label associated with it, and we can measure how cost changes.
And if we're having a high VOI, then we're reducing this cost.
So breaking out what that cost would be composed of, so it's both how good is
the model, also how low is our misclassification risk, and how is that balanced
by the class of the annotation? So if we look at the before and after as shown
here, that amounts to looking at the risk on the labeled data and the unlabeled
data. So are you satisfying the labels on the unlabeled data? Is the
unlabeled data clustered well with respect to your decision boundary, for
example? And how does that risk change once this guy gets annotated. And then
penalize that by some cost on getting the annotation.
So risk on the current cost once we add z into the labeled set, and then the
cost of getting z. So you can imagine applying this function to all your
unlabeled data and coming back with the question that looks most useful to ask
next.
So filling in those parts. The first getting the risk of the under the current
classifier. That we can do in straightforward ways. So what we're seeing is
the misclassification risk low for the labeled data and unlabeled data. When
we look at trying to predict the risk, after we add z into the labeled set -of course, we're asking for something we don't know because, if we had a label
for z, we would not be wasting time figuring out if we need to get the label
for z. So what we have to do is estimate how things are going to change based
on the predictions under the current model.
So do that, we'll look at the expected value of this risk. You can imagine
thinking of all probabilities -- of all labels that this z could have -- here's
something over the possible labels. If I insert that z into the labeled set,
labeled as such, then how does the risk change? And then weight that by the
probability it would have that label given your current model.
So you can compute this expectation easily for two cases. The first two that I
mentioned for annotation, one, if you're talking about labeling a region, then
there's -- in the binary case, there's two labels it could have, two terms and
the sum. If you're talking about an unlabeled bag, same thing. However many
classes there are, that's how many terms you have to evaluate in order to
compute this expected value.
The tricky case, and the one that we have to address to do this full
combination of questions, is this third one, where we're actually asking how
valuable is it going to be if I get someone to segment all the regions in here
and name them? Because now, if you think about the expected value over all
labelings, that's however many combinations of labels we can have. So it's
exponential in the number of these regions.
So we don't want to compute that. We can't. But what we've shown is that we
can have -- take a good sampling approach to compute this expected value. So
essentially, we're sampling these likely label configurations for this bag of
regions, and use those samples to fill in this expected risk.
>>:
[inaudible] how good it is?
>> Kristen Grauman:
>>:
How good the expected value is?
How good the sampling version is versus the two at the end version?
>> Kristen Grauman:
>>:
If we did it exhaustively?
Yes.
>> Kristen Grauman:
No, we haven't done that.
>>: You also could take advantage of some of the process work -- Andreas'
process work. Andreas will be here tomorrow, I guess. And make assumptions of
submodularity so that the reading approach to composing your image into a set
of end features would come within close bound of doing an exhaustive search.
[inaudible]. That's in line with Andreas' work. So it would be interesting to
show if the assumptions of your domain are indeed submodular and move behind
the Gibbs sampling or something like that. Interesting.
>> Kristen Grauman: That's definitely possible, and that would be interesting.
This is computationally the bottom line for us, doing this sampling over these
bags is better than what we'd have to do exhaustively, but it is somewhat
expensive. What we do to try to counter that is use incrementally undatable
classifiers so that we're not -- because when you do this sampling, you're
retraining the classifier with candidate labels. So we'll train that
incrementally so you're not as bad off.
>>: People understand that you get the information -- you had talked about a
value [inaudible] as well. Every imagined configuration, you're retraining the
classifier and then using that to compute updated risk. So it's quite a cycle.
>> Kristen Grauman:
Right.
>>: What's the benefit of -- because you had three scenarios. The first
scenario was saying for a specific segment this was a cat or not a cat. And
the last one was label all the segments and maybe you'll get to this, but it
seems like the first one is kind of a subset of the third one. Is there a
reason for not just breaking the third one into a whole bunch of number ones?
>> Kristen Grauman: Well, so, yeah, what we're expecting is that by computing
the expected value of information over that contained image -- because there is
this information, the fact that they appeared in the image together. So by
computing this counter to that, we might find that actually going to the
trouble and bringing up that image and asking for all information is worthwhile
albeit more expensive than saying what's this individual region? Versus
pulling out an individual region in some cases can be the most informative
question to ask individually.
>>: Because it's the contextual information that's going to give you the gain
on the third?
>> Kristen Grauman: Yeah, and even if you're not reasoning about the context
between the regions, it's not random that these things are together. So this
packet of regions may have some information that, once you evaluate it in this
way, you can predict and then choose that instead. Yeah. And when I get a
little bit later talking about budgeted learning, you can imagine picking a
bunch of instances that fit within some budget you have and comparing that to a
single image label like this, yeah.
Okay. So then this third function, now that we can predict how good things
look in terms of the information, or the reduction in misclassification risk,
is to say how expensive was this annotation? So we need to predict the cost.
So image data, again, I think is somewhat special/challenging because it's not
necessarily self-evident how expensive an image is to get labeled, right? I
mean at a very coarse level, I can say, I can pay someone on Mechanical Turk.
Do people know what Mechanical Turk is? An interface. So I could pay someone
a penny to do an image level tag, or maybe a lot less than that, but if I want
them to segment it, I have to pay 10 cents, let's say. But it's even more than
that.
I could have a very simple image, and someone's making a lot of money off me
because it only took them a second whereas, if I have a very complex image with
lots of objects, they'll be sad, and they'll spend a lot more time annotating
that for 10 cents. So our goal here was to say, well, how can we predict in an
example-specific way how expensive the annotation task is? Because then we can
pay the right amount, and we'll make the right VOI selection.
The idea is, if you glance at these, who wants to annotate this one versus this
one, where you have to do task number 3, which is full segmentation. You want
to do this one, right? Because it's easier. Do you know this one because you
counted the objects? Maybe. But we're suggesting that you don't necessarily
have to know what's already in the image or count the objects in order to
predict the expense of the annotation.
So here's the cartoon of why this might be true, right?
There's just a notion
of complexity based on how many colors are present, how many textures are
there, how often do they change in the image? And I'm thinking, you still
prefer this one, and it's not that you know what this is or how many of them
there are. So that was our intuition. And you could learn a function -- you
give them an image, it says, how long does it take to annotate this?
And we learn this function in a supervised way by learning this cost doesn't
have to be annotation. It could be things like what kind of expertise do I
need to get this job done? Just thinking outside of this particular object of
condition task. It could be what's the resolution of the data in these images?
How long are the video clips? And so on. There could be some cost, but in
many cases for us it has to be predicted at the example level.
So what we did to estimate this predicted cost, we look at some features we
think are going to be indicative of cost, and I've already alluded to what they
might be. Things like how dense are the edges in different scales regions?
How quickly do colors change as I look in local windows of the image? Record
all of these as features, and then go online, collect annotation data, so we're
watching Mechanical Turk annotators. In an interface like this, they're
drawing polygons. We're timing them. We've done this redundantly over many
examples and many annotators. Now we have our training data.
So we're going to learn basically a function that you give it features about
complexity, as I'm showing here, and then we'll give you back a time. And this
is a function, once learned, we can plug into that value of information and get
an example specific cost that we trade off.
>>: [inaudible] so you don't know if the people are 100 percent focused on the
task or doing other things?
>> Kristen Grauman: Yeah, right, so there is that issue. Right. So for
example, they could start annotating, go for coffee, and we're sitting here
running it. Oh, this is a really expensive example. But it's not quite so bad
because actually the timer is per polygon. Unless they start clicking half the
polygon and stop. So it's pretty good. And we have 50 annotators doing every
one. We look for consistency and things like that. Yeah?
>>: [inaudible] interesting down the road. I'm sure you've thought about
this. Getting some three-level feedback on how much fun was it to annotate?
>> Kristen Grauman:
>>:
Even if it takes more time, if it's enjoyable versus tedious.
>> Kristen Grauman:
>>:
Exactly.
Yeah, that's true.
You can pay them less money if it's kind of fun.
>> Kristen Grauman: Exactly, yeah. And we already have the leg up in computer
vision in general. The images are fun tasks for annotators versus I don't know
what else. Right. But if you give them a spray can to do this painting, it's
a lot more fun than if you give them a polygon tool. Maybe we could pay less.
>>: Than if you're asking people to listen to voicemail messages and annotate
the urgency, the longer the message, the longer it goes on and on, we have time
right there, but you can imagine that's not as fun as playing with images and
shapes.
>> Kristen Grauman: Definitely, yeah. So this class function could be even
more fine grain. If it's not just time, it's also enjoyment. Right.
Definitely.
>>: Also look at how long it takes people to finish the hits in general.
There are some hits up there that takes weeks to finish and other ones you
could throw up in there and be done in an hour or two. So whatever hits aren't
being finished, they're kind of ->> Kristen Grauman:
They're not fun.
>>:
Or even getting completed.
>>:
Yeah, just even getting completed is a good measure.
>>:
And it's free.
>>: And they do it. So you just keep upping the cost. Everything's a penny
initially, and then after a day, everything goes up. This is big data.
>> Kristen Grauman: Although I also understand they watch the job posters, and
they have favorites among those. You have to balance your reputation of being
a good job provider.
Okay. So we have this function now that we could rate all our data and decide
what to annotate next. So we really would compute this function for all these
examples and then come back and give this targeted question. The thing to
notice in contrast to traditional active learning, the things we're doing here,
one is multi-question because I'm not just saying what is this? I'm saying,
here's an image. I want a full segmentation on it. And, two, we're predicting
cost.
So when accounting for variable complexity data, we're making wise choices and
good use of our money based on how expensive we think things are going to be.
And that's predicted.
Okay. So let me show you a couple of quick results. Here we're looking at
this MSRC data set from MSR and Cambridge, and there are 21 different object
categories. These are multi-level images. We have grass and sheep and bikes
and roads and so on. And then let's see some examples of costware predicting.
So here are all these data points. This is actual time that annotators took,
and then when we're doing leave one out validation, the predicted time that we
would produce. And so there's certainly a correlation here.
And if we look along this axis -- so from the easy things that we predict to be
easy, to the medium effort things we predict to be medium, to the more complex,
you can kind of see how those features are being used to make these
predictions. So these are the ones that we're getting almost dead on.
Now here's one with a big error. Here's a nice, wonderful, close-up shot of a
patch of grass on the ground. It's an unlikely photo. But, of course, this
makes us give a bad prediction, this very high res, high texture patch of grass
makes us think it's more expensive and complicated than it is. Here a few more
examples where we think these three images in red would have about equal and
moderate cost.
Okay. So now if we put all these things together and we set that active loop
going, then as we spend more money on annotation, we get the models that are
more and more accurate. So these are learning curves where, if ours are shown
in blue here -- now, if you're just to use the random passive learning, which
really is kind of what most -- you know, many of us have been doing just to
train object recognition systems, then you learn a lot more slowly with less
wise use of the resources.
If you're traditional active learning, though, you get this red curve, and that
means you're allowed to look at points, you're allowed to ask for labels, but
you're not allowed to trade off these different kinds of questions. And if you
do our method, you get this third. That's what we want is to be learning
accurate models very quickly.
Okay. It's fun to kind of pick apart, you know, what did the system ask us? I
mean, we had to be careful. These aren't -- this is our system actively asking
questions. These are discriminative models. There's no need that they
actually be interpretable, but we can go back and see what are the questions
that were asked?
So here, if we train the system with just two image level tags per category -here are all the ones we've already seen with labels. Now here are the first
four questions the system asks us, and it's asking between here to name some
objects on image level tag, image level tag in all of these four, and if we
look back at the existing training data, here's what we knew about buildings,
and here's what we're asking for next. So this is somewhat intuitive. It's
going to be -- really change the building class if we find out about that. And
similarly for sky, these are different looking skies.
Now, if we go further down in our chain of questions -- so here down to the
12th question. Notice when the system starts picking out about regions that it
wants specific labels for, in some cases this happens where the different
regions are less obviously separable. Even if we knew about something in this
image -- in fact, we had a tag at the image level for this guy, but now we're
coming back to get more specific. Well, what actually is this region? And the
one question here in the first 12, that's the really expensive one where we
asked for full segmentation and labels, is this one here that is a bit more
complex and worth the effort, according to the system.
All right. So thus far I've assumed that we're doing this loop one at a time.
We go to one annotator. We get the answer. We retrain, and then we continue.
Now, the problem -- and we've already talked about Mechanical Turk. What if,
instead of that single annotator, I wanted to have a distributed set of many
annotators working on this task? They're going to be sitting and waiting while
we do Gibbs sampling and come back with the next question we should ask if we
really are running this online live.
And, in fact, what would be great is if we can select instead a bunch of tasks
that we can feed to these annotators.
However, the most kind of straightforward thing we could do is unlikely to be
successful, and that is if I just take that function I had, this VOI function,
take your top K-rated things and give them next. Why is that not likely to
work? Well, there can be a lot of information overlap in the things that were
highly rated according to VOI. Maybe they're similar examples, and as soon as
you got one, maybe you just wasted money on getting the rest.
There also could be interactions where, if I asked for a large set of examples
to be labeled once, any portion of those labels can dramatically change the
current model such that the other things I thought were interesting become
uninteresting. So you have this sort of change in the model that it's hard to
predict when you're talking about asking for many labels at once so that you
can do these tasks in parallel.
So we have an approach that we call budgeted batch active selection to try and
address this. And the idea is to not just use the current model over here when
you're looking at your unlabeled data, but to also set some kind of budget. So
I'm willing to spend this much money as you get me the next batch of
annotations, and then use that, and when doing the active selection, your job
is now not to pick the best, most informative thing that's using the money
wisely, but instead use that budget and spread out your annotations to select
the most informative batch. Then you'll bring those back, and everyone can do
the jobs it wants, and you could continue iterating, but you're doing this a
batch at a time.
So to adjust the challenge I mentioned of sort of the naive solution of rating
everything and taking the top chunk that would fit in the budget, we need some
farsighted notion in the selection function. So to do this, here I have kind
of the boiled down sketch of what our ideal selection would be. So we want to
select a set of points. The optimal set will be the one that maximizes the
predicted gain, which I'll define in a minute, for our classifier such that the
labeling costs are within our budget and that we were given.
And this predicted gain then is where we can have some farsightedness in the
selection because we're not just going to look according to the current model
at which things seem most informative, we're going to simultaneously predict
which instances we want labeled as well as under which labeling they'll most
optimistically reduce the misclassification error and increase the margin in
our classifier.
So let me try to draw a picture here to show that. So let's say this is your
current labeled training set. So positives and negatives and then unlabeled is
in black. So we're going to do this predicted gain according to an SVM
objective for the classifier. So that contains both a term for the margin, you
want a wide margin, and a term for the misclassification risk. So you want to
satisfy your labels.
So what we're going to do is try to estimate -- try to select those examples
where the risk is reduced as much as possible while the margin is maintained as
wide as possible under the predicted labeling of that selected set. So let's
say that these were some three examples that fit in your budget. Then what we
will do is choose -- optimize over what possible labels they could have -- and
let's say this ended up being the candidate labeling. And then look at the
predicted gain from what we previously had in the old model to what we have in
this new model, again, in terms of the margin and in terms of misclassification
error.
Okay. So the important thing to here note is that there was a change to the
model, and while making the selection, we're accounting for that change, and
that will be what's farsighted in the batch -- batch choice. So I'll -- hmm.
>>:
[inaudible] possible?
>> Kristen Grauman: Right. No, we don't. We don't want to enumerate all
possible. That's what we want to optimize, right? So I'll try to kind of
sketch out how we can do this effectively to make the selection. So we're
doing this for the SVM objective. I think this is probably familiar to
everyone here. So we have this margin term, and then we have this slack, and
then here are our labeling constraints.
So now let me define this intermediate function. So this will be the cost of a
snapshot of a model and some data. So here's the model, meaning a classifier,
wb, and here's a set of data, set b.
And we're defining this function to be able to evaluate the margin term on one
set of data, whatever data was used to train it, and then -- but then evaluate
misclassification errors on some other set b. So we're breaking it out so we
have this ability to have model, but then the model applied to the rest of the
data to evaluate this term. So that's this function g. And here are the two
parts, the margin and the misclassification errors.
So now with we want to do this farsighted selection, we want to optimize both
which set of examples to take and under which labels we think they seem to have
the most beneficial impact. So we're going to minimize this difference. And
what we have here is the cost after adding that data into the label set. So
the situation under the new model, and then here the cost before we added it
in. And I know there are a lot of symbols on this, but the high level part
here to capture is this comparison, and to note that here we're evaluating
things once we've added these things into the set, and here we're looking at
the old model and how well it would do on those same examples.
And then this final constraint is saying, you have to have the cost associated
with each of the ones you select add up to no more than T.
Okay. So finally then, this is what we want. How do we select for this? So
notice that our optimization requires both selecting the examples and the
labels under which they're going to have this most benefit. So what we do is
introduce an indicator vector. So vector q. So say you have u unlabeled
points. This is a 2u length vector where you put one entry first, the first
example with its label being positive, so this is one bit in this indicator.
And you do the same thing for considering if that example's negative. Do that
for all your examples in the unlabeled set.
Now, this q is our selection indicator for both labels and points, and we'll -this is the -- what we're going to optimize over at the same time that we
optimize for the classifier parameters. So finally, we can write down this
entire objective, where we want to select both the classifier and the selected
points. And the important things to note here -- so this is what we get if we
reduce -- or simplify that term I just showed you before, the cost under the
new model and then the cost under the old model, and we'll do this subject to a
list of constraints that, say, all the labels should be satisfied on both
labels on label points.
But notice that this penalty for misclassification only affects those examples
where we're including it in the selection. So it's not all unlabeled data,
it's the ones that we're going to think about adding to the labeled set.
There's also a constraint for the budget. So the ones you select, you have to
add up their costs. It has to be less than t. And this constraint is what's
going to say that, in that big indicator, you can only let -- you should have
the paired instances, sum 2 less than 1. So we can't say both unlabeled
example number 2 is both positive and negative. We have to say one or the
other. And finally, we would like an integer indicator vector.
So this is finally the whole task that will get to the picture I showed on the
first slide with the classifier changing. This is an integer programming
problem in NP Hard. But we show that, by relaxing this indicator vector, we
can perform alternating minimization between a convex QP and a convex LP. So
we'll iterate between solving for this, which is like a normal SVM objective,
and solving for this. And between these two, we'll convert to a local optimum
for these selections.
>>: [inaudible] I believe this is a lot of work on a Heuristic method to
compute, which is the decision criteria for a set. I'm curious if you've
thought about how even in an exact solution to this would relate to the full
model where you're not going SVM, you're sticking with just a list of
classifiers, not using the Heuristic of the margin and the two objectives you
have that you're mixing here and look more at the -- try and basically reason
about what's an optimal bag of things to ask about? Given the fact that with
every answer that board is going to be changing.
>> Kristen Grauman:
Right.
>>: And how will this approximation -- or say approximation to the Heuristic
function to the actual thing you're trying to do relates to the thing you're
trying to do?
>> Kristen Grauman: Also, I would remove one level of the those Heuristic,
right? So we have the SVM objective, and that's what we care about. And we
want to say what set of points under those labels that will most optimistically
reduce the objective, should we take to do that.
>>: The SVM objective function, the border, you're assuming, as your prior
hyperplane is moving with every single answer.
>> Kristen Grauman: Right. And that's why, when we're choosing this, we're
looking at all predicted answers under like simultaneously, right? So like how
everything changes once you add that selected batch of examples under the
budget. So that is the farsighted part in that we're not -- we don't just -- I
know. This is not the best place to do this. So we don't just have this
evaluation according to the current model, which is the key, right? We're
saying how does the objective reduce when we have these new things introduced
into the model and the model itself has changed?
>>: I guess the question would be is that really -- should that be really -going back to the SVM question now, should that be really your objective
function?
>> Kristen Grauman:
>>:
Yeah.
The SVM?
When you [inaudible].
>>: The SVM, empirically, yes, because in object recognition terms, probably
the very best results are often using this good and flexible discriminate
classifier, where we can plug in fancy kernels as we like and so on. So I do
think as a tool, as a classifier, this polar vector machine is worthwhile to
study for this active learning case.
And then what we're doing is being able to say, if you have a budget to spend
and variable cost data, how can you make a farsighted choice under the SVM
objective?
>>: How do you handle the fact that for the labels that you don't know that
they could be one label or the other? You just say which label is more
probable and just assign it to that? Or do you actually do a mix?
>> Kristen Grauman: Right. So it is -- so in the integrator that was doubled
up, you have both options ->>:
You put it in --
>> Kristen Grauman: But then you have the constraint that you should be
choosing one or the other. And, in fact, we do tend to get -- even though this
is a relaxed form, we tend to get integer responses, and I think this is
because once you enter -- if you enter in a slack penalty for both -- like if
you have non-zero values on both, for example, it's going to hurt your
objective whereas, if you choose one that's more aligned for the real answer,
you get a lower slack, a lower penalty.
>>: At the same time, you might get a little bit easier because it kind of
jumps to one side or the other, it might have a hard time jumping back? Is
that -- I mean it decides upon the answer pretty quickly.
>> Kristen Grauman: I mean, right. So in practice this will run for about 10
or 12 iterations, and when we initialize this, because we have to initialize
the indicator itself, we initialize with the myopic greedy batch selection,
which is the one I was saying is not accounting for the joint information or
the farsighted section. So we start from there, and then we iterate to
optimize this q selection.
>>: How do you know that they'll converge? It could be that those two parts
are actually a trade-off between one and the other?
>> Kristen Grauman: Well, so the constraints -- and I don't have all the
equations right now here. But the constraints on the one part are independent
for the variables that we're optimizing in the other. So in each step, we're
improving, and then we're alternating between these two improvements.
So what this gives us is a way to have now an active learning system that's
really working with live annotators and where we can farm out these jobs at
once, and they're well-chosen jobs, and we can do all this under a very
realistic form of budget where, if we actually have money or time limits, then
we can give this in the formulation of the task.
And there's actually very limited amount of work on batch selection in general
for active learning, and the important difference between our work and any
existing batch mode active selection is the fact that it's working on a budget
with variable costs and that it's farsighted by making predictive model changes
when it does the choice.
So I want to show you -- focus on this one result with video. So here we have
all these video clips and the Hollywood Movie data set. This is a benchmark
for active recognition, and the variableistic videos like the ones you see here
from Hollywood Movies, and your task is to take a video and decide whether or
not it contains a certain action. So it's actually sort of a retrievable task
based on these categories. And here we're going to measure how expensive an
annotation is based on how long the video is. Because if someone has to watch
it to decide if that action ever occurred. So in this case, we don't even have
to predict cost. It's just a nice function of the data items themselves.
So here is an example kind of selection that you'd get back, according to three
methods. So on the left, passive selection, you had a budget. You spent it
all on one video. In the middle, this is the myopic batch selection, where you
use the budget and take the top ranked examples according to your current
model. And here is our selection, where we use that same amount of time to
select a number of videos that we think are going to most improve, in this
case, the stand-up classifier for people standing up.
So what you see kind of graphically here, you're getting more interesting
examples that were well-chosen. These happened to be shorter ones that still
all fit the budget but we think look informative versus, even with this quite
strong baseline of myopic batch selection, where you're just choosing a couple
that fit in the batch but didn't look at how the model changed.
>>: [inaudible] out of curiosity, I don't know the detail, how are you
featurizing the space that they system gets a good sense where you'll be in
good condition to do stand-ups, for example? [inaudible].
>> Kristen Grauman:
>>:
So what are the features?
Just an example.
>> Kristen Grauman: We're using stand-up features for the video based on local
interest operators and space time to find like Harris corners in 3D, and then
around those local interest points, we take histogram of gradients and
histogram of optical flow. So you have a bunch of descriptors to quantitize
those, you make a bag of representations for the video.
>>: So when you say you take the length, you take the length of the original
video? But you could have used a temporal window and say, okay, I'm looking
for a sit down, stand up, right? And then you basically didn't have that
length as a cluster.
>> Kristen Grauman: Yeah. So there's some of that here. You're saying you
could uniformly chunk up the video and then survey those. Yeah, and that is
possible. There is some brains or smarts into how these clips were generated
by the data set collectors. That was based on scripts of movies, when they're
seeing these actions. And that's why they do vary in length because given this
scene, they think the action standing up might be happening. So the lengths
aren't arbitrary. They were guided by the data creation, yeah.
But you could -- you know, if you don't know any top down knowledge about these
actions, you might do a uniform chunking, and then that cost factor goes away.
But we still have to choose a batch that's farsighted.
And empirically, as before, I can show you the same kind of trade-offs here.
That's what you're seeing. We learn more quickly. We've done this sort of
batch selection for object recognition as well. And here again we're looking
at image segmentation time. And I'm very excited about this kind of framework
just because it gives us this next step where we can make the system that's
live online and looking at many annotators and making good decisions about what
should we label.
>>:
[inaudible] there was a failure case and why?
>> Kristen Grauman: Oh, there's a failure case? What's that doing there?
Yeah, so this is a failure case. You want your curves going up. All of these
are kind of jumbled. This is the dirty work gloves category. It's a bad
object. It's like white, textureless, and we have any of these methods, the
costware alone has a hard time doing this. [inaudible].
>>:
[inaudible] the SIVAL.
>> Kristen Grauman: So the SIVAL is 25 different object categories. It was
developed at University of Washington St. Louis. Basically, you get different
snapshots of where the object is there and then where it's on different
backgrounds. So this is the blue scrunge, sponge, and it's occurring here. So
it's a true classifier problem.
Okay.
So in this next bit -- and I think we need to wrap up.
>>: Well, we have the room for an hour and a half.
another ten minutes or so.
So why don't you run for
>> Kristen Grauman: Okay. Perfect. So I wanted to tell you about this other
side of managing the human supervision and object recognition. In this, I'm
describing as listening more carefully. So we're used to listening in a
certain way to annotators, and that way is I give you an image, you tell me
object names, or you give me key words, I have some labels about those objects.
But we want to learn a bit more from what those kind of annotations might
provide.
So, for example, if you go to something like Flickr, and you have these human
tagged images, it's clear that we could use those tags and do some kind of
smarter detection. I already know that there's probably a dog in here, and
there's a sofa. So this is a nice way to use almost freely available, loosely
labeled data. And, in fact, there's been a number of different techniques over
the years showing how you can try and learn this association from the weak
annotations of tags, but all these techniques are looking at this noun and
region correspondence, right?
So trying to discover, if you give me tags on an image, which regions goes with
which names.
And instead what we're trying to do is think about what other implicit
information is there in those tags people are giving us. They're giving us a
list of key words, and they're not just necessarily giving us noun
associations. So we're assuming there's implicit information about which -about the objects, not just which ones are there. So now I have to do my
audience test. So here are two images that are tagged. Two people looked at
some images I'm hiding, right? These weren't the images. And they gave these
lists. So what do you know about the mug in both of these images if you just
look at these tags?
>>:
[inaudible] on the left it's more prominent.
>> Kristen Grauman:
>>:
It's more prominent on the left.
Okay.
And why is that?
Because it's the first thing.
>> Kristen Grauman: Yeah, it's the first thing. If you're going to mention
mug first when you look at an image, probably it's pretty prominent. And what
that might mean for us in terms of detection is maybe it's central, maybe it's
large. So there are localization cues in how these tags are provided. And did
I hear the images? You're right. This guy is very prominent. Here's the tiny
mug over here. And they mention it twice again.
So this is the very intuition we wanted to capture and try to let it benefit
our object detectors. So it's things like the rank. So the mug is named
first. It's probably important and big. Here it's named later. But there's
also things like the context of these tags themselves. So the fact that you're
mentioning other physically small objects already gives you a sense of the
scope of the scene when you see that list. Whereas here when you're mentioning
physically larger objects, then you have a sense of the scope of that scene.
It may be further back.
Of course, we're not going to write down rules to try and get all of this, but
we can learn these connections from data. So the idea is to learn these
implicit localization cues to help detection, and let me give you the
high-level view of the algorithm. So during training, we have tagged images,
and we also have bounding boxes on the objects of interest. So that's our
supervision. And we'll extract some implicit tag features and essentially
learn in an object specific way a distribution over location and scale given
the tag features we've designed.
Once you have that distribution, you can come in and look at a new test image
that also has tags but you don't know where the object is and use these
implicit localization cues from the tags to do better detection. You could do
it better by doing it faster because you know where you should be looking for
this object, or you could do it better by looking everywhere but taking into
account that, if the tags are indicating mug is supposed to be prominent and
you find some tiny one over here, it's probably an incorrect estimate.
Okay. So I'll briefly describe these implicit tag features. We've already
given a hint of what all they might be. One, we're going to look at which
objects are there and which ones aren't. Because the ones that aren't also
tell us something about the scale of the scene. So to do this we'll just look
at the count of every word in the tag list. So normal bag of words, tag
occurrence, tag frequency. And so here's how we'll capture this total scale
based on objects that are mentioned.
So the assumption is here, of course, that, when you look at a scene like this,
you don't start mentioning some of the least important and small objects only.
You'll mention some of the most important ones for sure.
Now, the second cue, we look at the rank of these tags because the order is
very telling about which things were most noticeable or which were maybe most
important in defining the scene. And rather than looking at just the raw rank
of the tag within the list, we actually look at the percentile rank. So this
is where, if you have -- in this case, let's say the pen is not necessarily the
very first thing mentioned, but it's mentioned rather high for a pen compared
to what we've seen in other training data. So here, by recording the relative
rank, or percentile rank for each word, we get a feature that's describing
relative importance within the scene.
And then the final third cue we look at is proximity. So we already know from
human vision experiments you don't look at the scene and then jump around
haphazardly. There's a systematic way, after you fixate at some object, that
you'll tend to drift to other nearby ones. And if we can assume that, in doing
that, kind of drifting with your eyes, that you start mentioning things in that
order, then we can look at the proximity mutually between any two of these tags
to get a spatial cue about the different objects.
So that's what we're going to capture by looking at between all pairs of these
how close are they in the list?
>>: Is that equal to equating how two big objects, obviously they'll be first
[inaudible].
>> Kristen Grauman: So these things all come together. Between the rank,
between the presence and absence, and this proximity, we're going to put all
these forth and then learn which combination in which way these are predictive.
And it could be object specific as well, right? Yeah.
So then we can use this to do detection. First, we need to make the
distribution itself. So here we're looking for a distribution where, if you
give it tag features of those three types, we can come back with the
distribution over scale and position, and we use a mixture of density and
network to do this. So basically, given these features, you're learning, and
you're able to predict the mixture model parameter. So calcium mixture model
in three dimensions over spatial dimensions and scale. And each one of these
are object specific. We'll have one of these for model. We'll have one of
these for person and so on.
So now we have the ability to look at these images just like you all did a
moment ago, and only seeing the tags, look at that distribution. So if I
sample the top 30, 20 windows from that probability of localization parameters
given tags, I could come up with something like this. And I think this is very
cool, right? So you're just looking at a list of key words and deciding where
you think this object's going to occur. And we can look and uncover like the
actual image, and this is a fairly good distribution of where to start looking
for the objects. Here again is a failure case.
Boulder and car, I mentioned -- here's the car. But our very top 20 guesses
were going to be to look around here. So certainly the data representation, as
in any task, the more representation we have between tags and images, the
better we'll do.
And, of course, we don't stop there. We actually run the detector. So we have
this prior. And now we'll go in and also see what the detector has to say.
And so we can use this to do priming, to do detection faster. So you'll prune
out a whole bunch of windows you shouldn't even bother looking at, and we'll
also use it to improve accuracy. And that's where, if the detector itself gave
us some certain probability of the object presence in any window, we can
modulate that according to some -- what the tag features are saying. So we do
this with some logistical classifier.
And we've done this with -- the experiments with this algorithm in two
different label sets, LabelMe and PASCAL. If you haven't seen PASCAL, this is
really the most challenging benchmark for object detection today, and it has
about 20 different categories. Flickr images were the source. So they're
quite varied, and they're quite difficult. So what we wanted to do is show, if
you take the state of the art object detector from the latent SVM model and you
combine it with our implicit tag features, can we actually get a better
detector? And here are some examples where I show that we can.
So these are Flickr images from the PASCAL data, and the green box is our most
likely -- or most confident estimate for that object. Here on the top bottle,
here car. And the red one is what the latent SVM would return as the top, top
bounding box. So notice we're getting here the latent SVM, the baseline. We
think the bottle is here in large, but just given even the context of this
scene displayed by the tags, we can guess a smaller window for this bottle.
So another instance where, here again, giving the list of tags, our estimate
for the person ends up being the smaller one over here whereas the latent SVM
would make this kind of prediction.
This rank cue is quite powerful.
So here you have a dog mentioned first but
also a hair clip, just something small. And, of course, I'm just reading into
what was totally learned from the data. But something small is there so that
kind of limits the scope of the scene and tells us to predict here for this
dog.
>>:
So how many images did you run the whole system on?
>> Kristen Grauman: Let's see.
around 10,000 images.
>>: That's PASCAL.
source?
So the training set for PASCAL, I believe is
But you were using Flickr, right?
So that's a different
>> Kristen Grauman: I was saying Flickr because they came from Flickr, the
PASCAL images themselves.
>>: And you were able to find all the labels?
labels?
You went back and find all the
>> Kristen Grauman: No, thanks. These tags came from Mechanical Turk, and I
don't work for Mechanical Turk. We're just using it for this system. But we
went -- these tags weren't preserved for them.
>>: So you took the PASCAL ones, and then you ran them through Mechanical
Turk, and you're republishing that data?
>> Kristen Grauman: Yes, we have it. Right. So this is where the people -- I
think we had about 800, 700 different people tagging these images for us, to
make the training set.
>>: So one way to interpret this is, if you run a computer vision algorithm
over this PASCAL challenge, but then you also have people not tell you where
things are, but just to whole image labels, you do a lot better if you pay
attention to what the people said?
>> Kristen Grauman: Right, right. And not just the nouns that showed up. So
it could be in a weak supervision setting where these people are intended to be
in theaters, or we hope -- and we're working on the case that -- not that these
people knew what they were doing. They were just giving us names for the
Mechanical Turk, but this should be, you know -- you know, the next step is to
see, if I take your random set of tagged images, that would probably be less
complete than what we would get. Here we have about five tags on average per
image. Can we get the same effect?
But even if it's at this stage of supervision that we intend, this gives you a
strong prior on the object, where it is without asking someone to draw those
pesky bounding boxes, yeah.
>>: Well, Flickr tags are more used for searching for the images. They won't
necessarily go out and specify the nouns in the image but can say something
like happy, things like that.
>> Kristen Grauman: That's right. And we're encountering that in some new
data collection. They'll say Honda Accord or whatever, and it's not going to
mention some of the objects. But possible direction to go to make this better
is looking at the language side and the meaning of these words rather than just
the key words themselves, you know, which should strengthen the representation
in what we learn here.
And there's certainly failure cases. Here are just a few. If you're used to
seeing planes that, when someone mentions them first, they're on the runway at
the airport, then you think the airplane is going to be large. And this
unusual view where the building is there and so's the airplane, it's actually
very small. You can see some other errors here.
Okay. And we can quantify all this. So if you run a sliding window detector
for a given detection rate, you have to look at a certain percentage of
windows. So to get up to 80 percent, you're looking at 70 percent or so of the
windows, and if you do this priming, just based on the tags, you're doing -you're looking at much fewer windows, so about 30 percent.
And maybe one of my favorite results out of this is that we're actually -- you
know, with this tagged data, we're able to provide better detection than using
the state-of-the-art LSVM detector. So we're improving, over all these 20
different categories, based on this priming, in accuracy, not just in the
speed.
I'm going to skip this. Finally, we're taking this idea and looking now at not
just the idea of image retrieval itself and with weaker supervision. So if we
think of these tags and the importance cues, meaning the rank and so on that I
described, as two views of our data, then what we do is learn a semantics base
that's importance aware because it's paying attention to the rank and relative
prominence of these different tags so that this is a single representation
that's taking into account both views of this data. So there are many
algorithms to take two-view data to learn a better representation. Here we're
using kernelized canonical correlation analysis to get this space.
So when you have such a space now, you can look with an image untagged, and
once you project into this semantics base, which things are most similar. And
the goal then is not to just do CBIR based on blobs of color and texture and so
on, but now also have some learning through the representation that tells us
which things to retrieve so that we have the most important things there.
And I'll show you two examples. Here, if you just look at visual cues and do
the retrieval, for this query image, you'll get back these. And globally, they
definitely look similar according to the different cues, but the important
objects aren't necessarily preserved. Here's what happens if you learn a
semantics base looking at words but just treating them as nouns, not caring
about order. These are the results you'll get.
And finally, if you use our approach and do this retrieval, you'll get images
that look similar but also are preserving the most important defining objects
in the scene. Another example of same format.
>>:
For the query image, you had tags associated?
>> Kristen Grauman: No, we don't. During learning, we have tagged images and
no bounding boxes in this case. And then at retrieval time, you give an image
or you give tags, and you get back images. So that's the role of that
semantics base from which you can -- into which you can project from either
view of the data.
>>: It's interesting to think about what the intention is when someone gives a
query image in terms of their goals. I guess you have different classes,
right? Textures and types. But when it comes to nouns, it probably is
something that's [inaudible] vision central. Realizing the kinds of things
that can be done. The intention of the query image. It's an interesting
question to ask. What's the goal? Do they have things like this?
>> Kristen Grauman:
And the user has their own semantic notion of what it is
that's the right answer, and you're trying to have the features and metrics
that give that back. And here the assumption is based on what we've learned
about what people think is worthy to mention early when they mention it, that's
the kind of images we want to bring back when you go to a similar bag.
>>:
[inaudible] versus these very interesting objects, mechanical objects.
>> Kristen Grauman: Right. And, you know, we have tested on a data set that's
more like what you're describing, which is LabelMe. These are all PASCAL
images. But LabelMe is more scene oriented. It's not nice like a beach. It's
like outdoor Boston scenes and indoor office scenes. And there we see less
gain actually. So there, all the scenes are actually so similar, that it's
hard to get much differentiation in boards.
>>: It's funny because looking at the visual only, most of them actually are
trains, and none of them survive to the combine. Is it because they didn't
have the words in it?
>> Kristen Grauman: Right. Well, it looks like, I suspect -- you know, these
are all test images, but I suspect these would have tags -- well, at least a
few of them don't have trains. But they were mentioned maybe later or they
mention many more things, whereas here maybe just train had been mentioned, and
it matched well for this.
>>:
[inaudible] tracking with the query?
>> Kristen Grauman: It does, yeah. I don't know that I -- no, I didn't
include that. But, right, you could -- since you have that space, you could go
in with the image and ask for key words, and you could give those back. The
idea is they're not just the right words, but they're in the right order.
This is not okay. These are kind of examples. Some of them are not so good.
We get first here. The larger this data set gets -- during training but also
just populating the database beyond the training set, the better I can expect
these to be and that we'll have a denser set of tag lists that exist and that
you can bring back.
Okay. So to conclude what I've just shown you, we have studied how can you
manage the supervision process and object recognition more effectively? And
the parts I showed you today were saying how can you do this where the system
is actively evolving based on what it currently knows and what it thinks it
needs to know? And the contributions there were to look at how do you do
multi-question or multi-level active learning with predicted costs? And how to
make these batch selections when you have a variable cost data as well.
And then the second component, I just kind of gave the overview for thinking
about the implicit importance of objects both in recognition and in image
retrieval and how to exploit what people are telling us beyond what we're used
to listening to. And that's to use this importance aware kind of feature as an
implicit tag feature.
I'll stop there.
Thank you.
[applause].
>>:
Some more questions.
>>:
If you tried quantitatively to check the annotation that it generates?
>> Kristen Grauman: Yeah, definitely. And I didn't show all -- I skipped
over. But we're looking at things like information retrieval metric and BCG.
So how well is your ranking doing compared to the perfect ranking? Where we
say the perfect ranking is either one where the right object skills and objects
are present, or where we say that the tag list would have been generated the
same way. So this is, again, with those same two baselines what we're getting.
>>: Hopefully. I mean, you could take part of the images that were tagged by
the Mechanical Turk and try to generate the tags from them, maybe from the test
set and see -- I don't know how I could measure the label.
>> Kristen Grauman: So do you mean withhold the tags on some test set and then
check? Yeah, that is what we're doing. Yeah, so we'll take -- we'll look at,
for this query image, we're saying we don't have tags, what comes back as the
tags and then how much -- how well do those tags align?
>>:
All right.
Thanks.
>> Kristen Grauman:
Thank you.
Download