20912. >>: Good morning, everyone. It is my great pleasure to welcome Kristen Grauman here to give us a talk about some of her recent work. Kristen is a graduate, got her Ph.D. at MIT with Trevor Dorrell, and she's also one of the Microsoft new faculty fellows. And she's now a professor at University of Austin, Texas. Good morning, Kristen. >> Kristen Grauman: Thank you. Good morning, everybody. So today I wanted to talk to you about in some detail, about a couple of projects that are related that we're doing in object recognition. And this is work done with Sudheendra Vijayanarasimhan, Prateek Jain, and Sung Ju Hwang. Please I'd welcome any questions that you might have as I go along. If I may show this big picture. What I'm trying to convey here is the richness of the visual world that we have. I'm interested in recognizing, naming, reidentifying different objects, activities, people, and so on. And to do so means we've got a really rich kind of space to work with and quite complex and very large in scale. So early models for object recognition would turn to more handcrafted techniques, so trying to really mold it to a certain task and so on. Later on, people start developing and turning to the power of statistical learning methods instead. So there what we do is ship the problem into one where, if you get some kind of labeled data, images, videos, then you can train the model. So let's say some classifier to distinguish this category from all others. But what this picture is showing is this big kind of divide between the vastness and the complexity of this visual space and what we're trying to do here with the learning algorithms. It's meant to be a cartoon, right? What we really need to be focusing on now, if we're going to take this path, is this interaction between the data itself, the learning algorithms we're using, and the way any annotator is providing some kind of supervision. So focusing on this part, a couple issues that I think we need to be concerned about is, one, that as much as we want to be able to fit the standard kind of point and label mode of supervised learning algorithms, it's not always the most beautiful fit for visual data, right? Meaning that, if you try to cram your visual space into nice vectors with labels, we can surely lose a lot of information. So trying to necessarily target that space can be too restrictive. Secondly, we have the high expansive human annotation effort. So if we're thinking of our annotators here trying to convey everything they can immediately know about any of these kind of instances to a classifier in the form of labels. This is expensive, and right now the channel is quite narrow when they do do this. So when I talk about our approach to trying to adjust these possible limitations, and it has two components. So, one, we're going to look at not just thinking of this as a one-pass thing where we learn models, annotators tell us in labels, and then we fix them. Instead, we go back through our models and let them self-improve. Let them look at things that are unclear, uncertain, and let them actively be refined by asking annotators the most useful questions. Secondly, I'm going to look at the flip side, not just going in the direction from the model data to the annotator, but looking here at this connection between what the annotators are telling us about visual data and what they're not telling us but we could get if we listen more carefully. That's a little vague right now, but I'll get to what I mean in the second half of the talk. First, let's look at this cost-sensitive active learning challenge. Let me tell you sort of the standard protocol today that many of you, I'm sure, are very familiar with of how you might train an object recognition system. You've got an annotator. They've prepared label data. Let's say you're learning about cows and trees. So you could then come up with some classifier using your favorite algorithm, and now you're done. You have the model, and you can get different images and categorize them, or run some kind of detection. So this is a one shot kind of pipeline. It's fairly rigid, and it assumes that both, usually, more data we put in here, the better, and also that the annotator knew which data to pick to put in here so that we have a good reputation, and in effect, a good model. So instead of this kind of standard pipeline, what we're interested in is making this a continuous active learning process. So the active learner in the human classroom is the one who's asking questions about what's most unclear and thereby clarifying or more quickly coming to understanding the problem. So in the same way, we want to have a system where, instead of just this one-pass kind of effort, the current category models that we have can survey unlabeled data, partially labeled data, and determine what's most unclear to the system and come back to the annotator with a very well chosen question. So the idea is you're not just learning in one-shot pipeline, and you're also saving yourself a lot of effort if you do it well because, if you're paying this annotator, you only want to put forth the things that are going to most change the way you recognize objects. So this is the active selection task. Okay. So when we start trying to bring this idea into the object recognition challenge, we first have to think about traditional active learning algorithms. Here you'd have, say, some small set of positively labeled examples, negative labels, so cows and non-cows, and some current model. And then the idea is, if we survey all the unlabeled data, we can ask about one of those points that we want the label for next. So we won't just accept whatever annotation the annotator gives us at random passively. Instead, we might pick something that's going to have the most influence on the model. And given whatever that label turned out to be, we could retrain the model -say our decision boundary changed based on the new information. That's a standard kind of formula for an active learner. Problem is it doesn't translate so easily to learning about visual object categories, and there's a few reasons for that, why it doesn't translate. So one is that assume that we could say yes or no about a point. So some vector description, we say yes or no, and that tells us about the object. But, in fact, there's much richer annotations that we can provide to the system. So we could say yes or no, there's a phone or there's not a phone in these images, but the annotator could also go so far to outline the phone within the image. That's even less -- or more informative annotation, or going even further, we could think about annotating all the parts of the object and so on. Okay. So notice that here we have a trade-off between the kind of level of the annotation, meaning how unambiguous the information is, and the effort that's involved to get there, right? So this is the most clean and perfect annotation, but it costs more in human time. This is really fast to provide, but it's not so specific to the algorithm because we still don't know which pixels actually belong to the phone. So we want to actively manage this kind of cost and information trade-off. Two other things to notice here, these points -- or not points. Every image is itself possibly containing multiple labels. It's not just one object already cropped out for us. So it's multi-label data. And the other thing to note, we can't necessarily afford to have a system that, every time we get one label back, we go and retrain and figure out what we need next. In reality, I want to have multiple annotators working in parallel. So I could use human resources even though I have more than one single annotator working. So these are the kinds of challenges we need to adjust if we're going to do cost sensitive, active visual learning. So let me tell you about our approach. I'm going to show you a decision theoretic selection function that looks at the data and weighs those parameters I just described, so both what example looks most informative to label next, but also how should I get it labeled? So we call this a multi-level active selection. The levels are the types of annotations that one could provide. So as an example, we'll also say what kind of annotation we want, and we'll do this while balancing the predicted effort of the annotation. So different things, more complex images require more time or money from the annotator, and so we'll try to predict this trade-off explicitly. And the idea here is we expect that the most wise use of human resources is going to require a mixture of different label strengths. So we'll have to continually choose among these different kind of levels in the annotation. So there's my cartoon of what I just described. You can imagine the system that's looking at the data, as you see here, and making these kind of measurements about the predicted information and the predicted effort, and, of course, we want to go to the sample next that has the largest gain in information relative to the effort. So, you know, maybe this image looks quite complicated. So we could predict that it's going to be expensive to get annotated, but it's not worth it because it's not going to change our models too much. Whereas this one over here, it's still going to cost some annotation because this is fairly complex in the colors and textures, but we think it's going to change the models based on what we've seen before. So we would choose this one. Okay. So let me talk -- I need to explain both the way we can capture these labels at different granularities, or different levels, and then I'll talk about how to just formulate this function so we can predict how things are going to change in the model. >>: Question. How do you calculate like the predictive cost for annotating? >> Kristen Grauman: So just hold up. And I'm going to get to that in a few slides in detail. So multiple distance learning is how we're going to be able to encode these labels at different granularity. As opposed to just having positives and negatives labeled at the instance level, we'll have bags that are labeled as well as instances. So what this picture is showing is multiple instance learning setup where you have negative bags. If you have a negative bag, it's a set of instances, all of them are negative. But if you have a positive bag -- here these ellipses in green -- you don't know the individual labels of the points inside. You just know that one or more is positive and you're guaranteed that, and you might have accompanying those some actual points that are labeled as instances. So notice that there's a good relationship here and the ambiguity of the label that we can use to encode these different levels in annotation. So if you think of images as bags of regions, so you could do some segmentation automatically and come up with different blobs. So if someone says, no, there's no apple in this image, that's a negative bag. All those blobs aren't apples. But if someone says, here's an image that does have an apple, you know there's one blob or more that has the apple. You don't know which ones. So this is how multiple instance learning would fit the learning framework and allow us to provide either course labels at the image label or finer instance labels where we've actually labeled, say, some region. I'm describing this in the binary two-label case, but, in fact, there are multi-label variance of multi-instance learning. But we're going to think of it more in the two-label case just for simplicity right now. >>: What's the purpose of bagging together negatives? >> Kristen Grauman: What's the purpose? So they're bagged together here, but, right, in terms of implementation, you'd unpack that bag, and you'd have a set of negative instances. So they would exactly be label constraints on each of those individual points. >>: [inaudible] indices among the objects within the bag? And leverage that information in doing the policy? So they were actually independent instances of the bag. >> Kristen Grauman: So they're considered -- so the Dietterich work is the first interaction of the MIL setting, and it wasn't for the image scenario. >>: Have they focused on the idea that there are rich tendencies among the objects within a bag? >> Kristen Grauman: No, no. So it's just the setting of here are all these things, and we've captured thus far that something in here is positive, but we don't know which one or how many. >>: I mean, for example, a bag of negative things, a red bag, even just dependencies in the existence of things that co-occur in bag-like things? Which I imagine would be spatial. >> Kristen Grauman: Right, yes. Definitely. And this is something that we can try to explore, for example, when we're trying to evaluate the information gain and we know there are relationships, and special ones in our case with the images, based on spatial images or objects that can co-occur. So if I know it has a positive label for apple, I'm more inclined to think the unknown positive label for pear or something might also be true. >>: In MIL, does the -- is there opportunity to do really rich work in that space to really come up with a really -- some meaningful learning approaches based on the positive learning [inaudible] that was done in the past? >> Kristen Grauman: I've seen some work with multi-instance, multiple-instance learning, MMIL, looking at some sort of label co-occurrence setting and exploiting that in this kind of ambiguous label setting, yeah. So we have a way now to represent labels at two granularities in this case, and now what we want to do is make our active selection function to choose among different questions that you can pose given this kind of label space. So there's three kinds of questions we're interested in. One, we want to be able to take an image and say, should I ask next for just a label on this region? We correspond the labeling in instance here, and the black dot represents an unlabeled instance. Second kind of question, we could just say give me an image level tag. So name some object that's here, and that would correspond to giving an image level or a bag level label. And finally, the most possibly expensive annotation we could ask for is say, here's the image. Segment all the objects and name them. And that would correspond to, in a multi-label version, taking this positively labeled bag and saying, what are all the instances within it? And actually drawing these boundaries. So what we want to do then, and kind of the important novel aspect of considering multi-level active learning is you trade off these different questions the same time as you look at all your points. So let me tell you briefly how we do this. So we're using a value of information based selection function, and this function is going to look at some candidate annotations and decide how much is this going to improve my situation versus how much it's going to cost to get annotated? So z carries with it both the example and the question you're considering to ask about the example. And what we'll look at to measure VOI is to say what's the total cost, given my label then unlabeled data under the current data set, and how is that total cost going to change once I move this Z into the label set? So this denotes there's some label associated with it, and we can measure how cost changes. And if we're having a high VOI, then we're reducing this cost. So breaking out what that cost would be composed of, so it's both how good is the model, also how low is our misclassification risk, and how is that balanced by the class of the annotation? So if we look at the before and after as shown here, that amounts to looking at the risk on the labeled data and the unlabeled data. So are you satisfying the labels on the unlabeled data? Is the unlabeled data clustered well with respect to your decision boundary, for example? And how does that risk change once this guy gets annotated. And then penalize that by some cost on getting the annotation. So risk on the current cost once we add z into the labeled set, and then the cost of getting z. So you can imagine applying this function to all your unlabeled data and coming back with the question that looks most useful to ask next. So filling in those parts. The first getting the risk of the under the current classifier. That we can do in straightforward ways. So what we're seeing is the misclassification risk low for the labeled data and unlabeled data. When we look at trying to predict the risk, after we add z into the labeled set -of course, we're asking for something we don't know because, if we had a label for z, we would not be wasting time figuring out if we need to get the label for z. So what we have to do is estimate how things are going to change based on the predictions under the current model. So do that, we'll look at the expected value of this risk. You can imagine thinking of all probabilities -- of all labels that this z could have -- here's something over the possible labels. If I insert that z into the labeled set, labeled as such, then how does the risk change? And then weight that by the probability it would have that label given your current model. So you can compute this expectation easily for two cases. The first two that I mentioned for annotation, one, if you're talking about labeling a region, then there's -- in the binary case, there's two labels it could have, two terms and the sum. If you're talking about an unlabeled bag, same thing. However many classes there are, that's how many terms you have to evaluate in order to compute this expected value. The tricky case, and the one that we have to address to do this full combination of questions, is this third one, where we're actually asking how valuable is it going to be if I get someone to segment all the regions in here and name them? Because now, if you think about the expected value over all labelings, that's however many combinations of labels we can have. So it's exponential in the number of these regions. So we don't want to compute that. We can't. But what we've shown is that we can have -- take a good sampling approach to compute this expected value. So essentially, we're sampling these likely label configurations for this bag of regions, and use those samples to fill in this expected risk. >>: [inaudible] how good it is? >> Kristen Grauman: >>: How good the expected value is? How good the sampling version is versus the two at the end version? >> Kristen Grauman: >>: If we did it exhaustively? Yes. >> Kristen Grauman: No, we haven't done that. >>: You also could take advantage of some of the process work -- Andreas' process work. Andreas will be here tomorrow, I guess. And make assumptions of submodularity so that the reading approach to composing your image into a set of end features would come within close bound of doing an exhaustive search. [inaudible]. That's in line with Andreas' work. So it would be interesting to show if the assumptions of your domain are indeed submodular and move behind the Gibbs sampling or something like that. Interesting. >> Kristen Grauman: That's definitely possible, and that would be interesting. This is computationally the bottom line for us, doing this sampling over these bags is better than what we'd have to do exhaustively, but it is somewhat expensive. What we do to try to counter that is use incrementally undatable classifiers so that we're not -- because when you do this sampling, you're retraining the classifier with candidate labels. So we'll train that incrementally so you're not as bad off. >>: People understand that you get the information -- you had talked about a value [inaudible] as well. Every imagined configuration, you're retraining the classifier and then using that to compute updated risk. So it's quite a cycle. >> Kristen Grauman: Right. >>: What's the benefit of -- because you had three scenarios. The first scenario was saying for a specific segment this was a cat or not a cat. And the last one was label all the segments and maybe you'll get to this, but it seems like the first one is kind of a subset of the third one. Is there a reason for not just breaking the third one into a whole bunch of number ones? >> Kristen Grauman: Well, so, yeah, what we're expecting is that by computing the expected value of information over that contained image -- because there is this information, the fact that they appeared in the image together. So by computing this counter to that, we might find that actually going to the trouble and bringing up that image and asking for all information is worthwhile albeit more expensive than saying what's this individual region? Versus pulling out an individual region in some cases can be the most informative question to ask individually. >>: Because it's the contextual information that's going to give you the gain on the third? >> Kristen Grauman: Yeah, and even if you're not reasoning about the context between the regions, it's not random that these things are together. So this packet of regions may have some information that, once you evaluate it in this way, you can predict and then choose that instead. Yeah. And when I get a little bit later talking about budgeted learning, you can imagine picking a bunch of instances that fit within some budget you have and comparing that to a single image label like this, yeah. Okay. So then this third function, now that we can predict how good things look in terms of the information, or the reduction in misclassification risk, is to say how expensive was this annotation? So we need to predict the cost. So image data, again, I think is somewhat special/challenging because it's not necessarily self-evident how expensive an image is to get labeled, right? I mean at a very coarse level, I can say, I can pay someone on Mechanical Turk. Do people know what Mechanical Turk is? An interface. So I could pay someone a penny to do an image level tag, or maybe a lot less than that, but if I want them to segment it, I have to pay 10 cents, let's say. But it's even more than that. I could have a very simple image, and someone's making a lot of money off me because it only took them a second whereas, if I have a very complex image with lots of objects, they'll be sad, and they'll spend a lot more time annotating that for 10 cents. So our goal here was to say, well, how can we predict in an example-specific way how expensive the annotation task is? Because then we can pay the right amount, and we'll make the right VOI selection. The idea is, if you glance at these, who wants to annotate this one versus this one, where you have to do task number 3, which is full segmentation. You want to do this one, right? Because it's easier. Do you know this one because you counted the objects? Maybe. But we're suggesting that you don't necessarily have to know what's already in the image or count the objects in order to predict the expense of the annotation. So here's the cartoon of why this might be true, right? There's just a notion of complexity based on how many colors are present, how many textures are there, how often do they change in the image? And I'm thinking, you still prefer this one, and it's not that you know what this is or how many of them there are. So that was our intuition. And you could learn a function -- you give them an image, it says, how long does it take to annotate this? And we learn this function in a supervised way by learning this cost doesn't have to be annotation. It could be things like what kind of expertise do I need to get this job done? Just thinking outside of this particular object of condition task. It could be what's the resolution of the data in these images? How long are the video clips? And so on. There could be some cost, but in many cases for us it has to be predicted at the example level. So what we did to estimate this predicted cost, we look at some features we think are going to be indicative of cost, and I've already alluded to what they might be. Things like how dense are the edges in different scales regions? How quickly do colors change as I look in local windows of the image? Record all of these as features, and then go online, collect annotation data, so we're watching Mechanical Turk annotators. In an interface like this, they're drawing polygons. We're timing them. We've done this redundantly over many examples and many annotators. Now we have our training data. So we're going to learn basically a function that you give it features about complexity, as I'm showing here, and then we'll give you back a time. And this is a function, once learned, we can plug into that value of information and get an example specific cost that we trade off. >>: [inaudible] so you don't know if the people are 100 percent focused on the task or doing other things? >> Kristen Grauman: Yeah, right, so there is that issue. Right. So for example, they could start annotating, go for coffee, and we're sitting here running it. Oh, this is a really expensive example. But it's not quite so bad because actually the timer is per polygon. Unless they start clicking half the polygon and stop. So it's pretty good. And we have 50 annotators doing every one. We look for consistency and things like that. Yeah? >>: [inaudible] interesting down the road. I'm sure you've thought about this. Getting some three-level feedback on how much fun was it to annotate? >> Kristen Grauman: >>: Even if it takes more time, if it's enjoyable versus tedious. >> Kristen Grauman: >>: Exactly. Yeah, that's true. You can pay them less money if it's kind of fun. >> Kristen Grauman: Exactly, yeah. And we already have the leg up in computer vision in general. The images are fun tasks for annotators versus I don't know what else. Right. But if you give them a spray can to do this painting, it's a lot more fun than if you give them a polygon tool. Maybe we could pay less. >>: Than if you're asking people to listen to voicemail messages and annotate the urgency, the longer the message, the longer it goes on and on, we have time right there, but you can imagine that's not as fun as playing with images and shapes. >> Kristen Grauman: Definitely, yeah. So this class function could be even more fine grain. If it's not just time, it's also enjoyment. Right. Definitely. >>: Also look at how long it takes people to finish the hits in general. There are some hits up there that takes weeks to finish and other ones you could throw up in there and be done in an hour or two. So whatever hits aren't being finished, they're kind of ->> Kristen Grauman: They're not fun. >>: Or even getting completed. >>: Yeah, just even getting completed is a good measure. >>: And it's free. >>: And they do it. So you just keep upping the cost. Everything's a penny initially, and then after a day, everything goes up. This is big data. >> Kristen Grauman: Although I also understand they watch the job posters, and they have favorites among those. You have to balance your reputation of being a good job provider. Okay. So we have this function now that we could rate all our data and decide what to annotate next. So we really would compute this function for all these examples and then come back and give this targeted question. The thing to notice in contrast to traditional active learning, the things we're doing here, one is multi-question because I'm not just saying what is this? I'm saying, here's an image. I want a full segmentation on it. And, two, we're predicting cost. So when accounting for variable complexity data, we're making wise choices and good use of our money based on how expensive we think things are going to be. And that's predicted. Okay. So let me show you a couple of quick results. Here we're looking at this MSRC data set from MSR and Cambridge, and there are 21 different object categories. These are multi-level images. We have grass and sheep and bikes and roads and so on. And then let's see some examples of costware predicting. So here are all these data points. This is actual time that annotators took, and then when we're doing leave one out validation, the predicted time that we would produce. And so there's certainly a correlation here. And if we look along this axis -- so from the easy things that we predict to be easy, to the medium effort things we predict to be medium, to the more complex, you can kind of see how those features are being used to make these predictions. So these are the ones that we're getting almost dead on. Now here's one with a big error. Here's a nice, wonderful, close-up shot of a patch of grass on the ground. It's an unlikely photo. But, of course, this makes us give a bad prediction, this very high res, high texture patch of grass makes us think it's more expensive and complicated than it is. Here a few more examples where we think these three images in red would have about equal and moderate cost. Okay. So now if we put all these things together and we set that active loop going, then as we spend more money on annotation, we get the models that are more and more accurate. So these are learning curves where, if ours are shown in blue here -- now, if you're just to use the random passive learning, which really is kind of what most -- you know, many of us have been doing just to train object recognition systems, then you learn a lot more slowly with less wise use of the resources. If you're traditional active learning, though, you get this red curve, and that means you're allowed to look at points, you're allowed to ask for labels, but you're not allowed to trade off these different kinds of questions. And if you do our method, you get this third. That's what we want is to be learning accurate models very quickly. Okay. It's fun to kind of pick apart, you know, what did the system ask us? I mean, we had to be careful. These aren't -- this is our system actively asking questions. These are discriminative models. There's no need that they actually be interpretable, but we can go back and see what are the questions that were asked? So here, if we train the system with just two image level tags per category -here are all the ones we've already seen with labels. Now here are the first four questions the system asks us, and it's asking between here to name some objects on image level tag, image level tag in all of these four, and if we look back at the existing training data, here's what we knew about buildings, and here's what we're asking for next. So this is somewhat intuitive. It's going to be -- really change the building class if we find out about that. And similarly for sky, these are different looking skies. Now, if we go further down in our chain of questions -- so here down to the 12th question. Notice when the system starts picking out about regions that it wants specific labels for, in some cases this happens where the different regions are less obviously separable. Even if we knew about something in this image -- in fact, we had a tag at the image level for this guy, but now we're coming back to get more specific. Well, what actually is this region? And the one question here in the first 12, that's the really expensive one where we asked for full segmentation and labels, is this one here that is a bit more complex and worth the effort, according to the system. All right. So thus far I've assumed that we're doing this loop one at a time. We go to one annotator. We get the answer. We retrain, and then we continue. Now, the problem -- and we've already talked about Mechanical Turk. What if, instead of that single annotator, I wanted to have a distributed set of many annotators working on this task? They're going to be sitting and waiting while we do Gibbs sampling and come back with the next question we should ask if we really are running this online live. And, in fact, what would be great is if we can select instead a bunch of tasks that we can feed to these annotators. However, the most kind of straightforward thing we could do is unlikely to be successful, and that is if I just take that function I had, this VOI function, take your top K-rated things and give them next. Why is that not likely to work? Well, there can be a lot of information overlap in the things that were highly rated according to VOI. Maybe they're similar examples, and as soon as you got one, maybe you just wasted money on getting the rest. There also could be interactions where, if I asked for a large set of examples to be labeled once, any portion of those labels can dramatically change the current model such that the other things I thought were interesting become uninteresting. So you have this sort of change in the model that it's hard to predict when you're talking about asking for many labels at once so that you can do these tasks in parallel. So we have an approach that we call budgeted batch active selection to try and address this. And the idea is to not just use the current model over here when you're looking at your unlabeled data, but to also set some kind of budget. So I'm willing to spend this much money as you get me the next batch of annotations, and then use that, and when doing the active selection, your job is now not to pick the best, most informative thing that's using the money wisely, but instead use that budget and spread out your annotations to select the most informative batch. Then you'll bring those back, and everyone can do the jobs it wants, and you could continue iterating, but you're doing this a batch at a time. So to adjust the challenge I mentioned of sort of the naive solution of rating everything and taking the top chunk that would fit in the budget, we need some farsighted notion in the selection function. So to do this, here I have kind of the boiled down sketch of what our ideal selection would be. So we want to select a set of points. The optimal set will be the one that maximizes the predicted gain, which I'll define in a minute, for our classifier such that the labeling costs are within our budget and that we were given. And this predicted gain then is where we can have some farsightedness in the selection because we're not just going to look according to the current model at which things seem most informative, we're going to simultaneously predict which instances we want labeled as well as under which labeling they'll most optimistically reduce the misclassification error and increase the margin in our classifier. So let me try to draw a picture here to show that. So let's say this is your current labeled training set. So positives and negatives and then unlabeled is in black. So we're going to do this predicted gain according to an SVM objective for the classifier. So that contains both a term for the margin, you want a wide margin, and a term for the misclassification risk. So you want to satisfy your labels. So what we're going to do is try to estimate -- try to select those examples where the risk is reduced as much as possible while the margin is maintained as wide as possible under the predicted labeling of that selected set. So let's say that these were some three examples that fit in your budget. Then what we will do is choose -- optimize over what possible labels they could have -- and let's say this ended up being the candidate labeling. And then look at the predicted gain from what we previously had in the old model to what we have in this new model, again, in terms of the margin and in terms of misclassification error. Okay. So the important thing to here note is that there was a change to the model, and while making the selection, we're accounting for that change, and that will be what's farsighted in the batch -- batch choice. So I'll -- hmm. >>: [inaudible] possible? >> Kristen Grauman: Right. No, we don't. We don't want to enumerate all possible. That's what we want to optimize, right? So I'll try to kind of sketch out how we can do this effectively to make the selection. So we're doing this for the SVM objective. I think this is probably familiar to everyone here. So we have this margin term, and then we have this slack, and then here are our labeling constraints. So now let me define this intermediate function. So this will be the cost of a snapshot of a model and some data. So here's the model, meaning a classifier, wb, and here's a set of data, set b. And we're defining this function to be able to evaluate the margin term on one set of data, whatever data was used to train it, and then -- but then evaluate misclassification errors on some other set b. So we're breaking it out so we have this ability to have model, but then the model applied to the rest of the data to evaluate this term. So that's this function g. And here are the two parts, the margin and the misclassification errors. So now with we want to do this farsighted selection, we want to optimize both which set of examples to take and under which labels we think they seem to have the most beneficial impact. So we're going to minimize this difference. And what we have here is the cost after adding that data into the label set. So the situation under the new model, and then here the cost before we added it in. And I know there are a lot of symbols on this, but the high level part here to capture is this comparison, and to note that here we're evaluating things once we've added these things into the set, and here we're looking at the old model and how well it would do on those same examples. And then this final constraint is saying, you have to have the cost associated with each of the ones you select add up to no more than T. Okay. So finally then, this is what we want. How do we select for this? So notice that our optimization requires both selecting the examples and the labels under which they're going to have this most benefit. So what we do is introduce an indicator vector. So vector q. So say you have u unlabeled points. This is a 2u length vector where you put one entry first, the first example with its label being positive, so this is one bit in this indicator. And you do the same thing for considering if that example's negative. Do that for all your examples in the unlabeled set. Now, this q is our selection indicator for both labels and points, and we'll -this is the -- what we're going to optimize over at the same time that we optimize for the classifier parameters. So finally, we can write down this entire objective, where we want to select both the classifier and the selected points. And the important things to note here -- so this is what we get if we reduce -- or simplify that term I just showed you before, the cost under the new model and then the cost under the old model, and we'll do this subject to a list of constraints that, say, all the labels should be satisfied on both labels on label points. But notice that this penalty for misclassification only affects those examples where we're including it in the selection. So it's not all unlabeled data, it's the ones that we're going to think about adding to the labeled set. There's also a constraint for the budget. So the ones you select, you have to add up their costs. It has to be less than t. And this constraint is what's going to say that, in that big indicator, you can only let -- you should have the paired instances, sum 2 less than 1. So we can't say both unlabeled example number 2 is both positive and negative. We have to say one or the other. And finally, we would like an integer indicator vector. So this is finally the whole task that will get to the picture I showed on the first slide with the classifier changing. This is an integer programming problem in NP Hard. But we show that, by relaxing this indicator vector, we can perform alternating minimization between a convex QP and a convex LP. So we'll iterate between solving for this, which is like a normal SVM objective, and solving for this. And between these two, we'll convert to a local optimum for these selections. >>: [inaudible] I believe this is a lot of work on a Heuristic method to compute, which is the decision criteria for a set. I'm curious if you've thought about how even in an exact solution to this would relate to the full model where you're not going SVM, you're sticking with just a list of classifiers, not using the Heuristic of the margin and the two objectives you have that you're mixing here and look more at the -- try and basically reason about what's an optimal bag of things to ask about? Given the fact that with every answer that board is going to be changing. >> Kristen Grauman: Right. >>: And how will this approximation -- or say approximation to the Heuristic function to the actual thing you're trying to do relates to the thing you're trying to do? >> Kristen Grauman: Also, I would remove one level of the those Heuristic, right? So we have the SVM objective, and that's what we care about. And we want to say what set of points under those labels that will most optimistically reduce the objective, should we take to do that. >>: The SVM objective function, the border, you're assuming, as your prior hyperplane is moving with every single answer. >> Kristen Grauman: Right. And that's why, when we're choosing this, we're looking at all predicted answers under like simultaneously, right? So like how everything changes once you add that selected batch of examples under the budget. So that is the farsighted part in that we're not -- we don't just -- I know. This is not the best place to do this. So we don't just have this evaluation according to the current model, which is the key, right? We're saying how does the objective reduce when we have these new things introduced into the model and the model itself has changed? >>: I guess the question would be is that really -- should that be really -going back to the SVM question now, should that be really your objective function? >> Kristen Grauman: >>: Yeah. The SVM? When you [inaudible]. >>: The SVM, empirically, yes, because in object recognition terms, probably the very best results are often using this good and flexible discriminate classifier, where we can plug in fancy kernels as we like and so on. So I do think as a tool, as a classifier, this polar vector machine is worthwhile to study for this active learning case. And then what we're doing is being able to say, if you have a budget to spend and variable cost data, how can you make a farsighted choice under the SVM objective? >>: How do you handle the fact that for the labels that you don't know that they could be one label or the other? You just say which label is more probable and just assign it to that? Or do you actually do a mix? >> Kristen Grauman: Right. So it is -- so in the integrator that was doubled up, you have both options ->>: You put it in -- >> Kristen Grauman: But then you have the constraint that you should be choosing one or the other. And, in fact, we do tend to get -- even though this is a relaxed form, we tend to get integer responses, and I think this is because once you enter -- if you enter in a slack penalty for both -- like if you have non-zero values on both, for example, it's going to hurt your objective whereas, if you choose one that's more aligned for the real answer, you get a lower slack, a lower penalty. >>: At the same time, you might get a little bit easier because it kind of jumps to one side or the other, it might have a hard time jumping back? Is that -- I mean it decides upon the answer pretty quickly. >> Kristen Grauman: I mean, right. So in practice this will run for about 10 or 12 iterations, and when we initialize this, because we have to initialize the indicator itself, we initialize with the myopic greedy batch selection, which is the one I was saying is not accounting for the joint information or the farsighted section. So we start from there, and then we iterate to optimize this q selection. >>: How do you know that they'll converge? It could be that those two parts are actually a trade-off between one and the other? >> Kristen Grauman: Well, so the constraints -- and I don't have all the equations right now here. But the constraints on the one part are independent for the variables that we're optimizing in the other. So in each step, we're improving, and then we're alternating between these two improvements. So what this gives us is a way to have now an active learning system that's really working with live annotators and where we can farm out these jobs at once, and they're well-chosen jobs, and we can do all this under a very realistic form of budget where, if we actually have money or time limits, then we can give this in the formulation of the task. And there's actually very limited amount of work on batch selection in general for active learning, and the important difference between our work and any existing batch mode active selection is the fact that it's working on a budget with variable costs and that it's farsighted by making predictive model changes when it does the choice. So I want to show you -- focus on this one result with video. So here we have all these video clips and the Hollywood Movie data set. This is a benchmark for active recognition, and the variableistic videos like the ones you see here from Hollywood Movies, and your task is to take a video and decide whether or not it contains a certain action. So it's actually sort of a retrievable task based on these categories. And here we're going to measure how expensive an annotation is based on how long the video is. Because if someone has to watch it to decide if that action ever occurred. So in this case, we don't even have to predict cost. It's just a nice function of the data items themselves. So here is an example kind of selection that you'd get back, according to three methods. So on the left, passive selection, you had a budget. You spent it all on one video. In the middle, this is the myopic batch selection, where you use the budget and take the top ranked examples according to your current model. And here is our selection, where we use that same amount of time to select a number of videos that we think are going to most improve, in this case, the stand-up classifier for people standing up. So what you see kind of graphically here, you're getting more interesting examples that were well-chosen. These happened to be shorter ones that still all fit the budget but we think look informative versus, even with this quite strong baseline of myopic batch selection, where you're just choosing a couple that fit in the batch but didn't look at how the model changed. >>: [inaudible] out of curiosity, I don't know the detail, how are you featurizing the space that they system gets a good sense where you'll be in good condition to do stand-ups, for example? [inaudible]. >> Kristen Grauman: >>: So what are the features? Just an example. >> Kristen Grauman: We're using stand-up features for the video based on local interest operators and space time to find like Harris corners in 3D, and then around those local interest points, we take histogram of gradients and histogram of optical flow. So you have a bunch of descriptors to quantitize those, you make a bag of representations for the video. >>: So when you say you take the length, you take the length of the original video? But you could have used a temporal window and say, okay, I'm looking for a sit down, stand up, right? And then you basically didn't have that length as a cluster. >> Kristen Grauman: Yeah. So there's some of that here. You're saying you could uniformly chunk up the video and then survey those. Yeah, and that is possible. There is some brains or smarts into how these clips were generated by the data set collectors. That was based on scripts of movies, when they're seeing these actions. And that's why they do vary in length because given this scene, they think the action standing up might be happening. So the lengths aren't arbitrary. They were guided by the data creation, yeah. But you could -- you know, if you don't know any top down knowledge about these actions, you might do a uniform chunking, and then that cost factor goes away. But we still have to choose a batch that's farsighted. And empirically, as before, I can show you the same kind of trade-offs here. That's what you're seeing. We learn more quickly. We've done this sort of batch selection for object recognition as well. And here again we're looking at image segmentation time. And I'm very excited about this kind of framework just because it gives us this next step where we can make the system that's live online and looking at many annotators and making good decisions about what should we label. >>: [inaudible] there was a failure case and why? >> Kristen Grauman: Oh, there's a failure case? What's that doing there? Yeah, so this is a failure case. You want your curves going up. All of these are kind of jumbled. This is the dirty work gloves category. It's a bad object. It's like white, textureless, and we have any of these methods, the costware alone has a hard time doing this. [inaudible]. >>: [inaudible] the SIVAL. >> Kristen Grauman: So the SIVAL is 25 different object categories. It was developed at University of Washington St. Louis. Basically, you get different snapshots of where the object is there and then where it's on different backgrounds. So this is the blue scrunge, sponge, and it's occurring here. So it's a true classifier problem. Okay. So in this next bit -- and I think we need to wrap up. >>: Well, we have the room for an hour and a half. another ten minutes or so. So why don't you run for >> Kristen Grauman: Okay. Perfect. So I wanted to tell you about this other side of managing the human supervision and object recognition. In this, I'm describing as listening more carefully. So we're used to listening in a certain way to annotators, and that way is I give you an image, you tell me object names, or you give me key words, I have some labels about those objects. But we want to learn a bit more from what those kind of annotations might provide. So, for example, if you go to something like Flickr, and you have these human tagged images, it's clear that we could use those tags and do some kind of smarter detection. I already know that there's probably a dog in here, and there's a sofa. So this is a nice way to use almost freely available, loosely labeled data. And, in fact, there's been a number of different techniques over the years showing how you can try and learn this association from the weak annotations of tags, but all these techniques are looking at this noun and region correspondence, right? So trying to discover, if you give me tags on an image, which regions goes with which names. And instead what we're trying to do is think about what other implicit information is there in those tags people are giving us. They're giving us a list of key words, and they're not just necessarily giving us noun associations. So we're assuming there's implicit information about which -about the objects, not just which ones are there. So now I have to do my audience test. So here are two images that are tagged. Two people looked at some images I'm hiding, right? These weren't the images. And they gave these lists. So what do you know about the mug in both of these images if you just look at these tags? >>: [inaudible] on the left it's more prominent. >> Kristen Grauman: >>: It's more prominent on the left. Okay. And why is that? Because it's the first thing. >> Kristen Grauman: Yeah, it's the first thing. If you're going to mention mug first when you look at an image, probably it's pretty prominent. And what that might mean for us in terms of detection is maybe it's central, maybe it's large. So there are localization cues in how these tags are provided. And did I hear the images? You're right. This guy is very prominent. Here's the tiny mug over here. And they mention it twice again. So this is the very intuition we wanted to capture and try to let it benefit our object detectors. So it's things like the rank. So the mug is named first. It's probably important and big. Here it's named later. But there's also things like the context of these tags themselves. So the fact that you're mentioning other physically small objects already gives you a sense of the scope of the scene when you see that list. Whereas here when you're mentioning physically larger objects, then you have a sense of the scope of that scene. It may be further back. Of course, we're not going to write down rules to try and get all of this, but we can learn these connections from data. So the idea is to learn these implicit localization cues to help detection, and let me give you the high-level view of the algorithm. So during training, we have tagged images, and we also have bounding boxes on the objects of interest. So that's our supervision. And we'll extract some implicit tag features and essentially learn in an object specific way a distribution over location and scale given the tag features we've designed. Once you have that distribution, you can come in and look at a new test image that also has tags but you don't know where the object is and use these implicit localization cues from the tags to do better detection. You could do it better by doing it faster because you know where you should be looking for this object, or you could do it better by looking everywhere but taking into account that, if the tags are indicating mug is supposed to be prominent and you find some tiny one over here, it's probably an incorrect estimate. Okay. So I'll briefly describe these implicit tag features. We've already given a hint of what all they might be. One, we're going to look at which objects are there and which ones aren't. Because the ones that aren't also tell us something about the scale of the scene. So to do this we'll just look at the count of every word in the tag list. So normal bag of words, tag occurrence, tag frequency. And so here's how we'll capture this total scale based on objects that are mentioned. So the assumption is here, of course, that, when you look at a scene like this, you don't start mentioning some of the least important and small objects only. You'll mention some of the most important ones for sure. Now, the second cue, we look at the rank of these tags because the order is very telling about which things were most noticeable or which were maybe most important in defining the scene. And rather than looking at just the raw rank of the tag within the list, we actually look at the percentile rank. So this is where, if you have -- in this case, let's say the pen is not necessarily the very first thing mentioned, but it's mentioned rather high for a pen compared to what we've seen in other training data. So here, by recording the relative rank, or percentile rank for each word, we get a feature that's describing relative importance within the scene. And then the final third cue we look at is proximity. So we already know from human vision experiments you don't look at the scene and then jump around haphazardly. There's a systematic way, after you fixate at some object, that you'll tend to drift to other nearby ones. And if we can assume that, in doing that, kind of drifting with your eyes, that you start mentioning things in that order, then we can look at the proximity mutually between any two of these tags to get a spatial cue about the different objects. So that's what we're going to capture by looking at between all pairs of these how close are they in the list? >>: Is that equal to equating how two big objects, obviously they'll be first [inaudible]. >> Kristen Grauman: So these things all come together. Between the rank, between the presence and absence, and this proximity, we're going to put all these forth and then learn which combination in which way these are predictive. And it could be object specific as well, right? Yeah. So then we can use this to do detection. First, we need to make the distribution itself. So here we're looking for a distribution where, if you give it tag features of those three types, we can come back with the distribution over scale and position, and we use a mixture of density and network to do this. So basically, given these features, you're learning, and you're able to predict the mixture model parameter. So calcium mixture model in three dimensions over spatial dimensions and scale. And each one of these are object specific. We'll have one of these for model. We'll have one of these for person and so on. So now we have the ability to look at these images just like you all did a moment ago, and only seeing the tags, look at that distribution. So if I sample the top 30, 20 windows from that probability of localization parameters given tags, I could come up with something like this. And I think this is very cool, right? So you're just looking at a list of key words and deciding where you think this object's going to occur. And we can look and uncover like the actual image, and this is a fairly good distribution of where to start looking for the objects. Here again is a failure case. Boulder and car, I mentioned -- here's the car. But our very top 20 guesses were going to be to look around here. So certainly the data representation, as in any task, the more representation we have between tags and images, the better we'll do. And, of course, we don't stop there. We actually run the detector. So we have this prior. And now we'll go in and also see what the detector has to say. And so we can use this to do priming, to do detection faster. So you'll prune out a whole bunch of windows you shouldn't even bother looking at, and we'll also use it to improve accuracy. And that's where, if the detector itself gave us some certain probability of the object presence in any window, we can modulate that according to some -- what the tag features are saying. So we do this with some logistical classifier. And we've done this with -- the experiments with this algorithm in two different label sets, LabelMe and PASCAL. If you haven't seen PASCAL, this is really the most challenging benchmark for object detection today, and it has about 20 different categories. Flickr images were the source. So they're quite varied, and they're quite difficult. So what we wanted to do is show, if you take the state of the art object detector from the latent SVM model and you combine it with our implicit tag features, can we actually get a better detector? And here are some examples where I show that we can. So these are Flickr images from the PASCAL data, and the green box is our most likely -- or most confident estimate for that object. Here on the top bottle, here car. And the red one is what the latent SVM would return as the top, top bounding box. So notice we're getting here the latent SVM, the baseline. We think the bottle is here in large, but just given even the context of this scene displayed by the tags, we can guess a smaller window for this bottle. So another instance where, here again, giving the list of tags, our estimate for the person ends up being the smaller one over here whereas the latent SVM would make this kind of prediction. This rank cue is quite powerful. So here you have a dog mentioned first but also a hair clip, just something small. And, of course, I'm just reading into what was totally learned from the data. But something small is there so that kind of limits the scope of the scene and tells us to predict here for this dog. >>: So how many images did you run the whole system on? >> Kristen Grauman: Let's see. around 10,000 images. >>: That's PASCAL. source? So the training set for PASCAL, I believe is But you were using Flickr, right? So that's a different >> Kristen Grauman: I was saying Flickr because they came from Flickr, the PASCAL images themselves. >>: And you were able to find all the labels? labels? You went back and find all the >> Kristen Grauman: No, thanks. These tags came from Mechanical Turk, and I don't work for Mechanical Turk. We're just using it for this system. But we went -- these tags weren't preserved for them. >>: So you took the PASCAL ones, and then you ran them through Mechanical Turk, and you're republishing that data? >> Kristen Grauman: Yes, we have it. Right. So this is where the people -- I think we had about 800, 700 different people tagging these images for us, to make the training set. >>: So one way to interpret this is, if you run a computer vision algorithm over this PASCAL challenge, but then you also have people not tell you where things are, but just to whole image labels, you do a lot better if you pay attention to what the people said? >> Kristen Grauman: Right, right. And not just the nouns that showed up. So it could be in a weak supervision setting where these people are intended to be in theaters, or we hope -- and we're working on the case that -- not that these people knew what they were doing. They were just giving us names for the Mechanical Turk, but this should be, you know -- you know, the next step is to see, if I take your random set of tagged images, that would probably be less complete than what we would get. Here we have about five tags on average per image. Can we get the same effect? But even if it's at this stage of supervision that we intend, this gives you a strong prior on the object, where it is without asking someone to draw those pesky bounding boxes, yeah. >>: Well, Flickr tags are more used for searching for the images. They won't necessarily go out and specify the nouns in the image but can say something like happy, things like that. >> Kristen Grauman: That's right. And we're encountering that in some new data collection. They'll say Honda Accord or whatever, and it's not going to mention some of the objects. But possible direction to go to make this better is looking at the language side and the meaning of these words rather than just the key words themselves, you know, which should strengthen the representation in what we learn here. And there's certainly failure cases. Here are just a few. If you're used to seeing planes that, when someone mentions them first, they're on the runway at the airport, then you think the airplane is going to be large. And this unusual view where the building is there and so's the airplane, it's actually very small. You can see some other errors here. Okay. And we can quantify all this. So if you run a sliding window detector for a given detection rate, you have to look at a certain percentage of windows. So to get up to 80 percent, you're looking at 70 percent or so of the windows, and if you do this priming, just based on the tags, you're doing -you're looking at much fewer windows, so about 30 percent. And maybe one of my favorite results out of this is that we're actually -- you know, with this tagged data, we're able to provide better detection than using the state-of-the-art LSVM detector. So we're improving, over all these 20 different categories, based on this priming, in accuracy, not just in the speed. I'm going to skip this. Finally, we're taking this idea and looking now at not just the idea of image retrieval itself and with weaker supervision. So if we think of these tags and the importance cues, meaning the rank and so on that I described, as two views of our data, then what we do is learn a semantics base that's importance aware because it's paying attention to the rank and relative prominence of these different tags so that this is a single representation that's taking into account both views of this data. So there are many algorithms to take two-view data to learn a better representation. Here we're using kernelized canonical correlation analysis to get this space. So when you have such a space now, you can look with an image untagged, and once you project into this semantics base, which things are most similar. And the goal then is not to just do CBIR based on blobs of color and texture and so on, but now also have some learning through the representation that tells us which things to retrieve so that we have the most important things there. And I'll show you two examples. Here, if you just look at visual cues and do the retrieval, for this query image, you'll get back these. And globally, they definitely look similar according to the different cues, but the important objects aren't necessarily preserved. Here's what happens if you learn a semantics base looking at words but just treating them as nouns, not caring about order. These are the results you'll get. And finally, if you use our approach and do this retrieval, you'll get images that look similar but also are preserving the most important defining objects in the scene. Another example of same format. >>: For the query image, you had tags associated? >> Kristen Grauman: No, we don't. During learning, we have tagged images and no bounding boxes in this case. And then at retrieval time, you give an image or you give tags, and you get back images. So that's the role of that semantics base from which you can -- into which you can project from either view of the data. >>: It's interesting to think about what the intention is when someone gives a query image in terms of their goals. I guess you have different classes, right? Textures and types. But when it comes to nouns, it probably is something that's [inaudible] vision central. Realizing the kinds of things that can be done. The intention of the query image. It's an interesting question to ask. What's the goal? Do they have things like this? >> Kristen Grauman: And the user has their own semantic notion of what it is that's the right answer, and you're trying to have the features and metrics that give that back. And here the assumption is based on what we've learned about what people think is worthy to mention early when they mention it, that's the kind of images we want to bring back when you go to a similar bag. >>: [inaudible] versus these very interesting objects, mechanical objects. >> Kristen Grauman: Right. And, you know, we have tested on a data set that's more like what you're describing, which is LabelMe. These are all PASCAL images. But LabelMe is more scene oriented. It's not nice like a beach. It's like outdoor Boston scenes and indoor office scenes. And there we see less gain actually. So there, all the scenes are actually so similar, that it's hard to get much differentiation in boards. >>: It's funny because looking at the visual only, most of them actually are trains, and none of them survive to the combine. Is it because they didn't have the words in it? >> Kristen Grauman: Right. Well, it looks like, I suspect -- you know, these are all test images, but I suspect these would have tags -- well, at least a few of them don't have trains. But they were mentioned maybe later or they mention many more things, whereas here maybe just train had been mentioned, and it matched well for this. >>: [inaudible] tracking with the query? >> Kristen Grauman: It does, yeah. I don't know that I -- no, I didn't include that. But, right, you could -- since you have that space, you could go in with the image and ask for key words, and you could give those back. The idea is they're not just the right words, but they're in the right order. This is not okay. These are kind of examples. Some of them are not so good. We get first here. The larger this data set gets -- during training but also just populating the database beyond the training set, the better I can expect these to be and that we'll have a denser set of tag lists that exist and that you can bring back. Okay. So to conclude what I've just shown you, we have studied how can you manage the supervision process and object recognition more effectively? And the parts I showed you today were saying how can you do this where the system is actively evolving based on what it currently knows and what it thinks it needs to know? And the contributions there were to look at how do you do multi-question or multi-level active learning with predicted costs? And how to make these batch selections when you have a variable cost data as well. And then the second component, I just kind of gave the overview for thinking about the implicit importance of objects both in recognition and in image retrieval and how to exploit what people are telling us beyond what we're used to listening to. And that's to use this importance aware kind of feature as an implicit tag feature. I'll stop there. Thank you. [applause]. >>: Some more questions. >>: If you tried quantitatively to check the annotation that it generates? >> Kristen Grauman: Yeah, definitely. And I didn't show all -- I skipped over. But we're looking at things like information retrieval metric and BCG. So how well is your ranking doing compared to the perfect ranking? Where we say the perfect ranking is either one where the right object skills and objects are present, or where we say that the tag list would have been generated the same way. So this is, again, with those same two baselines what we're getting. >>: Hopefully. I mean, you could take part of the images that were tagged by the Mechanical Turk and try to generate the tags from them, maybe from the test set and see -- I don't know how I could measure the label. >> Kristen Grauman: So do you mean withhold the tags on some test set and then check? Yeah, that is what we're doing. Yeah, so we'll take -- we'll look at, for this query image, we're saying we don't have tags, what comes back as the tags and then how much -- how well do those tags align? >>: All right. Thanks. >> Kristen Grauman: Thank you.