>> Ran Gilad-Bachrach: Umar is not new to Microsoft Research. He's spent the summer in Mountain View working with Nina Mishar and others on search related problems. He also spent the summer in AT&T working on voice activated interactive applications. Umar won several prestigious awards including the best student paper in UAI 2007. The computer science department award in Princeton in 2004. He also won the Google student award in the machine learning symposium twice in 2007 and 2009. Therefore without further ado, let's here from Umar Syed about robust semi-supervised learning. >> Umar Syed: Okay. Thank you, Ran, for that very kind introduction. So as Ran said today, I'll be talking about robust semi-supervised learning. This is joint work with Ben Taskar at the University of Pennsylvania. So just to set the scene a little bit let me review supervised learning, which is the oldest, most traditional learning model in machine learning. So in supervised learning a -- we are given a training set of examples and a labeler carefully examines this set of examples and reveals all the labels of the examples to a learning algorithm and the learning algorithms job is to try to fit a function that predicts the labels of these examples as accurately as possible. But it has been commonly observed that labeling examples is an expensive and time consuming process and many alternative approaches to supervised learning have been proposed to deal with this. And one of the most popular is semi-supervised learning. So semi-supervised learning is based on the observation that while labeled data is expensive, unlabeled data is typically quite cheap and easy to obtain and so our learning algorithm to try to exploit this abundance of unlabeled data. So in semi-supervised learning we're given a training set of unlabeled examples. A labeler examines the examples and reveals the labels of some of the examples to the learning algorithm. Now a question that sort of immediately comes to mind is how are the examples selected to be labeled. And in most analyses of semi-supervised learning and the way most algorithms are derived it is assumed that the examples to be labeled are chosen randomly and usually uniformly at random; that is the unlabeled data and the labeled data are drawn independently from the same distribution. But in most naturally occurring partially labeled datasets, the examples that are labeled are not selected at random. So consider a website like Flickr. Users on Flickr can tag pictures with tags that indicate the content of the picture. Something similar happens on Facebook. You can tag pictures. You can also indicate whether you like the picture or not. And the process by which users are selecting pictures to label is unknown to us but it's almost certainly not a random process. In fact, most websites on the Internet that are soliciting user feedback are generating in one way or another a partially labeled dataset and the examples that are labeled in that dataset are selected according to a process that is difficult to model and difficult even to note. Now, let me just point out that in all these examples I'm showing you, the user selects an example to label not in a sort of sterile isolated environment but rather in the context of a webpage typically in the context of other examples. So this selection bias -- the selection bias by which the chooser chooses examples to label is highly correlated across examples and seems difficult to model. And so today I'm going to talk about an approach to semi-supervised learning that attempts to totally lift this modeling burden. So we're going to assume that the examples in the training set that have been selected to be labeled are chosen arbitrarily even adversarially. And so we don't want to make any probabilistic assumptions about which examples are labeled. Within this framework, I'm going to drive some generalization bounds, tight upper and lower generalization bounds on learning in this framework. And using those bounds, I'll derive an efficient and scalable algorithm. They can learn a predictor using -- using partially labeled datasets. And then towards the end, I'll describe some experiments that illustrate the benefits of our approach. And these experiments will illustrate that our approach is more robust than traditional semi-supervised learning algorithms when the labels are -- when the labels are provided by a particularly unhelpful or confusing labeler. Okay. So here's a roadmap of the talk. After I describe robust semi-supervised learning in detail at the end I will briefly discuss some will other products I've worked on related to bandit algorithms. I've worked on these projects during both my PhD and my post-doc. And I think some people here might be interested in perhaps talking about these projects later. So I'll just kind of briefly summarize them at the end, and maybe we can discuss them later. Okay. So let's talk about the learning model. So what do we want from this learning model? Remember that we want to lift the burden of having to model how users select examples to label. So we don't want to make any probabilistic assumptions about how those examples are chosen. Additionally and somewhat orthogonally we don't just want the labeler to be able to give us discrete label information about each example, but also to be able to tell us some global information about the labeling of the entire dataset. And this kind of soft global information is going to be particularly useful for our experiments. And I'll describe it more in detail a little bit. So here's our framework in some more precise detail. So we're given a training set of examples, X and Y. Here X and Y are vectors, so X is a vector of X and Y is a vector of labels corresponding to those examples. The labeler examines the training set. And then he reveals to the algorithm a label regularizer function. Now, this function is chosen arbitrarily, perhaps even adversarially from some known function class. And the function is going to encode all of the information that the labeler wishes to reveal to the learning algorithm. So the semantics of this -- what is this label regularizer? The semantics of this regularizer are that of an asymmetric penalty function on labels of the training set. So, in other words, if R assigns a large value to sort of full labeling of the training set, this means that that labeling is not likely to be the correct label of the data. If R assigns a small value to some complete label of the training set, then R may -- then that labeling may or may not be the right value -- the right labeling of the dataset. So in other words, roughly speaking the minima or near minima of the function R, which is defined on label space -- labeling space are possible labelings of the dataset. >>: This doesn't talk about choice when you get to that next slide, this just seems to talk about [inaudible]. >>: [inaudible] function. >>: Okay. >> Umar Syed: All right. So we're just given this function. That's it. >>: Okay. >> Umar Syed: So just ->>: One kind of clarifying question. How is the expert supposed to give you this function? It is given to you by something, so what [inaudible] give you the sense of how the expert is going to come up with this function? >> Umar Syed: So you're going to need to be able to follow the gradient of this function. >>: How are we supposed to expect an expert to give you this function? >> Umar Syed: I'll give you some examples of specific functions and maybe that will clear it up. Yeah. So that will be the next slide. But just to recap, this is the learning model. Again so a training set X and Y is drawn independently from some distribution. The labeler looks at that training set, replaces the labels with this regularizer function and that is the only information that the learning algorithm sees. Okay. So here are some examples. So suppose the -- so this framework is quite general, and it can capture several existing approaches to semi-supervised learning. So here's an example. Suppose that the labeler is going to tell you for each example a set of possible labels that this example could have. So a regularizer function R that would encode that information is a function that assigns zero to any full labeling of the dataset that is consistent with these -- that is consistent with these sets and infinity otherwise. It looks like in the transfer some symbols got lost. So this should be infinity and this should be element. Sorry about that. Oh, yeah. Oh, boy. I hope this doesn't happen too much. Okay. So a -- so here's another kind of information that a regularizer function R could encode. So suppose the labeler wants to tell that you to give you a similarity measure on examples and he wants to assert that two examples that have similar -- that are similar have similar labelings. So this is basically what's called laplacian regularization. And it can be encoded using this form here. And notice here that I've replaced a labeling of the dataset Y with Q which is a distribution on labelings of the dataset. This is sort of basically a soft version of a labeling of the dataset. And for the rest of the talk when I mean a labeling, I really mean this softer version of a labeling, distributions on labelings. Another thing the labeler could tell you is he could tell you the expected values of some features under the true posterior distribution. And this kind of information is called posterior regularization. And this is yet another approach to semi-supervised learning. I hope that answers your question. These regularizers can also be combined. For example if he wants to tell you -- if the labeler wants to tell you a set of labels per example and also similarity information, then that can be encoded using this simple sum. >>: So if I wanted to just [inaudible] let's say I wanted to be really adversarial and I want to choose the worst possible examples in each class to show you, how would I [inaudible]. I want to just go and I want to mess you up. Let's say there's two overlapping Gaussians and I want to take the labels from the wrong ->> Umar Syed: Right, right, right. So I'm going to do an experiment with exactly that kind of noise later. >>: So how do you express R -- how do you express that R? >> Umar Syed: Oh, so you would say that -- so he wants to tell you information say about two examples. So labelings that are consistent -- so any labeling, the function will be independent of what labels are assigned to the other examples. And for those two examples, you know, it will have a low value for those labelings that you think that you say are allowed, the labeler says that are allowed and high value for those that [inaudible]. >>: So how do you -- okay. Well, keep going. Keep going. >> Umar Syed: So one question you might want to ask is -- so I've just explained how several existing approaches to semi-supervised learning can be put in our framework. So how is this any different from order semi-supervised learning? So all existing -- at least the ones that we're familiar with, approaches to semi-supervised learning are optimistic, all right, in the sense that they tend to find a predictor that has good performance on the best possible complete, complete labeling of the partially labeled dataset. And our algorithm, as I'll demonstrate to you, really finds the best predictor that has the best performance for the worst possible complete labeling of the dataset. And so in this sense existing approaches are optimistic while our approach is more pessimistic. And this is particularly good for scenarios where the examples to be labeled -- that have been revealed are not representative or confusing or misleading in some way. And our experiments will illustrate this. And moreover as kind of a bonus, this pessimistic versus optimistic approach has the effect of convexifying some existing approaches to semi-supervised learning that are non-convex. For example, posterior regularization, one of the examples I showed you, is a non-convex formulates of the problem and our pessimistic approach convexifies that approach. Okay. So I hope the learning model is clear. So now let me give you a few bounds about this model. So before I kind of jump into stating what the theorems are, let me step back and ask what is the be point of these theorems. So I'm going to give you and upper and lower bound on generalization error in this learning model. And the algorithm I present later in the talk is going to minimize the upper bound. And because this upper and lower bound are going to be almost tight, this will show that our learning algorithm is near optimal. In other words, the right thing to do. So here's the upper bound. So let's -- you know, again, X and Y is a training set drawn from some distribution. And L is the loss of D, the parameter that you're trying to learn. It could be log loss or hinge loss or something like that. And R is the label regularizer function that encodes the information. Then with probability one minus delta we have the following somewhat difficult to parse upper bound. But luckily this -- the bound on the right-hand side is the sum of three turns. And each of these three turns I think has a nice intuitive interpretation. So let me walk you through it to help you understand what this bound is really saying. So the first term is this -- is a maximization. And this maximization is balancing between two objectives. So this -- so this first part of the first term this is the expected loss of the training set under labeling Q, and this is the penalty essentially that the regularizer assigns to Q. So this maximization is trying to maximize the empirical loss that we observe subject to this penalty. So when R has -- and so this term, this first term will tend to be large whenever R has a lot of minima because that gives it -- or near minima, because that gives this maximization more opportunities to be large, more places where it can be maximized. So in other words, this term will tend to be large whenever R is ambiguous, you know, many, many labelings are possible or allowed. The second term is the penalty that is assigned by the regularizer function to the true labeling of the dataset. And so this term will tend to be large when R is misleading, when the labeler is lying to you about the what the true labeling is. And the last term is the term that goes to zero as the number of samples guess to infinity and this term comes from standard arguments from uniform conversions. And so taken together what this bound is saying as the number of examples goes to infinity the loss of any parameter is bounded by the ambiguity and misleadingness of the label regularizer function that the labeler gave you. Which is a sort of an expected kind of result. Okay. So this is the upper bound. Now, it turns out that without further assumptions we're not going to be able to prove a matching lower bound. So what I'm going to describe to you is a set of what I think, I hope you will agree, are natural assumptions about the behavior of a labeler. And under these assumptions our learning we'll be able to prove a tight lower bound. Okay. So there are basically three assumptions. So the first assumption is basically a truth telling assumption. So here's the function R that the labeler gives you. And let this be the true labeling of the data. And the assumption is that this -- the true labeling is within epsilon of the minimum of the function. In other words, you know, it's not -- it's not -- the labeler is not highly misleading about the truth. I should perhaps also pound out that this function R typically will have many minimum, so it's not enough just to identify the minimum of the function to be able to recover the labeling. >>: I have a question [inaudible] the simple -- so the labeler is handing you both the regularizer R and subset of labels. >> Umar Syed: Just the R. >>: The just the R. >> Umar Syed: The only information [inaudible]. >>: But isn't this -- how is this -- earlier you said you were taking a pessimistic view of [inaudible] and then here it seems maybe I misunderstand you but it seems that you kind of assume that [inaudible] misleading you [inaudible]. >> Umar Syed: Oh, right. So the pessimistic view -- so that's a very good question. So I would say that the pessimistic view is not so much a consequence of the model but rather a consequence of the bounds that I'm describing to you. So this -- so this upper bound is pessimistic because we have this max over in essentially all possible labelings that are essentially consistent with what he told you, right? And I'm going to describe to you as -- a scenario where this bound is tight. >>: So you're saying that worst case in terms of, you know, given the setup with that particular -- >> Umar Syed: That's right. That's right. >>: The R itself is not [inaudible]. >> Umar Syed: Right. So I should make a distinction between being misleading and being ambiguous. Right. So this is saying that the labeler is not misleading. But he can be as ambiguous as he wants. >>: I have confusion also on the previous slide. The -- all right. The loss -- this is the loss of the model assuming Q is the correct labeling. >> Umar Syed: On the right-hand side, that's correct. >>: On the left, the first term on the right-hand side. >> Umar Syed: The first term on the right-hand side. >>: Okay. Okay. >>: You write Q as a state. Yeah. Okay. So it's [inaudible]. >> Umar Syed: So draw an example from the training set and then draw it's label according to Q. Okay. So again, the first assumption is that the labeler is not too misleading. The second assumption is an assumption about the stability of the labeler. So if I have two training sets and they are very similar to each other, and I'll explain what I mean by similar in a second, if they're very similar, then the labeler will give me the same information for both training sets. And by similar, I mean think of those two training sets as empirical distributions in put space. If those two distributions are close by this amount that we're going to prioritize by lambda, then the two -- then that's what we mean by close. So that's stability. And the last assumption is kind of an unusual one, which is -- but it's necessary. And that is that -- what we call the no coding assumption. And that is the labeler can't hide the true labeling in some clever way in the function R. We want to exclude this possibility because if he hides the true labeling in a clever way, then a very clever algorithm might be able to recover it, and we have no hope of showing that our algorithm's optimal. So let me just give you an example of a clever hiding of the information. So consider a dataset where the examples have binary labels and the labeler is going to erase some of the labels and not erase others. So there are -- there are M examples. There are two to the M possible labelings and there's also 2 to the M possible ways to erase labelings. So he -- there could, in fact, be a one-to-one mapping between labelings and erasure patterns. And a very clever algorithm could invert that mapping and remove the truth. Now, obviously this is a highly artificial unrealistic setting and this is exactly the kind of artificial thing we want to rule out. Because if we don't rule it out, then I can't show that the algorithm is optimal. >>: What would be the [inaudible]. >> Umar Syed: So the proper definition is that -- so R -- so R assigns some labeling's finite value and other labeling's infinite value, right? So let's call things that are finite allowed and things that are infinite not allowed, right? The penalty means it's not allowed. So the assumption is that no algorithm can conclude that a labeling that is allowed is not allowed -- is not possible. So if the regularizer assigns a finite value to a labeling then it's possible. That's the assumption. >>: So on [inaudible]. >>: Yeah. >>: [inaudible] where people tend to rate things that are [inaudible]. >> Umar Syed: Yeah. Yeah. >>: And so that would be kind of the situation that seems to be analogous to a mapping between -- not a deterministic mapping but nevertheless a mapping between labelings and [inaudible] mappings, right, because what's happening essentially erasing the three star gradient. >> Umar Syed: That's true. >>: And [inaudible]. >> Umar Syed: That's a good point. That's true. That's a good point. I wouldn't argue with that. >>: Your lower bound cannot [inaudible]. >> Umar Syed: No, no. It's fine with that. Because -- so what would the R look like for that setting? >>: [inaudible]. >>: Well, no, it's more than that, right, because you're saying that the cases that are -the cases that are left unlabeled essentially telling me something about ->>: The actual label. >>: The actual label. >>: But it's ->> Umar Syed: Right. So if it ->>: [inaudible]. >> Umar Syed: No, no. [laughter]. So if it were the case that anybody unlabeled is definitely not a one or definitely not a five ->>: [inaudible] right? >> Umar Syed: Well, I think ->>: What type of ->>: So is one soft it's okay. >> Umar Syed: Yeah, I think so. >>: Oh, okay. >> Umar Syed: So the assumption is -- does not involve probability. The no coding assumption. >>: So there can't be a deterministic way to determine [inaudible]. >>: Conversely a [inaudible] R of truth [inaudible] I mean you can still do full labeling, right? >> Umar Syed: Absolutely, yes. >>: So R truth equals zero [inaudible]. >> Umar Syed: That's the supervised setting. >>: That still works? >> Umar Syed: That still works. Okay. So under these three assumptions which I hope I've convinced you are reasonable we have the following lower bound. So if the number of examples is at least this many, here lambda is from the stability assumption, then with at least constant probability no algorithm is guaranteed to have generalization error that's more than this much better -- excuse me, so in other words that's this much better than the upper bound. In other words, there's no algorithm that can guarantee with high probability a generalization error bound that's more than this much better than the upper bound that I showed you. And again, here lambda is from the stability assumption, and epsilon is from the truth telling assumption. >>: That square root, is it real square root or [inaudible] I'm just curious if it actually cancels ->> Umar Syed: It's probably O, you're right. >>: But does it actually cancel the term? Does it cancel the term in your upper bound? Is this ->> Umar Syed: It's the same term. >>: Yes. So this is saying that you can't be more than max over Q of L theta, you know, Q. >> Umar Syed: Oh, no, no. >>: Q minus epsilon. >> Umar Syed: No, no, no, no. It's the -- so the gap will be twice. So in other words, the upper bound has a slack of this much. >>: Yes. >> Umar Syed: And the lower bound has sort of a complementary slack below it. >>: I see. >> Umar Syed: So the gap is twice this term, yeah. >>: But it goes. >> Umar Syed: Yes, it goes to -- asymptotically goes to zero. So asymptotically the gap is epsilon. Okay. So now that we have our matching upper and lower bounds, we're ready to drive an algorithm. And the algorithm is really simple. And now that we've shown that the upper bound is tight, at least under some reasonable assumptions, the algorithm is just going to be minimized -- find the theta then minimizes the upper bound. And hopefully we've argued that that's the right thing to do. And we're going to minimize the upper bound while controlling the norm of theta to help with generalization error. And so here I've just plugged in the upper bound from the previous slides discarding terms that are independent of theta. Oh, sorry. So how could I minimize this objective? So one thing I could do that perhaps some of you are familiar with, is I could try a subgradient descent. I want to minimize this function. This function is not differentiable but I can find its subgradient at various points. And I can follow the subgradient until I find the minimum. But something that we found seems to work better is the algorithm I'm about to describe, which is essentially a -- we're going to describe a two step algorithm for finding theta star and the theta that minimizes this objective. And this algorithm assumes that both the loss function and the regularizer function are convex. And all of the examples of regularizer functions that I've given you so far are, in fact, convex. And the algorithm is called game because these game theoretic theorem in the proof of its conversions. So here's the algorithm. Essentially what we're going to do is -- excuse me. So we have this mini max objective. So we're basically going to solve the problem inside out. So by solving it inside out I mean I'm going to swap the min and the max in the objective. Okay? And now I'm going to find not the best theta for the worst possible labeling but rather the worst labeling for the best theta. In other words, I'm sort of changing roles. I'm no longer the learning algorithm. I'm like nature trying to mess up my learning algorithm. And that worst labeling is called Q star. And the next step is to find the best response. So assuming that this worst labeling is instead the true labeling I just find the best theta for this Q star in this worst labeling. Okay? So I'm kind of solving the problem inside out. >>: So you do kinds of like [inaudible] when you -- between the steps? >> Umar Syed: No, not exactly. I'll describe what I'm going to do. But think a high level argument. So there's a couple of questions. First of all, the first question is, you know, is this buying -- what is this buying us? Why does turning the problem inside out helping in any way? An even more basic question is that why does this give us the right answer? In other words, why is the theta star here the same as the theta star for the original objective? Right. And so let me address that problem first. So first of all, is it clear what I mean by solving the problem inside out? Okay. And the reason why this works is that Q star that worst possible labeling of the data, in fact, uniquely determines theta star. The parameter that registered ->>: As long [inaudible]. >> Umar Syed: Yes. That's right. Well, a bit more on that. Strongly convex. >>: Strongly convex. >> Umar Syed: Yeah. So here's exactly what you're talking about. So here's the objective -- here's the objective for this second step of the algorithm. So I found my Q star. The worst labeling. And indeed theta star is the unique minimum of this objective. And this -- this function is strongly convex because of the regularizer alpha that I've added to it. If I had not added that, it might be just convex. It could have flat lesions. But due to strong convexity it's the unique minimum. But we have kind of a problem, which is that no algorithm is going to find the exact minimum but rather an approximate minimum. And at least in this picture, the true minimum and the approximate minimum are quite far apart because when the function is just almost flat, you have this problem. So what we do, and this is sort of done in more than one place, is that if you just increase alpha you make the function more strongly convex and then you can be sure that any approximate minimum is going to be close to the true minimum. And so this is the effect there. And so this kind of illustration I hope explains the conversions guarantee we have for the algorithm, which is that if both steps of the algorithm, this inside-out approach find epsilon approximate solutions to the problem, then the theta that's output by the algorithm is going to be this close to the true -- to the parameter that you're after. And the position of your answer is governed by how approximate the answers were and also this alpha. So you need that alpha to make the thing more curved so that the approximate answer was close to the true answer, and then you have to make epsilon at least that big to drive everything close. Go ahead. >>: You said you promised to do better than the subgradient minimization. Do you mean it was just faster or do you mean it found a better answer? >> Umar Syed: It just seemed to be faster. And I can expand a little bit on that. So if you were to just kind of do just regular subgradient minimization, what you would find essentially is that -- so this approach does, you know, sort of one step all at once and then finds the Q all at once and then finds the theta. And the subgradient approach would kind of alternate between those two things. It would make a step of progress in the Q direction, a step of progress in the theta direction, and it would just alternate between those two. And it seems, at least on the examples that we tried, is that doing things all at once got you to the answer faster than doing a step of one first and then a step of the other. >>: You were saying that R often has many local minima of this [inaudible] R is convex. >> Umar Syed: It has many minima but not testimony local minima. >>: Oh, you're saying that they're contiguous or something? >> Umar Syed: Yeah, yeah. So all these [inaudible] that I give you are convex. >>: I mean so like your previous slide you're showing you like you know I mean as you change alpha you're going to get closer to the optimum. But only in a sense that is meaningful in terms of that regularizer. Right? Like I mean that theta, even though your [inaudible] data is kind of arbitrary, right, has nothing to do with the really decision of the problem and [inaudible] does that really mean anything? I guess it means something in terms of the particular expression. >> Umar Syed: Right, right. Yeah, I mean, so I would say a couple of things. One is that I would appeal to -- there's a number of sort of online learning algorithms that play the same trick, which is that their convergence guarantee is in terms of how strongly convex the objective is. So we have the same issue I guess. So that is one thing I would say. And the other thing I would say is that if you -- you know, if you have enough time, right? So there's two terms here. There's alpha and there's also epsilon which is how accurate each step is. So if you have enough time to get a very -- an order alpha approximate solution, then this distortion that's imposed by increasing alpha is not going to hurt you. Do you know what I mean? >>: So I can think of [inaudible] like algorithms that give you non-convex [inaudible]. >> Umar Syed: Yeah. >>: Yeah. So this is -- but you could still do epsilon subgradient in those cases, but I guess there's no fear that says how -- how accurate you will end up [inaudible]. >> Umar Syed: Right. Right. I mean definitely. And I think that would -- yeah. And that would still apply here. So I haven't even -- so this theorem applies to the meta algorithm, how I'm solving each of these steps is not ->>: [inaudible]. >> Umar Syed: Yeah. Go ahead. >>: So is the alpha and the epsilon, are they coupled in some way so if you make it steeper, the epsilon ->> Umar Syed: No, they're ->>: [inaudible] epsilon is an input in terms of added to the theorem. >> Umar Syed: Yes, yes. >>: But in practice is -- if you were to change the steepness of the R, then the effective epsilon you had before you changed the alpha would be different? >> Umar Syed: Yes. Yes. >>: And so these two things are kind of coupled in a kind of an interesting way? >> Umar Syed: Definitely. So they're independent parameters, but this theorem is telling you that they had better be coupled like they had better be the same order if you wanted the performance. >>: So I guess the question is when you could get them to be useful [inaudible]. >> Umar Syed: Right. Right. So ->>: One or two layers. >> Umar Syed: Right. So this epsilon governing how approximate each step of the algorithm is. And typically that's just a matter of time. So the longer you run it -- I mean, I'll explain this in a moment, but each one is just an optimization. So the longer you run the optimization the more I hear it is going to be. Okay. Okay. So there are these two steps, right? And I said if you solve each step pretty well you get a good -- you get a good answer. But how do we solve each step? And it's not clear that we've won anything yet because I started with a mini max problem and step one is a maxi min problem so how is that any better? Well, what's interesting is for the loss functions that we're interested in, specifically log loss and hinge loss, we can turn this into something that's much more friendly. And what we do is we take the -- we dualize basically. We take the dual of the inner minimization. And so this minimization I'm going to take its dual. And after I take its dual, for log loss I get this maximization. And so this is the entropy of -- well, let me just say it. So this is basically very similar to maximizing log likelihood. So here -- and you know the dual ever that is maximizing entropy. And so here I'm maximizing entropy subject to these expectation constraint in the features. But now I'm not matching the empirical distribution, I'm matching this worst case distribution Q. And this -- this objective has a nice form and sort of a well studied form, and we can use existing algorithms like for example exponentiated gradient descent to solve this problem sort of just right out of the box. So that's step one. We dualize the inner minimization and we just kind of use a standard approach to solving the problem. And then once I found the best -- the worst case labeling Q star, I just plug it into here and now this is just normal maximum likelihood. And I can use whatever I want, for example stochastic gradient descent to find the best parameter for this worst case labeling of the data. Okay. So now I'd just like to spend a little time talking about some experiments that illustrate the features of our approach. Okay. So we tried our algorithm on some image labeling tasks. These are both -- and we compared our algorithm, excuse me, on both binary and multi-class classification against some standard traditional semi-supervised learning approaches, and we did it using labeling noise that is both simulated and also some label noise that we got by simulating the data to labelers on Amazon Mechanical Turk. So for the first hundred experiments we tried it on the binary classification task. We compared our algorithm to some semi-supervised learning variance of support vector machines. The first is laplacian SVM and the second is transductive SVM. For our game algorithm, we used the regularizer that I had talked about earlier. For each example we give a set of possible labels that it might have plus a laplacian regularizer. And all of the experiments that I'm going to report, the algorithm uses log loss for its objective but in the results I'll show you accuracy. Yes? >>: [inaudible]. >> Umar Syed: That's right. Just one of the features. So the first kind of label noise that we wanted to test against was an unhelpful labeler. And an unhelpful labeler what we're trying to simulate is a person who erroneously but perhaps sincerely thinks that his best effort is labeling those examples that are exceptions to the general rule. So this is the example that you wanted, right? So as -- when labels are requested, they are labeled in the order of outliers first. More precisely in decreasing order of the number of neighbors who are of a different class. So this is one of the noise labels, outliers first. It's a very confusing kind of noise. We tried this on a couple of different image -- datasets. The first is the Columbia Object Image Library. This is dataset of household images. And the second is a dataset of EEG brain scans. These are both datasets part of a standard semi-supervised learning benchmark. And to give you an idea of what an outlier in these datasets is, here is the most outlierish example in the Object Image Library dataset. So these two classes are medicine cartons and toy cars. And on the right circled in red, although it doesn't look like it, this is actually a picture of a toy car. It just looks like a medicine carton because of the strange angle of the image. So this is the first outlier [inaudible]. >>: [inaudible] heavy encode [inaudible] in a car. [inaudible] algorithm ceases the [inaudible] function, right? >> Umar Syed: Oh, yeah. So I'm going to label say 10 -- I'm going to reveal the label have 10 examples, and that will be a trial. And then I'll reveal the labels of, you know, 15 examples, and that will be a trial. And so on. And so I'll show you how the algorithms perform. >>: Algorithm just -- just with different offers? >> Umar Syed: Different offers, right. And I'll show you how they're performed as more and more information is revealed. And what I'm telling you now is how is that information revealed. So -- in the trial that reveals the least amount of information I just show you the outliers, and the trial that reveals somewhat more information I show you those plus a few more and plus a few more. >>: So you reveal a subset how [inaudible] then epsilon of R your best Q? So let's say in this example you just [inaudible] toy car. Right? So how is R defined. So for that being toy car. >> Umar Syed: Yes. >>: And that [inaudible] half of all possible R, that's the domain, and that's zero and then the other half is infinite, oh, because you're not lying. >> Umar Syed: I'm not lying, yeah. >>: The truth is in that ->> Umar Syed: Yeah. I always reveal the correct label. >>: I see. >>: [inaudible]. >>: Okay. >> Umar Syed: Right. So this is kind of what the data was like. And so here -- okay. So here we are on the -- this is actually the main computer in the trained dataset. So on the X axis I have how much of the training set I have revealed or labeled. And again, this is in order of hardness. So the hardest first and then so on. And I can only point out this is a very challenging kinds of noise. So this is why the accuracy levels are fairly low. But generally speaking the game algorithm does better than the other methods. Although as the amount of training set that is revealed increases their performance tends to converge. >>: [inaudible]. >> Umar Syed: I didn't -- not necessarily. I guess I didn't -- I didn't check explicitly but we didn't enforce that in any way. >>: [inaudible] do you know what the [inaudible] is on this with the -- what is the fraction ->> Umar Syed: It's 50/50. >>: It is 50/50? >> Umar Syed: Yes. >>: Okay. >>: [inaudible] trials what changes? So like you have verifiers, right. >> Umar Syed: Yes. Oh, I mean so for example -- oh, so we are -- we are subsampling fraction of the data, yeah. So that's where the noise comes from. >>: So you subsample then rank based on outlier [inaudible] ->> Umar Syed: That's right. And we have something similar on the image library dataset. The gap is not as big here. But again I would point out that at the very low levels of the training set being labeled our algorithm does do somewhat better. >>: [inaudible]. >> Umar Syed: Excuse me? >>: [inaudible]. >> Umar Syed: Oh, yeah. You mean the examples that are labeled are just chosen uniformly. >>: Yes. >> Umar Syed: Not in this experiment but in a later one I do that. >>: [inaudible]. >> Umar Syed: Yeah. Yeah. For sure. >>: [inaudible] accuracy from a [inaudible] at least 90 percent. >> Umar Syed: Yes. So [inaudible]. >>: It's twisted. This is [inaudible]. [brief talking over]. >>: [inaudible]. >>: Yeah, that's right. This is almost but not quite lying [inaudible]. >>: If only choose the worst part of that because [inaudible] a much better [inaudible]. >>: It's not the [inaudible] boundary, right? [brief talking over]. >>: [inaudible] the other side of the boundary. >>: Yeah. It's errors. He's labeling almost like errors. >> Umar Syed: So okay. So I do show some results per uniform sampling on a different dataset so maybe [inaudible]. >>: In this presentation you mean? >> Umar Syed: In this presentation, yes, in just a little bit. So in the second set of experiments we tried it on a multi-class dataset. And here we compare to some semi-supervised algorithms that use -- try to infer the missing labels using app kind of maximize likelihood approach. And so here the kind of noise we wanted to test against was a labeler who is reluctant to label examples that seem to be ambiguous. So here the labeler is going to labeler examples on the border between two classes last. So as the number of examples whose labels are requested increases, the labeler examples -- labels examples in increasing order of the distance from the centroid of the class cluster. That's the precise definition that we're using here. But generally speaking this labeler is avoiding the border regions which are the harder examples. So this is a guy who is reluctant to label hard examples. And we tried this on dataset called Faces in the Wild. This is a dataset of facial photographs of public figures. The features are the eigenfaces. And so just to give you an example of what a border region between two classes look like, these are the two classes in our dataset that had the largest border region by our definition. And Arnold Schwarzenegger and Gerhard -- this is the former German prime minister, Gerhard Schroder. And they kind of look alike, right? I think they do. Okay. So here are the results. And again this is a 10-class example. And again we're reporting accuracy. And I would say over the range of the fraction of the dataset that's labeled we're showing small but consistent improvement over the existing algorithms. >>: [inaudible] is more surprising to me. Like I mean it [inaudible] it will be interesting to [inaudible] results in that one too. Because there you know kind of like you're essentially giving it the good sort of labels, things that are squarely in the midst of the classes, right? And take the [inaudible] you hope would kind of grow out in the right sort of way. >> Umar Syed: Right. >>: What's your intuition [inaudible] so it's not really such an adversarial solution, right? >> Umar Syed: Right. Right. Right. Yeah, I think so -- I think what Dana is basically doing is hedging, right? If you think of like EM, right? So what the EM approach basically does is it's trying to infer the missing label. An EM can get caught in a local minimum. And what EM does is it doesn't hedge at all. It guesses sort of wildly -- I mean, I'm [inaudible] but it -- but, you know, by getting caught in the local minimum it's sort of being very optimistic. It's guessing some labels for the examples that are missing and then finding that, in fact, I can fit those quite well, whereas our algorithm hedges. If there's uncertainty about ->>: So you're using the EM against over confident [inaudible] treats the problem [inaudible]. >> Umar Syed: Exactly. And the way that I labeled it A is that the examples that are missing are exactly the ones that you should be hedging. >>: [inaudible]. >> Umar Syed: Yeah, well, you know, so a bunch of them are like synthetic, right? So like I don't know, at least three or four of them are like synthetic. And then we showed you two. >>: [inaudible]. >> Umar Syed: Yeah. >>: [inaudible]. >> Umar Syed: Which ones [inaudible]. >>: [inaudible]. >> Umar Syed: Okay. >>: That's very hard [inaudible]. >>: Is that the one that has a lot of classes? >>: Pardon me? >>: Does it have a lot of classes? >>: [inaudible]. >>: But a lot of features. >>: The [inaudible]. >> Umar Syed: Yeah. Right. So we saw the two datasets out of those eight real ones. We had some real practical considerations like something like too many features. I don't remember there being -- I don't know. I can't say that -- I don't remember the exact reasons why we selected the ones that we did. Okay. And finally we wanted to try it on some sort of label noise that comes from real people and not from simulation. So we selected from our Faces in the Wild dataset, we selected a set of photographs from -- for two celebrities who have a lot of photos in the dataset, Laura Bush and Jennifer Aniston. And rather than showing these pictures to the labelers one by one as is typically done in a labeling scenario, we wanted to simulate something more realistic, something more like the motivating examples I gave you in the beginning of the talk. I'm sorry, did you have a question? Something more like the motivating examples I gave at the beginning of the talk. So we actually showed them all the pictures simultaneously in a sort of tile format, which is here. Like this. And we said label any five images you like, exactly five images. And so here what we're trying to test is, you know, we want the users to select pictures in the context of seeing other pictures, and we want them to use their own biases and methods for selecting pictures to label whatever they like. >>: Can you tell them what the purpose of it was by giving -- I'm just curious because I mean you admit that earlier you had set this thing about like well, here's a user who, you know, thinks they're being helpful while they're getting these really hard ones or something, so I'm wondering if you tried to motivate them by saying hey you're helping a learning algorithm so do the right thing. >> Umar Syed: Right. Right. No, no, no. So this is the entire task. >>: I see so [inaudible]. >> Umar Syed: All the guidelines are there right. And>>: I'm really curious to see your analysis of how many people just pick the first five. >> Umar Syed: Oh, yeah. I'm going to show you that in a second. >>: All right. [laughter]. >> Umar Syed: Not quite that, but something like that. >>: Okay. >> Umar Syed: So this is the entire task. I think that you have to give it a title and I just said image labeling easy. That was the title. Easy meaning easy money. I was trying to attract people. [laughter]. So this is what it looks like. And so the first thing to note is that there is indeed strong bias in how people choose example -- pictures to label. So here I've ordered the pictures from most label to least. And the most labeled picture I think is about three times more likely to go to be labeled than if they were to be labeled random. >>: [inaudible]. >> Umar Syed: This is not -- this is not presentation. >>: No, no, [inaudible]. >> Umar Syed: This -- the presentation order is fixed. >>: Is it fixed. >>: But is that picture number the presentation order? >> Umar Syed: No, it's not. >>: You've just sorted it. >> Umar Syed: I just sorted it. >>: So you know what the [inaudible] is with the ->> Umar Syed: Yeah, I'll show you. >>: [inaudible] that picture chosen. >> Umar Syed: This is just arbitrary. [inaudible]. So this is to demonstrate that there are biases. And let me kind of pull out two kinds of biases. So the first is that Jennifer [inaudible] [laughter]. There were an equal number of each kind of picture, but Jennifer Aniston was much more likely to be labeled. And the second kind of bias is I think what you're getting at which is location on the webpage. So ->>: [inaudible]. Stronger than that. >> Umar Syed: Well it might be influenced by the fact that the submit button is at the bottom of the page. So we think that people click on a bunch of pictures on the top. >>: [inaudible] labeled more than five. >> Umar Syed: Oh, so if someone -- that didn't usually happen, but if someone labeled more than five or they made a mistake I just discarded the trial. So I'm not interested in that. >>: [inaudible]. >>: I mean essentially 50/50. 50 percent is [inaudible] 50 percent not [inaudible]. >> Umar Syed: It seems like you're saying basically we think that people clicked on some things, scrolled to the bottom, clicked on some things and submit. Go ahead. >>: Did people mostly label one person or would they do both? Like would they do just pick out all the pictures of one person [inaudible]. [brief talking over]. >>: I mean it doesn't matter but I'm just curious. >> Umar Syed: So here's the performance if you -- if you run the SVM variance and our algorithm on this dataset. These are box spots over all the trials. So not only is our algorithm doing somewhat better on accuracy ->>: [inaudible] one labeler here. >> Umar Syed: One labeler, yes. These are box plots. So this is is giving you -- the whisker plots are the range -- the whiskers are the range and the top and bottom of the boxes are the 75th and 25th percentile of the accuracy. So we're getting somewhat better accuracy and also more tightly -- more tight range. And maybe to address what you were saying earlier, what if this data was just labeled uniformly at random, right, rather than by in this biassed fashion? So we tried just labeling the data uniformly, then the advantage of our algorithm basically goes away. >>: [inaudible]. >> Umar Syed: Yeah. So we did some cross-validation. >>: [inaudible]. >>: [inaudible]. >> Umar Syed: I think it was normalized. When you say normalized, you mean -- so we take the distance and we exponentiate the distance, and then we have this scaling parameter that controls -- kind of controls the dynamic range. That's the kind of laplacian [inaudible]. >>: [inaudible] divide by the square root [inaudible]. >>: [inaudible]. >>: Of each node. [brief talking over]. >>: Do you think that they're reduced variants [inaudible] or do you expect that or ->>: So, I think it's intuitively meaningful, but I have no theoretical results to explain why. So the reason it's intuitively meaningful is that, you know, it's a -- for example, unlike the transductive SVM it's a convex problem. And so it's a maximizing globally convex -transductive SVM can get caught in the local minimum. So that's one -- that's a partial explanation. >>: [inaudible]. >> Umar Syed: Yeah. >>: [inaudible]. >> Umar Syed: Excuse me? >>: [inaudible]. [brief talking over]. >> Umar Syed: Okay. So that concludes the semi-supervised portion of the talk. If I have time, I don't know if I do, I can briefly talk about some of the work I've done on bandits. >>: [inaudible] but people may have to leave. >> Umar Syed: Okay. All right. So in my PhD and also my post-doc I've worked on some bandit algorithms for sponsored web search. I'll describe a couple of the projects that I've worked on just very briefly. The formulation of the problem is a contextual bandit problem. So what is a contextual bandit problem? Well, it's a problem -- a decision making problem that proceeds over several rounds. And each round one is supposed to choose one of several actions. And depending on the action that you choose and also an exogenous context you receive some payoff. So you can imagine walking into a casino and you're faced with all these slot machines. You choose one slot machine to play each round, and then this is where the analogy falls apart. There's some context that also affects both the context and the choice of slot machine you made affect your payoff. So why -- so, you know, where ->>: [inaudible]. >> Umar Syed: The context is full observed. And this learning problem was formulated and was motivated by a sponsored web search because this is an excellent model for sponsored web search. So sponsored web search in this model here the actions are the advertisements that you can choose to display to a user coming to a search engine. The context is the query that the user input to the search engine as well as anything else on the search results page. And the payoff, the expected payoff is the click-through rate, the probability that a you've is going to click on the ad given that that's the query they searched for and everything else. Okay. So we -- we've addressed two kind of challenges of these contextual bandit problems. The first is that in any realistic setting the number of actions or the number of contexts is going to be very, very large. Because that's the number of ads and the number of queries. And of course that's enormous. And the second challenge that we address is that click-through rates are usually not stationary but rather they can change over time. So I'll describe a solution for each of these problems briefly. So the click-through rate of any ad is a function of both the features of the ad and the features of the query. And so here we have, you know, a search, tickets to Seattle, and here's an ad, you know, flights to network. So we want to estimate what is the click-through rate when this ad is presented in response to this query. And the basic problem is that we can't just use counts because the number of distinct ads and queries is enormous. So we have to do something to generalize. And so our approach, and I just have one slide on explaining our approach is that it's -the functional relationship between the click-through rate and how that depends on the ad and the query, decomposes in some nice way over the words that are in the ad and that are in the query. And so this graph. So we have this graph, and we place an edge -- and so we have, you know, a node in this graph for the words in the query and also for the words in the ad. And we place an edge between two words if we think the click-through rate depends jointly on those two -on those two things. So for example the click-through rate is going to depend both on the fact that there was the word tickets in the query and the word flights in the ad and -- I'm sorry. Go ahead. >>: So are you saying this edge is an input, you are to determine where the edges are or the algorithm [inaudible]. >> Umar Syed: This is input [inaudible]. >>: Okay. Oh, I see. >> Umar Syed: So, yes, this is all prior information about the problem that is being coded by us, the algorithm designer. >>: [inaudible] just say the edge exists. >> Umar Syed: Yes, that's right. You don't know the weight, you just know the edge exists. And likewise the fact that Seattle is in the query and New York City is in the ad has some effect on the click-through rate, probably has the down -- downward effect. And the fact that the word cheap is in the ad, this graph is indicating that that has some effect on the ad that's independent of what was in the query. >>: [inaudible] have multiple edges connecting them and cycles and so on? Do you assume that the decomposition is always just in terms of edges or -- if you have more complicated [inaudible]. >> Umar Syed: So -- so here's what I'm really assuming, which is that I assume that the function decomposes into a bunch of terms. Each term depends on only a few words. And then if two words appear in a term, I put an edge between them. >>: So you can have a full factor [inaudible] where each term would be a factor and [inaudible] a number of words from either ->> Umar Syed: That's right. >>: [inaudible] so here it might as well be a full two sided graph, right? So -- right? You mean you would like to just be -- you would like to assume that you don't need -- you need to look at both terms like, you know, cheap flights. >> Umar Syed: Yes. >>: But cheap could be connected to any game term and so it could be all the links [inaudible]. >> Umar Syed: That's right. That's right. Any edge could be in that graph. And then the regret and the complexity of the algorithm will depend on how sparse the graph is. >>: [inaudible] assumption is mostly around the fact that there is two sided? >> Umar Syed: No. Much so I made it two sided to make it look good, but you could have edges between two things on the same side. But so you're going to have a graph, it could look like anything you'd like, and then the regret and complexity of the algorithm will depend on terms that perhaps you guys are familiar with, things like the tree width of the graph and the degree of the nodes in the graph or ->>: Can the single thing still be involved as the settlement to the [inaudible] function as well. >> Umar Syed: No, they can be. >>: Oh, they can? >> Umar Syed: Yes. And so this is exactly the kind of assumption that we're making in graphical models by encoding independent assumptions about distribution. But here we're ->>: [inaudible]. >> Umar Syed: That's right. That's right. But you can -- so you can take your factor graph and then you can make it a graph by just adding -- you know, adding edges between everything in the same factor. And we like that representation better because our bounds are in terms of properties of that graph, not the factor graph. And so here I'm just repeating what I said a moment ago. Of we have an algorithm that exploits this structure of the payoff function and this algorithm will have -- it is a contextual bandit problem, an algorithm for selecting an action to take in every round. And both the running time and the regret of this algorithm are going to be polynomial as long as this graph structure is sparse. And more precisely it has low tree width and low degree. Go ahead. >>: I'm sorry to keep asking all [inaudible] the graph thing. I want to understand if I understood what they were saying correctly. Can you go back one? >> Umar Syed: Yeah, you bet. >>: Can you have a factor like if I think it's important to have cheap flights and New York City all at the same time or not at all, I can express that in your [inaudible]. >> Umar Syed: You can. You can. >>: And how would you split that out into ->> Umar Syed: So that would be a [inaudible]. >>: That would be a [inaudible]. Right. So you're allowed to okay. >>: [inaudible]. >> Umar Syed: Okay. And so that's all I was going to say about that. This is obviously kind of a quick tour. >>: [inaudible] result. You proved polynomial on tree width; is that right, empirical result or something ->> Umar Syed: [inaudible]. >>: Okay. >> Umar Syed: But to the best of our knowledge, this is the sort of first end-to-end algorithm for contextual bandit problems or the number of context and the number of actions can be very large. And by end-to-end I mean that we don't make any assumptions about access to an oracle that can solve some problem that maybe we can't solve. This is sort of a full solution and exploits the structure of the graph. >>: [inaudible] right, you can have arbitrary properties of either and just define the [inaudible] any way you want? >> Umar Syed: Sure. You bet. >>: Okay. >> Umar Syed: So the second kind of problem that we wanted to deal with in contextual bandits is that these click-through rates typically are not fixed over time, they change, right, the rate at which people click on things changes over time. And in the study of bandit algorithms, when you're talking about the payoffs changing over time, there's usually two kind of extreme forms of assumptions that people make. One is stationary that these payoffs or these click-through rates are fixed for the entire duration of the interaction. And the second is the other extreme which is that these click-through rates can change every single round, and they're changed even adversarially, right? And so for us, these assumptions are either too weak or too strong. And in our view it seemed like a more realistic assumption is something that's kind of in the middle which is that for many queries click-through rates change -- they can change abruptly but only rarely. So let me give you kind of an example. So before October 2009 the query balloon boy, if you were to input it to any search engine would give you a company called balloon boys which is they make rooftop inflatable advertisements. You've probably seen these things. But in October 2009, if you guys were watching the news around that time, there was this kid in Colorado I think who went up -- who went up into the sky on this sort of hand made balloon. >>: [inaudible]. >> Umar Syed: Or no, he didn't. Right. [laughter]. We thought he did. >>: This is [inaudible] half of the change, right, but it seems like there's a -- like the huge other factor which is seasonal facts and some things where a ton of queries change slowly but they're all changing every day. >> Umar Syed: Yeah, you're right. You're right. So this -- this work is focused on the kinds of queries that change abruptly and suddenly, not the ones that follow a kind of rhythm. You're right. This is not well suited to that. >>: You can apply what you did to that or is it -- it's really targeted towards the [inaudible]. >> Umar Syed: We did. And we found it didn't work so well for the cyclical. >>: Okay. >> Umar Syed: It's really designed for [inaudible]. All right. So what's interesting is that -- so the meaning of this query changed very abruptly and therefore the associated click-through rates changed very abruptly. But this change was not like -- the change was associated with all kinds of other signals that were indicating that a change happened; for example the volume the query spiked, the occurrence of the query in news articles spiked. And so there was lots of indications that something has happened to the query. And so our approach -- and this is work I did while I was at Search Labs in Silicon Valley for one summer, which is that we combined an online prediction algorithm with a bandit algorithm. And so the prediction algorithm uses these contextual signals like the volume of the query, its occurrence in news articles and so forth to predict whether or not the meaning of the query has recently changed and, therefore, whether it's associated click-through rates might have dramatically changed. And then that prediction algorithm outputs a prediction. And then our bandit algorithm takes that prediction as input and decides whether or not it should modify its behavior accordingly. >>: [inaudible] the optimal [inaudible] when looking [inaudible] computer program? >> Umar Syed: Yes, this is the algorithm that knows the option that its highest click-through rate, and it just offers the best [inaudible]. >>: But it's even sort of super optimal, it's not even assuming that it's a constant [inaudible]. >> Umar Syed: Right, right. This is not the best single ad in hindsight or anything, this is every round. >>: [inaudible]. >> Umar Syed: Yes. Well, so it knows -- so there's all these options, and they each have some click-through rate. It knows which is the best one. >>: No, but on what period [inaudible] if you [inaudible]. >> Umar Syed: Oh, I see. So [inaudible] granularity is [inaudible]. >>: Oh. >> Umar Syed: So we know that today's [inaudible]. >>: Why wouldn't the best one in hindsight be any different than [inaudible]. >> Umar Syed: Because the best one is changing in this experiment. >>: But if you're only giving an answer for that day's ->> Umar Syed: Yeah, yeah, we're only giving it for that day. But I think John was asking are we [inaudible]. >>: [inaudible] it's usually the best constant algorithm, so you would average the whole click-through, the whole dataset but that's not [inaudible]. >> Umar Syed: [inaudible]. >>: Okay. >> Umar Syed: We [inaudible] any way, we find that using this online classifier [inaudible] and then modifying the bandit algorithm to use this information and take advantage of it can result in some dramatic improvements in the regret. And that's all I had. Thank you very much. [applause]