24720 >> Dengyong Zhou: All right. Today we're hosting Abhradeep Guha Thakurta from Penn State. Abhradeep has done some really interesting work on differential privacy. He just had two papers In Cult, which is really, really awesome. And he's going to tell us the grand view of all of this work today. >> Abhradeep Guha Thakurta: Penn State. Good morning. I'm Abhradeep Guha Thakurta from And I'm going to talk about differentially private and political risk minimization and high-dimensional regression. It's joint work with Prateek Jain from MSR India, Daniel Kifer from Penn State and Pravesh Kothari from UT Austin and Adam Smith from Penn State. This talk is essentially a merging of two different papers I had in Cult and try to give a high level view of it. The primary intention of this work was to design learning algorithms with rigorous privacy guarantees. So before going forward, let's try to understand what kind of privacy guarantees are we looking at. Suppose you have a dataset of a bunch of individuals. You can think each of the entry in the dataset to be some kind of sensitive information about the individuals. For example, it might contain information of a bunch of medical tests and a bit which tells whether he has AIDS or not. And you have an algorithm which is sitting in front of this dataset which is realizing some of the statistic about the dataset. For example, it might release a classifier to the external world to classify a person whether he will have AIDS or not. So the summary statistic which goes out, you might think that this statistic goes to an adversary who has the intention of breaching the privacy of the dataset. By breach of privacy I mean he might want to get specific information about a particular individual in the dataset. So thinking more about it, there are essentially two conflicting goals here. First, you want to release this kind of statistics to the external world because they're just useful. Additionally, there is this privacy concern that whenever releasing this statistic, it might reveal information about the individual, which is clearly an issue. And it so happened that this is a pretty difficult problem. Recently there has been work where people design systems with some ad hoc notions of privacy. And there has been a lot of high profile breaches against those systems. To name some of them, there was this work by Arvin [inaudible] and [inaudible] where they deanonmyzed the Netflix prize database which essentially Netflix said they're anonymized data and they're preserving the privacy of people. There was this work by Alexandria [inaudible] in 2011 where she showed vulnerabilities in the Facebook advertisement system. And there was a current work in Oakland last year where people showed privacy breaches against recommendation systems. So it's really -- the question really arises is -- I mean, what kind of privacy guarantee does the adopted systems give, or do we significantly need to rethink the problem? So in the last like three to four years people have been trying to design algorithms with rigorous privacy guarantees. And this has been an active area of research almost spanning across all computer science, databases, learning theory, programming languages, cryptography, algorithms, just to name a few. So what I'll do is in this talk I'll narrow my scope down a bit. I'll be -for learning algorithms I'll specifically talk about, in particular, risk minimizers. For privacy guarantee I'll use this notion called differential privacy, which has been reasonably successful in the recent past. And this is one of the rigorous notions on privacy. Just to get a hang of it, in particular, risk minimization is you have a dataset of N points and you are trying to get a model parameter which minimizes an empirical risk. Now what is an empirical risk. For each of the data points for a given model parameter, you have a loss function L, which will give you the risk for that data point. And, in particular, risk is the average of the risk over all these data points. And for the purposes of this talk we'll be thinking of this last function to be convex in this first parameter. This is just for technical reasons. And then empirical risk minimizer is model parameter which minimizes the empirical risk. Additionally, you add that term there is a regularizer to inculcate your prior believability toward the minimizer. For example, if you think the minimizer had very few non-zero entries, the minimizer you're trying to learn has very few nonzero entries, you might put the L1 regularization here. It's essentially inculcating the prior belief of the system. Throughout the talk, I'll be basically making references to the specific form of linear regression problem. That is, this linear regression. It is one of the [inaudible] problems, which is pretty common. And so the empirical risk minimizers which I'll be telling you overall whenever like situations [inaudible], logistic regression, median, support vector machines, et cetera. Now, after having thought about empirical risk minimization and me defining it for empirical risk minimization is, the question arises what kind of privacy threats does it pose? I mean, is it even worthwhile to investigate on designing private algorithms for empirical risk minimization? Consider this very simple and tough example. You have a dataset of five points, minus five, minus two, minus one, two and five. And if I define the median of this dataset it is very trivial to see that minus one is the median. Now, you can write this problem of finding of the median as an empirical risk minimization problem. This is essentially minimizing over the parameter theta the sum of the absolute distances from each of the data points. And if you plot the graph of this function, it will look something like this. So it is -- the slope is minus one plus one, plus three and minus one, minus three, minus five and the median is this guy. >>: Actually computing the average. >> Abhradeep Guha Thakurta: No, I'm computing the median here. So the slopes are -- so if you plot this graph, this is median. Average will be theta minus DI square. That is average. So thinking of it, median is an empirical risk minimization problem, and you are revealing one data point exactly. And if your purpose was to protect the privacy of the data point, this is not good news. And if this thing seems like a trivial problem like I'm dealing with one dimensions, let me give you a more concrete problem. Support vector machines. Suppose you're dealing with support vector machines in the dual form where you're outputting the support vectors. So think about it. The support vectors which are outputting are essentially the data points on the margins. And this support vectors are reasonably high dimensional. So you're actually revealing a lot of information about individuals. So with this kind of issue with empirical risk minimization, the question arises what can you do about it? So these are the problems and we need to mitigate these problems. So to that end, we use this notion called less differential privacy. It was initially proposed by Dwark Maxim [phonetic] and Smith in 2006 and there was a follow-up by Dwark [phonetic] et al. in 2006. So at a high level what this definition tells you is that suppose I'm releasing an interesting statistic about a dataset, the privacy breach on you, the statistic can have is in the event of you being in the dataset or not. For example, to be more concrete, suppose you have two datasets D and D prime. This third entry is your entry. And that, in the other dataset you have changed your entry to something else. And the algorithm is running either on D or on D prime. Now, the output is given to an adversary. And the adversary cannot decide whether the algorithm was running on dataset D or D prime. To formalize this, a randomized algorithm A is epsilon delta differentially private. So epsilon delta privacy parameters. So lower the values of epsilon delta, better privacy guarantees you have. For all datasets D and D prime the different one element. And for all possible sets of answers S, the probability that the algorithm running on dataset D produces something in this and the probability that the algorithm running on dataset D prime produces something in this are closed by this factor of [inaudible] epsilon and delta. So if you have seen the definition is good, if you haven't, it's fine. You don't need to remember the details of the definition. Only thing you need to do with epsilon delta privacy parameters lower value, the better it is. Any questions so far? >>: Go back further in the slides. I don't understand why so many [inaudible] SVM reduce the privacy of some people. >> Abhradeep Guha Thakurta: So if you're running SVM -- so if you're running in this dual form while you're outputting the support vectors, what is a support vector? Support vectors are feature vectors, some of the feature vectors. These feature vectors can be some of the individual information. For example, in the initial setting I told you, right, it might be a bunch of medical tests which you have done. And the final thing is [inaudible]. >>: Coming from the support vectors. >> Abhradeep Guha Thakurta: Yes. >>: I see. >> Abhradeep Guha Thakurta: So these are definitions of differential privacy. So let me try to motivate the definition of vector, why you have chosen this definition. So why is it a good definition? So one of the key features of differential privacy is they have this property called composition. What composition tells us, suppose you have this dataset. I'm running two algorithms on the dataset. Algorithm one, which guarantees epsilon delta differential privacy and algorithm two which I'm running again on the dataset, which also guarantees epsilon delta differential privacy. And if I -- so each of these algorithms reveals something about the dataset. I mean, it doesn't -- I mean, it doesn't give out zero information. So now the question arises, when I group these two information together what is the privacy guarantee I have? Differential privacy trivially tells you that you will have two epsilon comma two delta privacy. Now, to my knowledge this is the only definition, only rigorous definition which guarantees you composition so easily. There are other rigorous, other definitions of privacy like maximal differential privacy or noise-less database privacy. Getting a composition result of this clean form is very difficult. So this is one of the key features of differential privacy. And additionally, it gives you guarantees against arbitrary auxiliary information. So if you remember in the definition I never mentioned about anything about this information. So I mean, I assume that adversary might know everything accepting your inventory. However, what it does not imply -- so recently there have been a lot of works where people kind of misinterpret or overinterpret this definition. It does not protect you against the information an adversary can learn from an [inaudible] statistic. For example, suppose someone runs a survey which tells that people smoking, people who smoke have cancer. Now, someone sees you outside smoking and from the survey result that smoking causes cancer, the guy who sees you smoking might have significant high belief that you smoke -- that you have cancer. But this is individual of the fact that you being in the dataset or not. So this might be a privacy breach. This might not be, but differential privacy does not protect you against this. And this distinction is pretty important. >>: That means differential privacy cannot prevent you from getting the auxiliary information. Is that what this means? >> Abhradeep Guha Thakurta: This means that public information what is there, if that breaches your privacy, that differential privacy does not protect you against -- what it only protects you against is the computation which the system is doing for releasing that information. That is in different of you being in there. So -- yeah. So this talk is essentially about empirical risk minimization and how do you design differentially private algorithms for [inaudible] risk minimization. So I'll be segregating the talk into two parts. One is in the low dimension. The other is in the high dimension. The low dimension I mean is the number of sample points or number of data points you have is much larger than the dimensionality of the model parameter. So the theta which I was trying to learn the dimensionality of the model parameter is smaller than the number of samples you have. And the other setting I'll be talking about is the high dimension, for the other thing that's happening, where the dimensionality of the problem is much larger. Maybe exponential in the number of sample points you have. In the low dimension I'll initially be talking about some of the existing work. This output perturbation. This was initially proposed by Charles [inaudible], et al. in 2011 and there was a follow-up by Rubenstein, et al. in 2011. Then I'll be talking about another technique called objective perturbation which is in the same paper of the same as 11 and one of our recent papers we make a significant announcement with respect to the perturbation. The third thing is using online learning techniques for guaranteeing differential privacy, and this is work by Prateek Jain, [inaudible] and me. Then I want to kind of compare all these three approaches and see which one is better. Not that I have a very good answer right now of like how do these pieces fit together, but there are some initial directions which I can tell which one is better and why it is better. >>: So along with this and the objective, isn't there also the possibility of there being the data points themselves to decree false data that's in ->> Abhradeep Guha Thakurta: Sure. Yeah. So the reason that I didn't put it was in some sense it was intentional. So what you are doing, when you are perturbing the data points, so differential privacy in some sense you use that one data point is hidden from the output. Now, if you want to perturb all the data points initially without looking what algorithm you're running, that essentially means that the effect of the data point is almost gone. For example, suppose let's take this example of averaging a bunch of numbers. You want to average a bunch of numbers. Now, one way -- and the numbers are bits let's say. So the average can change by changing one of the entries one way in. So in the output, if you had to you'll be adding noise in the order of 1 by N. If you're adding noise to the inputs, you're adding noise to order of one to each of the inputs. When you take the average, there will be an error of one way route in, errors cancel out and one root in so significantly high number of errors in the input perturbation. >>: But you could also find the solution in your support vector function in the example find the solution and say like, well, find perturb data points in the neighborhood and find something which comes close ->> Abhradeep Guha Thakurta: Yes, so that approach is somewhat kind of related to this perturbation. It's like in the similar flavor. So output perturbation essentially you use the structure of the problem a bit. So in high dimensional setting, the dimensionality of the problem is much larger than the number of samples you have. And this kind of problems are really hard to address because even nonprivately deserve very hard to address, because the system remains extremely underdata [inaudible] with the number of samples. And what we do is give private algorithms for sparse regression. Sparse regression meaning the underlying model parameters you're trying to learn from high dimensional algorithms, that model parameter has very few non-zero entries. Then you can do something interesting and I'll tell how. So this first two bullets are mostly for batch empirical risk minimization. While you have all the sample points in one goal. Now, suppose you're dealing with advertisement systems or any kind of system which kind of changes over time. So you do not have the complete dataset offhand. So that comes, brings us to this online political risk minimization. The problem was the data points are committed online. And the challenges here is much different from the batch setting because in the batch you are somewhat outputting only one output. But in the online -- every time a data point comes you have to give some output and the number of outputs is almost equal to the number of inputs you have and there's a significant challenge in terms of predicting the privacy. So there we give a very generic technique for designing differentially private online learning. So this is our work, the other paper. So coming to batch empirical risk minimization, low dimension, there are three algorithms I'll talk about. One is output perturbation, objective perturbation, online convex programming risk and I will compare the three. So our perturbation was proposed by Chaudry et al. So the idea for perturbation, first you select a random variable B from the gamma distribution where the scale of the distribution, it's a mean zero distribution where the scale is order by one of epsilon. Epsilon is the privacy parameter. What you do is then you take your objective function of the original risk minimization, find out the minimizers. Then you add this noise to it. So if you look at this picture, this is your theta hat which has the zero slope. And you add noise to it. And the claim is that this guy is differentially private. But for full disclosure, let me tell you that actual algorithm has slight, is slightly different and slightly more tricky. But for the purpose of this talk I think this is fine. So basically find out the minimizer and add noise to it. So the privacy guarantees, this algorithm is epsilon comma zero differentially private so the data term is zero there. And in terms of utility, the price you pay for privacy -- so this is the loss with the empirical risk without the privacy requirement, and this is with the privacy. So the loss he pays around P lock P by epsilon N. So this data term is showing up there. It is the bound on the gradient of the loss. Small little loss functions. And theta is the L to norm of the theta. So these two parameters you can ignore for the purpose of this talk, because in all the guarantees I will show, we'll have this term common. So essentially what you need to compare is P lock P by epsilon N. So essentially we have a dependence on the dimensionality SP, lock P and N is the number of samples we have. Great. So one issue with this kind of an approach is the privacy argument requests that you reach the true optimum of the objective function. So the guarantee privacy guarantees only for the true optima. If you reach the optima. Now, in practice pickup might not and it so happens that the guarantees do not hold or the analysis doesn't go through. So this is a significant problem. In practice, in practice, what does it even mean to guarantee privacy for these kind of settings, and over -- I mean over a sequence of slides I'll show you how do you mitigate this issue. So the other algorithm is this objective perturbation. So I'll mention this objective perturbation as objective perturbation of CMS, because we make an announcement over it. So I'll just tell this is what CMS did. So what does this algorithm -- again, sample a random variable B from gamma distribution. And this time instead of adding noise to the output what you do is you add a random linear term to the objective function. So you can think of this problem, think of the setting as two things, first you take the objective function and randomly tilt it. Alternatively, so this was the original theta hat which is for the loss, for the regularizer. What you are doing here is you're finding out a model parameter where the gradient of the objective function is equal to minus B. So if you are conversant to find duality, this is essentially working for the dual [inaudible] the dual [inaudible]. But we will not go there. Okay. This is a simplistic view I'm giving you because you need to choose a regularizer in a proper manner to make things work, but that's fine. So the result that was existing in object perturbation, it required the space you're optimizing to be unconstrained. You cannot put hard constraints on this set. This has to go the active role of P. And this rules out problems like linear regression for technical reasons, because to bound certain parameters for privacy you need to work with the constraint state in the case of linear regression. Additionally, they record the regularizer to be smooth. Smooth meaning twice differentiable. At least the analysis required it. So this essentially rules out most common regularizer like L1 regularization or nuclear nom, et cetera. So what we do is we allow convex constraints. We allow constraint optimization and nondifferentiable regularizers. So to do this we actually needed a significantly different approach which was analyzed earlier. And I'll tell you what this approach is. It's completely different from what has been analyzed in the past. And the second of gamma noise factor of root can get root P contribution was we noticed if we use Gaussian noise instead because Gaussian noise is more tied to the mean, we can get a P improvement in the error. So earlier it was P lock P. We in there. And another thing which you noticed was this objective perturbation somehow respects the structure of the problem more. So if we give a tighter analysis, we have a tighter analysis, which saves another factor of root P in the error. >>: [inaudible]. >> Abhradeep Guha Thakurta: Sure. >>: I don't understand the question. It seems like if you're using linear regression instead of SVM regression, wouldn't it be inherently private because you're not exposing any data features, any specific ->> Abhradeep Guha Thakurta: So the idea about privacy is that so the kind of privacy guarantees we are working with is the probablistic guarantee. So the first point is no deterministic algorithm can satisfy differential privacy. So you need randomization. So the point is this, suppose my algorithm running on database D produces an output O. If I change O to D to D prime since my algorithm is dependent on the dataset, any nontrivial algorithm should change its output. So the probabilities are not like one guy has zero probability and another guy is one probability and that is -- and that basically essentially does not give us epsilon. So yes so and coming to this problem of linear regression, the specific reason we need this constraint set is if you take the gradient of the loss for linear regression, the gradient has the theta parameter there. And theta is a model parameter. So gradient is like X transpose Y minus X transpose theta. So and for privacy guarantees we need to bound the gradient. And you cannot bound the gradient unless you have bound on the constraint set. For logistic regression this is not the case because you basically, the differential of the logistic function loss function is the logistic function, and that is bounded by plus or minus on all of us. So, yeah, so these are our contributions. Now, coming to the guarantees, what we get, we get epsilon delta differential privacy and the utility guarantees we get -- now we lose a factor of lock P. So the lock P goes away and instead of P we get a root V here. But we pay a per price in terms of privacy with the delta coming up. So there is not a zero delta. So there's log one delta which is fine. And as I told you, we can make farther improvements on the dimensionality for nice datasets where essentially the dataset has some nice properties, then we can get better utility guarantee. So we can further reduce the dimensionality problem. So coming back to the contributions, so I'll not talk about these two parts. But what I will tell you is how do you relax. And this is in some sense the most interesting part of the work. So recall that if the regularizer were not differentiable you'll have your objective function something like this. And let's say this is the point of nondifferent rentability. And to work with the constraint you have a constraint here. This is just denoting the constraint. The object function can be outset of the constraints. What you do is for each of the, for the regularizer, we get a sequence of smooth approximations of the regularizer. So I'll tell you how you get the smooth approximations. But so for the nondifferentiable-izer, I get a sequence of smooth approximation but each of the approximation is differentiable. And finally in the limit I converge to the actual regularizer. Additionally what you do is, yes, so with this sequence of approximation the objective function will look like this, because you have a smooth end out here. And slowly as I goes to infinity this will start looking like this objective function. Additionally, we add convex penalty term which penalizes the function heavily when you go out of the constraint set. Great. So it might -- so some of the -- there are like two significant challenges in finding out this regularizer. It took like a reasonable time to find out what is the way of doing it. So we needed to keep the strong convexity property of the algorithm unchanged. So there's a limit the algorithm has some form of strong convexity. We need it to keep it unchanged otherwise privacy guarantees get a hit. And the second challenge was we needed to find out the right notion of convergence. So let me tell you how do you even do the smooth approximation so that we get -- we satisfy both of these challenges. So the first thing we do is we take the regularizer and convolve it with a smooth kernel so think of convolving with a Gaussian kernel. And what we do we reduce the width of the kernel as we go forward. And [inaudible] convolution tells me that if my kernel is twice differentiable or infinitely differentiable, my convex is also differentiable. But if your regularizer is nondifferentiable in the limit, this guy is also converging to it because when you are kind of shrinking the width of the kernel, converging it in the limit you'll get a delta function. And convolving with the delta function is the function itself. So limit you recover the actual regularizer. And we add convex penalty term, F of delta, trust me this is convex. If you go out of the set you basically make it pay. And essentially what you do is when you add in the approximations we add the approximation point of some I into the convex penalty so that as I goes to infinity it starts penalizing heavily when you go out of the convex set. Coming to the second challenge, we needed the right notion of convergence. So I can show you something here. So you can think of this algorithm as some kind of a function which takes two parameters. There are nonvariable, random -- the random, the noise, and the dataset D. What I'm doing is I'm approximating this function with sequence of F1, FK. One of the classic ways of proving this kind of result is going where convergence in distribution. Seeing that these two algorithms functions converge in distribution. So as I changes to infinity, as I tends to infinity you want to say these two converge into distribution, which this does not happen here because if you see, let's say we were taking daily L1 regularization, in L1 regularization, based on this B, there is no density function there, because the point of nondifferentiability we'll have probability mass. So distribution of convergence doesn't hold. So what we need is we prove weak convergence in measure. So roughly what we say is so if you -- I mean, here we need to recall differential privacy a bit. So in differential privacy we give guarantees over probability measures of different output sets. What we converge in measure tells you is that roughly the argument goes like this. Spouse my limiting algorithm were not differentially private. Limiting algorithm. So I limit I changes to infinity, F of I is not differentially private. But I know with a smooth approximations all this sequence of algorithms are differentially private. So the argument is if the limit is not differentially private then there must be an algorithm in the sequence which is not differentially private. So this takes up on positive and proves that it's the contradiction, because if the limit is differentially private, then it's not differentially private then there must be an algorithm which is not differentially private in the sequence, but already have argued that the sequence is differentially private, because it's a smooth approximations. Questions? So it's important to digest this part. Wait. So essentially, yes, so the next issue we show a sequential closure of differential privacy. So this is a theorem which is kind of found in individual interest. So what this says is suppose you have a sequence of algorithms A 1 A 2 blah, blah, blah, all of them are epsilon differentially delta private algorithm. Then there exists an algorithm and there exists an algorithm A such that for all datasets D and the algorithm A converges weakly to A. When I say weakly, it converges point-wise. So if you get a point-wise convergence of these algorithms to A, additionally each of these algorithms are differentially private, then the limiting algorithm is also epsilon delta differentially private. So what this says is suppose you have this A 1 A 2 blah, blah, blah which limit converges to AI, then the complete block is differentially private. So this is one of our theorems which we used for proving this property. Now, after having talked about our perturbation where you basically changed perturbed the output and the other case where you tell the objective function and solve the objective perturbation, let me give you a significantly different approach from the other two. So this is based on line convex programming you'll be using online learning techniques to solve this problem. Just to give a background of what online convex programming is, you can think of online convex programming to be in the setting. There's an N round game between two parties. There's the player and challenger. In each round what the player does it chooses a point in theta I, which is utilized in some convex set. And the challenge at best I'm seeing is point which he has chosen, gives him back a cost function LI. And the player has to pay cost of L I of theta I. challenger kind of a game. So this is a player versus And the player wants to perform well. That means he doesn't want to pay too much of cost. And online convex programming algorithm is such which gives the strategy to the player to choose these points. And the property you want from the online convex programming algorithm is that the regret is, that should go down to zero as you play more and more. So the idea of regret is so LI, theta is the parameter which the player is outputting based on the algorithm. So this is the average cost with I based. Minus suppose this person knew all the cost functions at once. Then what was the best theta he could have chosen. So this is called the off line best. So you want to perform well against the off line best. That means as you see more and more samples you want to go closer and closer to the off line best. So this is what any good online convex programming algorithm will give you. And this is what they call a sublinear regret. So you want to minimize the regret. Great. Now, how do you use this framework for solving our differential privacy algorithm? So the cost function you can think of do we -- this functions for each of the data points. L of theta of DA. Additionally, we need this property from an online convex programming algorithm that it is stable. By stability I mean that you take two datasets D and D prime, which differ in exactly one entry. And the Ith iterate, the Ith output player outputting the L2 distance between theta I on D and D prime should go down as one by I. So as you go farther and farther down the time, dependence on one single data point reduces linearly. And this is a property I need, and we call this thing as stable algorithm. Additionally, we want the original OCP algorithm A to be having sublinear regret. Meaning that the regret should go down to zero as N goes to infinity. With these two properties we can do something interesting. What we do is we do this thing, these are a pretty common thing now in online learning to use this online to bask conversion. So what you do, you run the online converse programming algorithm with the cost function and appropriately chosen regularizer. So OCP algorithm gives you output theta one to theta N. What you do you take the average of all these outputs, add the noise, which is scaled roughly around P upon the square, the variance is P1 N square this is a Gaussian random variable. This has a similar flavor of output perturbation where you take the output and add noise. So this looks very similar to that, but this has much significant advantage over output theta Y. And before going there let me tell you this that I told you we need stable online convex learning algorithms. Now does there exist convex online algorithms in the literature, that's the first question you want to ask. What we show is in fact it's still not clear to us why it is happening. Every online convex learning algorithm which you picked up satisfies stability property what you wanted. Each and every algorithm we picked up. More precisely, this algorithm implicit gradient descent had this stability property. The gig algorithm Martin Sankovich in 2003 has this property conditioned on the fact that the cross function is differentiable. But I'm not sure it is critical of the proof or we actually need this. But anyways, so then follow the regularizer, all of them have this property. Yeah, sure. >>: The stability property, just as a -- you don't change your output too much based on a single. >> Abhradeep Guha Thakurta: Data point. >>: Difference? >> Abhradeep Guha Thakurta: There's something more. As you go farther in time, the change that can have that goes down with the time. So this goes down as -- so one by I. So essentially you want to say that the last point which you're outputting almost does not depend on your sample set. So as this K goes to zero. >>: Difference between two datasets gets [inaudible]. >> Abhradeep Guha Thakurta: Yes. So I mean at a very high level what this guy's saying is part of the learning algorithm is to get something which I can use on a new sample point. In that sense, I should not be too much boiled down with one data point. And this is essentially telling that I should not overfit on one data point. Although, it is not clear it's an open problem kind of that is there something fundamentally close that's going on and we can say that probability any algorithm that is differentially private has sublinear regret or vice versa. I don't know. So what we show is our CPL algorithm is delta differentially private the utility guarantee is very similar to objective perturbation, except in the fact that we get this poly log N term for technical reasons. But that is fine because sample size, logarithm sample size is fine. The key property of this algorithm and why you looked at this algorithm is that it is much more practical than objective an output. If you see the privacy guarantee, it does not really depend on you minimizing any objective function. The OCP algorithm essentially works in gradient descent. And your algorithm what you're outputting is essentially taking an average and adding noise to it. So the privacy guarantee holds no matter what you do. So whatever you stop privacy guarantee holds. So with these two things in mind, now the question is which is useful. I don't have a very clear answer. But let's try to see. So there are two approaches which you designed. One is objective perturbation and another is OCB-based. So objective perturbation requires stronger assumptions than this online convex programming, meaning that it requires the cost function so in our work we allowed the regularizer to be nondifferentiable but it requires the empirical risk to be differentiable. And that we can prove that it is necessary. It's not a problem in the proof technique. It is necessary for the algorithm. But at the cost of that, you get better result guarantees. So it supports the structure of the problem better than what output or OCB does. Contrary to that, OCP algorithm is much more practical. In the sense you do not need to bother about whether the objective function actually minimizes or how good let's say convex optimization package is working. Just take a bunch of outputs and take the average. And, yeah, it is more meaningful. So I haven't pulled here, because OCP performs unconditionally better than With a poly log factor dependence on the dataset size, performs unconditionally better. So the comparison is objective and OCP. in output perturbation output perturbation. but that's fine but it essentially with an And ->>: You're comparing a batch algorithm within ->> Abhradeep Guha Thakurta: No, this is a batch algorithm. OCP is a technique I'm using for solving the batch problem. So let me make this clear. So I have a sequence of data points. So this is an online algorithm. OCP is an online -- but I can think -- I'm pretending the data points are coming on line but I'm solving a batch problem. So essentially the batch is existing work and it so happened that people never looked at using online algorithms for batch [inaudible] algorithms. So this talk so far I've told you about low dimensions three approaches output objective and online convex programming and give you a comparative study about each of the three. And these three here, the first part you want to do sparse regression in high dimensions. So the idea of high dimension is that the dimensionality of the problem is much larger than the number of sample points you have. And this makes the problem significantly challenging. Why is that so? So if you use the previous techniques whatever I told you output objective or OCP-based, the error scales polynomially in the dimensionality and one-way sample set size. But I told you that the dimensionality of the problem is significantly higher than the, significantly higher than the number of samples you have. So to get any consistent estimator, you need to control the dimensionality, the differentially significantly what you want is something logarithmic in it. What it shows is that in the high dimensional case for sparse problems and specifically for linear problems, the error scale is polynomial in the logarithmic of the dimensionality. Polylog of the dimension. And to do this we assume that the underlying model parameter we're trying to learn is sparse. Sparse meaning it has very few nonzero entries. So this is dependent of what sparse regression is. So the number of dimensions of the problem is much more than the number of samples you have. So if you look at -- so essentially what you are doing, empirical risk minimization problems we are solving some kind of an objective perturbation, objective function. Kind of minimizing the objective function. Now, when the dimensionality of the problem is much larger than the number of sample points you have, the objective function looks something like this. You can assume it -- you can view it as a boat, where there's a lot of flat points in the middle. Now, any of the model parameters here will work equally good on the dataset. But recall that your idea was to work on a new sample you get, not on the dataset. So for that you need to choose the right model parameter. And if I just give you the problem in its original form where you have the dimensionality higher, larger than the number of samples the problem is extremely[inaudible] and so how do you go about it? It so happens that statisticians saw that if the underlying model parameter which you are trying to learn is sparse, sparse meaning it has very few nonzero entries, then you can look -- you can actually search for parameter vectors which had this property that is sparse. So in that case you can constrain your search space significantly rather than looking completely here on this line. And this from a computer sense perspective so this was mostly based on statisticians from a computer sense perspective this essentially feature selection problem given a bunch of features you want to select a set of features for you and feature selection problem. So for doing this, we design two algorithms. And what this algorithms are very similar flavor. What you do is first you run an algorithm to select a bunch of features. Then in the second stage, we use this idea of objective perturbation. First you select the features, restrict the problem to that small subset and use objective perturbation restricted to that set, because now you're in the low dimensional setting, once we get the features out. And for doing that, we use two approaches. One is this exponential sampling based algorithm which is roughly based on the sum of the ideas from [inaudible] interval in 2007 and another is the subsampling based approach which loosely relates to the idea of sample negative framework from [inaudible] in 2007. So in this talk I'll be specifically speaking about sub sampling based approach because this guy is in some sense a first cut solution which you get and this is in some sense it is not computationally efficient but this is a computationally efficient algorithm and this is what I'm going to speak about. So what sub sampling-based algorithm does you take the complete dataset, break it up into blocks of root N size the dataset. Run your favorite feature selection algorithm on each of the blocks. So each block gives you a bunch of features which it thinks is important. And from that outing you take a noisy vote. So how you take the noisy vote, I'll tell you. So this is at a high level. You can plug in any feature selection algorithm what you want. It is essentially oblivious to this. So suppose block one comes. It votes for a bunch of features. S number of features it votes for. Then block two does the same and block root N also does the same. Now what you do is you take the votes for each of the features which from different root N blocks, add noise in the order of S by epsilon And give the block out. Now, this is the noisy votes you have. You the top S in terms of this noise efforts, and the claim is that this differentially private. >>: This count. has got to it. select case So -- >> Abhradeep Guha Thakurta: So these guys vote for it. So vote is like plus one or plus one or zero. So this guy will have a bunch of votes. >>: Count already has already. >> Abhradeep Guha Thakurta: Yes for the votes. >>: When you say add noise ->> Abhradeep Guha Thakurta: Add noise in the sense -- yeah, there I was kind of sloppy. But add noise essentially you are scaling plus noise where the scaling is as S by epsilon. >>: You say the adding the noise is different you're not actually exposing that. >> Abhradeep Guha Thakurta: Yes, I'm exposing. What I'm exposing is the coordinates which I'm choosing. It can be shown to the world this is coordinates is the thing I'm reducing the problem down to. And this is differentially private, epsilon differentially private. So what you do is we use the general thing and try to study it and the linear regression. So linear regression, what you have, you have design metrics and the model parameter, which is N cross P and then additional failed noise to it. And the response is Y. So this hatched part you can think of personal information about a person. This is like whether he has AIDS and this is the Fisher vectors. The way you solve linear regression problem you want to find out the theta star. And what you want to do is you want to minimize this objective function which will give you an estimate of theta hat, theta star. So the minimization is the gap between delta norms, so L2 norm of the difference between Y and XN data and appropriate chosen regularizer for different problems you choose different things. So for LASSO they're different variants, LASSO they use regression, et cetera. So the settings of theta star, the underlying model parameter. These are design metrics, and the output vector. The exponential sampling requests around SQ block P number of samples. And the subsampling based algorithm which you have requires S square lock square number of samples. If you note, both algorithms have sample complexity of poly S log P. So I have pulled down the samples from P to log P. With the polynomial definition of sparsity, but that's all right. So now if P goes to -- P grows faster than N I'll still get a good convergence rate because it needs to grow N log P. So this bypasses some of the impossibility results in the literature which talks about deterministic algorithms may not be differentially private. But since you are working with individual algorithms, that's fine. So those are the details. So this talk so far low dimensions output perturbation, objective perturbation, online convex problem. >>: Smaller than. >> Abhradeep Guha Thakurta: S is smaller than N. >>: Smaller than N but no requirement on being square root or ->> Abhradeep Guha Thakurta: No. So in the high dimension we need -- we have the dimension of the problem larger than the entries, and we give a differentially private sparse regression. You should tell me when I should stop. >>: S was the number of non-zero entries. >> Abhradeep Guha Thakurta: Entries in the underlying parameter. So this case, number of nonzeros. One very common example people use is suppose you are taking photos of stars in the night sky. Not every place you'll find a star. Very few places you'll find a star. So you can actually see that -you can use the model you're trying to learn you can assume there are very few [inaudible] and that's essentially they're using astronomical data, they use sparse pers bayesian quite a bit. So I don't know. Have I gone over time? Not time? So I can stop at this point also. Sure, I'll go. This will take another 10, 15 minutes I guess. So after having talked in this problem for the problem when you have samples in batch now I'll tell you how do you do it online samples coming to you with others. So the data points are [inaudible] the cost functions as I told you in how do you map it. Some [inaudible] online. And in this setting, so this is work with Prateek Jain and me and Cole [phonetic] in 2012 where we give a very generic transformation. So I'll give you an OCP algorithm. I'll plug it into the system. The system will give a differentially private variant of it. Moreover, the original OCP algorithm had sublinear regret you recall it was how do I perform well with the best possible off line adversary. The private version will also have sublinear regret. And additionally we show that if the loss function has some problems like the quadratic loss like linear loss, then we can get private variant with almost optimal not private regret. So the nonprivate regret is around lock T upon T and log N upon N. get almost close to that. And we So this is just a refresher of those structure in an online setting. Now we're actually working on the online setting. So there's a player, there's a challenger. Player chooses a point in the convex set. The challenger chooses a cost function. You play a cost function and you pay a regret. And then -- and you want to minimize that regret. So the issue with online algorithms is this model parameter theta I, whenever it's working the batch setting, those are kept within the system and then you take an average and give noise to output. But the area you're outputting is model parameter [inaudible], so this is much more reactive compared to off-line batch setting. You have to update the model parameter each and every -- every time you see it you update it and output it. And what about these cost functions can be containing sensitive information. For example, for online classification, this can be the label whether the person is As or not and this is the feature vector. And you need to guarantee privacy across all the possible outputs. So the point is that if you have N points outputs privacy for all of this and this is a pretty completed [inaudible] I'll show you a prior attack how online problem becomes kind of challenging. Suppose there ABC. And you a female. So be shown here is this adversary. Owner of some perfume blend. Let's call it also find out when a person who cliques on this ad is a male or a male comes, clicks writes perfume and Bing this is this ad to ABC. Now, he clicks on that and the adversary gets [inaudible] by basically clicking on it. And at the same time here we know these guys clicked on that. Now, he wants to find out whether the -- whether this person side P is male or female. Now, in the site of Bing, what Bing does Bing says that a male has come and for perfume has clicked on this ad ABC. It might be that ABC should be reranked and given higher ratings. So the [inaudible] of this ad for all male users, this is an example and so the system it's much more complicated than just the example. So this guy gets reranked. Now, after the male clicked on that, the adversary immediately goes to his male profile. He had made a bunch of profiles earlier. He goes to his male profiler sees for male it has increased and immediately goes to his female profile. It hasn't increased so it must have been a male. So the meta point is online learning tends to make quick decisions. Every time you see some input you make a decision. Or you change the system. And that might compromise privacy. So the key thing what it did was we tried to design online learning algorithms that controlled changes. So you do not want to make sharp updates with one single entry. And additionally we added appropriate. So when -- so the idea is when you output these model parameters, theta to the external world, we add some noise to it. Now, this is a pretty common thing in differential privacy everywhere everyone adds noise. Now, what's the interesting part here? The interesting thing is the number of inputs here is equal to the number of outputs. So you are not outputting a single thing. You are putting sequence of outputs and that is exactly the number of inputs we have. We're going to see significantly challenging to guarantee privacy in that setting. So the challenge is that privacy should be guaranteed across all of this. Again, for privacy we use differential privacy here. So you want all the signals of outputs to be differentially private. So I should recall the intuition here because I'll tell you what's important. So the presence of a particular data point should not be visible to the observer. So the output should be kind of indistinguishable from those outputs. And for online learning, we use this framework of online converse learning here. So for these online learning also uses this. So OCP algorithms usually have sublinear regret meeting that as N goes to infinity this guy goes to zero. Additionally they have this property called explore versus exploit. Roughly what explore versus exploit says you is that whereas you needed a point I will explore that but I will not discard what I have learned from the previous. So that's the exploit part of the previous -- so this exploit versus exploit actually says you that it does not allow you to make sharp objects. Those [inaudible] don't allow you to make sharp objects. And if you think now the intuition for privacy and this explore versus explore philosophy these two are pretty much in sync, and in our analysis and proofs, we essentially use this idea. Although again it is not clear to us that -is it specific to online learning this property or -- and it's more general in general learning, because essentially learning [inaudible] you do not want to depend on one single point. We don't. But for online learning at least this is true. So earlier and in fact in the earlier talk I give you a brief idea about what off line in the off line in the batch setting what is there. So people that analyzed off line learning with privacy a bit, but this is not almost untouched in the online scenario. There's only one notable paper by [inaudible] et al. in 2010 which showed for online experts from a very specific kind of an online problem where you get sublinear regret. But that does not extend to general online convex programming, and there comes the challenge. Again, recall the OCP algorithm you're dealing with we want it to have sublinear regret, and it needs to be stable. Stable meaning that as time progresses, the dependence of that Ith parameter should go down based on the data point and the cost function. So it should go as 1 by I. So it should go down similarly. Now, in the online algorithm, what you do is this: We maintain two copies of the model parameter. One we keep with the algorithm, and one we output to the external world. And the noise we add for outputting is roughly scaled around root I upon N. I is the Ith iterate that I'm [inaudible] so the model parameter theta is used for updating. And the noisy parameter which I'm outputting to the outside is used for predictions. So what prediction I'm giving. So that can be used by let's say by Bing to find out what [inaudible] to show. Now, for the privacy the issue comes is that you have to guarantee privacy across all iterations. So there is this result by Dwork in et al. 2006 the initial paper on privacy which says if I have a sequence of outputs each of them are epsilon differentially private then my complete sequence is NF epsilon differentially private in N outputs, but that's not enough for us to get privacy, I mean get sublinear regret. Because if privacy scale says N, then we can show that you'll not get regret which goes down to zero with N. So we needed a stronger composition technique result. So what we did was we kind of used the ideas from a recent paper where Dwork, et al. 2010 and Hart and Rock in 2010 and modified that analysis in our setting to get a composition result which roughly says that if I have sequence of N outputs, it's not N epsilon differentially private, it's root N of epsilon differentially private. And that essentially allows us to get sublinear regret. Coming back factor N from root N. So we are guarantee delta differentially private for stable velocities. The regret for the private variant is the regret for the nonprivate algorithm OCP algorithm, plus a term of log square N upon root N. So the regret scales as log square N upon root N you can think of. So it's trivial to see if the original algorithm has regret tending to zero, then my, the private one also has regret than tending to zero. This is just a toy kind of scenario how it works. So this is model parameter. This is learner you can take it to be Bing which I'm showing output, the learner gets this. Interested party makes computation. This is gives user profile like types of perfume and this is the data. The system gives out an ad which is based on the user profile and the model parameter. This goes on. And based on whether this guy clicks that or not the system incurs a loss or cost. And this thing goes on for a bunch of iterations. So now what the system does is it objects to theta two. But it gives theta two with itself. What it shows is to the external world is something [inaudible] noise. And this goes on for theta three and log. That's how it works. Coming to quadratic loss, if the loss function of the cost function is of this form, then we get our solution to have the -- yeah, sure. >>: When you say updates according to the original theta two, does it mean in terms of [inaudible] lowering the rank of the plan we actually use the true information. >> Abhradeep Guha Thakurta: No, it will use this. >>: It will use the ->> Abhradeep Guha Thakurta: The private version. So this is only used for this system for generating the system. Whatever is shown to the external world is through this. So for the -- yeah, for linear regression, we can get the regret down to log upon N instead of root N. Earlier it was root N. We get it for log by N. And for this we heavily use the structure of the problem. And here so this uses one technique by Dwork et al. and it's improvement of 1 over root N. So that's coming to tend of the talk. So what I've told you is in low dimension I've shown you three approaches how to do empirical risk minimization in the batch. These two algorithms are the most promising ones. This one has better utility guarantees and online converse programming is more practical because the privacy guarantee holds at whatever you do. The high dimensional setting, I told you about the first algorithm, the first line of work for sparse regression in high dimensions, and I spoke to you about two different algorithms. Exponential sampling best and one sub sampling one I give you details where you do this voting kind of scheme. And the sample complexity scales us omega. Online thing what I told you is I give can you a generic framework how to translate your favorite OCP algorithm into a differentially private one. And I showed -- I basically kind of hand over it, how to get tighter bounds for online linear regression instead of one way root and two-way. So in terms of future work, in the batch area, high dimension, we have made progress on that. So if you forget about privacy, the sample complexity required S log P. But what you are getting is the best of what you're getting is S square S log P and current analysis will not allow it to get S log P. The question arises whether there is a gap between S log P and SP log P for private and nonprivate whether the gap is necessary or we can actually breach that gap. So you have made some special progress and the short answer is probably yes that we can bridge that gap but it requires significantly different analysis techniques. For online learning, if the cost functions each of the cost functions small loss functions were really hard, if they were strongly convex, then nonprivate regret is known to be scaled as one way in. But our algorithm gives you one way root N. So the question is can we breach that gap and that will be significant improvement over this current algorithm if it is. And the long-term motive what you want to understand is the relationship between differential privacy and learning, because learning, the whole intention of learning is to not to overfit your data to one single sample point. And differential privacy is initially [inaudible] can we formalize that and in learning actually this notion is called stability. And there are different versions of it. Can we actually formalize and give a deeper condition within that. And with that I want to end. [applause] Questions?