Document 17856893

advertisement
24720
>> Dengyong Zhou:
All right. Today we're hosting Abhradeep Guha Thakurta
from Penn State. Abhradeep has done some really interesting work on
differential privacy. He just had two papers In Cult, which is really,
really awesome. And he's going to tell us the grand view of all of this work
today.
>> Abhradeep Guha Thakurta:
Penn State.
Good morning.
I'm Abhradeep Guha Thakurta from
And I'm going to talk about differentially private and political risk
minimization and high-dimensional regression. It's joint work with Prateek
Jain from MSR India, Daniel Kifer from Penn State and Pravesh Kothari from UT
Austin and Adam Smith from Penn State. This talk is essentially a merging of
two different papers I had in Cult and try to give a high level view of it.
The primary intention of this work was to design learning algorithms with
rigorous privacy guarantees. So before going forward, let's try to
understand what kind of privacy guarantees are we looking at.
Suppose you have a dataset of a bunch of individuals. You can think each of
the entry in the dataset to be some kind of sensitive information about the
individuals.
For example, it might contain information of a bunch of medical tests and a
bit which tells whether he has AIDS or not. And you have an algorithm which
is sitting in front of this dataset which is realizing some of the statistic
about the dataset.
For example, it might release a classifier to the external world to classify
a person whether he will have AIDS or not.
So the summary statistic which goes out, you might think that this statistic
goes to an adversary who has the intention of breaching the privacy of the
dataset. By breach of privacy I mean he might want to get specific
information about a particular individual in the dataset.
So thinking more about it, there are essentially two conflicting goals here.
First, you want to release this kind of statistics to the external world
because they're just useful. Additionally, there is this privacy concern
that whenever releasing this statistic, it might reveal information about the
individual, which is clearly an issue.
And it so happened that this is a pretty difficult problem. Recently there
has been work where people design systems with some ad hoc notions of
privacy. And there has been a lot of high profile breaches against those
systems.
To name some of them, there was this work by Arvin [inaudible] and
[inaudible] where they deanonmyzed the Netflix prize database which
essentially Netflix said they're anonymized data and they're preserving the
privacy of people.
There was this work by Alexandria [inaudible] in 2011 where she showed
vulnerabilities in the Facebook advertisement system. And there was a
current work in Oakland last year where people showed privacy breaches
against recommendation systems.
So it's really -- the question really arises is -- I mean, what kind of
privacy guarantee does the adopted systems give, or do we significantly need
to rethink the problem?
So in the last like three to four years people have been trying to design
algorithms with rigorous privacy guarantees. And this has been an active
area of research almost spanning across all computer science, databases,
learning theory, programming languages, cryptography, algorithms, just to
name a few.
So what I'll do is in this talk I'll narrow my scope down a bit. I'll be -for learning algorithms I'll specifically talk about, in particular, risk
minimizers. For privacy guarantee I'll use this notion called differential
privacy, which has been reasonably successful in the recent past. And this
is one of the rigorous notions on privacy.
Just to get a hang of it, in particular, risk minimization is you have a
dataset of N points and you are trying to get a model parameter which
minimizes an empirical risk. Now what is an empirical risk. For each of the
data points for a given model parameter, you have a loss function L, which
will give you the risk for that data point. And, in particular, risk is the
average of the risk over all these data points.
And for the purposes of this talk we'll be thinking of this last function to
be convex in this first parameter. This is just for technical reasons.
And then empirical risk minimizer is model parameter which minimizes the
empirical risk. Additionally, you add that term there is a regularizer to
inculcate your prior believability toward the minimizer. For example, if you
think the minimizer had very few non-zero entries, the minimizer you're
trying to learn has very few nonzero entries, you might put the L1
regularization here. It's essentially inculcating the prior belief of the
system.
Throughout the talk, I'll be basically making references to the specific form
of linear regression problem. That is, this linear regression. It is one of
the [inaudible] problems, which is pretty common.
And so the empirical risk minimizers which I'll be telling you overall
whenever like situations [inaudible], logistic regression, median, support
vector machines, et cetera.
Now, after having thought about empirical risk minimization and me defining
it for empirical risk minimization is, the question arises what kind of
privacy threats does it pose? I mean, is it even worthwhile to investigate
on designing private algorithms for empirical risk minimization?
Consider this very simple and tough example. You have a dataset of five
points, minus five, minus two, minus one, two and five. And if I define the
median of this dataset it is very trivial to see that minus one is the
median.
Now, you can write this problem of finding of the median as an empirical risk
minimization problem. This is essentially minimizing over the parameter
theta the sum of the absolute distances from each of the data points.
And if you plot the graph of this function, it will look something like this.
So it is -- the slope is minus one plus one, plus three and minus one, minus
three, minus five and the median is this guy.
>>: Actually computing the average.
>> Abhradeep Guha Thakurta: No, I'm computing the median here. So the
slopes are -- so if you plot this graph, this is median. Average will be
theta minus DI square. That is average.
So thinking of it, median is an empirical risk minimization problem, and you
are revealing one data point exactly. And if your purpose was to protect the
privacy of the data point, this is not good news.
And if this thing seems like a trivial problem like I'm dealing with one
dimensions, let me give you a more concrete problem. Support vector
machines. Suppose you're dealing with support vector machines in the dual
form where you're outputting the support vectors.
So think about it. The support vectors which are outputting are essentially
the data points on the margins. And this support vectors are reasonably high
dimensional.
So you're actually revealing a lot of information about individuals. So with
this kind of issue with empirical risk minimization, the question arises what
can you do about it?
So these are the problems and we need to mitigate these problems. So to that
end, we use this notion called less differential privacy. It was initially
proposed by Dwark Maxim [phonetic] and Smith in 2006 and there was a
follow-up by Dwark [phonetic] et al. in 2006.
So at a high level what this definition tells you is that suppose I'm
releasing an interesting statistic about a dataset, the privacy breach on
you, the statistic can have is in the event of you being in the dataset or
not.
For example, to be more concrete, suppose you have two datasets D and D
prime. This third entry is your entry. And that, in the other dataset you
have changed your entry to something else. And the algorithm is running
either on D or on D prime.
Now, the output is given to an adversary. And the adversary cannot decide
whether the algorithm was running on dataset D or D prime.
To formalize this, a randomized algorithm A is epsilon delta differentially
private. So epsilon delta privacy parameters. So lower the values of
epsilon delta, better privacy guarantees you have.
For all datasets D and D prime the different one element. And for all
possible sets of answers S, the probability that the algorithm running on
dataset D produces something in this and the probability that the algorithm
running on dataset D prime produces something in this are closed by this
factor of [inaudible] epsilon and delta.
So if you have seen the definition is good, if you haven't, it's fine. You
don't need to remember the details of the definition. Only thing you need to
do with epsilon delta privacy parameters lower value, the better it is.
Any questions so far?
>>: Go back further in the slides. I don't understand why so many
[inaudible] SVM reduce the privacy of some people.
>> Abhradeep Guha Thakurta: So if you're running SVM -- so if you're running
in this dual form while you're outputting the support vectors, what is a
support vector? Support vectors are feature vectors, some of the feature
vectors. These feature vectors can be some of the individual information.
For example, in the initial setting I told you, right, it might be a bunch of
medical tests which you have done. And the final thing is [inaudible].
>>: Coming from the support vectors.
>> Abhradeep Guha Thakurta:
Yes.
>>: I see.
>> Abhradeep Guha Thakurta: So these are definitions of differential
privacy. So let me try to motivate the definition of vector, why you have
chosen this definition.
So why is it a good definition? So one of the key features of differential
privacy is they have this property called composition. What composition
tells us, suppose you have this dataset. I'm running two algorithms on the
dataset.
Algorithm one, which guarantees epsilon delta differential privacy and
algorithm two which I'm running again on the dataset, which also guarantees
epsilon delta differential privacy.
And if I -- so each of these algorithms reveals something about the dataset.
I mean, it doesn't -- I mean, it doesn't give out zero information.
So now the question arises, when I group these two information together what
is the privacy guarantee I have? Differential privacy trivially tells you
that you will have two epsilon comma two delta privacy.
Now, to my knowledge this is the only definition, only rigorous definition
which guarantees you composition so easily. There are other rigorous, other
definitions of privacy like maximal differential privacy or noise-less
database privacy. Getting a composition result of this clean form is very
difficult.
So this is one of the key features of differential privacy. And
additionally, it gives you guarantees against arbitrary auxiliary
information. So if you remember in the definition I never mentioned about
anything about this information.
So I mean, I assume that adversary might know everything accepting your
inventory.
However, what it does not imply -- so recently there have been a lot of works
where people kind of misinterpret or overinterpret this definition.
It does not protect you against the information an adversary can learn from
an [inaudible] statistic. For example, suppose someone runs a survey which
tells that people smoking, people who smoke have cancer.
Now, someone sees you outside smoking and from the survey result that smoking
causes cancer, the guy who sees you smoking might have significant high
belief that you smoke -- that you have cancer.
But this is individual of the fact that you being in the dataset or not. So
this might be a privacy breach. This might not be, but differential privacy
does not protect you against this. And this distinction is pretty important.
>>: That means differential privacy cannot prevent you from getting the
auxiliary information. Is that what this means?
>> Abhradeep Guha Thakurta: This means that public information what is
there, if that breaches your privacy, that differential privacy does not
protect you against -- what it only protects you against is the computation
which the system is doing for releasing that information.
That is in different of you being in there. So -- yeah. So this talk is
essentially about empirical risk minimization and how do you design
differentially private algorithms for [inaudible] risk minimization.
So I'll be segregating the talk into two parts. One is in the low dimension.
The other is in the high dimension. The low dimension I mean is the number
of sample points or number of data points you have is much larger than the
dimensionality of the model parameter.
So the theta which I was trying to learn the dimensionality of the model
parameter is smaller than the number of samples you have.
And the other setting I'll be talking about is the high dimension, for the
other thing that's happening, where the dimensionality of the problem is much
larger. Maybe exponential in the number of sample points you have.
In the low dimension I'll initially be talking about some of the existing
work. This output perturbation. This was initially proposed by Charles
[inaudible], et al. in 2011 and there was a follow-up by Rubenstein, et al.
in 2011.
Then I'll be talking about another technique called objective perturbation
which is in the same paper of the same as 11 and one of our recent papers we
make a significant announcement with respect to the perturbation.
The third thing is using online learning techniques for guaranteeing
differential privacy, and this is work by Prateek Jain, [inaudible] and me.
Then I want to kind of compare all these three approaches and see which one
is better. Not that I have a very good answer right now of like how do these
pieces fit together, but there are some initial directions which I can tell
which one is better and why it is better.
>>: So along with this and the objective, isn't there also the possibility of
there being the data points themselves to decree false data that's in ->> Abhradeep Guha Thakurta: Sure. Yeah. So the reason that I didn't put it
was in some sense it was intentional. So what you are doing, when you are
perturbing the data points, so differential privacy in some sense you use
that one data point is hidden from the output.
Now, if you want to perturb all the data points initially without looking
what algorithm you're running, that essentially means that the effect of the
data point is almost gone. For example, suppose let's take this example of
averaging a bunch of numbers. You want to average a bunch of numbers.
Now, one way -- and the numbers are bits let's say. So the average can
change by changing one of the entries one way in. So in the output, if you
had to you'll be adding noise in the order of 1 by N. If you're adding noise
to the inputs, you're adding noise to order of one to each of the inputs.
When you take the average, there will be an error of one way route in, errors
cancel out and one root in so significantly high number of errors in the
input perturbation.
>>: But you could also find the solution in your support vector function in
the example find the solution and say like, well, find perturb data points in
the neighborhood and find something which comes close ->> Abhradeep Guha Thakurta: Yes, so that approach is somewhat kind of
related to this perturbation. It's like in the similar flavor.
So output perturbation essentially you use the structure of the problem a
bit. So in high dimensional setting, the dimensionality of the problem is
much larger than the number of samples you have.
And this kind of problems are really hard to address because even
nonprivately deserve very hard to address, because the system remains
extremely underdata [inaudible] with the number of samples.
And what we do is give private algorithms for sparse regression. Sparse
regression meaning the underlying model parameters you're trying to learn
from high dimensional algorithms, that model parameter has very few non-zero
entries. Then you can do something interesting and I'll tell how.
So this first two bullets are mostly for batch empirical risk minimization.
While you have all the sample points in one goal. Now, suppose you're
dealing with advertisement systems or any kind of system which kind of
changes over time. So you do not have the complete dataset offhand.
So that comes, brings us to this online political risk minimization. The
problem was the data points are committed online. And the challenges here is
much different from the batch setting because in the batch you are somewhat
outputting only one output. But in the online -- every time a data point
comes you have to give some output and the number of outputs is almost equal
to the number of inputs you have and there's a significant challenge in terms
of predicting the privacy.
So there we give a very generic technique for designing differentially
private online learning. So this is our work, the other paper.
So coming to batch empirical risk minimization, low dimension, there are
three algorithms I'll talk about. One is output perturbation, objective
perturbation, online convex programming risk and I will compare the three.
So our perturbation was proposed by Chaudry et al. So the idea for
perturbation, first you select a random variable B from the gamma
distribution where the scale of the distribution, it's a mean zero
distribution where the scale is order by one of epsilon. Epsilon is the
privacy parameter.
What you do is then you take your objective function of the original risk
minimization, find out the minimizers. Then you add this noise to it. So if
you look at this picture, this is your theta hat which has the zero slope.
And you add noise to it.
And the claim is that this guy is differentially private. But for full
disclosure, let me tell you that actual algorithm has slight, is slightly
different and slightly more tricky. But for the purpose of this talk I think
this is fine.
So basically find out the minimizer and add noise to it. So the privacy
guarantees, this algorithm is epsilon comma zero differentially private so
the data term is zero there.
And in terms of utility, the price you pay for privacy -- so this is the loss
with the empirical risk without the privacy requirement, and this is with the
privacy. So the loss he pays around P lock P by epsilon N. So this data
term is showing up there. It is the bound on the gradient of the loss.
Small little loss functions. And theta is the L to norm of the theta. So
these two parameters you can ignore for the purpose of this talk, because in
all the guarantees I will show, we'll have this term common.
So essentially what you need to compare is P lock P by epsilon N. So
essentially we have a dependence on the dimensionality SP, lock P and N is
the number of samples we have.
Great. So one issue with this kind of an approach is the privacy argument
requests that you reach the true optimum of the objective function.
So the guarantee privacy guarantees only for the true optima. If you reach
the optima. Now, in practice pickup might not and it so happens that the
guarantees do not hold or the analysis doesn't go through.
So this is a significant problem. In practice, in practice, what does it
even mean to guarantee privacy for these kind of settings, and over -- I mean
over a sequence of slides I'll show you how do you mitigate this issue.
So the other algorithm is this objective perturbation. So I'll mention this
objective perturbation as objective perturbation of CMS, because we make an
announcement over it. So I'll just tell this is what CMS did. So what does
this algorithm -- again, sample a random variable B from gamma distribution.
And this time instead of adding noise to the output what you do is you add a
random linear term to the objective function.
So you can think of this problem, think of the setting as two things, first
you take the objective function and randomly tilt it. Alternatively, so this
was the original theta hat which is for the loss, for the regularizer. What
you are doing here is you're finding out a model parameter where the gradient
of the objective function is equal to minus B.
So if you are conversant to find duality, this is essentially working for the
dual [inaudible] the dual [inaudible]. But we will not go there.
Okay. This is a simplistic view I'm giving you because you need to choose a
regularizer in a proper manner to make things work, but that's fine.
So the result that was existing in object perturbation, it required the space
you're optimizing to be unconstrained. You cannot put hard constraints on
this set. This has to go the active role of P. And this rules out problems
like linear regression for technical reasons, because to bound certain
parameters for privacy you need to work with the constraint state in the case
of linear regression.
Additionally, they record the regularizer to be smooth. Smooth meaning twice
differentiable. At least the analysis required it. So this essentially
rules out most common regularizer like L1 regularization or nuclear nom, et
cetera.
So what we do is we allow convex constraints. We allow constraint
optimization and nondifferentiable regularizers. So to do this we actually
needed a significantly different approach which was analyzed earlier. And
I'll tell you what this approach is. It's completely different from what has
been analyzed in the past.
And the second
of gamma noise
factor of root
can get root P
contribution was we noticed if we use Gaussian noise instead
because Gaussian noise is more tied to the mean, we can get a
P improvement in the error. So earlier it was P lock P. We
in there.
And another thing which you noticed was this objective perturbation somehow
respects the structure of the problem more. So if we give a tighter
analysis, we have a tighter analysis, which saves another factor of root P in
the error.
>>: [inaudible].
>> Abhradeep Guha Thakurta:
Sure.
>>: I don't understand the question. It seems like if you're using linear
regression instead of SVM regression, wouldn't it be inherently private
because you're not exposing any data features, any specific ->> Abhradeep Guha Thakurta: So the idea about privacy is that so the kind of
privacy guarantees we are working with is the probablistic guarantee.
So the first point is no deterministic algorithm can satisfy differential
privacy. So you need randomization. So the point is this, suppose my
algorithm running on database D produces an output O. If I change O to D to
D prime since my algorithm is dependent on the dataset, any nontrivial
algorithm should change its output.
So the probabilities are not like one guy has zero probability and another
guy is one probability and that is -- and that basically essentially does not
give us epsilon.
So yes so and coming to this problem of linear regression, the specific
reason we need this constraint set is if you take the gradient of the loss
for linear regression, the gradient has the theta parameter there. And theta
is a model parameter.
So gradient is like X transpose Y minus X transpose theta. So and for
privacy guarantees we need to bound the gradient. And you cannot bound the
gradient unless you have bound on the constraint set.
For logistic regression this is not the case because you basically, the
differential of the logistic function loss function is the logistic function,
and that is bounded by plus or minus on all of us.
So, yeah, so these are our contributions. Now, coming to the guarantees,
what we get, we get epsilon delta differential privacy and the utility
guarantees we get -- now we lose a factor of lock P. So the lock P goes away
and instead of P we get a root V here. But we pay a per price in terms of
privacy with the delta coming up.
So there is not a zero delta. So there's log one delta which is fine. And
as I told you, we can make farther improvements on the dimensionality for
nice datasets where essentially the dataset has some nice properties, then we
can get better utility guarantee. So we can further reduce the
dimensionality problem.
So coming back to the contributions, so I'll not talk about these two parts.
But what I will tell you is how do you relax. And this is in some sense the
most interesting part of the work.
So recall that if the regularizer were not differentiable you'll have your
objective function something like this. And let's say this is the point of
nondifferent rentability. And to work with the constraint you have a
constraint here. This is just denoting the constraint. The object function
can be outset of the constraints.
What you do is for each of the, for the regularizer, we get a sequence of
smooth approximations of the regularizer. So I'll tell you how you get the
smooth approximations.
But so for the nondifferentiable-izer, I get a sequence of smooth
approximation but each of the approximation is differentiable. And finally
in the limit I converge to the actual regularizer.
Additionally what you do is, yes, so with this sequence of approximation the
objective function will look like this, because you have a smooth end out
here.
And slowly as I goes to infinity this will start looking like this objective
function.
Additionally, we add convex penalty term which penalizes the function heavily
when you go out of the constraint set.
Great. So it might -- so some of the -- there are like two significant
challenges in finding out this regularizer. It took like a reasonable time
to find out what is the way of doing it.
So we needed to keep the strong convexity property of the algorithm
unchanged. So there's a limit the algorithm has some form of strong
convexity. We need it to keep it unchanged otherwise privacy guarantees get
a hit.
And the second challenge was we needed to find out the right notion of
convergence. So let me tell you how do you even do the smooth approximation
so that we get -- we satisfy both of these challenges.
So the first thing we do is we take the regularizer and convolve it with a
smooth kernel so think of convolving with a Gaussian kernel. And what we do
we reduce the width of the kernel as we go forward.
And [inaudible] convolution tells me that if my kernel is twice
differentiable or infinitely differentiable, my convex is also
differentiable. But if your regularizer is nondifferentiable in the limit,
this guy is also converging to it because when you are kind of shrinking the
width of the kernel, converging it in the limit you'll get a delta function.
And convolving with the delta function is the function itself.
So limit you recover the actual regularizer. And we add convex penalty term,
F of delta, trust me this is convex. If you go out of the set you basically
make it pay.
And essentially what you do is when you add in the approximations we add the
approximation point of some I into the convex penalty so that as I goes to
infinity it starts penalizing heavily when you go out of the convex set.
Coming to the second challenge, we needed the right notion of convergence.
So I can show you something here. So you can think of this algorithm as some
kind of a function which takes two parameters. There are nonvariable,
random -- the random, the noise, and the dataset D.
What I'm doing is I'm approximating this function with sequence of F1, FK.
One of the classic ways of proving this kind of result is going where
convergence in distribution. Seeing that these two algorithms functions
converge in distribution.
So as I changes to infinity, as I tends to infinity you want to say these two
converge into distribution, which this does not happen here because if you
see, let's say we were taking daily L1 regularization, in L1 regularization,
based on this B, there is no density function there, because the point of
nondifferentiability we'll have probability mass. So distribution of
convergence doesn't hold.
So what we need is we prove weak convergence in measure. So roughly what we
say is so if you -- I mean, here we need to recall differential privacy a
bit. So in differential privacy we give guarantees over probability measures
of different output sets.
What we converge in measure tells you is that roughly the argument goes like
this. Spouse my limiting algorithm were not differentially private.
Limiting algorithm. So I limit I changes to infinity, F of I is not
differentially private. But I know with a smooth approximations all this
sequence of algorithms are differentially private. So the argument is if the
limit is not differentially private then there must be an algorithm in the
sequence which is not differentially private.
So this takes up on positive and proves that it's the contradiction, because
if the limit is differentially private, then it's not differentially private
then there must be an algorithm which is not differentially private in the
sequence, but already have argued that the sequence is differentially
private, because it's a smooth approximations.
Questions? So it's important to digest this part. Wait. So essentially,
yes, so the next issue we show a sequential closure of differential privacy.
So this is a theorem which is kind of found in individual interest. So what
this says is suppose you have a sequence of algorithms A 1 A 2 blah, blah,
blah, all of them are epsilon differentially delta private algorithm. Then
there exists an algorithm and there exists an algorithm A such that for all
datasets D and the algorithm A converges weakly to A. When I say weakly, it
converges point-wise. So if you get a point-wise convergence of these
algorithms to A, additionally each of these algorithms are differentially
private, then the limiting algorithm is also epsilon delta differentially
private. So what this says is suppose you have this A 1 A 2 blah, blah, blah
which limit converges to AI, then the complete block is differentially
private.
So this is one of our theorems which we used for proving this property. Now,
after having talked about our perturbation where you basically changed
perturbed the output and the other case where you tell the objective function
and solve the objective perturbation, let me give you a significantly
different approach from the other two. So this is based on line convex
programming you'll be using online learning techniques to solve this problem.
Just to give a background of what online convex programming is, you can think
of online convex programming to be in the setting. There's an N round game
between two parties. There's the player and challenger. In each round what
the player does it chooses a point in theta I, which is utilized in some
convex set. And the challenge at best I'm seeing is point which he has
chosen, gives him back a cost function LI.
And the player has to pay cost of L I of theta I.
challenger kind of a game.
So this is a player versus
And the player wants to perform well. That means he doesn't want to pay too
much of cost. And online convex programming algorithm is such which gives
the strategy to the player to choose these points.
And the property you want from the online convex programming algorithm is
that the regret is, that should go down to zero as you play more and more.
So the idea of regret is so LI, theta is the parameter which the player is
outputting based on the algorithm. So this is the average cost with I based.
Minus suppose this person knew all the cost functions at once. Then what was
the best theta he could have chosen. So this is called the off line best.
So you want to perform well against the off line best. That means as you see
more and more samples you want to go closer and closer to the off line best.
So this is what any good online convex programming algorithm will give you.
And this is what they call a sublinear regret. So you want to minimize the
regret. Great.
Now, how do you use this framework for solving our differential privacy
algorithm? So the cost function you can think of do we -- this functions for
each of the data points. L of theta of DA.
Additionally, we need this property from an online convex programming
algorithm that it is stable. By stability I mean that you take two datasets
D and D prime, which differ in exactly one entry.
And the Ith iterate, the Ith output player outputting the L2 distance between
theta I on D and D prime should go down as one by I.
So as you go farther and farther down the time, dependence on one single data
point reduces linearly. And this is a property I need, and we call this
thing as stable algorithm.
Additionally, we want the original OCP algorithm A to be having sublinear
regret. Meaning that the regret should go down to zero as N goes to
infinity.
With these two properties we can do something interesting. What we do is we
do this thing, these are a pretty common thing now in online learning to use
this online to bask conversion. So what you do, you run the online converse
programming algorithm with the cost function and appropriately chosen
regularizer. So OCP algorithm gives you output theta one to theta N. What
you do you take the average of all these outputs, add the noise, which is
scaled roughly around P upon the square, the variance is P1 N square this is
a Gaussian random variable. This has a similar flavor of output perturbation
where you take the output and add noise. So this looks very similar to that,
but this has much significant advantage over output theta Y.
And before going there let me tell you this that I told you we need stable
online convex learning algorithms. Now does there exist convex online
algorithms in the literature, that's the first question you want to ask.
What we show is in fact it's still not clear to us why it is happening.
Every online convex learning algorithm which you picked up satisfies
stability property what you wanted. Each and every algorithm we picked up.
More precisely, this algorithm implicit gradient descent had this stability
property. The gig algorithm Martin Sankovich in 2003 has this property
conditioned on the fact that the cross function is differentiable. But I'm
not sure it is critical of the proof or we actually need this.
But anyways, so then follow the regularizer, all of them have this property.
Yeah, sure.
>>: The stability property, just as a -- you don't change your output too
much based on a single.
>> Abhradeep Guha Thakurta:
Data point.
>>: Difference?
>> Abhradeep Guha Thakurta: There's something more. As you go farther in
time, the change that can have that goes down with the time. So this goes
down as -- so one by I.
So essentially you want to say that the last point which you're outputting
almost does not depend on your sample set. So as this K goes to zero.
>>: Difference between two datasets gets [inaudible].
>> Abhradeep Guha Thakurta: Yes. So I mean at a very high level what this
guy's saying is part of the learning algorithm is to get something which I
can use on a new sample point. In that sense, I should not be too much
boiled down with one data point. And this is essentially telling that I
should not overfit on one data point. Although, it is not clear it's an open
problem kind of that is there something fundamentally close that's going on
and we can say that probability any algorithm that is differentially private
has sublinear regret or vice versa. I don't know. So what we show is our
CPL algorithm is delta differentially private the utility guarantee is very
similar to objective perturbation, except in the fact that we get this poly
log N term for technical reasons. But that is fine because sample size,
logarithm sample size is fine. The key property of this algorithm and why
you looked at this algorithm is that it is much more practical than objective
an output.
If you see the privacy guarantee, it does not really depend on you minimizing
any objective function. The OCP algorithm essentially works in gradient
descent. And your algorithm what you're outputting is essentially taking an
average and adding noise to it.
So the privacy guarantee holds no matter what you do. So whatever you stop
privacy guarantee holds. So with these two things in mind, now the question
is which is useful. I don't have a very clear answer. But let's try to see.
So there are two approaches which you designed. One is objective
perturbation and another is OCB-based. So objective perturbation requires
stronger assumptions than this online convex programming, meaning that it
requires the cost function so in our work we allowed the regularizer to be
nondifferentiable but it requires the empirical risk to be differentiable.
And that we can prove that it is necessary. It's not a problem in the proof
technique. It is necessary for the algorithm.
But at the cost of that, you get better result guarantees. So it supports
the structure of the problem better than what output or OCB does. Contrary
to that, OCP algorithm is much more practical. In the sense you do not need
to bother about whether the objective function actually minimizes or how good
let's say convex optimization package is working. Just take a bunch of
outputs and take the average.
And, yeah, it is more meaningful. So I haven't pulled
here, because OCP performs unconditionally better than
With a poly log factor dependence on the dataset size,
performs unconditionally better. So the comparison is
objective and OCP.
in output perturbation
output perturbation.
but that's fine but it
essentially with an
And ->>: You're comparing a batch algorithm within ->> Abhradeep Guha Thakurta: No, this is a batch algorithm. OCP is a
technique I'm using for solving the batch problem. So let me make this
clear. So I have a sequence of data points. So this is an online algorithm.
OCP is an online -- but I can think -- I'm pretending the data points are
coming on line but I'm solving a batch problem. So essentially the batch is
existing work and it so happened that people never looked at using online
algorithms for batch [inaudible] algorithms.
So this talk so far I've told you about low dimensions three approaches
output objective and online convex programming and give you a comparative
study about each of the three. And these three here, the first part you want
to do sparse regression in high dimensions.
So the idea of high dimension is that the dimensionality of the problem is
much larger than the number of sample points you have. And this makes the
problem significantly challenging. Why is that so?
So if you use the previous techniques whatever I told you output objective or
OCP-based, the error scales polynomially in the dimensionality and one-way
sample set size. But I told you that the dimensionality of the problem is
significantly higher than the, significantly higher than the number of
samples you have. So to get any consistent estimator, you need to control
the dimensionality, the differentially significantly what you want is
something logarithmic in it.
What it shows is that in the high dimensional case for sparse problems and
specifically for linear problems, the error scale is polynomial in the
logarithmic of the dimensionality. Polylog of the dimension.
And to do this we assume that the underlying model parameter we're trying to
learn is sparse. Sparse meaning it has very few nonzero entries.
So this is dependent of what sparse regression is. So the number of
dimensions of the problem is much more than the number of samples you have.
So if you look at -- so essentially what you are doing, empirical risk
minimization problems we are solving some kind of an objective perturbation,
objective function. Kind of minimizing the objective function.
Now, when the dimensionality of the problem is much larger than the number of
sample points you have, the objective function looks something like this.
You can assume it -- you can view it as a boat, where there's a lot of flat
points in the middle. Now, any of the model parameters here will work
equally good on the dataset. But recall that your idea was to work on a new
sample you get, not on the dataset. So for that you need to choose the right
model parameter.
And if I just give you the problem in its original form where you have the
dimensionality higher, larger than the number of samples the problem is
extremely[inaudible] and so how do you go about it? It so happens that
statisticians saw that if the underlying model parameter which you are trying
to learn is sparse, sparse meaning it has very few nonzero entries, then you
can look -- you can actually search for parameter vectors which had this
property that is sparse. So in that case you can constrain your search space
significantly rather than looking completely here on this line.
And this from a computer sense perspective so this was mostly based on
statisticians from a computer sense perspective this essentially feature
selection problem given a bunch of features you want to select a set of
features for you and feature selection problem.
So for doing this, we design two algorithms. And what this algorithms are
very similar flavor. What you do is first you run an algorithm to select a
bunch of features.
Then in the second stage, we use this idea of objective perturbation. First
you select the features, restrict the problem to that small subset and use
objective perturbation restricted to that set, because now you're in the low
dimensional setting, once we get the features out.
And for doing that, we use two approaches. One is this exponential sampling
based algorithm which is roughly based on the sum of the ideas from
[inaudible] interval in 2007 and another is the subsampling based approach
which loosely relates to the idea of sample negative framework from
[inaudible] in 2007.
So in this talk I'll be specifically speaking about sub sampling based
approach because this guy is in some sense a first cut solution which you get
and this is in some sense it is not computationally efficient but this is a
computationally efficient algorithm and this is what I'm going to speak
about.
So what sub sampling-based algorithm does you take the complete dataset,
break it up into blocks of root N size the dataset. Run your favorite
feature selection algorithm on each of the blocks.
So each block gives you a bunch of features which it thinks is important.
And from that outing you take a noisy vote. So how you take the noisy vote,
I'll tell you. So this is at a high level. You can plug in any feature
selection algorithm what you want.
It is essentially oblivious to this. So suppose block one comes. It votes
for a bunch of features. S number of features it votes for. Then block two
does the same and block root N also does the same.
Now what you do is you take the votes for each of the features which
from different root N blocks, add noise in the order of S by epsilon
And give the block out. Now, this is the noisy votes you have. You
the top S in terms of this noise efforts, and the claim is that this
differentially private.
>>: This count.
has got
to it.
select
case
So --
>> Abhradeep Guha Thakurta: So these guys vote for it. So vote is like plus
one or plus one or zero. So this guy will have a bunch of votes.
>>: Count already has already.
>> Abhradeep Guha Thakurta:
Yes for the votes.
>>: When you say add noise ->> Abhradeep Guha Thakurta: Add noise in the sense -- yeah, there I was kind
of sloppy. But add noise essentially you are scaling plus noise where the
scaling is as S by epsilon.
>>: You say the adding the noise is different you're not actually exposing
that.
>> Abhradeep Guha Thakurta: Yes, I'm exposing. What I'm exposing is the
coordinates which I'm choosing. It can be shown to the world this is
coordinates is the thing I'm reducing the problem down to. And this is
differentially private, epsilon differentially private.
So what you do is we use the general thing and try to study it and the linear
regression. So linear regression, what you have, you have design metrics and
the model parameter, which is N cross P and then additional failed noise to
it. And the response is Y. So this hatched part you can think of personal
information about a person. This is like whether he has AIDS and this is the
Fisher vectors.
The way you solve linear regression problem you want to find out the theta
star. And what you want to do is you want to minimize this objective
function which will give you an estimate of theta hat, theta star. So the
minimization is the gap between delta norms, so L2 norm of the difference
between Y and XN data and appropriate chosen regularizer for different
problems you choose different things.
So for LASSO they're different variants, LASSO they use regression, et
cetera.
So the settings of theta star, the underlying model parameter. These are
design metrics, and the output vector. The exponential sampling requests
around SQ block P number of samples. And the subsampling based algorithm
which you have requires S square lock square number of samples.
If you note, both algorithms have sample complexity of poly S log P. So I
have pulled down the samples from P to log P. With the polynomial definition
of sparsity, but that's all right.
So now if P goes to -- P grows faster than N I'll still get a good
convergence rate because it needs to grow N log P. So this bypasses some of
the impossibility results in the literature which talks about deterministic
algorithms may not be differentially private. But since you are working with
individual algorithms, that's fine. So those are the details.
So this talk so far low dimensions output perturbation, objective
perturbation, online convex problem.
>>: Smaller than.
>> Abhradeep Guha Thakurta:
S is smaller than N.
>>: Smaller than N but no requirement on being square root or ->> Abhradeep Guha Thakurta: No.
So in the high dimension we need -- we have the dimension of the problem
larger than the entries, and we give a differentially private sparse
regression.
You should tell me when I should stop.
>>: S was the number of non-zero entries.
>> Abhradeep Guha Thakurta: Entries in the underlying parameter. So this
case, number of nonzeros. One very common example people use is suppose you
are taking photos of stars in the night sky. Not every place you'll find a
star. Very few places you'll find a star. So you can actually see that -you can use the model you're trying to learn you can assume there are very
few [inaudible] and that's essentially they're using astronomical data, they
use sparse pers bayesian quite a bit.
So I don't know. Have I gone over time? Not time? So I can stop at this
point also. Sure, I'll go. This will take another 10, 15 minutes I guess.
So after having talked in this problem for the problem when you have samples
in batch now I'll tell you how do you do it online samples coming to you with
others.
So the data points are [inaudible] the cost functions as I told you in how do
you map it. Some [inaudible] online. And in this setting, so this is work
with Prateek Jain and me and Cole [phonetic] in 2012 where we give a very
generic transformation. So I'll give you an OCP algorithm. I'll plug it
into the system. The system will give a differentially private variant of
it.
Moreover, the original OCP algorithm had sublinear regret you recall it was
how do I perform well with the best possible off line adversary.
The private version will also have sublinear regret. And additionally we
show that if the loss function has some problems like the quadratic loss like
linear loss, then we can get private variant with almost optimal not private
regret.
So the nonprivate regret is around lock T upon T and log N upon N.
get almost close to that.
And we
So this is just a refresher of those structure in an online setting. Now
we're actually working on the online setting. So there's a player, there's a
challenger. Player chooses a point in the convex set. The challenger
chooses a cost function. You play a cost function and you pay a regret. And
then -- and you want to minimize that regret.
So the issue with online algorithms is this model parameter theta I, whenever
it's working the batch setting, those are kept within the system and then you
take an average and give noise to output. But the area you're outputting is
model parameter [inaudible], so this is much more reactive compared to
off-line batch setting. You have to update the model parameter each and
every -- every time you see it you update it and output it. And what about
these cost functions can be containing sensitive information. For example,
for online classification, this can be the label whether the person is As or
not and this is the feature vector.
And you need to guarantee privacy across all the possible outputs. So the
point is that if you have N points outputs privacy for all of this and this
is a pretty completed [inaudible] I'll show you a prior attack how online
problem becomes kind of challenging.
Suppose there
ABC. And you
a female. So
be shown here
is this adversary. Owner of some perfume blend. Let's call it
also find out when a person who cliques on this ad is a male or
a male comes, clicks writes perfume and Bing this is this ad to
ABC.
Now, he clicks on that and the adversary gets [inaudible] by basically
clicking on it. And at the same time here we know these guys clicked on
that. Now, he wants to find out whether the -- whether this person side P is
male or female. Now, in the site of Bing, what Bing does Bing says that a
male has come and for perfume has clicked on this ad ABC. It might be that
ABC should be reranked and given higher ratings.
So the [inaudible] of this ad for all male users, this is an example and so
the system it's much more complicated than just the example. So this guy
gets reranked. Now, after the male clicked on that, the adversary
immediately goes to his male profile. He had made a bunch of profiles
earlier. He goes to his male profiler sees for male it has increased and
immediately goes to his female profile. It hasn't increased so it must have
been a male.
So the meta point is online learning tends to make quick decisions. Every
time you see some input you make a decision. Or you change the system.
And that might compromise privacy. So the key thing what it did was we tried
to design online learning algorithms that controlled changes. So you do not
want to make sharp updates with one single entry.
And additionally we added appropriate. So when -- so the idea is when you
output these model parameters, theta to the external world, we add some noise
to it.
Now, this is a pretty common thing in differential privacy everywhere
everyone adds noise. Now, what's the interesting part here? The interesting
thing is the number of inputs here is equal to the number of outputs.
So you are not outputting a single thing. You are putting sequence of
outputs and that is exactly the number of inputs we have. We're going to see
significantly challenging to guarantee privacy in that setting.
So the challenge is that privacy should be guaranteed across all of this.
Again, for privacy we use differential privacy here. So you want all the
signals of outputs to be differentially private. So I should recall the
intuition here because I'll tell you what's important. So the presence of a
particular data point should not be visible to the observer. So the output
should be kind of indistinguishable from those outputs.
And for online learning, we use this framework of online converse learning
here. So for these online learning also uses this. So OCP algorithms
usually have sublinear regret meeting that as N goes to infinity this guy
goes to zero. Additionally they have this property called explore versus
exploit. Roughly what explore versus exploit says you is that whereas you
needed a point I will explore that but I will not discard what I have learned
from the previous. So that's the exploit part of the previous -- so this
exploit versus exploit actually says you that it does not allow you to make
sharp objects. Those [inaudible] don't allow you to make sharp objects. And
if you think now the intuition for privacy and this explore versus explore
philosophy these two are pretty much in sync, and in our analysis and proofs,
we essentially use this idea. Although again it is not clear to us that -is it specific to online learning this property or -- and it's more general
in general learning, because essentially learning [inaudible] you do not want
to depend on one single point. We don't. But for online learning at least
this is true.
So earlier and in fact in the earlier talk I give you a brief idea about what
off line in the off line in the batch setting what is there. So people that
analyzed off line learning with privacy a bit, but this is not almost
untouched in the online scenario. There's only one notable paper by
[inaudible] et al. in 2010 which showed for online experts from a very
specific kind of an online problem where you get sublinear regret. But that
does not extend to general online convex programming, and there comes the
challenge.
Again, recall the OCP algorithm you're dealing with we want it to have
sublinear regret, and it needs to be stable. Stable meaning that as time
progresses, the dependence of that Ith parameter should go down based on the
data point and the cost function. So it should go as 1 by I. So it should
go down similarly.
Now, in the online algorithm, what you do is this: We maintain two copies of
the model parameter. One we keep with the algorithm, and one we output to
the external world.
And the noise we add for outputting is roughly scaled around root I upon N.
I is the Ith iterate that I'm [inaudible] so the model parameter theta is
used for updating. And the noisy parameter which I'm outputting to the
outside is used for predictions. So what prediction I'm giving.
So that can be used by let's say by Bing to find out what [inaudible] to
show. Now, for the privacy the issue comes is that you have to guarantee
privacy across all iterations. So there is this result by Dwork in et al.
2006 the initial paper on privacy which says if I have a sequence of outputs
each of them are epsilon differentially private then my complete sequence is
NF epsilon differentially private in N outputs, but that's not enough for us
to get privacy, I mean get sublinear regret. Because if privacy scale says
N, then we can show that you'll not get regret which goes down to zero with
N.
So we needed a stronger composition technique result. So what we did was we
kind of used the ideas from a recent paper where Dwork, et al. 2010 and Hart
and Rock in 2010 and modified that analysis in our setting to get a
composition result which roughly says that if I have sequence of N outputs,
it's not N epsilon differentially private, it's root N of epsilon
differentially private. And that essentially allows us to get sublinear
regret. Coming back factor N from root N. So we are guarantee delta
differentially private for stable velocities. The regret for the private
variant is the regret for the nonprivate algorithm OCP algorithm, plus a term
of log square N upon root N. So the regret scales as log square N upon root
N you can think of. So it's trivial to see if the original algorithm has
regret tending to zero, then my, the private one also has regret than tending
to zero.
This is just a toy kind of scenario how it works. So this is model
parameter. This is learner you can take it to be Bing which I'm showing
output, the learner gets this. Interested party makes computation. This is
gives user profile like types of perfume and this is the data. The system
gives out an ad which is based on the user profile and the model parameter.
This goes on. And based on whether this guy clicks that or not the system
incurs a loss or cost.
And this thing goes on for a bunch of iterations. So now what the system
does is it objects to theta two. But it gives theta two with itself. What
it shows is to the external world is something [inaudible] noise.
And this goes on for theta three and log. That's how it works. Coming to
quadratic loss, if the loss function of the cost function is of this form,
then we get our solution to have the -- yeah, sure.
>>: When you say updates according to the original theta two, does it mean in
terms of [inaudible] lowering the rank of the plan we actually use the true
information.
>> Abhradeep Guha Thakurta: No, it will use this.
>>: It will use the ->> Abhradeep Guha Thakurta: The private version. So this is only used for
this system for generating the system. Whatever is shown to the external
world is through this. So for the -- yeah, for linear regression, we can get
the regret down to log upon N instead of root N. Earlier it was root N. We
get it for log by N.
And for this we heavily use the structure of the problem. And here so this
uses one technique by Dwork et al. and it's improvement of 1 over root N. So
that's coming to tend of the talk. So what I've told you is in low dimension
I've shown you three approaches how to do empirical risk minimization in the
batch. These two algorithms are the most promising ones. This one has
better utility guarantees and online converse programming is more practical
because the privacy guarantee holds at whatever you do. The high dimensional
setting, I told you about the first algorithm, the first line of work for
sparse regression in high dimensions, and I spoke to you about two different
algorithms. Exponential sampling best and one sub sampling one I give you
details where you do this voting kind of scheme. And the sample complexity
scales us omega. Online thing what I told you is I give can you a generic
framework how to translate your favorite OCP algorithm into a differentially
private one. And I showed -- I basically kind of hand over it, how to get
tighter bounds for online linear regression instead of one way root and
two-way.
So in terms of future work, in the batch area, high dimension, we have made
progress on that. So if you forget about privacy, the sample complexity
required S log P. But what you are getting is the best of what you're
getting is S square S log P and current analysis will not allow it to get S
log P. The question arises whether there is a gap between S log P and SP log
P for private and nonprivate whether the gap is necessary or we can actually
breach that gap. So you have made some special progress and the short answer
is probably yes that we can bridge that gap but it requires significantly
different analysis techniques. For online learning, if the cost functions
each of the cost functions small loss functions were really hard, if they
were strongly convex, then nonprivate regret is known to be scaled as one way
in. But our algorithm gives you one way root N. So the question is can we
breach that gap and that will be significant improvement over this current
algorithm if it is.
And the long-term motive what you want to understand is the relationship
between differential privacy and learning, because learning, the whole
intention of learning is to not to overfit your data to one single sample
point. And differential privacy is initially [inaudible] can we formalize
that and in learning actually this notion is called stability. And there are
different versions of it. Can we actually formalize and give a deeper
condition within that. And with that I want to end.
[applause]
Questions?
Download