18045 >> Eyal Lubetzky: Okay. Hi everyone. Today...

advertisement
18045
>> Eyal Lubetzky: Okay. Hi everyone. Today we are very happy to have Ilias Diakonikolas from
Columbia University, a Ph.D. student of Yanna Kakis. He's going to give a talk on threshold
functions approximation learning and pseudorandomness.
>> Ilias Diakonikolas: Thanks, Eyal. Can you hear me? It's like a small room, I guess.
Thank you for inviting me. So this talk -- okay. You see the title. It's joint work with a large
number of people. I will mention them, I guess, during the talk.
Okay. So let me start. So first I will start with regression. I will talk about multi-objective
optimization for like five minutes. This is my thesis topic. But I decided to spend my time here on
like an equally important part of my research regarding threshold functions and applied
probability.
So let me spend a few minutes on that. And I would be happy to discuss more about it with
everyone interested in the one-on-one meetings. So what is multi-objective optimization?
Usually we are used to having some space, having one objective function and trying to find a
solution efficiently that optimizes this function.
But in life, things are more complicated. We have -- in practice, we have more than one criteria.
So what do we do in this case? Usually in computer science, traditionally we choose somehow
some combined function. We combine somehow the objectives and then optimizes function. So
essentially would reduce a multi-objective optimization problem to a single objective problem.
However, in general, this approach, you know, doesn't really make sense, because in many
practical settings we don't really know what the combining function should be.
And this is the usual approach in multi-objective optimization. What do we do? What are we
interested in? We're interested in more complicated object, the set of part of optimal solutions.
So all the solutions in the objective space that are undominated in some sense.
However, this approach of computing the Pareto set is practically intractable, even for the most
simple problems, say shortest path with two criteria, and this Pareto set can be exponentially
large in the size of input. So we cannot really compute it. So the underlying goal in this area is to
somehow efficiently find some approximation, some succinct approximation to the solution space;
that is, to the Pareto set. And do this efficiently.
Okay. So this has been the focus of my Ph.D. thesis. And this is an example of a multi-objective
problem. So we have a graph, a source, a sync, every edge has two different weights.
Like length and cost. I want to find the part that's as short possible and cheap as possible.
However, these two goals are conflicting. There's no single Pareto that optimizes both of them
simultaneously. So in the objective space this set of ping points is a Pareto set. And somehow I
want to efficiently approximate it.
So the important observation here is that this set is not given to.
Us explicitly. Otherwise things would be a bit easier. We're given -- it's giving implicitly through
the instance. So we have to find a way to approximate it in polynomial time without explicitly
constructing the entire set.
Okay. I think this is it. And I'll be happy to discuss in person.
Okay. So now let me move to the actual talk. The talk will be about linear threshold functions
and more generally about the low degree multi polynomial regression functions. And let me start
by defining these functions.
What is a linear threshold function? It's a boolean function. The domain is a hypercube. Range
01 or minus 11 and it's expressible in this form, the sine of an afine form. So essentially what is
it? It's a partition of the hypercube by hyperplane. So the one side of the hyperplane is a plus
point. The other side is a minus point. So these functions, you know, even though they might
look very simple, they, you know -- they're very basic questions about them are really challenging
to answer. And they have been studied in various fields because they are related to important
problems in these fields.
In particular, like perhaps the most, perhaps the most important such field is the most appealing
to you might be machine learning, where we have like the perception algorithm, the winner
algorithm, the support vector machines and boosting, so all these algorithms and notions, they
are intimately related to linear threshold functions. And perhaps the problem of learning an
unknown such function has been one of the most influential problems in machine learning. Both
theory and practice.
In complexity theory, I mean, they appear in many different settings. One of the most -- one of
the most embarrassing open problems in certain complexity is this. So I give you a depth, two
circuits, of polynomial size, whose gates are linear threshold functions.
Try to find the lower bound for this class of circuit. So depth two, threshold circuit. We know no
other bound. It's considerable that this is NP. Of course, this is ridiculous. It's not NP. But we
know how to prove a lower bound.
>>: So lower bound on the ->> Ilias Diakonikolas: So we need to find some function in NP that is not computable by such a
circuit. By a polynomial size threshold circuit. We don't know of any function. So there are
conjectures. We know how to solve special case of this problem but we don't know how to solve
the general problem.
So this is one. This class is called zero. But even some other results you might be familiar with
like harness approximation, the majority stabilization, all this stuff. They have -- they have
intimate connection to linear threshold functions. And also you know a field outside computer
science, I guess, the intersection. So it's a choice theory, where these functions are viewed as
voting schemes. So these excise here the variables. They are the voters. So everyone votes,
one or minus one. And the weights essentially represent the influence of a voter.
Okay. So, okay, this is a definition. These functions are important. So now let me move to some
very basic thoughts. I care about these functions over the hypercube for the purpose of this talk.
Some very basic factors since the domain, the domain is finite, et cetera, we can assume the
weights, the WI and theta, they're integers. But, unfortunately, they have to be exponentially
large to be able to represent every such function.
You know, functions you have seen in your life many common functions belong to this class, like
perhaps the most common one is the majority function.
Okay. So a generalization of linear threshold functions is the class of polynomial threshold
functions. So the functions that are again over the cube but are expressible as a sign of a low
degree polynomial for the purpose of this talk, D, will denote the degree of this polynomial. It will
be assumed to be some absolute constant.
Okay. So now let me move to my first set of results. Analyzing the sensitivity of low degree
polynomial threshold functions. So I need to define the notion of the influence of providable
boolean function. I'm sure most of you in this group have seen this before.
Okay. So what is the influence of -- what is the sensitivity of an input in a boolean function? So
we take X and we see how many of its neighbors they get a different value under F. Okay. So
how many edges in the hypercube are bichromatic. Then we'll take the expectation of this over
all Xs where the X is uniformly distributed.
Okay. For every function F this notion of the algorithm sensitivity is between zero and N. And
alternate equivalent way to view this is we see the edges in the ith direction so the influence of
the ith variable in the function F is the probability of a random X if we flip the ith bit of X, the value
of the function changes. And the total influence, the sum of the individual influences is easily
seemed to be equal to the other sensitivity, the notion I defined before. So this notion is well
understood for linear threshold functions.
It's actually very easy to show that every linear threshold function has other sensitivity at most
written. And this is actually true for every monotone function. Even for unique functions. And the
majority function is the unique maximizer. So every linear threshold function has algorithm
sensitivity at most to them and the majority function obtains this as well. So the bound is tight.
So I care about this question in higher dimensions. So what is the average sensitivity of degree
D polynomial threshold functions?
So until very recently nothing was known about this. Nothing, nontrivial, nothing beyond the trivial
upper bound of N. And it has this question has the obvious can be a total interpretation, like the
number of edges in the hypercube that are sliced by the polynomial surface P.
So let us do a simple example, as we said, for the majority function, you know, the majority
function is essentially the optimal, attains the worst case. At least for the linear gaze, and this is
an example for the degree two, degree two threshold function.
Okay. And, as I said, for every linear threshold function, this is the upper bound site.
You know, this question is the corresponding question for degree D is actually still open. We
have made some progress, which I will describe. So Gotsman and Linial conjectured in 1994 that
essentially the average sensitivity of any degree DPTF is at most D times written. And in
particular that the symmetric function that slices the middle D layers of the cube is the worst case.
So this function actually has attains its upper bounds. This question is still open for D course 2,
but we are able to show something. We are able to make the first progress in joint work with
Ragavendraz, Varidio and Tan [phonetic]. We're able to show an upper bound of from of N to the
1minus 1 over 5 D. So for every constant D this is nontrivial for D. I don't know. Like some
power of log N.
And we also have a very different proof that gives a better bound when this small, for example,
for degree two we can get an upper bound of N to the three-quarters.
>>: So is that bigger D as D goes to infinity?
>> Ilias Diakonikolas: Yes. So it's like a hundred to the D. It's a hundred to the D.
Okay. They want to mention that similar results will prove independently by these guys, Harsha,
Klivans and Meka using relatively similar techniques. So, okay, we don't prove the conjecture,
but we make the first -- the first progress on this open problem and actually this upper bound is
suffice to get some nice learning applications which I will mention later.
So the first application is to noise sensitivity. Eyal is here. So I don't know what should I say.
Okay. What is the notion sensitivity of boolean function? I take -- so I flip each bit of the function
independently with probability delta. So this corresponds to adding some random notion to the
function. And I want to calculate the probability that the value of the function changes. Okay. So
X is uniform here. And Y is obtained from X by flipping each bit with probability epsilon.
Okay. This probability is like the notion sensitivity of F at noise rate epsilon. So this has been a
very influential notion in several areas of mathematics. I guess most of you are more, are experts
in some of these areas.
So I won't say something more. So, okay, so this is the definition. Now, one basic observation is
that if we have an upper bound for the noise sensitivity for every boolean function, then this
translates relatively easy to an upper bound on the other sensitivity. So in particular this is
essentially straightforward that the average sensitivity of F for any boolean function, is at most
like order of N times the noise sensitivity of 1 over N. This is essentially trivial and true for every
boolean function. So what we're able to show is a converse for the class of degree DPTFs.
Okay, which is obviously not in general. So we prove that any upper bound on the average
sensitivity of the degree DPTFs, translates to a similar upper bound on the noise sensitivity. And
proven essentially inspired by Peres proof of linearity central functions. It's actually very similar.
So this is a reduction. So what does it say? That if AS and comma D is a maximum average
sensitivity of N degree D, PTF with N variables, then the noise sensitivity of any degree PTF
noise rate epsilon is at most epsilon times the average sensitivity where N is 1 over epsilon.
Okay. And so this yields an upper bound of essentially epsilon to the 1 over 5 D, the noise
sensitivity of degree DPTF, first nontrivial upper bound on this quantity 2, so independent of N.
And in particular for me this is interesting, mostly because it gives two learning applications.
Essentially immediately.
So one of them -- so the basic -- so I don't know if you're familiar with these things, with learning.
But it's like straightforward that an upper bound to the noise sensitivity implies the for the
concentration, implies that spectrum of the boolean function has very little mass above some
level.
In this kind of -- this kind of condition for conjecture concentration is related to learning
algorithms. We can prove concentration for a class of functions and we can learn it roughly.
Okay. And in particular our result implies that the class of degree DPTFs, for any constant D, is
learnable in polynomial time in the agnostic model. The agnostic model is one of the most
realistic models of learning that incorporates adversarial noise.
Okay. If there is no noise, it was known that degree DPTFs are learnable. But in practice there is
noise. And based on bound on noise sensitivity we can get this learning application, this is the
first application. I can say more about it if you're interested, but essentially straightforward from
nontechniques from the upper bound.
The second application is part learning a class of circuits, in particular constant depth circuits,
with super constant number of threshold gates. I don't know, perhaps this problem seems a bit
obscure to you, but it has been an open problem in learning for a while, since at least 2000, to do
this. It was known how to learn the class of functions, either it was one majority gate of the talk,
but it wasn't known for arbitrary threshold gates. And actually there is an intrinsic difference
between these two cases.
Okay. So this is the result.
>>: Can you go back to the previous slide?
>> Ilias Diakonikolas: So it's not polynomial. It's exponential. But it's the best we know. The
previous was trivial.
So the idea here to learn the circuits is to give them with random restriction, what you will get will
be a low degree ability of slight concentration at elementary machinery and the upper bound. It's
not immediate. And actually not involved with this work. It's, I just mentioned, it's an application.
>>: This is ->> Ilias Diakonikolas: Gopalan and Servidio. The slides were made in haste. I shall continue.
I'll give you some rough idea of the proof. So basically the main open problem here is to actually
prove the Gotsman Linial conjecture to prove that any degree DPTF the average sensitivity is at
most this. At most D time. It would be nice, it's like very natural conjecture and I would like to
solve it.
It actually has some other implications, but I don't know if I have the time to go into it. I can
discuss it later.
Okay. So let me give you like a very rough idea of how the proof goes. And actually this pattern
is a recipe, but applies essentially to all the problems. I'm talking about here related to degree
DPTFs. So what is a degree DPTF? It's like the sign of a degree D multi-linear polynomial. So I
want t analyze it. How will I analyze it? Well, I don't know. So what is natural to do? So if we
know, if we excise the random variables in the polynomial were Gaussian, I would be able to say
something. I know Gaussians are nice. I know many things about low degrees of polynomial
over Gaussians. Unfortunately, this is not the case. So what we do is break it into two cases.
First, consider the case of regular degree threshold functions where regular means it essentially
behaves as if we are in the Gaussian setting, approximately, and then I try to reduce the general
case to the regular case. So this reduction that will lose something, but that's life. So this is like
the general recipe. It applies to this problem and also to many other problems.
In particular, toward the problem we will discuss today. So again first solve the regular case,
what I will follow up with what regular means. Then reduce the general case to the regular case.
In particular, what is regular here? Regular means that I look at the influences. So I want to -- let
me use a board for a bit. How much time do I have?
>>: Still have time.
>> Ilias Diakonikolas: Okay. I just want to -- there's lots of material. That's why I want to -- so I
have the function F. Okay. The boolean function. Can you see this? That is the degree DPTF.
So the sign of degree D polynomial. And I look at the influences of the variables in this
polynomial.
The influence of the variables in P. Okay. So if all these influences are small, then by using the
invariance principle of [indiscernible] essentially relate to the distribution of P of X where X is
Bernoulli to the distribution of P of Z, where it's Gaussian. So these distributions are close to
each other, up to an area that depends on this tau, which is the result parameter.
Since I have this, these two distributions are close, then I can deduce that P of X, okay, is actually
underconcentrated. So the probability that it puts substantial mass in a small interval is small.
The reason I know this is because I know this is true for low degree polynomial over Gaussians.
So this is like a main, a main intuition that we use. And this essentially suffices to prove the result
for the regular case. I want to go to the definition of the influence. I'm sure like all of you know
already. Okay. So I mean the arguments under this constraint, the argument is simple. We can
easily show by combination of like degree D general bound which follows by hypercontractivity
and then under concentration bound which follows from invariance plus the same property for low
degree polynomials, but the average sensitivity of low influence degree DPTFs is small. It's a
very simple, very simple argument.
The difficult thing is to actually do it in general. To actually reduce the general case to the regular
case, and this requires like the new machinery. So how do we do this?
>>: I think you went too fast. So how would you -- go back up one more slide. So everybody
understands. One more slide.
>> Ilias Diakonikolas: So are you interested in this? I think I can give it to you. I just don't think
that I will have the time to do it.
>>: So maybe off ->> Ilias Diakonikolas: Okay. I can go through it. So basically we break up -- so I want to upper
bound the influence of the first variable, okay, the influence of the first variable in the threshold
function F equals sine of P. What do we do? We write, we write the polynomial like this, and
essentially by degree D threshold -- by degree D general bounds we know that the probability that
this function has large [indiscernible] and the reason is because it's L2 norm. It's small.
Essentially bounded by its influence.
So the probability that this is big is small. This is the first and the second, these are the
probabilities of the polynomial is very close, it's very close to any threshold that's also small.
So by taking a unit bound we upper bound the influence of the first variable and multiply it by N
and get the overall bounds. Make sense?
Okay. So now let me say a few words about the reduction, which I think is the most interesting
part, which is why I skipped the previous things.
So what we do. Okay. So I have a general degree DPTF. And it's not regular. What does this
mean? It means there's some variable in the polynomial P that defines this. It has large
influence. So the structural lemma that we prove is this: Roughly. That there is a way to restrict
a small number of variables such that the restricted some function is sufficiently regular with
disprobability.
In particular, there is a small set of variables roughly log N over tau where tau is a variable
parameter such that a random restriction of a selected, of a set of variables of this size is tau
times log N to the D regular for at least this fraction of the restriction, with probability that at least
1 over 2 to the D.
And to do this we use some kind of, some new tool, what we call the critical index of like a degree
D polynomial, which essentially quantifies how fast the influence is decreased. This is a bit
technical, how to give you on the bordering ones. This is like the main structural result.
So based on this, okay, we can just -- we can just do a case analysis, and get the upper bound
on the other sensitivity.
Okay. I think this is catching up. Any questions?
>>: Random, do we know what items -- what items for the variables?
>> Ilias Diakonikolas: Yes. So I order the variables according to their influence in decreasing
influence. And then, you know, roughly I randomly restrict a subset of the most influential ones.
And, you know, for, of course, another restrictions what I get will be relatively regular. What is
difficult here? What is difficult is when I restrict variables, the influences change. But of course
the polynomial has low degree, they don't change by a lot.
That's rough intuition. That's not exact -- I'm lying here. There is a case where this cannot be
done but it can be handled.
Okay. So this sort of summarizes this work. Upper bounding the average sensitivity of the
degree DPTFs. And I guess I mentioned the applications to learning. Now a strengthening of
these results, strengthening of this like result of the sensitivity yields some kind of regularity
lemma of PTFs, which I think is useful probablistically, but it's also like crucial for the second
result which I'm going to talk about which is random writers for PTFs. So basically the result,
what I tell you is this. So for any degree -- this is what we proved with Poisso Logan and Miank
[phonetic] that for any invariable degree DPTF we can restrict a small set of variables such that
for constant number of restrictions we get a regular PTF where the large parameter is worse than
tau, it's like tau times polynomial log N. Note here there's a dependence of N of this bound and
this is not good enough for some of the applications. And in particular for the pseudorandomness
applications. However, we can strengthen it with a more careful analysis and get something like
this.
So we can restrict a set of D over tau many variables, okay, not dependence on N. And get for a
constant number of restrictions a regularity tau times log 1 over tau to the D. Actually, it's not -it's not, you know, easy to go from here to here. But it can be done.
Okay. So based on this super vision of the structural lemma we can get unlarity lemma for
degree DPTFs which essentially says what? That any degree can be decomposed and as a
decision tree of this depth, only a function of tau. Okay. And such that essentially most of the
leaves of this tree are regular degree DPTFs. So this is the statement, how should I explain it a
bit because I think it's important for the -- so what do we do? We start from a degree DPTFs and
we carefully, you know, we carefully restrict variables. So we get a decision tree. We can do this
for every boolean function. The important thing is that the degree DPTFs have some structure.
This allows the depth of the tree to be constant only function of tau and the leaves who
correspond to the restrictions of the variables on the corresponding path are regular.
So this essentially allows us to reduce, you know, several questions on the degree DPTFs for the
general case and the regular case, so this is essentially a reduction.
Yes.
>>: You know the influence with the algorithm for the decision tree each time true for the
variable, has the highest influence? Restrict that?
>> Ilias Diakonikolas: We don't construct exactly like this. We construct -- we construct
recursively. We want this statement to be true with high probability. However, this previous
lemma here it holds with constant probability 1 over 2 to the D.
Unfortunately, what we have to do is apply the lemma recursively and for the good leafs we are
okay every everybody recurse until the probability becomes very high. So it's not -- we don't take
the first influence, most influential variables and do it once. We do it many times. So when we
apply this lemma for a leaf, the influence have changed.
>>: What I'm asking is just to compare this decision tree to the one where you first choose the
most influential variable, fix it and recompute the influences, get the next most influential variable
and so on.
So that's also a way to make a decision tree, which would give them a new function, asking how
that would solve a product compared to the decision tree.
>> Ilias Diakonikolas: Yeah, I don't like -- I'm not exactly sure how you would preform. I really
think the worst case this would not work, because essentially here we crucially use the fact that
you restrict a specifically defined set of variables in every step.
I will discuss this more individually with you if you're interested. I can tell you more how this
works. So this lemma here allows us to do several things. So one thing is -- this is like a picture.
So one thing it allows us is to get low weight approximators for degree DPTFs. We can
approximate every degree DPTF over the distribution by another constant degree PTF that has
weighted some function of epsilon times N to the D. This bound here on this dependence on N.
N to the D is optimal for this problem.
Now this problem is not raised by learning because low weight threshold functions are nice for
several reasons. For example, algorithms like Perceptron and many heuristics they work better
where they actually have probable guarantees for low degree DPTFs as opposed to general
ones.
Okay. And now let me move to the last part of my talk, which is about pseudorandom generators
against PTFs. In particular, I'll show you something that I think must be interesting for you,
seeings as you're doing probability. I'll show you a derandomization of a central lemma theory,
bounded independence, at least a special case for the theorem.
So actually let me go straight to this and then come back. Okay. So a version of the very
essential theorem have a linear combination of [indiscernible] dependent. I normalize the
coefficients. The sum of the square is 1. So if every weight is at most epsilon in absolute value,
then I know that the, like the common goal of distance between the CDF of X, which is a linear
combination of the Bernoulli's and the CDF of the Gaussian is at most epsilon.
Okay. This is a very essential lemma theorem. Now, a special case of what I will show you
shows that this theorem is true if we only assume that the bits in the Bernoulli, in the linear
combination of Bernoullis have sufficient independence, 1 over epsilon squared independent. So
for any possible distribution on the joint distribution on the YIs, that has this amount of
independence, the Berry-Esseen theorem still holds. I don't know. I hope this is not known. And
I believe it's not known in probability. Actually, I think a very special case of this was given by
Benja Mingulvich [phonetic], who is here, and Palad. That's right, very special case of this was
given in your paper.
But this is like more general. So let me go back now and tell you how this central lemma theorem
is related to threshold functions, linear threshold functions. The motivation for me -- the
motivation for me is how important is randomness and computation and do we need randomness.
In general, randomness is very useful, but to actually get perfect randomness is very hard.
So many times we have faulty randomness or we want to analyze to be able to say, you know, to
have performance is about the running time of the performance of randomized algorithms,
assuming the randomness is faulty. It's like an entire area in the field of computation, it's called
derandomization. I care about a specific subclass of this theory. In particular, what's the power
of bounded independence against classes, against natural class of functions, and in particular
against the threshold functions.
Okay. So a distribution is called KYs independence, if any of its restrictions to like a subspace of
K over variables is uniform in all these things.
There are many constructions of such distribution, may explicit constructions that construct such
distributions which support N to the K. And this is optimal. So this corresponds to number of
random bits K log N as opposed to N but the uniform distribution applies.
And the theorem we proved recently with Gopalan Jaiswal Servedio Viola is that any distribution
of the cube that has sufficient independence has independence at least 1 over epsilon squared,
roughly. It fools the class of linear threshold functions, meaning if we take the expectation of any
such function under this distribution D, which is KYs independent, and the expectation of this
function under the uniform distribution, then these are like -- these are epsilon close.
So this class of functions cannot distinguish up to epsilon the fully uniform distribution from a KYs
independent one if K is large enough. And know that K is completely independent of N here
which is something that we expect. For like [indiscernible] iterations in this case.
So this theorem is optimal up to the log squared. In particular something stronger is known, that
even for the majority function in essentially 1 over epsilon squared independence, this was
proved by Bezeman Golivech and Pillads [phonetic].
This theorem, I claim to give -- from this we get immediately the derandomization of a central
lemma theorem. Because what is a linear threshold function? Is it like the sine of sub linear
form? So essentially the probability that the probability is -- the CDF of this random variable X
point T is actually the probability that the sine of the corresponding hash space is equal to minus
1.
So this statement about fooling spaces is equivalent to this derandomization of the central lemma
theorem.
>>: Does this converge to the distribution variant for densities?
>> Ilias Diakonikolas: For the densities? No.
>>: [indiscernible].
>> Ilias Diakonikolas: The thing is that KYs independence, we have so much more support.
Support will be uniform everywhere. It will be zero. For density, I don't know what to say.
Okay. Now, I guess I hope this is interesting for you. It is definitely interesting in
derandomization, because it gives the first explicit pseudorandom generator for this class of
function. This has been an open problem for a while. Previously only special cases were known.
And there are two like main open questions. One question is what happens for degree greater
than 1. Does constant independence suffice, like only a function of D and epsilon? That's a
main, that's the main open question from this. And also, you know, in the context of the
randomization, it will be actually interesting to get a distribution support, polynomial in N and 1
over epsilon. So the distribution over the KYs distribution has support N to the K and therefore
we can have full spaces support N to the 1 over epsilon squared. It will be interesting to actually
get polynomial independence and epsilon, too.
And I mean this is possible. Unless if it's not possible then please it's not equal to P. In
complexity theory it's like a standard conjecture.
Okay. So let me briefly describe some progress I recently have of this question. And the first
question with fellow students, with Daniel Kane and Jennel Nelson [phonetic]. So Daniel is at
Harvard and Jennel is at MIT.
This work, part of this work happened in IBM Almaden last summer. So we actually wanted to
prove bounded independence sufficed for the case of degree two PTFs. In particular same
dependence, polynomial 1 over epsilon, and we actually introduced some interesting analytic
techniques in this paper that I think will have other applications.
So beyond anything -- these two might seem a bit specialized. And in fact we don't know how to
prove it for generally, but we do know how to prove, we do know -- what we do know how to
prove is this: That for any degree D to prove that bounded independence suffices to full degree
DPTFs, it suffices to prove the statement for regular degree DPTFs. So because of regularity
lemma I described before, the general question can be reduced to the regular, to the regular
case. And in fact the most challenging part for this derandomization problem is actually to solve
the regular case. It's like essentially S hard is the general case.
Okay. And in order to be able to do this, I mean there is a standard way to approach this
question. You need to construct some sandwiching polynomial with functions in a technical
sense.
Okay. I don't know if you're familiar with this condition. But it's like a standard duality question.
To prove that boolean function F is fooled by KY independence, it suffices and is in fact
necessary to find two polynomials Q upper and Q lower that have small degree, degree at most
K. And they sandwich F from above and below on the domain, on the cube. And their
expectation, the expectation of their difference and of the uniform distribution is at most epsilon.
So the one direction of this equivalence, so here I guess I should say if and only if, is
straightforward. Follows from the outward expectation. The other direction uses LP duality. We
don't even need the other direction for the proof. We just need the straightforward one. Okay.
So the main challenge for all these problems is to be able to construct this low degree
polynomials with these properties for degree DPTFs.
Okay. And in particular, for the dequals 1 case, for the case of linear threshold function cases,
we do this: We construct an approximation to the univariate. So at this point it's polynomial N
degree polynomials. And I want to construct -- I want to construct them somehow. So I do this
using symmetricization. So I construct a good approximation to the univariate sine function under
the Gaussian distribution, and then I just plug in W.X. So for the irregular case, I know W.X
behaves like a Gaussian approximately. So this reduces essentially the problem from N
dimensions to 1 dimension, it actually works in the linear case.
I mean they're like many [indiscernible] to be worked out and use approximation theory to
construct these univariate functions for the sine function. But this is a rough idea.
Now for dequals 2, this univariate approach does not work. So we cannot construct univariate
approximations to the sine function or to any like function of 1 variable does not suffice and we
need to do much more. So we need to raise the problem to hard dimension. And this essentially
is what we do in the paper with Kain and Nelson.
This I think gives you like a very rough idea of the proof and let me give you some developments
afterward. And you know this paper was essentially published, made available in February. So
since then there have been a bunch of results on pseudorandomness for degree, for linear
threshold functions and higher degree functions.
In particular Mecka and Zuckerman were able to actually construct PRGs, pseudorandom
generators, for degree DPTFs. However the generators are not based on bounded dependence,
using some other specialized distribution. So the problem whether bounded dependence for
degree is still relevant. I think it's like quite challenging. Gopalan and O'Donnell, Wu and
Zuckerman, they generalized our results for the linear case to product distribution. For example,
they get some more general. So the derandomization of the based theorem, when the exercise is
not necessarily plus minus 1 Bernoulli, but they belong to some product distribution with some
like general moment assumptions.
Harshah Klivans Meka, they have results on intersection of spaces, actually the paper with Kain
and Nelson from my paper had results in this, for such functions, too. And finally [indiscernible]
they were able to say something about a special case degree DPTFs and I don't know much
about the result.
Okay. Again, so I think the message is again that in this case, too, what we do is we first solve
the regular case and then we use the general case, regular case, so this is also like an example
of the recipe to approach problems about low degree DPTFs.
Do I have time? Five minutes, okay. So I guess let me give you a very rough idea of why, of how
this univariate polynomial for the same function works. Again, the context is I want to follow linear
threshold functions, and I want this equivalent to constructing some univariate polynomial that's a
good approximation to the sine function, the uniform distribution. So how do I do this and what
are the properties of this function? So imagine, imagine there is a line under the uniform
distribution.
So I want my polynomial to look sort of like this. Okay. So what are the properties of the
polynomial? I have there are a lineup position to versions. First of all, this is the standard. This
is what's important. So the first thing is that in the region where the Gaussian distribution puts
most of its mass, the area between the sine function and my approximation is small, at most
epsilon. Okay. I can guarantee this.
The second is that in the area close to the origin, where the sine function just continues, you
know there is at most constant. This is not that hard. However, I have to make -- I have to make
this decision narrow enough so that it doesn't have lots of probability mass. But this section of
the problem because I know the Gaussian distribution is concentrated, so the probability puts
mass, some mass in an interval of length epsilon at most epsilon. And therefore if I make this
interval narrow, I'm okay. And the third region is the region where where the polynomial diverges
the area between the polynomial and the sine function is big. However, for that region, I'm also
okay because since the polynomial has low degree, it cannot increase very fast. And I know that
the K of the Gaussian distribution is very good. So these two things, you know, can be balanced.
So essentially the contribution to the expectation in like this region is also small.
Okay. And these are the properties of the polynomial. The way to construct it using some kind of
combination of theorems from approximation theory.
[applause]
>>: I guess in the point of KY is dependence, the first K moments as if they were completely
independent. So in your setting you get something out by just really moments, and you're playing
into whatever is known about how close you are to normal first K moments sufficiently based to
normal. So I guess I would probably ask to what extent you are doing something different, you're
getting a better result out from what you get out just by looking at the moment.
>> Ilias Diakonikolas: Yeah, I think so. I think the approach like trying to take approach in
[indiscernible], I don't know how to make this work for the general case.
>> Eyal Lubetzky: Thank the speaker.
[applause]
Download