>> Yuval Peres: We're delighted to have Daniel Kane... FT-Mollification Method.

advertisement
>> Yuval Peres: We're delighted to have Daniel Kane here with us to tell us about the
FT-Mollification Method.
>> Daniel Kane: Hello. I'm here to talk about the FT-Mollification Method which is a method
that Jelani and I and then maybe a couple of others have been recently developing to work with a
number of problems dealing with polynomial approximation in dealing with k-independent
random variables and such.
If you saw Jelani's talk yesterday, probably he discussed some of the techniques and application
of that one problem.
So, well, let's get started. So first a brief outline of what I'll be talking about today. I'll start with
an introduction where I'll describe some of the basic concepts that we'll be dealing with and what
sorts of problems we're dealing with. I'll then talk about how we can use this method to show
that k-independence fools linear threshold functions. I'll then talk about some ways of
generalizing this technique and how to deal with tooling several linear threshold functions at the
same time. I'll then move on to fooling degree 2 polynomial threshold functions and move on to
the conclusion.
So to start off with sort of one of the basic problems that we deal with using this method are
things involving k-independence. So we define a collection of random variables, X1 through
XN, to be k-independent if any collection of K of those random variables are independent of
each other.
And why should we care? Well, importantly for computer science k-independent families are
easy to produce. You can create small seeds that produce k-independent families which is good
for doing de-randomization or producing hash functions or something.
And, furthermore, even though you can produce them with much lower entropy than fully
independent families of random variables, k-independent families often behave like fully
independent families. So hopefully if something were going to hold for a fully random family, it
will also hold for a k-independent family if K is large enough.
So in what types of ways might we want our k-independent random variable to behave like fully
random variables? Well, one of the standard things we might want to do is fooling various
functions. So this means we have some function and we'd like the expectation of that function to
be correct.
And we'll get into what types of functions later. But in general -- or in this talk at least we'll be
talking about sort of polynomial threshold functions. You apply some polynomial to the
variables and then see whether it's positive or negative. And if you have some algorithm where
whether it works or not depends on where there's some polynomial in the variables that perhaps
you don't know ahead of time is positive or negative, then being able to fool your polynomial
threshold function means that your probability of success is maintained.
So to talk a little more about what we mean by fooling functions, let's introduce some
terminology. So Y here will be Y1 through YN. This is our collection of fully independent
random variables. And then we'll let X be an arbitrary k-independent family of random variables
such that each of the XIs are distributed in the same way as the corresponding YI.
And there's some function F that we want to fool, which is a function from RN to R. And we'd
like to be able to show that the expectation of F of X is roughly the expectation of F of Y. So
we've got Y -- all the YI have been fixed where we want to know that for any X that's
k-independent with the correct distributions on each variable, the expectation is close to what it
should be.
And one thing to note here is that if F is a polynomial of total degree at most K, then these
expectations in fact are going to be exactly the same, because your polynomial degree at most K
is a sum of binomials of degree at most K, and each of those have their expectations determined
by k-independence.
And, in fact, as we'll see a little bit later, the problems of showing something's fooled by
k-independence and approximating by degree K polynomials are very closely related problems.
Moving on to a little bit of notation. We say that k-independence epsilon fools F if for all such
k-independent X the absolute value of the difference between the expectation of X and the
expectation of Y -- or F of X and F of Y is at most epsilon.
>>: [inaudible] normalization of FX?
>> Daniel Kane: Yeah. This is assuming that the X is -- yeah. So, I mean, really I should add to
this terminology like with respect to Y, and then also assume that the XIs are distributed in the
same way as the Ys.
And we'll use the notation that A is approximately equal to subepsilon B, just to say that the
absolute value of the difference is O of epsilon.
And really when we're talking about epsilon fooling something, we'll really generally replace
with O of epsilon fooling, because if you're only off by the constant, you can just make epsilon
smaller.
Okay. So one thing you can look at is let's look at this from a very theoretical level. Let's
suppose that the YI are Bernoulli random variables, they're plus or minus 1 with equal
probability. Suppose that we've got some fixed arbitrary function F. For the moment we don't
even care what it is. And we'd just like to see how far off can k-independence give you.
So, for example, what is the biggest possible expectation that we could have for F of X where X
is k-independent and, again, the individual variables are the same as for the Ys.
Well, it's not that hard. Let's -- for every string A sub-I of possible values we'll let PA sub-I be
the probability that we get that particular string, which is always positive. K-independence says
that if we fix K, any K of the entries, and sum these probabilities over all strings with those fixed
K entries, we get 2 to the minus K. And with respect to these conditions, we want to maximize
the expectation. So we want to maximize the sum of P of A times F of A.
And this is just a linear program which, as a linear program, it's somewhat complicated and
perhaps not particularly insightful, but we can look at the dual program to this, which actually is
a lot easier.
And if you work out all the details, the dual program is the following. What we want to do is we
want to find G, which is a polynomial of degree at most K. Or, in general, if your variables take
more than two values, so if each of your coordinates has more than two values, we want it to be a
sum of functions that each depend on at most K coordinates. And we furthermore want G of A
to be bigger than or equal to F of A for all A, and we want to minimize the expectation of G of
Y.
And it's easy to see that this gives us an upper bound, because the expectation of G of Y is equal
to the expectation of G of X, because it's a sum of these -- I mean, it's a degree K polynomial.
And because G is bigger than F.Y, this is bigger than the expectation of F of X.
And so, again, fooling F with k-independence is very similar to saying that F can be well
approximated by degree K polynomial.
Okay. So from here on we're going to let Y be a standard Gaussian. A lot of these papers we've
worked at this we've actually wanted it to work for Bernoulli random variables. And, in fact,
almost everything I'll say here will generalize to Bernoulli random variables, for a few reasons.
One, we have the same sort of moment bounds for polynomials of Bernoulli random variables as
the ones I'm using for Gaussian random variables.
For Gaussian random variables I also use a number of anti-concentration results that, you know,
the probability that your Gaussians in some small range is small. But as it turns out, if you have
a polynomial with low influence, meaning no one coordinate has a major impact on the final
value, the invariance principle says it behaves more or less like a Gaussian, like the same
polynomial of Gaussians. And, hence, you will still have anti-concentration.
And if you have high-influence random variables, there are a number of regularity lemmas that
say basically after fixing some of the high-influence random variables, you can reduce to the
low-influence case. So you use some of your extra independence to fix a bunch of the
high-influence variables. And then for the rest of our independence you're just dealing with a
low-influence polynomial.
Okay. So let's stick with -- so a lot of these techniques are a lot easier if you can sort of reduce
the dimensionality of your problem. As I've said, a lot of the idea involves approximating some
function by a degree K polynomial. But approximating functions in many, many, many
dimensions by high-degree polynomials can become very, very complicated. So it's a lot easier
if you can write your function as some function in just a few polynomials in your original
variables.
So for simplicity let's assume that, you know -- this is now what I'm calling my variable Z, which
could be either X or Y, so that F of Z is some function little F applied to W.Z. So it's just -- it
depends on some linear function in Z.
And now the idea is instead of approximating big F by a polynomial, let's just approximate little
F by a polynomial and apply this approximation to W.Z.
Now, if little F is smooth, there's a very convenient way to do this. Because if you have a
smooth function, it has a Taylor series, and not all the fonts are showing up the way they should.
I forgot this. I really should have remembered my laptop.
Anyway, yeah, this is the Taylor series. And there's a Taylor error term. Sorry. Throughout this
talk there are going to be a number of poorly fonted things.
Anyway, but you've got an error that looks like absent value of Z to the K over K factorial times
the size of Kth derivative. And so if K is even, that error is in fact a polynomial, so your
k-independence is going to determine its size. And if we want to look at the moments of W.Z,
which is our variable little Z here, well, we've got the expectation of W.X to the K times the size
of the Kth derivative over K factorial. That's our Taylor error.
And from the hypercontractive inequality, which gives you bounds for moments of a polynomial
or in general just knowing that your linear polynomial and Gaussians is just a Gaussian, this
turns out to be -- the size of this is roughly the size of W to the K times K to the minus K over 2
times the size of the Kth derivative of F.
So if F has nice bounds on its higher-order derivatives, you can approximate by your Taylor
series and get nice bounds on the expectation of the Taylor error.
So unfortunately most of the functions that we're going to care about are not smooth. In
particular I mentioned this polynomial threshold functions. Well, the simplest one is a linear
threshold function. So here we just want to know whether -- we take an indicator function of
W.Z, we get 1 if W.Z is bigger than some constant A and 0 otherwise.
So the problem here is that little F is not a smooth function. It's got this discontinuity at A. It
doesn't have a -- well, we could make a Taylor series somewhere, but it's not going to converge
very well.
So what do we do? Well, the idea is we're going to replace F by a smooth approximation F tilde.
And since we're trying to prove our -- since we're trying to prove our approximation for F and
not F tilde, we've got to do a number of steps to actually show that it works.
But basically what we're going to do is we're going to try and prove the following chain of
approximations. We start with F of Y and then show -- and then approximate by F tilde of Y and
approximate that by F tilde of X, and finally by F of X.
And so for each of these steps we use slightly different methods. For the first step, F of Y and F
tilde of Y, since F and F tilde approximate each other, that's basically where it comes from. But
unfortunately since we're approximating a discontinuous function by a continuous function, it
has to have an error somewhere near the discontinuity.
So what we need is we need to know that the probability that W.Y near that discontinuity needs
to be small. So we need some sort of anti-concentration result for W.Y.
For the second thing we've got F tilde of Y and F tilde of X. But F tilde was smooth. We do
what I said before, we approximate by a Taylor series and just show that the expectation of the
Taylor error in either case is small and note that k-independence determines the expectation of
the actual Taylor polynomial. So we use polynomial approximation, we use moment bounds.
And for the last step it's very similar to the first step. But now we need anti-concentration of
W.X instead of W.Y, which is harder because Y is fully independent. In fact, in this case, W.Y,
we know it's a Gaussian. We've got very good anti-concentration results for that. W.X we really
don't know what it looks like. But we'll use some of these techniques to be able to derive this
anti-concentration from our anti-concentration results for W.Y. And I'll get into that a little bit
more later.
Okay. So, on the other hand, so this is a great plan. The problem is how do we do our
smoothing, how do we replace F by F tilde. Well, one of the oldest techniques for smoothing
functions -- for approximating by smooth functions is this idea of mollification. What we'll do is
we'll let F tilde be F convolved with some smooth function rho.
And, well, the idea is to show this thing is smooth. We know that if we want to take the
derivative of a convolution, we can take the derivative of either side. So the Kth derivative of F
tilde is just F convolved with the Kth derivative of rho. So if rho is smooth, this thing is smooth.
If rho has good bounds on its high-order derivatives, this thing has good bounds on its high-order
derivatives. Great.
So what we need is we need bounds on the derivative of rho so we can get bounds on the
derivatives of F tilde. And we'd like these bounds to be strong. Because, remember, our Taylor
error depends on the size of the Kth derivative of F. And as we'll see slightly later, if our bounds
aren't strong enough in some cases, it really makes this method not work at all.
Also, though, we want F tilde to approximate F. And so in the mollification context, the way
you do this is you want rho to be sort of what people call a smooth approximation to the delta
function. You want its integral overall space to be 1, so you should -- so convolving with rho is
sort of averaging it with nearby stuff. And, furthermore, we want it to decay rapidly. We want
the size of rho to be very small when X is large. So convolving with rho is sort of averaging
your value with a bunch of nearby values. So unless F is discontinuous at that point, you're
averaging with nearby values which are all close to your original value, and you're in good
shape.
Okay. So in the way we're going to construct rho for this application, so, I mean, we wanted to
have those results for approximation, in particular decaying your infinity, and we also wanted it
to have very tight bounds in its high-order derivatives.
So to construct this function, we're going to use the Fourier transform, which is where the name
of this method comes from. But before we do this, let's briefly review some basic properties of
the Fourier transform.
Importantly, the Fourier transform interchanges differentiation with multiplication by X. The
Fourier transform of the derivative is X times the Fourier transform and vice versa.
And what this means is that it interchanges smoothness with decay at infinity. Because we say a
function is very smooth if you can differentiate it a lot of times and it remains small. And, on the
other hand, we say a function decays rapidly at infinity if you can multiply it by X a lot and it
stays small.
So since the Fourier transform interchanges differentiation in multiplication by X, it interchanges
smoothness in decay at infinity.
So what we want rho to be is we want it to be very smooth. We want it to have sharp bounds in
its high-order derivatives, and we also want it to decay at infinity. So if we want to think of rho
as the Fourier transform of something, we want it to be the Fourier transform of something that's
very decayed at infinity and also smooth.
So for varying decay at infinity, we're going to use -- it's going to be a compact -- it's going to be
the Fourier transform of a compactly supported function. Because really you can't decay at
infinity faster than that.
So rho is the Fourier transform of a compactly supported function so, for example, rho could be
the Fourier transform of the following smooth compactly supported function on minus 1, 1.
So what do we know? We know that the size of rho to the K in this case is O of 1 because if you
want to compute the Kth derivative, what you do is you take the Fourier transform and multiply
by X to the K. X to the K is bounded by 1 on minus 1, 1. Then you take your inverse Fourier
transform so you're bounded. And, remember, this O of 1 doesn't depend on either K or X.
Next we want decay at infinity. And we just -- and to be really weak about this, for every N we
know that it decays faster than X to the minus N. Because we know that when we multiply by X
to the N that's the same as taking Nth derivative of the Fourier transform. The Fourier transform
is still bounded after taking an Nth derivative. So taking an inverse Fourier transform back,
there's some absolute bound that is bigger than rho of X times X to the N.
Furthermore, because I made the inverse Fourier transform of rho to be 1 at the origin, this
means that the integral of rho over the entire line is 1. Just something we wanted.
Furthermore, well, so we can't quite just use this rho, because this rho, it does have decay at
infinity, but it's still got some finite width. And if we always convolved with this rho, then we
always have errors within some finite region of our discontinuity, and since there's some finite
probability that we'll land in that region, this technique will never show that we can get errors
smaller than some constant.
So we need to be able to scale it to make it very, very concentrated. So we let rho sub-C be C
times rho of CX. So this scrunches it by a factor of C and blows it up by a factor of C.
Now the size of the Kth derivative is O of C to the K, and we have that the K is now
correspondingly more rapid, and its integral over the entire line is still 1.
Okay. So now we can grade F tilde. So lets write F tilde sub-C is our indicator function from A
to infinity convolved with rho sub-C.
So what are the properties of this? Well, its Kth derivative is O of C to the K. And this holds
everywhere. That's good. We want bounds on the error between F and F tilde. On the one hand
this should be at most 1 because F is bounded by 1 and we're convolving with something. And,
furthermore, this is at most C times the distance to the discontinuity to the minus N. And this is
basically because of the rate at which rho decays.
Yeah. So polynomial approximation -- my fonts are doing wonky things. Great. Anyway, so
what this should say is, okay, we want to show here that the absolute value of the expectation of
F of Y minus the expectation of F of X is small. So are bounds. We get the size of the Kth
derivative, which is C to the K, and then we get the size of the Kth moment, which is size of W
to the K times K to the K over 2, and we're dividing that by a K factorial, so it's K to the minus K
over 2.
Putting this all together, we've got something that's O of C times the size of W over square root
of K all to the K.
So we've got something to the K. The something has a root K in the denominator. So as long as
the something is small and the K is at least log 1 over epsilon, we're in good shape. So we need
K to be at least something like 2C size of W squared. And we also need it to be at least log 1
over epsilon.
Next step. We want the anti-concentration. We need that the expectation of F of Y minus the
expectation of F tilde of Y is small. So we're going to use the obvious approximation that's
bounded by the expectation of the absolute value of the difference. And we note that F minus of
F tilde is small except when we're near A. And, furthermore, W.Y is anti-concentrated.
So let's remember F minus F tilde is big O. Now I'll just do it -- you know, it's either O of 1 or O
of C times X minus A to the minus 2. And ick, so ->>: [inaudible].
>> Daniel Kane: Sure. So, I mean, the idea here is we're going to write out a geometric series.
We're going to say that if our error is at most -- let's see. If it's at most 2C inverse, then our error
is at most 1. So we've got at most the probability that W.Y minus A less than 2C inverse.
So we've got at most 1 if it's in that range. Now, if it's within 4C inverse, between 2C inverse
and 4C inverse, I should say, our error is at most a quarter. So this should be plus a quarter times
the probability that our absolute value minus A is less than 4C inverse, and then plus 1/16 of the
same thing with 8C inverse and so on. Because, remember, we go down like the square of our
distance.
And now remembering our anti-concentration results, this probability is going to be something
like 2C inverse over size of W, I think. And this is going to be like 4C inverse over size of W.
And this is going to be 8C inverse over size of W.
But, remember, this is being multiplied by 1 and a quarter and a 16th. So these things are going
down by powers of 4. These things are going up by powers of 2. This geometric series
converges. And in the end we get O of 1 over C times the size of W.
Okay. So the thing to get out of this is that we need that 1 over C times the size of W needs to be
much, much less than epsilon basically in order for this term to be reasonable.
Okay. So that's that. So next we need to perform an anti-concentration bound for W.X. And,
again, like this is basically the same argument. But now we need anti-concentration for W.X.
So again we have an error that is the same gibberish in this font, but should be exactly this
summation only now all the Ys are replaced with Xs. So what -- we have these bounds for Y, we
know that these probabilities that were within those ranges are good for Y but we need them for
X.
So really what we want to do is we want to show that k-independence forces this function G of Z
which is in the indicator function that were between A minus B and A plus B of W.Z to not have
too large an expectation.
Now, you might think that we haven't gotten anywhere from this because we started by staying
let's try and fool this function, and we've reduced it to the problem of let's try to fool this other
function. But the point is the original problem we had some function and we wanted to show
that we're within epsilon. Here we don't necessary want to show that we're within epsilon; we
just want to show that we're not too much bigger. We want to show that this probability isn't
more than, say, 8 times what it's supposed to be.
And for that what we can do is we can go through the same mess as we just did, but we're going
to make sure that G is strictly less than G tilde everywhere. So this means that sort of when
we're running our first steps, yeah, sure, the expectation of G of Y is near the expectation of G
tilde of Y, and that's near the expectation of G tilde of X. But, hey, that's strictly bigger than the
expectation of G of X. So we'll be good when we do it this way.
So what we're going to do is we're going to obtain G tilde. Instead of mollifying G, we're going
to mollify something like twice the indicator function of twice the interval. So that, you know,
since we're much bigger and go much further out, it's not hard to cook this up in such a way that
our G tilde strictly upper bound is G. But, I mean, the actual expectation of twice this doubly
large indicator function should only be about four times as big as the thing we're trying to bound
in the first place. So we'll be in good shape.
Okay. So performing linear threshold functions, we have the error from approximating F by F
tilde that's roughly 1 over C times the size of W. So we need C to be at least 1 over epsilon size
of W. And to have a small error from our Taylor error, we need K to be at least 2CW squared or
so. So K we can get away with being O of epsilon to the minus 2.
So what you need to fool epsilon and do epsilon fully linear threshold function is O of epsilon to
the minus 2 independence.
Okay. So that's sort of the first application. Now I'm going to talk about a few tricks that we can
use to make this work out more conveniently. And I'll get back to another problem we can solve
with this a little later. But one of these things is -- so in sort of that last step of the inequality we
wanted a convolution so that G tilde was strictly bigger than G. And to make that work it's
actually convenient to have our rho be a strictly positive function.
Because this means when we do our convolution our F tilde is always between the Nth of F and
the [inaudible] F. A lot of the analysis is made easier by doing this. And in particular guarantees
that our G tilde is strictly bigger than 0 and doesn't accidentally get below 0 near the boundaries.
Okay. So, I mean, how do we do this. Well, we can let rho be the square of the Fourier
transform of, again, a compactly supported function. We can normalize by letting the integral
be -- well, the integral E of rho is just the L2 norm of B, and we're going to set that equal to 1.
Note that that's -- yeah.
So the problem now is for a compactly supported function we knew how to bound entire
derivatives. How do we bound the higher derivatives here? Well, again, this is a mess, so let's
use the board.
But the point is that our Kth derivative of rho, well, we could write it as we want to take the Kth
derivative of B hat times -- sorry, B hat times B hat bar. And, well, you can take -- you
differentiate both of these, so you end up with a sum of K choose J.
And now, I mean, for one thing, in this case I'm actually going to bound the L1 norm of rho here,
which is actually, if you think about it, what you want to do, because you're going to be
convolving rho with something that, say, has a bounded L infinity norm and you want to sort of
bound the size of the error. So you really often want to get the L1 norm of this.
And so now we've got a bunch of products of B hat, the Jth derivative and B hat ->>: [inaudible].
>> Daniel Kane: Yes. Of the Kth derivative. Sorry.
And so the point is we've got sort of an inner product of these. And really, I mean, these should
be like we should be taking the L2 norm of this times the L2 norm of this and using
Cauchy-Schwarz. And so now instead of taking derivatives of the Fourier transform -- for one
thing, the L2 norm is preserved by Fourier transforms, so instead of taking derivatives of this guy
we're going to be taking the Fourier transform and multiplying by X.
So what we've got is sum of K choose J, and now we have X to the J B, L2 times X to the K
minus JB L2. And the point is that if B is compactly supported, multiplying by X doesn't
multiply it by more than however big your support is. And so we get this sum of K choose J.
And this thing looks like the L2 norm of B squared times the size of a support of B to the K.
And, well, this sum of binomial coefficients is 2 to the K, and we get a sum -- and the size of B
squared is 1 by our initial normalization. So we get is at most 2 times this size of the support of
B all to the K. Or by the size of the support here I mean sort of the biggest absolute value of any
point in the support.
So we get a nice bound here based on the size of the support. Okay. And so another thing that
we're going to want to do is we're going to want to do multidimensional mollification. So
[inaudible] linear threshold functions, we got kind of lucky. We could write the function that we
cared about as a function of a single polynomial in our actual coordinates. And so we only need
to work with sort of a one-dimensional function that we needed to smooth out.
But in general we're not always going to be so lucky, so we're going to want to have a rho that's a
certain function of several variables. So one thing that we found out that works very well for a
lot of purposes is we're going to use the same strategy as before that I just described for a
positive convolving function. We're going to let B of R be appropriate normalizing constant
times 1 minus the L2 norm of R squared for R less than 1 and 0 outside that.
Note that this is no longer actually a smooth function, but it's decay in infinity will be good
enough as we'll get to shortly. And this is a mess. But basically what we're going to do is we're
going to bound the -- we're going to bound the integral of X squared times rho of X DX.
And I guess I'll not get into -- fine. I'll do this very briefly. This is a sum of XI squared times
the Fourier transform of B squared. Note that like this is something squared that we're
integrating. It's an L2 norm. We can interchange this XI with taking the derivative. We do
some computations, dot, dot, dot. This is equal to O of M squared where M is the dimension.
And what this means is, I mean, for one thing, since rho is positive and has integral 1, you can
think of it sort of as a probability distribution. And this is the expectation of X squared. And
since that's finite, it means that we're somewhat concentrated. And in particular -- this isn't
showing up, but what it means is that our integral over all of space where the absolute value of X
is bigger than R of rho of X, DX is going to be at most M squared over R squared.
And that's -- that bound will turn out to be basically good enough for everything that we want to
do. If you wanted -- if you needed tighter bounds, you could of course make B more smooth. It
would require being a little bit more clever perhaps.
Also, from the bounds that we had before, the Kth derivative, and in fact the Kth directional
derivative in any particular direction, at least a unit vector direction, is at most 2 to the K. And
we'll use that later.
Okay. So fully intersections of half spaces. We now have enough to deal with this. So now
instead of F being a single linear threshold function, it's going to be a product of M different
linear threshold functions. So we want to control which side of several different half spaces we
are in at the same time.
So now instead of depending on just one W.X it depends on a bunch of WI.Xs. Let's for
convenience let B sub-I be an orthonormal basis of the WIs. And now we have some function Ls
of X whose coordinates are just BI.X. And our capital F is just some function little F of L of X.
So, again, this is sort of the same idea but now instead of F being just -- big being junction some
function of W.X, it's now some function of this M dimensional function of X.
So now we're going to use multivaried FT-Mollification. The basic plan is the same. We
convolve little F with the rho or a scaled version of this rho to get F tilde, and we prove this
chain of approximations. Yeah. We get F tilde by now convolving with C to the M times rho of
C times X.
Okay. So anti-concentration, again, it's -- the math is being wonky. Okay. So we need a bound
on how far apart F is from F tilde. So let's suppose that X is a distance of at least D from any of
the hyperplanes. So this is after you take L of X, the point that you end up with is distance D
from any of the hyperplanes at which we have these discontinuities. Then the claim is that F of
X minus F tilde of X is at most O of M over CD squared.
And the reason for this is that, well, since within a distance D we're on sort of the same half of all
of the hyperplanes. If we integrate sort of -- if we're integrating against rho on that sphere of
distance D, we're just integrating against exactly the value that we want. And all of the error
comes from integrating against stuff outside that range.
But we already had a bound for what we got with integrating stuff outside a range. And the C
comes, of course, from the fact that we rescaled rho by a factor of C. So it would have been just
M squared over D, but we rescaled rho, so it's M squared over CD.
And so what this means is that basically our error is roughly the probability that X is within M
over C of some hyperplane. Because, I mean, if D is roughly M over C, then this is order of 1.
But it falls off quadratically as we get into -- as we get through -- as D becomes multiples of that.
And as we sort of saw before, this falling off quadratically means that sort of we get this
geometric series of terms that we don't care about, so ->>: [inaudible] subspace is determinant function? Is it supreme [inaudible]?
>> Daniel Kane: Something like the product of the characteristic functions.
>>: [inaudible].
>> Daniel Kane: Okay. So our error is roughly the probability of M over C of some hyperplane.
There are M hyperplanes. And, remember, we actually dallied with the BIs which were unit
normal vectors. So this error probability is going to be O of M squared over C.
So that's what the error we get is from our anti-concentration problems. I won't do the same
thing for X, because the trick is the same. Fine. We need C to be whatever it needs to be.
So the next step is polynomial approximation. Now, for this step, again, we take F tilde, we
approximate by its degree K minus 1 Taylor polynomial at 0. Now since we're in several
dimensions we need to know what we mean by this. And in this case I just mean we take
everything up to degree K minus 1 in all combinations of directions.
Again we've got this problem of what's our Taylor error here. And one trick for doing this, we
can look at the Taylor error along the line from 0 to wherever we happen to land.
So, I mean, in particular we restrict our function to that line, and it turns out that this Taylor
series approximated to that line is the Taylor series of the function approximated to that line. So
we just look at the Taylor error along this line. We get it's the size of the Kth derivative along
that line times the size of L of X to the K over K factorial. The size of the Kth derivative was at
most 2C to the K.
The expectation of the Kth derivative, well, in each coordinate it's just a Gaussian. So in each
coordinate would be like K to the K over 2. There are M coordinates. So that's like M over K to
the 2 times K to the K over 2.
And so putting this all together, the error is like 4M C squared over K to the K over 2. So as
long as M is bigger than, you know, MC squared or to make C as big as it needs to be, M to the
5th epsilon to the minus 2, we'll be in good shape. Because then we'll be taking this small thing
to the K, and K is pretty big.
On the other hand, we did a kind of lousy approximation here. We approximated the size of L to
the K as just M to the K over 2 times the bound on each of the coordinates. And if the
coordinates were somehow highly correlated, this would be an okay bound. For example, if they
were all the same. But they're not. In fact, since we picked the orthonormal basis to define L, if
X were actually a random Gaussian, these coordinates would be completely independent of each
other. So we should expect the bound to be much smaller.
And, in fact, there's this bound on moments of a quadratic form of a Gaussian. And, again, I'll
have to write this down to make it convenient. So what you have is you've got Q a quadratic
form. And we're looking at the expectation of the absolute value of Q of X to the K.
And there is a sort of -- and so the theorem is that this bound is O of the expectation of Q of X.
So there's sort of a bunch of terms here. And plus the Frobenius norm of Q. This is the sum of
the squares of the eigenvalues times root K plus the infinity norm of Q. This is the operator
norm, this is the size of the biggest eigenvalue times K all to the K.
So the way I like to think about this, there's sort of three different ways that Q could behave.
One, it could -- you could take Q of X to be like the sum of the square. So -- sorry. Firstly, Q,
you've got a quadratic form. You can always diagonalize it. When you diagonalize, basically
your quadratic form is just a sum of chi-squared randoms variables that are independent of each
other times certain weights.
So on the one hand you have the expectation of Q. I mean, if you've got like a bunch of
chi-squared random variables with really small weight that you're adding together, it basically
looks like a constant. So your Kth moment is going to look like that constant to the K.
The second term, well, Q2, the Frobenius norm, this is -- on the other hand, if you've got no
constant term, if, say, your weight all add up to 0, then you've got a sum of independent
chi-squared random variables. They've got no expectation. But since you've got a sum of a
bunch of different things that are all independent of each other, you expect it to be a Gaussian.
And Q2 turns out to be what you expect the variance of this Gaussian to be. And so if you're
approximating this giant sum of independent things as a Gaussian, that's what the Kth moment
should be.
But clearly that's not always accurate. Suppose that you just had one chi-squared random
variable. Clearly its Kth moment is not going to look like the Kth moment of a Gaussian. It's
going to look like the 2 Kth moment of a Gaussian.
So this last turn is you just take the biggest eigenvalue, the biggest chi-squared that fits inside of
that and take the Kth moment of that. So that's where this comes from. But it can give us a
better bound on the expectation of L of X to the K since L of X squared is a quadratic form. Its
expectation is M. Its Frobenius norm is M. And its L infinity norm is 1. So we get the
following for our bound. And plugging this in, we can actually get away with K being O of M to
the 4th times epsilon to the minus 2. So this quadratic-for-a-moment bound saves a factor of M.
Okay. So, again, here's one more technique, and then I'll get into degree 2 polynomial threshold
functions. So one thing that we want to do is we want to build an F tilde that A approximates F
and has B very sharp bounds on its derivatives.
Now, there's a problem here, though. There's sort of a fundamental limitation; that F is typically
going to be bound. And so let's say we want F tilde bounded and have very fast bounds in its
high-order derivatives. The problem is the derivatives are never going to decay faster than
exponentially, unless it's like a constant or something. And one way to see this is just look at the
Fourier transform of F, which it should have because it's bounded.
And if you look at the Fourier transform, then it's going to have some amount of width, then you
take derivatives of F, you're going to be multiplying the Fourier transform by X to the K. But
since the Fourier transform has some support outside of 0, unless it's a polynomial, this means
that multiplying it by X to the K will at least multiply that part of it by however far you are from
0 to the K. You're not going to be able to decay faster than exponentially. And this is a problem.
On the other hand, you can actually -- if you only care about the value of F on positive numbers,
you can do a little bit better. So, for example, if F tilde only needs three bounds on positive
numbers, you've got things like cosine of square root of X.
Now, since cosine of X has terms, you know, X to the N, 2N over 2N factorial, this thing has
coefficients X to the N over 2N factorial. So it's got much better higher-order derivatives that
you would expect.
On the other hand, it's only bounded on positive numbers. On negative numbers, it looks cache
of root X and blows up exponentially.
But if somehow we were taking F and it only ever needed to be evaluated on positive numbers,
the fact that it's a really terrible approximation on negative numbers doesn't matter to us.
So in general if we have an X that we want to approximate by F tilde but only care about F tilde
on the positive numbers, well, we can write G of X is F of X squared. So this is sort of a
function that is symmetric and only sees the positive values of F.
Now we get F tilde by taking G, convolving it with our rho, and then evaluating it back at square
root of X. And it turns out that this thing will have derivatives that decay, like instead of C to the
K, they'll decay like C over K to the K, which is a lot better.
And, furthermore, this F tilde approximates F now on the positive reels. Though, again, you
can't evaluate these things on negative reels because you'll get big exponentials and it will be
horrible.
Okay. So using this we can fool degree 2 polynomial threshold functions. So what we want to
do is we let F of X be this indicator function 0 to infinity of P of X where now P is a degree 2
polynomial.
Note that we cannot now naively use the FT-Mollification Method. So, again, we've got a mess,
but if we're trying to approximate F by a smooth thing -- suppose even F were smooth and now
we're going to use our Taylor error, well, our Taylor error looks like the size of F to the K times
can Kth moment of P over K factorial.
Now, the size of F to the K is, in the case we're looking at, looks like C to the K. But now the
Kth moment is the Kth moment of a degree 2 polynomial. So that instead of looking like K to
the K over 2 looks like, you know, whatever our size of P is times K to the K. Again, over K
factorial. The K to the K cancels the K factorial, and now we get something that looks like C
times the size of P to the K.
But this means that if C is big relative to the size of P, this actually won't get any better as we
increase K. In fact, the errors will get worse as we increase K. And since we can't do that, we'll
never be able to make C big enough, so we'll never be able to convolve with something that's
concentrated enough, and it just won't work.
On the other hand, there's a way to get around this. I mean, because -- I mean, this thing says
that we're going to run into problems with moment -- with threshold functions of, say, you know,
a Gaussian square. But clearly a Gaussian square we can just get threshold functions of the
Gaussian, and it would be okay.
So we want some sort of decomposition. So if it's got sort of a few big terms that look like the
square of a Gaussian, you want them to just see where those Gaussians lie.
So we're going to decompose P into a couple of functions. So first off we're going to pick a
normalization. Let's assume that the L2 norm of P is 1. This is just the -- the expectation of P
squared should be 1.
What we're going to do first, we're going to write P of X as a quadratic form plus a linear form
plus a constant. And, furthermore, we're going to decompose the quadratic form. We're going to
decompose it so we can always diagonalize our quadratic form, and we're going to split sort of
the eigenvalues into three classes. Q plus takes all the eigenvalues that are bigger than some
delta. Q minus takes all the eigenvalues that are less than minus delta. And Q0 takes all the
eigenvalues that are in between.
And the idea of this decomposition is that both Q plus and Q minus are positive definite. So
since we only ever need to evaluate them on positive numbers, this trick about functions that we
only never need to evaluate on positive numbers is usable.
Q0, on the other hand, has a small L infinity norm. And so if you remember the
quadratic-for-a-moment bound, it's got things that look fine, that look like a Gaussian, unless
your L infinity norm is big. But since the L infinity norm is small, we should be okay.
So formally what do we do? We define M of X to be -- we've got four coordinates. We've got
square root of Q plus, square root of Q minus, Q not minus the trace of Q not, just so that this
term doesn't get in our way. And then we've got the linear term. And we let little F -- is in fact
equal to this indicator function applied to W squared minus X squared plus Y plus Z plus the
trace of Q not plus C. And big F is just little F of M of X.
So what we do is we take little F and we mollify it, and let's see what we get. So the first thing is
we've got anti-concentration to deal with. F minus F tilde is small unless M of X is within about
C inverse of the boundary between its 0 and its 1 regions.
And things are a little bit difficult here, because beforehand we just knew that we were okay
unless the value of a polynomial were close to the boundary.
On the other hand, we can move M of X a little bit. We can move it C inverse. How much can
we change the like effective value of the thing we're taking an indicator function of. Because,
remember, we were taking an indicator function of a W squared minus X squared plus some
linear stuff.
And this means if we move by C inverse we can move the value of this thing that we're taking
our indicator of by something like, you know, W times C inverse plus X times C inverse plus C
inverse plus some stuff.
So in order to -- in order for us to not sort of automatically be very close to the boundary, we
need that the first two coordinates are small. Fortunately, as it turns out, size of M1 and size of
M2, it's not hard to show R about delta to the minus a half, O of delta to the minus a half, with
high probability.
And this is just because since they only take large eigenvalues, there can't be too many of those
eigenvalues, and so the total expectation can't be too big.
And from that we have a that F minus F tilde is going to be small unless P of X is within about
delta to the minus a half times C inverse of 0. And, well, we've got Gaussian anti-concentration.
In fact, there's an anti-concentration result for polynomials of Gaussians. And given the
normalization of P, our error is going to be about O of delta to the minus a quarter times C to the
minus a half.
Next, polynomial approximation. So our error is again the size of the Kth derivative times the
size of M to the K, the expectation of that over K factorial. Kth derivative is like O of C to the
K. For size of M we've only got four coordinates. Let's bound them each separately.
So the moments of M1 and M2 are like the K over second moment of a quadratic and it's not
hard to see that that's -- so K to the 1/2 to the K plus delta to the minus a half to the K. These are
just coming from -- the delta to the minus a half is coming from the fact that it's got a big
expectation perhaps, and the K to the minus a half is just from bounds for quadratics from
hypercontractive inequality.
So for M3 this is the expectation of Q not minus trace to the K. Here we use this thing. So that's
either L2 norm which is -- the Frobenius norm which is at most 1 times K to the 1/2 to the K. Or
the other term which comes from the operator norm which is at most delta times K to the K.
The last one we have is 4M to the 4th, which is the linear term, so that's like O of root K to the
K. So we get a total error. We've got those four terms all to the K.
So putting it all together, what do we need to be true for our total error to be at most epsilon.
Well, we need each of those terms that we saw in the last thing to all be smaller than 1. So we've
got something much smaller than 1 to the K. We want K to be at least log 1 over epsilon, so
we've got 1/2 to the K. It's actually small.
And then we need that delta to the minus a half times C to the minus a half that we got from our
anti-concentration bounds. That should also be much smaller than epsilon. But we can get away
with C being roughly epsilon to the minus 4 and delta being epsilon to the 4 and K being epsilon
to the minus 8.
And if you twiddle with the exact parameters, you can make all the things that are supposed to be
less than 1 actually less than 1. And so you can get -- so you can show here that epsilon to the
minus 8 or so independence, epsilon fools degree 2 polynomial threshold functions.
So there's actually a new result that was more new when I originally gave this talk; that you can
actually fool degree D polynomial threshold functions. Unfortunately, you need K to be
something like epsilon to the minus 2 to the O of D independence.
But what's the basic idea here? Now instead of moment bounds for degree 2 polynomials like
this, you need moment bounds for arbitrary degree polynomials. And basically there's a result
that says that for an arbitrary degree polynomial, it's got the moment bounds you'd expect from
sort of a Gaussian with its variance unless it's got some big component that looks like a product
of two other functions.
So what type of things have high moment bounds? You've got high moments if you've got
like -- if you've got like a single Gaussian to a large power, then that has very large moments. Or
one Gaussian times another Gaussian or something of this form.
But this gives us a natural decomposition. So we have a structure theorem. Basically you've got
your polynomial, it's got small moments unless it's got the component that looks like a product of
the two other things.
And in that case you just make those two other things sort of new variables. You write in terms
of your old thing minus those and these new things. And you sort of -- you sort of then
decompose these recursively. You get some big number of things that you're writing your
original polynomial in terms of, but fortunately all of these things actually have small moments.
Then you use FT-Mollification Method in some gigantic number of variables and have to be a
little bit clever about in some directions you need it to be more concentrated -- more
concentrated than in others, but those are details, and it works.
So further work for this technique. We could probably use a better multidimensional version for
the function that we use to mollify with. The one that we use seems to work, but it's not clear in
any sense whether or not it's optimal.
Also, I mean, we're always looking for new applications of this technique. We have a few. It'd
be nice to have more.
Here are some references to papers that were used in the development of this technique. That's
all I had to say. Are there any questions?
[applause].
>>: Epsilon to the minus 2 to the D, do you suspect that can be improved?
>> Daniel Kane: I suspect it can be improved. In fact, I mean, I think that the actual answer is it
should be like D squared times epsilon to the minus 2. But I have no idea how to prove anything
approximating that.
>>: [inaudible] on the same topic and I am totally ignorant [inaudible].
>> Daniel Kane: Let's see. And I don't know -- I don't think there was anyone who had fooled
degree 2 polynomial threshold functions before we came along. I think there were several
people who had sort of other -- so there had been lots of people who had written papers that
show you could fool linear threshold functions. And sort of a lot of them, instead of using this
framework, they just went and used sort of standard results from approximation theory.
And, I mean, I think that sort of in doing this we've discovered that some of this -- a lot of this
technique is sort of rediscovered from what people did back in the day for approximating by
polynomials. But for a lot of this stuff it really helps to know the inside of your machine and be
able to actually use it so that you can get this thing done in multiple dimensions and figure out
exactly where your errors are and exactly what your bounds and your derivatives look like.
>> Yuval Peres: Any other questions? So I actually have a few other potential applications, but
I'll discuss them with you later.
Let's thank Daniel again.
[applause]
Download