21746 >> Yuval Peres: Okay. I'm delighted to have... today. The work he'll describe on moment estimation and...

advertisement
21746
>> Yuval Peres: Okay. I'm delighted to have Jelani Nelson from MIT with us
today. The work he'll describe on moment estimation and data streams is
already quite well known. I've heard a lot indirectly about it. So I'm very happy
we can hear it from the source. Please.
>> Jelani Nelson: I'm Jelani. This is joint work with Daniel Kane and David
Woodruff. And I'm going to start off just by defining the model for this problem.
So we have an N dimensional vector X. Starts off as the zero vector, and you
should think of N as being really, really large, like maybe I'm indexing the vector
by source destination IP pairs or something.
And I have a really long sequence of updates coming in a data stream. And
each update is a pair. It's an index together with an amount. And it says add
that amount to the corresponding index of the vector.
And the amounts are all bounded precision integers. And the goal in this talk is
to output the Pth frequency moment of the vector X where P is given up front.
So I imagine like ->>: Updating this vector. So N is not timed.
>> Jelani Nelson: N is not timed. N is like, for example. Here's an example I
have network traffic. Every packet has a destination and source IP. And X is
indexed by these things. Every time I see a packet I like update the
corresponding thing by one and now the vector tells me, you know, X sub IJ is
how many packets from person I to J.
Now, is the model okay? So moment estimation gets used like in a black box
way and a bunch of other problems. So one, for example, is empirical entropy
estimation. So you imagine again this IP example, if it's indexed just by
destination IP, you know like the entropy tells me how skewed the distribution is.
>>: So just want to, maybe it's good for us to focus on what particularly is the
most important example. So was it this IP?
>> Jelani Nelson: Yeah, so from now on think of being indexed by destination
IPs and think of V always being 1. And XI is just how many packets that were
sent to IP addresses.
So then let's say, like in this application, you know the empirical entropy of this
vector now tells me something about how concentrated like the distribution is of
where packets are going. And if it's too concentrated, for example, that could be
a signal for a denial of service attack.
So, I mean, so this thing -- actually they use empirical entropy estimation to do
these things. And the best known algorithm uses moment estimation as a black
box, which this is an algorithm I did with Nick Harvey and Chris [inaudible].
So I want to talk about like, I guess, kind of how to solve the problem. So we
give optimal algorithm for this problem and that's what I'm going to focus on.
There are two natural objectives in developing an algorithm.
One is I want to minimize the amount of memory I use in my algorithm as I'm
processing the data stream. So if you think of N as being really large, the length
of the vector being really large you don't want to have to remember the entire
vector in memory.
And I also want to minimize update time. I don't want to have to take too long to
process every single packet or something.
So there are two easy solutions that use a lot of space but have low update time.
One is remember the whole stream. And then at the end of the stream calculate
the answer, or there's also keep the whole vector in memory.
Remember in little M, little M was the stream length and capital M is the update
amount is bounded by capital M magnitude.
So this is just like how many bits do you need to represent a single frequency.
>>: When you say keep the stream in memory, calculates it afterwards?
>> Jelani Nelson: Yeah.
>>: Is there something clever we do other than get that out of memory?
>> Jelani Nelson: You mean this one or this one?
>>: The second one.
>> Jelani Nelson: I mean, so if you don't care about the reporting time at the
very end, you could keep the whole stream in memory and loop over 1 to end
and sum up all the frequencies, sum up the frequency.
>>: Okay.
>> Jelani Nelson: You wouldn't use N memory. You would use little n. Is that
what you're asking?
>>: I mean, there's just -- what we're doing after we see the string.
>> Jelani Nelson: Or you could sort by the index and then just like run through it
once and accumulate.
>>: Your goal is just to cut the moments of the final vector?
>> Jelani Nelson: Yes.
>>: So why would you want to keep more than X?
>>: Maybe N is much larger than M.
>> Jelani Nelson: Yeah. You might also want like, for some of these like
network monitoring applications, you want to like ask what is the moment now,
what is the moment now. Periodically throughout the stream. So it's not
necessarily always at the end of everything. But, yeah, M could be much less
than N which is why this could be much better than that.
>>: The example of ->> Jelani Nelson: P is something that's very near 1. But not quite 1. And the
reason is -- and I can explain why this even like makes some kind of sense.
The idea is just like -- so there are these other entropies besides Shannon entity.
Like there's solace entropy and Ren entropy, defined in terms of moments. Just
by L'Hôpital's Rule, they converge to Shannon entropy and the limit as the P
goes to 1. If I write down what the definition of they are.
Like define like H sub P to be like 1 minus, I guess, F sub P over P minus 1.
Okay. Where this vector -- this is F sub P of the normalized vector. So this is
like the frequency. This is the probabilities.
Then this thing, the limit as P goes to 1 of HP is equal to Shannon entropy. So
this is the connection between moments and entropy.
But if you just black box it, if you use the fact that this converges to Shannon
entropy, you're not going to get such a great algorithm, but there are some tricks
you can use where you basically calculate lots of HPs and then interpolate and
then evaluate some interpolated polynomial at 1.
That's a really short summary of that. Okay. So I'm going to focus in this talk on
minimizing space usage. Okay? And so there's one bit of bad news, which is
that the first work that studied this problem by Alon, Matias and Szegedy is that
they showed you cannot output the exact answer deterministically in sublinear
space. So you need to allow for slack in two ways. One is multiplicative error,
and the other is you have to settle for randomized algorithm that has some
constant probability of failure.
So this is our new goal, just out of necessity, if we want to sublinear space
algorithm. And then even more bad news is that a polynomial space is required
for large P.
And so I'm going to focus -- in this talk I'm going to focus on P between 0 and 2
and it's a real number. So this is not just three values.
And P is not 0. So greater than 0. So let me tell you what was known before this
work.
>>: That work also had quality results?
>> Jelani Nelson: Alon Matias and Szegedy, yes. They have a positive result.
When P equals 2 they had an algorithm which actually we show as optimal.
When P is greater than, I think, 5 they also had some like -- I'm sorry, when P is
greater than 2 they also had an algorithm which has been improved upon since
then. But they had positive results.
Okay. So first let me just recall. This is our notation here. So previously the
lower bound -- so this is all like number of bits of space you need to solve the
problem. Where you want 1 plus minus epsilon approximation, let's say 99
percent probability.
Okay. Over the randomness used by the algorithm. So David Woodruff showed
a 1 over epsilon squared lower bound and the original work of AMS showed a log
N lower bound. And when P equals 2, this is one of the positive results that they
had was remember frequencies can never be bigger than little M times capital M.
So this is like the word size you need to store frequency.
They showed you just need 1 over epsilon squared words over memory. And
when P is not equal to 2, [inaudible] gave an algorithm which had log squared
space. And there were some -- there's some differences given by Lee later. But
I mean it's the same bit complexity.
So what do we show? We show that for all P between 0 and 2 this is the correct
answer, both upper and lower bounds. And I should say that I'm being slightly
pedantic, all bounds on this slide are hiding in log log little n, which I'm not going
to talk about.
So are there any questions about statement of the problem and what we're going
to try to show? I'm going to show you the upper bound. No. Okay.
Now I'm focusing on the upper bound and this bound. Okay. So the first thing I
want to tell you about are P stable distributions. So these are -- this is a family of
continuous probability distributions. And they're out there for all P between 0 and
2. Let's say they're indexed by P.
In reality, for each P there's like a family of P stable distributions. But let's say
there's only one for each P, which is going to satisfy some property that I'm going
to tell you about now.
So I'm just going to define this distribution by a property it has. Okay? And the
property it has is that if you take any fixed vector X and take its dot product with a
vector of independent P stable random variables. So you take the sum over IQ
IXI, this itself is a random variable, which is P stable, just for the scale factor
which is the LP over X.
So that's the only property I care that it has. And you probably have seen this for
Gaussians maybe. But they're out there for all P between 0 and 2.
>>: Curious why you credit the entire distribution?
>> Jelani Nelson: Oh, yeah. So I've been told to fix this. Is it Levy who I should
be crediting? Yeah. Levy. Sorry.
>>: 50 years earlier.
>> Jelani Nelson: I think I did that -- yes, we wrote a book and I forgot to change
it to Levy. Levy and who else?
>>: Kinshin [phonetic].
>> Jelani Nelson: Kinshin. I see. Okay. So let me tell you about the algorithm.
So it's going to maintain -- you should think like dimensionality reduction. So
rather than maintain the vector X in memory, I'm going to maintain a vector Y
which is some linear map applied to X.
And why can I maintain this in a stream. If I can maintain the linear map A,
basically whenever I see an update saying add V to the ith coordinate of X. I
take V multiply it by the ith column and add it to Y. So I can do that.
What is A? A is a vector of independent P stable random variables. And the
idea of using this stable, P stable sketch matrix for data streams was first used
by a Inix [phonetic] algorithm. And he said he's going to estimate L P, by the
way, this is like P norm. I'm just saying take the moment, take it to the 1 over P
power.
Okay. He said estimate the Pth norm as the median of the entries. Lee gave a
different estimator which is unbiased and has lower variance and stuff. But in
both cases the number of rows you need in this matrix is 1 over epsilon squared.
Okay. So that's great. You just maintain this 1 over epsilon squared dimensional
vector. The only issue is you need to store this matrix A. You need to
consistently use the same matrix A over the stream. And naively, you need a lot
of bits of space to store the matrix. And we're trying to get a small space
algorithm.
So what Indig did in his algorithm was he didn't actually store A explicitly in
memory. He stored some short random seed and used a pseudo random
generator to stretch the seed to get various entries of A.
So at any given time he was only storing the seed, not the matrix. Okay. And
actually the only -- remember his algorithm used something like 1 over epsilon
squared times log squared space. And we're trying to get something like log
space, 1 over epsilon squared log. The only reason his algorithm was so optimal
was because of this step. The seed that he needs to store was the dominant
space, was the dominant thing his algorithm in terms of space.
So you can just ask, well, is there a more efficient way to generate this matrix A?
And what we show is that in fact, yes, just via K-wise independence.
So for fixed I -- so I have this matrix. So for a fixed row the entries in that row
need to be K-wise independent and how do you generate K-wise independent
random variables? You take a seed and you do stuff with it.
And those seeds from row to row all need to be pairwise independent. So this is
what we show. And we show that if you pick K to be something like 1 over
epsilon to the P, then index algorithm, which took the median of the entries of the
vector works. And we also show that a different estimate or works with a much
smaller K.
Although algorithmically I am still not aware of any reason why this is better. Like
basically this would give you the same, this would give you optimal space as well
and it would give you the same time as this one, if you do the appropriate tricks.
So I'm just going to focus on this one. Okay. So the remainder of the talk I'm
going to explain to you why this is true. So is it clear what I'm saying here?
>>: Are you going to tell us how you produced these K-wise independent
variables from the pairwise independent seeds?
>> Jelani Nelson: Oh. I didn't want to get into that, but I guess I'll say -- so like
let's say you want to generate K-wise independent random variables over some
finite field FP. So you take a degree, random degree K minus 1 polynomial
evaluate it at all the points and that's your stuff.
What you're actually going to do here is well like Cauchy random variables, P
stable random variables are continuous. I can't implement that on the computer.
First step is I'm going to discretize it over some precision. And then how do you
generate -- how do you generate a P stable random variable? Like there are
ways to do it where you pick a random angle and you pick some random number
between 0 and 1 and you do some trigonometric stuff.
You discretize -- what you actually do is you discretize just like the unit interval 0
to 1. You pick the discretized angle. Pick a discretized number between 0 and 1
and do the finite field version in this discretized way. Does that answer your
question?
>>: [inaudible] how are you building up pairwise independence to K-wise
independence because polynomial you assume more ->> Jelani Nelson: Oh, I see. Okay. Okay. So to get my K-wise independent
variables for a particular row, I need to pick coefficients which are like coefficient
C0, CK minus 1. And let's say this is row 1. And then I need to pick coefficients
C02 for row 2.
Okay. And I'm saying basically look at this, just like concatenate the coefficients
and look at this as a single integer. And now I'm saying like these integers need
to be chosen pairwise independently. So you can do that over a larger field.
Basically pick these seeds pairwise independently and then now use the seeds
to generate K-wise independent random variables.
>>: K-wise independence is the same within a row but not between the ->> Jelani Nelson: If this is the matrix.
>>: Different rows.
>> Jelani Nelson: If this is the matrix, you look at a particular row, these entries
are K-wise independent.
>>: But you're not making an assumption about the K-wise independence around
columns?
>> Jelani Nelson: Around columns. No.
>>: There you only have pairwise.
>> Jelani Nelson: Yeah. Because so really what is this matrix doing? Like
here's my vector X. And here's Y. Let's look at like a particular entry. This row.
So if these were independent, then this entry would be a P stable random
variable. And I'm saying replace this full independence with K-wise
independence, and now this is no longer P stable, but I still want to say somehow
it's good.
Okay. So the way I'm going to show you that this works is I'm going to analyze
index algorithm under full independence in a way that exposes where full
independence is being used. And then just replace it -- I'll show you how to
modify the argument.
So here's an argument for why his algorithm works. Remember, it takes this A
equals Y and looks at the median of absolute values of entry Y.
So why does that work? So take -- let's assume the P stable distribution, the
median of the absolute value distribution is 1. And you can do that just by
scaling the distribution.
And now I'm going to define this indicator function of an interval. And let's look
at -- let's look at 1P stable random variable Q. So Q is the sum over I of QI over
XI. And the median of -- okay. So what is this random variable? Q is a P stable
random variable scaled by the LP norm of X. So this is the absolute value of a P
stable random variable.
And the median of it being -- the median of that distribution being 1 means this.
That's exactly what it means. Does anyone want me to draw anything? Okay.
Good. I'm not really saying anything except rewriting stuff. Okay. Good.
Okay. So how are we going to analyze the algorithm? Well, I'm going to take
that interval minus 1 to 1, and I'm going to shrink it on both sides by epsilon.
So I have this P stable distribution. I said the amount of probability mass
between minus one and one is a half. I shrunk it by epsilon both sides. I'm going
to lose data epsilon mass. Similarly if I stretch it by epsilon on both sides, I'll gain
theta epsilon mass. Now what's this telling me? This is telling me if I take
enough trials -- what's a trial? A trial is a row of this sketch matrix, right?
If I imagine these are my QIs and this is sum over QI over XI and I'm doing a lot
of different trials, so what this is saying is if I take R trials, then the fraction -- in
expectation, the fraction of trials that are less than 1 minus epsilon times the LP
norm is less than half of the trials.
And the fraction of trials that are less than 1 plus epsilon times the LP norm,
that's more than half the trials. So if everything were going according to what
these expectations are telling me, like imagine that these weren't just
expectations but this is like what happened, then the median would be between 1
minus epsilon and 1 plus epsilon times the LP norm. So all that I need to happen
is that, like in fact less than one-half are less than 1 minus epsilon times LP norm
etcetera. And if I take enough trials that will happen just by concentration. And
the kind of concentration you need is guaranteed by Chebyshev. And already I'm
telling you that the trials, to analyze it I need Chebyshev's inequality. The trials
need to be pairwise independent. That's why all rows need to be pairwise
independent. Within the row I'm using the fact that within a row I have full
independence because I'm saying this is a P stable random variable. And I know
that if I take this P stable random variable I know how much mass is where.
But if I only have K-wise independent QIs, then, for example, maybe a lot of
mass is in this really tiny epsilon interval, and I shrink my epsilon and something
crazy happens, I don't know. Or I guess in this case there's no mass there, and
then I don't lose the epsilon mass.
So the only thing I need to show to make sure that this algorithm works is I need
to show that the expectations -- I need the show these two expectations are
approximately still the same, to within epsilon, in the case of K-wise
independence.
Okay. So this is I guess -- call it invariant principles. I'm showing some
invariance -- what's the expectation of an indicator function? It's like I'm talking
about CDFs here. So I'm saying the CDF is close if you have full independence
versus K-wise independence.
So that's what I'm showing. Okay. Good. So how might you show that? Well,
you want to show that this expectation is similar under K-wise independence.
Well, we know K-wise independence preserves expectations of degree K
polynomials. That's what they're good for. Let me replace this indicator function
of an interval with a well approximating low degree polynomial, maybe standard it
between two polynomials and use the fact that they're fooled [phonetic] or
something like that.
Okay. So we're not going to do that for a good reason, which I'm going to tell you
soon. The good reason is that such polynomials don't exist because these P
stable distributions other than Gaussians have infinite variance. So like any
polynomial is either going to have -- it's either going to be like a constant function
or something or it's going to have infinite expectation. Okay. So there are no
such well-approximating low degree polynomials under P stable measure. So
we're going to do instead we're going to well-approximate this indicator function
interval with some smooth function.
And then I'm going to show you that the smooth function, or I'm going to show
you some steps. I don't have time to show you the whole thing. The smooth
function is K-wise independence under P stable measure.
Okay. Okay. So I'm going to now talk to you about how to get this function.
Okay. Okay. So switching contexts I'm going to talk to you about FT
mollification, which I haven't defined yet but I'm about to define it.
I'm just curious, have people seen mollification before? Mollification? Okay.
Some have, some haven't. That's okay. I'm not assuming that anyone has seen
mollification, I'll tell you what it's all about.
So the idea is this. I have a function F which is not a smooth function. Okay.
Like, for example, it's the indicator function of an interval. It might not even be
continuous.
And I would like a smooth function which well-approximates F. And there are
these two facts out there. One is if you convolve a function F with a smooth
function G, then all of a sudden the convolution becomes smooth as long as like
these derivatives, as long as this integral exists.
Okay. And this is some identity that the Kth derivative is the same as K
convolved kth derivative G. And the next fact is that there's this like this limit of
functions, the direct delta function.
If you convolve F with the direct delta F function you get F back.
So basically the idea convolve F with a smooth approximation of the direct delta
function. That's mollification. And mollification looks something like this. You
take a smooth function. This is my G, which is you know like there are smooth
functions out there which are only supported like in some finite interval.
You take B sub C, which means shrink it in words by factor of C and make the
taller by factor of C and as C goes to infinity it looks like the direct delta function.
Convolve it with F you get something smooth. That's mollification.
Okay. So a problem with mollification, problem meaning it's a problem for our
particular application, you know? Is that okay like the resulting function is a
smooth function, but just how smooth is it? What the derivative bounds you get
out of mollifying things? And you get something like if you use a function which
has like some finite support like this, you're going to get derivative bounds which
are like K to the K 0 kth derivative. And C remember is the tuning factor where
the larger I make C, this looks more and more like the direct delta function.
Okay. So what we do instead is so I said FT mollification. We'll take the same B.
And rather than convolve with this we're going to convolve with its foray
transform.
So this is B. This is foray transform looks like this. Looks like a sink function sort
of. And we square it and we convolve with that.
Okay. And we're able to show -- so you can show two things. One is that you
get bound derivatives. So I wrote K before. So basically the Lth derivative is like
C to the L times 2 to the L. Rather before it was C to the L times L to the L. And
also you can write down some explicit bound on how close your indicator function
is at any given point to the smooth function.
And all this means is the larger you make C, the better of an approximation you
get which you expect. But you pay in your derivatives.
And also the farther away you are from the points of discontinuity, A and B, the
better approximation you get, which you also expect from it.
And here's some plots. I'm making C bigger and bigger. That's a small C. Big
C. Okay. Okay. Back estimation, this thing only matters because, well, for this
talk because I'm going to use it to prove what I said about K-wise independence.
So how are you going to prove it? Okay. So suppose that R1 up to RN are
K-wise independent P stable random variables and K-wise are fully independent.
And look at sum over I, X I sum over QI XI. This I know is a P stable random
variable scaled by the LP norm of X. Let's just say the LP norm of X is 1.
What I want to show is that this is true. This expectation is epsilon of that one. If
I could show that for any AB, that would give me what I said in the beginning.
Okay. And I'm going to show it to you via a chain of three inequalities. First I'm
going to move from the original function to the smooth version, then move to
K-wise independence, and then move back to the original function.
Okay. Okay. So why is the first one true? Okay. So this difference is at most
the difference when I put the absolute values inside the expectation. And now I
just think of some conditioning.
So what is Q? You know, it's either within epsilon of the boundary points AB, or
its distance from the boundary points is between 2 to the S minus 1 epsilon and 2
to the S epsilon for some S.
So if it's within, if it's within epsilon of the boundary, well, that happens with
probability at most O of epsilon. Okay. And I know that these two functions are
never off by more than one from each other. So that contributes at most O
epsilon to the error.
And for the other case, where the distances between 2 to the S minus 1 epsilon
and 2 to the epsilon for sub S, I loop over all Ss, I get this thing and this is at
most O of 2 to the epsilon get geometric series O of epsilon.
Okay. And then the next inequality is some technical amount which says
bounded [inaudible] smooth functions. Last inequality is very similar to the first.
The only thing is you have to argue that under K-wise independence it's still the
case that my K-wise independent random variable will not be a small interval with
large probability.
So now I'm going to move to showing you this one the second inequality. Any
questions? So suppose I have a function F whose Lth derivative goes like alpha
to the alpha for some base alpha and I have K-wise independence where K is
large enough both in terms of some error parameter and L alpha. Then K-wise
rules F. That's what this says.
Question? Okay. Okay. So, okay, prove strategy. So this is the thing that is not
going to work, which is, well, I have a smooth function K-wise independence
fools polynomials, I'll approximate the smooth function by the polynomial. And
that's fooled exactly K-wise independence, bounded error using Taylor's
theorem. That's not going to work because, well, the moments that you get from
Taylor's theorem are infinite.
And also these polynomials, like they have infinite expectations. All right. So
that's the problem that we had in the very beginning. So what are we going to do
instead? Okay. So we just ask ourselves like why are the moments of this
distribution infinite? It's because it's a wide tailed distribution. Okay. So we'll kill
the tails of the distribution.
Okay. So I'm going to find some new random variables where as follows. And
then I'm going to define some indicator random variable 1 sub S which says 1
sub S is 1 when S is exactly the set of indices that got to be too big and I
truncated them.
Okay. And now from here to here I'm just saying, you know, at any point into my
probability space, you know, 1 of the 1 sub S is exactly one of them. Something
got truncated. Maybe it's the empty state. Then I use linear expectation. Then I
say, well, if S is exactly the set of truncated indices, then the things not in S, RI
prime is still RI, so this is inequality. And now I want to say why this looks, why
this should look like some kind of progress.
Okay. So the reason this looks something like progress is as follows: So wishful
thinking. Pretend there's no 1 sub S here. Now, there are two problems we
have. One is that we don't have full independence, we only have K-wise
independence. The other is, even if we had full independence, then our
moments are infinite. If we try to tailor expand and bound errors we get infinite
stuff.
So for the -- I'm claiming that this should look like progress in terms of the second
problem. So imagine we had all the independence in the world and 1 sub S is
not there. What I want to do is something like condition on all the RIs for I and S,
okay? And tailor expand about this point, and now my error term depends upon
moments of this random variable, and this has finite moments, because RI prime
has no tail.
Okay. Okay. But I also have the problem that I don't have fully independent
random variables. Okay. So I'm going to define new indicator random variables.
1 prime sub S says that S is a subset of the truncated RI primes. And now
inclusion/exclusion. If S is the subset of truncated things, that means it's a
subset and nothing else is a subset. So this is inclusion/exclusion. I'm just going
to plug this in for 1 sub S on the last slide.
So this is exactly what was on the last slide with that plug-in. And I claim like this
looks like even more progress. Okay. And the reason it looks like even more
progress is, now if I only have K-wise independence, suppose SU and T is small.
Suppose SU and T has at size at most K over 2. And I can condition on all RIs
for SU and T. That determines this. It determines this. And I can tailor expand
about this. And looking at the K over twoth most of this. I can expand K over 2
and K over 2 dependence to look at moments of this.
Okay. Of course the problem is not all the SU and Ts are small. Okay. So the
last thing that you do is you just ask yourself, well, how much error do I
accumulate by just not including the sum over the large sets? Okay?
So this is the approximate inclusion/exclusion. And what you show -unfortunately this part just becomes like a technical, like a series of technical
calculations. So I'm not going to draw on it. But basically what you can show is
you don't lose much by doing this.
So that's the idea. By the way, how long do I have? Another 20 minutes. And
the very last step, so remember that was inequality two. Moving from full
independence to K-wise dependence. Now the very last inequality I said was
similar to the first but needs to prove anti-concentration. And maybe I'll, if
someone really wants to know at the end of the talk I can come back to it. But
I'm just going to move on to some other stuff.
Okay. So what did we talk about? So norm estimation, moment estimation,
they're the same things, raising things up to P. We showed that moment of norm
estimation can be done in big O of this space. I didn't really talk about the lower
bound.
And the main idea was just reducing this, reducing the proof to showing this
invariance principle for P stable things. Okay. So any questions about what we
just showed? Great. Okay. So there's some other applications of FT
mollification which I'm going to briefly sketch.
One is pseudo random generators for polynomial threshold functions. And there
so the polynomial threshold function is a function F which I can write as the sign
of some polynomial. And it was shown by Diakonikolas, et al. that if you look at
let's say X being from the hypercube, okay, if you look at X being random from
the hypercube, uniformly at random versus Y be having K-wise independent
entries, then as long as K is at least roughly 1 over epsilon squared, this sine
function is fooled. Okay.
And the proof outline was well my degree 1 polynomial P looks like some linear
form minus a constant. And they have a reduction which I won't talk about, which
says we only need to consider the case where none of the AIs are too big in
magnitude. And they show the theorem for this regular case.
Okay. And so with FT mollification you can show that in this regular case you
need 1 over epsilon squared. You don't even need these logs. It's basically the
same proof that I showed you.
>>: P of X is polynomial.
>> Jelani Nelson: P of X is polynomial. And they looked specifically at the
degree 1 case. Yeah. So they showed for degree 1, when P is degree 1, you
only need one over epsilon squared wise independence, roughly. And
basically -- this would look like what I just showed you, because what is the
sign -- what is the sign of something? It's, what's the sign function look like? So
this is the sign function. This is my F. And they're looking at F of sum of AIXI,
right? This is exactly -- and this really looks a lot like the indicator function minus
1-1 and this is 0. This is the indicator function of minus theta. This thing is 1
when sum over IAIXI is in theta infinity. So I just need to look at the indicator
function of theta infinity. Right. Okay. And in fact it's even a bit simpler than
what I just showed you, because so let's define these random variables. I'm
going to use the same chain of inequalities. The first inequality is the same exact
thing, except now use the fact that you're in this regular case where no AI is too
large, because you need to show that like X is anti-concentrated. There's not too
large a probability that you're near the boundary. Okay. But anyway this thing is
basically exactly the same.
The second step is just Taylor's theorem. You remember I said in the P stable
case we can't just Taylor expand because moments are infinite. But now we're
working over a hypercube so things are finite. Use Taylor's theorem and that's it.
And the last step is the same as last time. Basically the same proof I just
showed you.
And, in fact, you can also use FT mollification to show that polynomial [inaudible]
fools the degree 2 case, and recently Daniel showed how to do it for any
constant degree but only under Gaussian measure, not over the hypercube.
Okay. I'll show you another thing you can do by FT mollifying. So Jackson's
theorem. Some theorem in approximation theory. So what is approximation
theory about? It's about approximating functions by simpler functions. And one
example of a simpler function is a low degree polynomial. So one natural
question is I have some function F and I want to ask myself, in some set S what's
the best approximate of K degree polynomial or L infinity or something.
So here's a natural question. If I give you K, what's the best epsilon you can get?
Jackson's theorem says this. It says that if S is, say, the unit interval and I define
this modulus of continuity which just says if two points are within delta how much
can F differ by the evaluation on those points, then this is the epsilon that
Jackson shows you can get.
Okay. And basically you can prove this just by FT mollifying F and Taylor
expanding to K and that's the polynomial.
So I'm not going to -- you don't have to dwell on the details, but basically I want
to show you it's the exact proof you just saw.
So first I'm going to take this function G which is basically F. And then I'm going
to FT mullify it. And now what is G of X minus FT mollification at X. I can just
write it out. Ignore this M. Sorry. And now, okay, one thing that I should say is
like what's one reason FT mollification works? I'm convolving with the square of
the four A square transform function. And I'm going to normalize the square so
its integral is 1. So now that I have this function which is nonnegative that I'm
convolving with and it integrates the 1 I can view it as a density function. And
then I can say, well, actually this integral is really an expectation under that
density. Okay. And then now I can condition on this random variable Y where is
it in that density and I can write something similar to what I wrote last time. And
then I can use the second moment method.
So that's really what FT mollification is about. Why it works. And so then you
say well G is close to its FT mollification. ST mollification is close to a polynomial
by Taylor's theorem balance errors which I won't do in front of you. It's
something you can go home and calculate. That's basically the idea and you get
the bound.
And I'm just going to conclude. So I showed you space optimal estimation data
streams and some other applications of FT mollification. And the algorithm I just
showed you, the update time, the time it would take to process every update in
the stream would be like 1 over epsilon squared. So one question is can you get
a space optimal algorithm which has constant update time and it's known that
that is possible when P is 2. And for other P, in some newer work we're able to
get it down to at least being poly log as opposed to poly, but it's still not constant.
Another thing is F2 estimation, that's the same thing as L2 estimation. L2
basically maintaining a linear sketch and to estimate the L2 norm is very related
to Johnson Lindenstrauss. You get things better for Johnson Lindenstrauss.
That's another question. And the other one is find estimations for mollification.
So if anyone has questions, I'll -[applause]
>>: Can you say a bit about [inaudible].
>> Jelani Nelson: Actually I have some slides on that. Okay. So agnostic
learning. So I have a domain and a distribution on the domain with like labels.
And so I have a concept class which functions mapping domain to labels. And,
okay, the optimal value for solving this learning problem is to come up with the
best function in my concept class which minimizes like the probability of error.
And I want to use a few examples as possible to get within epsilon of optimal. So
what do I get? So I can get samples from this distribution domain items together
with labels and I want to get as few samples as possible to get, to learn
something good. Okay. And so there was some work where basically it says
that if you can well-approximate every function in your concept class by a low
degree polynomial, then you can agnostically learn. Okay. And so actually the
work was called, the paper was called agnostic learning of half spaces or
something. And what's a half space? A half space is a degree one polynomial.
And when I showed you this like K-wise independence fools degree one
polynomial basically what are you doing in there? You're FT mollifying and
Taylor expanding gives you a polynomial which well approximates these half
spaces.
So that's the connection. So they show that if you want to well approximate an
L1 norm, these half spaces then, to get error epsilon, you would need degree 1
over epsilon to the fourth. What you can get for FT mollification if you go over F
squared. That's the connection.
Any other questions?
>>: Can you say something about the Nth concentration?
>> Jelani Nelson: Oh, yeah, sure.
What do I want to show? I want to show if I have a point -- I want to show that
the expectation of the indicator function of minus epsilon-epsilon evaluated at R
is still O of epsilon. Right? Okay. So what I'm going to do is come up with a
function which looks like this. So this is G. I want the property that G is greater
than or equal to I sub minus epsilon-epsilon everywhere. I also want the property
that the expectation of G of Q, this is the fully independent version, is O of
epsilon. I want these two properties.
I also want that I have good derivative bounds on G. Okay. Now, if I had these
things, then what would I say? I would say, well, I know that this is true. I know
that the expectation of G of R, upper bounds, the expectation of the indicator
function of minus epsilon-epsilon are, right? And I know that since the
expectation of G of Q is O of epsilon and G has good derivative bounds, I can
apply that smooth function lemma to say that the expectation of G of R is still O
of epsilon and that would give me that this is O of epsilon.
So I just need to show you that such a function G exists. And such a function G
does exist and basically you can take G to be two times the FT mollification
epsilon up to 2 epsilon. And you can sub C where C is something like 1 over
epsilon.
Yeah.
>>: What kind of bounds do you need per ->> Jelani Nelson: It's going to be something like O over 1 to the epsilon to the L.
So the ->>: Could you use a Gaussian for G, or would that ->> Jelani Nelson: I think Gaussians would give you this L to the L derivative
bounds. E to the minus X squared. You'd get L to the L derivatives.
So I think there's like a theorem in complex analysis which says that if you want a
function to have Lth derivative bounded by alpha to the L for all L, then its four A
transform should only be supported in the uniform minus epsilon delta. So
Gaussians would not satisfy. This is why you have to mollify, by convolving with
the four A transform of some function of compact support you're killing the tail.
>> Yuval Peres: Any other questions? Let's thank Jelani.
[applause]
Download