21746 >> Yuval Peres: Okay. I'm delighted to have... today. The work he'll describe on moment estimation and...

21746 >> Yuval Peres: Okay. I'm delighted to have Jelani Nelson from MIT with us today. The work he'll describe on moment estimation and data streams is already quite well known. I've heard a lot indirectly about it. So I'm very happy we can hear it from the source. Please. >> Jelani Nelson: I'm Jelani. This is joint work with Daniel Kane and David Woodruff. And I'm going to start off just by defining the model for this problem. So we have an N dimensional vector X. Starts off as the zero vector, and you should think of N as being really, really large, like maybe I'm indexing the vector by source destination IP pairs or something. And I have a really long sequence of updates coming in a data stream. And each update is a pair. It's an index together with an amount. And it says add that amount to the corresponding index of the vector. And the amounts are all bounded precision integers. And the goal in this talk is to output the Pth frequency moment of the vector X where P is given up front. So I imagine like ->>: Updating this vector. So N is not timed. >> Jelani Nelson: N is not timed. N is like, for example. Here's an example I have network traffic. Every packet has a destination and source IP. And X is indexed by these things. Every time I see a packet I like update the corresponding thing by one and now the vector tells me, you know, X sub IJ is how many packets from person I to J. Now, is the model okay? So moment estimation gets used like in a black box way and a bunch of other problems. So one, for example, is empirical entropy estimation. So you imagine again this IP example, if it's indexed just by destination IP, you know like the entropy tells me how skewed the distribution is. >>: So just want to, maybe it's good for us to focus on what particularly is the most important example. So was it this IP? >> Jelani Nelson: Yeah, so from now on think of being indexed by destination IPs and think of V always being 1. And XI is just how many packets that were sent to IP addresses. So then let's say, like in this application, you know the empirical entropy of this vector now tells me something about how concentrated like the distribution is of where packets are going. And if it's too concentrated, for example, that could be a signal for a denial of service attack. So, I mean, so this thing -- actually they use empirical entropy estimation to do these things. And the best known algorithm uses moment estimation as a black box, which this is an algorithm I did with Nick Harvey and Chris [inaudible]. So I want to talk about like, I guess, kind of how to solve the problem. So we give optimal algorithm for this problem and that's what I'm going to focus on. There are two natural objectives in developing an algorithm. One is I want to minimize the amount of memory I use in my algorithm as I'm processing the data stream. So if you think of N as being really large, the length of the vector being really large you don't want to have to remember the entire vector in memory. And I also want to minimize update time. I don't want to have to take too long to process every single packet or something. So there are two easy solutions that use a lot of space but have low update time. One is remember the whole stream. And then at the end of the stream calculate the answer, or there's also keep the whole vector in memory. Remember in little M, little M was the stream length and capital M is the update amount is bounded by capital M magnitude. So this is just like how many bits do you need to represent a single frequency. >>: When you say keep the stream in memory, calculates it afterwards? >> Jelani Nelson: Yeah. >>: Is there something clever we do other than get that out of memory? >> Jelani Nelson: You mean this one or this one? >>: The second one. >> Jelani Nelson: I mean, so if you don't care about the reporting time at the very end, you could keep the whole stream in memory and loop over 1 to end and sum up all the frequencies, sum up the frequency. >>: Okay. >> Jelani Nelson: You wouldn't use N memory. You would use little n. Is that what you're asking? >>: I mean, there's just -- what we're doing after we see the string. >> Jelani Nelson: Or you could sort by the index and then just like run through it once and accumulate. >>: Your goal is just to cut the moments of the final vector? >> Jelani Nelson: Yes. >>: So why would you want to keep more than X? >>: Maybe N is much larger than M. >> Jelani Nelson: Yeah. You might also want like, for some of these like network monitoring applications, you want to like ask what is the moment now, what is the moment now. Periodically throughout the stream. So it's not necessarily always at the end of everything. But, yeah, M could be much less than N which is why this could be much better than that. >>: The example of ->> Jelani Nelson: P is something that's very near 1. But not quite 1. And the reason is -- and I can explain why this even like makes some kind of sense. The idea is just like -- so there are these other entropies besides Shannon entity. Like there's solace entropy and Ren entropy, defined in terms of moments. Just by L'Hôpital's Rule, they converge to Shannon entropy and the limit as the P goes to 1. If I write down what the definition of they are. Like define like H sub P to be like 1 minus, I guess, F sub P over P minus 1. Okay. Where this vector -- this is F sub P of the normalized vector. So this is like the frequency. This is the probabilities. Then this thing, the limit as P goes to 1 of HP is equal to Shannon entropy. So this is the connection between moments and entropy. But if you just black box it, if you use the fact that this converges to Shannon entropy, you're not going to get such a great algorithm, but there are some tricks you can use where you basically calculate lots of HPs and then interpolate and then evaluate some interpolated polynomial at 1. That's a really short summary of that. Okay. So I'm going to focus in this talk on minimizing space usage. Okay? And so there's one bit of bad news, which is that the first work that studied this problem by Alon, Matias and Szegedy is that they showed you cannot output the exact answer deterministically in sublinear space. So you need to allow for slack in two ways. One is multiplicative error, and the other is you have to settle for randomized algorithm that has some constant probability of failure. So this is our new goal, just out of necessity, if we want to sublinear space algorithm. And then even more bad news is that a polynomial space is required for large P. And so I'm going to focus -- in this talk I'm going to focus on P between 0 and 2 and it's a real number. So this is not just three values. And P is not 0. So greater than 0. So let me tell you what was known before this work. >>: That work also had quality results? >> Jelani Nelson: Alon Matias and Szegedy, yes. They have a positive result. When P equals 2 they had an algorithm which actually we show as optimal. When P is greater than, I think, 5 they also had some like -- I'm sorry, when P is greater than 2 they also had an algorithm which has been improved upon since then. But they had positive results. Okay. So first let me just recall. This is our notation here. So previously the lower bound -- so this is all like number of bits of space you need to solve the problem. Where you want 1 plus minus epsilon approximation, let's say 99 percent probability. Okay. Over the randomness used by the algorithm. So David Woodruff showed a 1 over epsilon squared lower bound and the original work of AMS showed a log N lower bound. And when P equals 2, this is one of the positive results that they had was remember frequencies can never be bigger than little M times capital M. So this is like the word size you need to store frequency. They showed you just need 1 over epsilon squared words over memory. And when P is not equal to 2, [inaudible] gave an algorithm which had log squared space. And there were some -- there's some differences given by Lee later. But I mean it's the same bit complexity. So what do we show? We show that for all P between 0 and 2 this is the correct answer, both upper and lower bounds. And I should say that I'm being slightly pedantic, all bounds on this slide are hiding in log log little n, which I'm not going to talk about. So are there any questions about statement of the problem and what we're going to try to show? I'm going to show you the upper bound. No. Okay. Now I'm focusing on the upper bound and this bound. Okay. So the first thing I want to tell you about are P stable distributions. So these are -- this is a family of continuous probability distributions. And they're out there for all P between 0 and 2. Let's say they're indexed by P. In reality, for each P there's like a family of P stable distributions. But let's say there's only one for each P, which is going to satisfy some property that I'm going to tell you about now. So I'm just going to define this distribution by a property it has. Okay? And the property it has is that if you take any fixed vector X and take its dot product with a vector of independent P stable random variables. So you take the sum over IQ IXI, this itself is a random variable, which is P stable, just for the scale factor which is the LP over X. So that's the only property I care that it has. And you probably have seen this for Gaussians maybe. But they're out there for all P between 0 and 2. >>: Curious why you credit the entire distribution? >> Jelani Nelson: Oh, yeah. So I've been told to fix this. Is it Levy who I should be crediting? Yeah. Levy. Sorry. >>: 50 years earlier. >> Jelani Nelson: I think I did that -- yes, we wrote a book and I forgot to change it to Levy. Levy and who else? >>: Kinshin [phonetic]. >> Jelani Nelson: Kinshin. I see. Okay. So let me tell you about the algorithm. So it's going to maintain -- you should think like dimensionality reduction. So rather than maintain the vector X in memory, I'm going to maintain a vector Y which is some linear map applied to X. And why can I maintain this in a stream. If I can maintain the linear map A, basically whenever I see an update saying add V to the ith coordinate of X. I take V multiply it by the ith column and add it to Y. So I can do that. What is A? A is a vector of independent P stable random variables. And the idea of using this stable, P stable sketch matrix for data streams was first used by a Inix [phonetic] algorithm. And he said he's going to estimate L P, by the way, this is like P norm. I'm just saying take the moment, take it to the 1 over P power. Okay. He said estimate the Pth norm as the median of the entries. Lee gave a different estimator which is unbiased and has lower variance and stuff. But in both cases the number of rows you need in this matrix is 1 over epsilon squared. Okay. So that's great. You just maintain this 1 over epsilon squared dimensional vector. The only issue is you need to store this matrix A. You need to consistently use the same matrix A over the stream. And naively, you need a lot of bits of space to store the matrix. And we're trying to get a small space algorithm. So what Indig did in his algorithm was he didn't actually store A explicitly in memory. He stored some short random seed and used a pseudo random generator to stretch the seed to get various entries of A. So at any given time he was only storing the seed, not the matrix. Okay. And actually the only -- remember his algorithm used something like 1 over epsilon squared times log squared space. And we're trying to get something like log space, 1 over epsilon squared log. The only reason his algorithm was so optimal was because of this step. The seed that he needs to store was the dominant space, was the dominant thing his algorithm in terms of space. So you can just ask, well, is there a more efficient way to generate this matrix A? And what we show is that in fact, yes, just via K-wise independence. So for fixed I -- so I have this matrix. So for a fixed row the entries in that row need to be K-wise independent and how do you generate K-wise independent random variables? You take a seed and you do stuff with it. And those seeds from row to row all need to be pairwise independent. So this is what we show. And we show that if you pick K to be something like 1 over epsilon to the P, then index algorithm, which took the median of the entries of the vector works. And we also show that a different estimate or works with a much smaller K. Although algorithmically I am still not aware of any reason why this is better. Like basically this would give you the same, this would give you optimal space as well and it would give you the same time as this one, if you do the appropriate tricks. So I'm just going to focus on this one. Okay. So the remainder of the talk I'm going to explain to you why this is true. So is it clear what I'm saying here? >>: Are you going to tell us how you produced these K-wise independent variables from the pairwise independent seeds? >> Jelani Nelson: Oh. I didn't want to get into that, but I guess I'll say -- so like let's say you want to generate K-wise independent random variables over some finite field FP. So you take a degree, random degree K minus 1 polynomial evaluate it at all the points and that's your stuff. What you're actually going to do here is well like Cauchy random variables, P stable random variables are continuous. I can't implement that on the computer. First step is I'm going to discretize it over some precision. And then how do you generate -- how do you generate a P stable random variable? Like there are ways to do it where you pick a random angle and you pick some random number between 0 and 1 and you do some trigonometric stuff. You discretize -- what you actually do is you discretize just like the unit interval 0 to 1. You pick the discretized angle. Pick a discretized number between 0 and 1 and do the finite field version in this discretized way. Does that answer your question? >>: [inaudible] how are you building up pairwise independence to K-wise independence because polynomial you assume more ->> Jelani Nelson: Oh, I see. Okay. Okay. So to get my K-wise independent variables for a particular row, I need to pick coefficients which are like coefficient C0, CK minus 1. And let's say this is row 1. And then I need to pick coefficients C02 for row 2. Okay. And I'm saying basically look at this, just like concatenate the coefficients and look at this as a single integer. And now I'm saying like these integers need to be chosen pairwise independently. So you can do that over a larger field. Basically pick these seeds pairwise independently and then now use the seeds to generate K-wise independent random variables. >>: K-wise independence is the same within a row but not between the ->> Jelani Nelson: If this is the matrix. >>: Different rows. >> Jelani Nelson: If this is the matrix, you look at a particular row, these entries are K-wise independent. >>: But you're not making an assumption about the K-wise independence around columns? >> Jelani Nelson: Around columns. No. >>: There you only have pairwise. >> Jelani Nelson: Yeah. Because so really what is this matrix doing? Like here's my vector X. And here's Y. Let's look at like a particular entry. This row. So if these were independent, then this entry would be a P stable random variable. And I'm saying replace this full independence with K-wise independence, and now this is no longer P stable, but I still want to say somehow it's good. Okay. So the way I'm going to show you that this works is I'm going to analyze index algorithm under full independence in a way that exposes where full independence is being used. And then just replace it -- I'll show you how to modify the argument. So here's an argument for why his algorithm works. Remember, it takes this A equals Y and looks at the median of absolute values of entry Y. So why does that work? So take -- let's assume the P stable distribution, the median of the absolute value distribution is 1. And you can do that just by scaling the distribution. And now I'm going to define this indicator function of an interval. And let's look at -- let's look at 1P stable random variable Q. So Q is the sum over I of QI over XI. And the median of -- okay. So what is this random variable? Q is a P stable random variable scaled by the LP norm of X. So this is the absolute value of a P stable random variable. And the median of it being -- the median of that distribution being 1 means this. That's exactly what it means. Does anyone want me to draw anything? Okay. Good. I'm not really saying anything except rewriting stuff. Okay. Good. Okay. So how are we going to analyze the algorithm? Well, I'm going to take that interval minus 1 to 1, and I'm going to shrink it on both sides by epsilon. So I have this P stable distribution. I said the amount of probability mass between minus one and one is a half. I shrunk it by epsilon both sides. I'm going to lose data epsilon mass. Similarly if I stretch it by epsilon on both sides, I'll gain theta epsilon mass. Now what's this telling me? This is telling me if I take enough trials -- what's a trial? A trial is a row of this sketch matrix, right? If I imagine these are my QIs and this is sum over QI over XI and I'm doing a lot of different trials, so what this is saying is if I take R trials, then the fraction -- in expectation, the fraction of trials that are less than 1 minus epsilon times the LP norm is less than half of the trials. And the fraction of trials that are less than 1 plus epsilon times the LP norm, that's more than half the trials. So if everything were going according to what these expectations are telling me, like imagine that these weren't just expectations but this is like what happened, then the median would be between 1 minus epsilon and 1 plus epsilon times the LP norm. So all that I need to happen is that, like in fact less than one-half are less than 1 minus epsilon times LP norm etcetera. And if I take enough trials that will happen just by concentration. And the kind of concentration you need is guaranteed by Chebyshev. And already I'm telling you that the trials, to analyze it I need Chebyshev's inequality. The trials need to be pairwise independent. That's why all rows need to be pairwise independent. Within the row I'm using the fact that within a row I have full independence because I'm saying this is a P stable random variable. And I know that if I take this P stable random variable I know how much mass is where. But if I only have K-wise independent QIs, then, for example, maybe a lot of mass is in this really tiny epsilon interval, and I shrink my epsilon and something crazy happens, I don't know. Or I guess in this case there's no mass there, and then I don't lose the epsilon mass. So the only thing I need to show to make sure that this algorithm works is I need to show that the expectations -- I need the show these two expectations are approximately still the same, to within epsilon, in the case of K-wise independence. Okay. So this is I guess -- call it invariant principles. I'm showing some invariance -- what's the expectation of an indicator function? It's like I'm talking about CDFs here. So I'm saying the CDF is close if you have full independence versus K-wise independence. So that's what I'm showing. Okay. Good. So how might you show that? Well, you want to show that this expectation is similar under K-wise independence. Well, we know K-wise independence preserves expectations of degree K polynomials. That's what they're good for. Let me replace this indicator function of an interval with a well approximating low degree polynomial, maybe standard it between two polynomials and use the fact that they're fooled [phonetic] or something like that. Okay. So we're not going to do that for a good reason, which I'm going to tell you soon. The good reason is that such polynomials don't exist because these P stable distributions other than Gaussians have infinite variance. So like any polynomial is either going to have -- it's either going to be like a constant function or something or it's going to have infinite expectation. Okay. So there are no such well-approximating low degree polynomials under P stable measure. So we're going to do instead we're going to well-approximate this indicator function interval with some smooth function. And then I'm going to show you that the smooth function, or I'm going to show you some steps. I don't have time to show you the whole thing. The smooth function is K-wise independence under P stable measure. Okay. Okay. So I'm going to now talk to you about how to get this function. Okay. Okay. So switching contexts I'm going to talk to you about FT mollification, which I haven't defined yet but I'm about to define it. I'm just curious, have people seen mollification before? Mollification? Okay. Some have, some haven't. That's okay. I'm not assuming that anyone has seen mollification, I'll tell you what it's all about. So the idea is this. I have a function F which is not a smooth function. Okay. Like, for example, it's the indicator function of an interval. It might not even be continuous. And I would like a smooth function which well-approximates F. And there are these two facts out there. One is if you convolve a function F with a smooth function G, then all of a sudden the convolution becomes smooth as long as like these derivatives, as long as this integral exists. Okay. And this is some identity that the Kth derivative is the same as K convolved kth derivative G. And the next fact is that there's this like this limit of functions, the direct delta function. If you convolve F with the direct delta F function you get F back. So basically the idea convolve F with a smooth approximation of the direct delta function. That's mollification. And mollification looks something like this. You take a smooth function. This is my G, which is you know like there are smooth functions out there which are only supported like in some finite interval. You take B sub C, which means shrink it in words by factor of C and make the taller by factor of C and as C goes to infinity it looks like the direct delta function. Convolve it with F you get something smooth. That's mollification. Okay. So a problem with mollification, problem meaning it's a problem for our particular application, you know? Is that okay like the resulting function is a smooth function, but just how smooth is it? What the derivative bounds you get out of mollifying things? And you get something like if you use a function which has like some finite support like this, you're going to get derivative bounds which are like K to the K 0 kth derivative. And C remember is the tuning factor where the larger I make C, this looks more and more like the direct delta function. Okay. So what we do instead is so I said FT mollification. We'll take the same B. And rather than convolve with this we're going to convolve with its foray transform. So this is B. This is foray transform looks like this. Looks like a sink function sort of. And we square it and we convolve with that. Okay. And we're able to show -- so you can show two things. One is that you get bound derivatives. So I wrote K before. So basically the Lth derivative is like C to the L times 2 to the L. Rather before it was C to the L times L to the L. And also you can write down some explicit bound on how close your indicator function is at any given point to the smooth function. And all this means is the larger you make C, the better of an approximation you get which you expect. But you pay in your derivatives. And also the farther away you are from the points of discontinuity, A and B, the better approximation you get, which you also expect from it. And here's some plots. I'm making C bigger and bigger. That's a small C. Big C. Okay. Okay. Back estimation, this thing only matters because, well, for this talk because I'm going to use it to prove what I said about K-wise independence. So how are you going to prove it? Okay. So suppose that R1 up to RN are K-wise independent P stable random variables and K-wise are fully independent. And look at sum over I, X I sum over QI XI. This I know is a P stable random variable scaled by the LP norm of X. Let's just say the LP norm of X is 1. What I want to show is that this is true. This expectation is epsilon of that one. If I could show that for any AB, that would give me what I said in the beginning. Okay. And I'm going to show it to you via a chain of three inequalities. First I'm going to move from the original function to the smooth version, then move to K-wise independence, and then move back to the original function. Okay. Okay. So why is the first one true? Okay. So this difference is at most the difference when I put the absolute values inside the expectation. And now I just think of some conditioning. So what is Q? You know, it's either within epsilon of the boundary points AB, or its distance from the boundary points is between 2 to the S minus 1 epsilon and 2 to the S epsilon for some S. So if it's within, if it's within epsilon of the boundary, well, that happens with probability at most O of epsilon. Okay. And I know that these two functions are never off by more than one from each other. So that contributes at most O epsilon to the error. And for the other case, where the distances between 2 to the S minus 1 epsilon and 2 to the epsilon for sub S, I loop over all Ss, I get this thing and this is at most O of 2 to the epsilon get geometric series O of epsilon. Okay. And then the next inequality is some technical amount which says bounded [inaudible] smooth functions. Last inequality is very similar to the first. The only thing is you have to argue that under K-wise independence it's still the case that my K-wise independent random variable will not be a small interval with large probability. So now I'm going to move to showing you this one the second inequality. Any questions? So suppose I have a function F whose Lth derivative goes like alpha to the alpha for some base alpha and I have K-wise independence where K is large enough both in terms of some error parameter and L alpha. Then K-wise rules F. That's what this says. Question? Okay. Okay. So, okay, prove strategy. So this is the thing that is not going to work, which is, well, I have a smooth function K-wise independence fools polynomials, I'll approximate the smooth function by the polynomial. And that's fooled exactly K-wise independence, bounded error using Taylor's theorem. That's not going to work because, well, the moments that you get from Taylor's theorem are infinite. And also these polynomials, like they have infinite expectations. All right. So that's the problem that we had in the very beginning. So what are we going to do instead? Okay. So we just ask ourselves like why are the moments of this distribution infinite? It's because it's a wide tailed distribution. Okay. So we'll kill the tails of the distribution. Okay. So I'm going to find some new random variables where as follows. And then I'm going to define some indicator random variable 1 sub S which says 1 sub S is 1 when S is exactly the set of indices that got to be too big and I truncated them. Okay. And now from here to here I'm just saying, you know, at any point into my probability space, you know, 1 of the 1 sub S is exactly one of them. Something got truncated. Maybe it's the empty state. Then I use linear expectation. Then I say, well, if S is exactly the set of truncated indices, then the things not in S, RI prime is still RI, so this is inequality. And now I want to say why this looks, why this should look like some kind of progress. Okay. So the reason this looks something like progress is as follows: So wishful thinking. Pretend there's no 1 sub S here. Now, there are two problems we have. One is that we don't have full independence, we only have K-wise independence. The other is, even if we had full independence, then our moments are infinite. If we try to tailor expand and bound errors we get infinite stuff. So for the -- I'm claiming that this should look like progress in terms of the second problem. So imagine we had all the independence in the world and 1 sub S is not there. What I want to do is something like condition on all the RIs for I and S, okay? And tailor expand about this point, and now my error term depends upon moments of this random variable, and this has finite moments, because RI prime has no tail. Okay. Okay. But I also have the problem that I don't have fully independent random variables. Okay. So I'm going to define new indicator random variables. 1 prime sub S says that S is a subset of the truncated RI primes. And now inclusion/exclusion. If S is the subset of truncated things, that means it's a subset and nothing else is a subset. So this is inclusion/exclusion. I'm just going to plug this in for 1 sub S on the last slide. So this is exactly what was on the last slide with that plug-in. And I claim like this looks like even more progress. Okay. And the reason it looks like even more progress is, now if I only have K-wise independence, suppose SU and T is small. Suppose SU and T has at size at most K over 2. And I can condition on all RIs for SU and T. That determines this. It determines this. And I can tailor expand about this. And looking at the K over twoth most of this. I can expand K over 2 and K over 2 dependence to look at moments of this. Okay. Of course the problem is not all the SU and Ts are small. Okay. So the last thing that you do is you just ask yourself, well, how much error do I accumulate by just not including the sum over the large sets? Okay? So this is the approximate inclusion/exclusion. And what you show -unfortunately this part just becomes like a technical, like a series of technical calculations. So I'm not going to draw on it. But basically what you can show is you don't lose much by doing this. So that's the idea. By the way, how long do I have? Another 20 minutes. And the very last step, so remember that was inequality two. Moving from full independence to K-wise dependence. Now the very last inequality I said was similar to the first but needs to prove anti-concentration. And maybe I'll, if someone really wants to know at the end of the talk I can come back to it. But I'm just going to move on to some other stuff. Okay. So what did we talk about? So norm estimation, moment estimation, they're the same things, raising things up to P. We showed that moment of norm estimation can be done in big O of this space. I didn't really talk about the lower bound. And the main idea was just reducing this, reducing the proof to showing this invariance principle for P stable things. Okay. So any questions about what we just showed? Great. Okay. So there's some other applications of FT mollification which I'm going to briefly sketch. One is pseudo random generators for polynomial threshold functions. And there so the polynomial threshold function is a function F which I can write as the sign of some polynomial. And it was shown by Diakonikolas, et al. that if you look at let's say X being from the hypercube, okay, if you look at X being random from the hypercube, uniformly at random versus Y be having K-wise independent entries, then as long as K is at least roughly 1 over epsilon squared, this sine function is fooled. Okay. And the proof outline was well my degree 1 polynomial P looks like some linear form minus a constant. And they have a reduction which I won't talk about, which says we only need to consider the case where none of the AIs are too big in magnitude. And they show the theorem for this regular case. Okay. And so with FT mollification you can show that in this regular case you need 1 over epsilon squared. You don't even need these logs. It's basically the same proof that I showed you. >>: P of X is polynomial. >> Jelani Nelson: P of X is polynomial. And they looked specifically at the degree 1 case. Yeah. So they showed for degree 1, when P is degree 1, you only need one over epsilon squared wise independence, roughly. And basically -- this would look like what I just showed you, because what is the sign -- what is the sign of something? It's, what's the sign function look like? So this is the sign function. This is my F. And they're looking at F of sum of AIXI, right? This is exactly -- and this really looks a lot like the indicator function minus 1-1 and this is 0. This is the indicator function of minus theta. This thing is 1 when sum over IAIXI is in theta infinity. So I just need to look at the indicator function of theta infinity. Right. Okay. And in fact it's even a bit simpler than what I just showed you, because so let's define these random variables. I'm going to use the same chain of inequalities. The first inequality is the same exact thing, except now use the fact that you're in this regular case where no AI is too large, because you need to show that like X is anti-concentrated. There's not too large a probability that you're near the boundary. Okay. But anyway this thing is basically exactly the same. The second step is just Taylor's theorem. You remember I said in the P stable case we can't just Taylor expand because moments are infinite. But now we're working over a hypercube so things are finite. Use Taylor's theorem and that's it. And the last step is the same as last time. Basically the same proof I just showed you. And, in fact, you can also use FT mollification to show that polynomial [inaudible] fools the degree 2 case, and recently Daniel showed how to do it for any constant degree but only under Gaussian measure, not over the hypercube. Okay. I'll show you another thing you can do by FT mollifying. So Jackson's theorem. Some theorem in approximation theory. So what is approximation theory about? It's about approximating functions by simpler functions. And one example of a simpler function is a low degree polynomial. So one natural question is I have some function F and I want to ask myself, in some set S what's the best approximate of K degree polynomial or L infinity or something. So here's a natural question. If I give you K, what's the best epsilon you can get? Jackson's theorem says this. It says that if S is, say, the unit interval and I define this modulus of continuity which just says if two points are within delta how much can F differ by the evaluation on those points, then this is the epsilon that Jackson shows you can get. Okay. And basically you can prove this just by FT mollifying F and Taylor expanding to K and that's the polynomial. So I'm not going to -- you don't have to dwell on the details, but basically I want to show you it's the exact proof you just saw. So first I'm going to take this function G which is basically F. And then I'm going to FT mullify it. And now what is G of X minus FT mollification at X. I can just write it out. Ignore this M. Sorry. And now, okay, one thing that I should say is like what's one reason FT mollification works? I'm convolving with the square of the four A square transform function. And I'm going to normalize the square so its integral is 1. So now that I have this function which is nonnegative that I'm convolving with and it integrates the 1 I can view it as a density function. And then I can say, well, actually this integral is really an expectation under that density. Okay. And then now I can condition on this random variable Y where is it in that density and I can write something similar to what I wrote last time. And then I can use the second moment method. So that's really what FT mollification is about. Why it works. And so then you say well G is close to its FT mollification. ST mollification is close to a polynomial by Taylor's theorem balance errors which I won't do in front of you. It's something you can go home and calculate. That's basically the idea and you get the bound. And I'm just going to conclude. So I showed you space optimal estimation data streams and some other applications of FT mollification. And the algorithm I just showed you, the update time, the time it would take to process every update in the stream would be like 1 over epsilon squared. So one question is can you get a space optimal algorithm which has constant update time and it's known that that is possible when P is 2. And for other P, in some newer work we're able to get it down to at least being poly log as opposed to poly, but it's still not constant. Another thing is F2 estimation, that's the same thing as L2 estimation. L2 basically maintaining a linear sketch and to estimate the L2 norm is very related to Johnson Lindenstrauss. You get things better for Johnson Lindenstrauss. That's another question. And the other one is find estimations for mollification. So if anyone has questions, I'll -[applause] >>: Can you say a bit about [inaudible]. >> Jelani Nelson: Actually I have some slides on that. Okay. So agnostic learning. So I have a domain and a distribution on the domain with like labels. And so I have a concept class which functions mapping domain to labels. And, okay, the optimal value for solving this learning problem is to come up with the best function in my concept class which minimizes like the probability of error. And I want to use a few examples as possible to get within epsilon of optimal. So what do I get? So I can get samples from this distribution domain items together with labels and I want to get as few samples as possible to get, to learn something good. Okay. And so there was some work where basically it says that if you can well-approximate every function in your concept class by a low degree polynomial, then you can agnostically learn. Okay. And so actually the work was called, the paper was called agnostic learning of half spaces or something. And what's a half space? A half space is a degree one polynomial. And when I showed you this like K-wise independence fools degree one polynomial basically what are you doing in there? You're FT mollifying and Taylor expanding gives you a polynomial which well approximates these half spaces. So that's the connection. So they show that if you want to well approximate an L1 norm, these half spaces then, to get error epsilon, you would need degree 1 over epsilon to the fourth. What you can get for FT mollification if you go over F squared. That's the connection. Any other questions? >>: Can you say something about the Nth concentration? >> Jelani Nelson: Oh, yeah, sure. What do I want to show? I want to show if I have a point -- I want to show that the expectation of the indicator function of minus epsilon-epsilon evaluated at R is still O of epsilon. Right? Okay. So what I'm going to do is come up with a function which looks like this. So this is G. I want the property that G is greater than or equal to I sub minus epsilon-epsilon everywhere. I also want the property that the expectation of G of Q, this is the fully independent version, is O of epsilon. I want these two properties. I also want that I have good derivative bounds on G. Okay. Now, if I had these things, then what would I say? I would say, well, I know that this is true. I know that the expectation of G of R, upper bounds, the expectation of the indicator function of minus epsilon-epsilon are, right? And I know that since the expectation of G of Q is O of epsilon and G has good derivative bounds, I can apply that smooth function lemma to say that the expectation of G of R is still O of epsilon and that would give me that this is O of epsilon. So I just need to show you that such a function G exists. And such a function G does exist and basically you can take G to be two times the FT mollification epsilon up to 2 epsilon. And you can sub C where C is something like 1 over epsilon. Yeah. >>: What kind of bounds do you need per ->> Jelani Nelson: It's going to be something like O over 1 to the epsilon to the L. So the ->>: Could you use a Gaussian for G, or would that ->> Jelani Nelson: I think Gaussians would give you this L to the L derivative bounds. E to the minus X squared. You'd get L to the L derivatives. So I think there's like a theorem in complex analysis which says that if you want a function to have Lth derivative bounded by alpha to the L for all L, then its four A transform should only be supported in the uniform minus epsilon delta. So Gaussians would not satisfy. This is why you have to mollify, by convolving with the four A transform of some function of compact support you're killing the tail. >> Yuval Peres: Any other questions? Let's thank Jelani. [applause]

21746 >> Yuval Peres: Okay. I'm delighted to have... today. The work he'll describe on moment estimation and...

Related documents

Products

Support

21746 &gt;&gt; Yuval Peres: Okay. I'm delighted to have... today. The work he'll describe on moment estimation and...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

21746 >> Yuval Peres: Okay. I'm delighted to have... today. The work he'll describe on moment estimation and...