>> Dengyong Zhou: I'm glad to introduce Anthony here. Anthony is a Ph.D. student from Department of Stats Yale University. Today he'll be talking about achieving limits in aggression. Antony. >> Antony Joseph: Hello. So thank you, Dengy. So as you said that my talk is on achieving information theoretical limits in high dimensional regression. So before I talk about these information theoretic limits and how we go about achieving it, let me talk about the general framework of high dimensional regression and give you a bit, some examples just to get you motivated. So I'm sure that almost all of you must be familiar with this. So you have a linear model. Y is equal to X beta plus epsilon, where the difference between this and the classical linear model is that the number of columns is typically much larger than the number of rows. In this case the number of columns -- sorry, the number of columns is the dimension. It's typically much larger than the number of rows, which is the sample size. So under this assumption, if you want to say something meaningful about beta, you need sparsity assumption on it. So the most common sparsity assumption is beta has got say L nonzero entries where L is typically much smaller than the dimension. >>: What if the assumption has to be sparsity or could it be any or strong prior assumption? >> Antony Joseph: Yes. So sparsity -- yes, you could have other assumptions also, right? For example, the beta is contained in a certain L-1 ball or something like that. Yes, definitely but this is the most common assumption. And the talk -- my talk is going to focus on this assumption and information theoretical limits in this assumption, yes, you're right. So perhaps under this assumption the most important problem would be feature selection would be that would be recovering the position of the nonzeros in beta, right? And this has been an area of a lot of active interest nowadays. So, for example, it's got applications in biology, where you're identifying the locations in a gene responsible for a disease, graphical model selection, where you want to estimate a sparse graph, which also is like sparse [inaudible] estimation and compressed sensing where you have to recover a signal from relatively large measurements. So before I start the main part of my talk, let me talk about two examples that I came across recently. One is this example in face recognition. In this, the situation is that you have got a number of people. So here we've got 10,000 people. And you have got a number of images of each person. So, for example, the first person she's got 16 images. And the second person has got 12 images. And these images are actually different. They vary in light illumination and gestures of the person. And the goal of this face recognition device is to, when a person comes in front of the device, to detect whether that person is there in the database, and if he's there, detect which person it is. So this can be formulated as a high dimension regression problem. So in this case the X matrix would be each column of the X matrix would correspond to an image of the person. So, for example, the first person would have 16 columns. Second person would have 12 columns, et cetera. And if a person stands in front of the device, it's like providing the device with a Y variable. And the assumption is that you have got a sparse linear combination of the columns of the X matrix. So since this image is of the second person but it's not exactly the same. It's probably differing in certain aspects, so you would expect that this would be a sparse -- a linear combination of the columns in the corresponding to the second person. So this is one example. Another example that I actually found recent interest, I just found this recently, is this method of output coding for multi-label prediction. The reason I bring this up is because the problem -- so I got started being interested in high dimension regression because of the coding problem, which I'll tell you about shortly. And this has got relations to that. So here the scenario is that you've got a number of documents. Say one up to a thousand, where each document has got a few labels from a large label set. So let's say you've got a label set of a thousand labels. And each document has got a few labels from that large label set. So these documents could be images also like I think the paper, their paper they considered images from the ESP game dataset. So this could be images also. And so, for example, for Document 1 you have this large vector, and with most of them zeros. And with the one, when the label is present in the document, right? Now, so the training data consists of these label vectors, along with other explanatory for these documents. Now because this label record is very large, it may not be a good idea to fit a model directly to the label vectors. So what they do is they encode these label vectors so here beta is the label vector by multiplying it by matrix A of, say, random Gaussian entries, to get much lower dimensional vector Y. And then instead of feeding the model to this huge label vector you fit it to the Y vector, which is much lower dimension. And then when you have got a new document, you try predicting the Y vector first and then use a reconstruction algorithm to get the original label vectors back. So this is one advantage of -- so this wouldn't be a good idea when the labels have a certain hierarchy. I mean, but if you only know that there's just the sparsity and nothing else, then this might be a good idea. And in fact there are ways of -- if you know something more about the sparsity, like the sparsity appears in groups, you can include that in the reconstruction algorithm. So these are examples of high dimensional regression. The question that I was interested in was the relationships between sample size, dimension sparsity and signal-to-noise ratio for accurate feature selection. And, two, the nature in which the above changes when one allows for a small number of mistakes. What I mean by mistakes is both false alarms and fail detections. False positives and false negatives. So actually this is important, because it might be the case that under the sparsity assumption that I said that there are a few L non-zero entries. Some of the non-zero entries may be relatively small, right? So putting the condition that you want to recover all of them, that may be too stringent a criterion. So you want to be more flexible and then you want to see how these relationships change when you allow that flexibility. But the main part of my talk would be on this, the first part. >>: Signal-to-noise ratio. >> Antony Joseph: Yes, signal-to-noise ratio, I define it as you can define it as the norm of X beta, the L-2 norm. So that will become clear in my context soon. So let me give you the outline of my talk. So I'll first talk about information theoretical limits of sparse recovery. And then I'll talk about this communication problem, which I'll describe only briefly. But this was the main reason I got interested in the information theoretic limits. And then I'll provide a discussion of practical approaches, for example, greedy algorithms for solving you know discussing how close to these information theoretic limits existing tactical approaches get. And then third I'll discuss the performance of an iterative algorithm that we propose and along with a little bit about the theoretical analysis actually demonstrating that one can get these information theoretic limits. So let me start about talking about information theoretic limits of sparse recovery. So assume that you have this linear model, and assume that the entries of X are ID with, say, variance 1 and epsilon has ID numbers 01 normal entries -- I mean ideally it would be a normal 01 sigma. But that's just a scaling. Let's just assume that it's a normal 01 entry. And like I said the beta, the portion of vector is sparse. So most specifically you assume that the quotient vector belongs to a set A, where A is a set of beta with L nonzeros and with the nonzeros having magnitude at least W. So the nonzeros have a sort of minimum magnitude. >>: You have exactly L nonzero and all of them or at least asignificant. >> Antony Joseph: Right. Off magnitude W. >>: [inaudible] everything is [inaudible]. >> Antony Joseph: Yes. >>: Every individual variable with X. >> Antony Joseph: Right. So the thing is that here we are worried about information theoretic limits. So more specifically we are interested in lower bounds in the sample size N to detect nonzeros of any beta in it, right? Now, you're right that this is a strong assumption, but this lower bound would hold for a broad class of matrixes. So in fact if it cannot hold for ideally matrices are the best kind of matrices you can get for which you can get -- you can get recovery with the lower sample size possible. >>: So the feature selection is constant, impose a problem. The problem would be something biology, assume that you have two features which are simply identical. >> Antony Joseph: Right. >>: Would you recognize both of them or one of them. >> Antony Joseph: Then you would be doing one -- taking a linear combination of them. >>: But it might just pick one of them. It might just take the combination of two, anything could happen. >> Antony Joseph: Right. So in real life dataset like as you were saying that you don't have this ideal assumption. So in this case when the variety, this thing wouldn't happen, when those two features would be identical. So, yes, but you're right. Real life datasets you typically don't have that scenario. So the question is then how do we arrive at such a lower bound? What is a lower bound, right? So what statisticians do in such a case of trying to find a lower bound for support recovery for the set A is that they try to reduce this problem to testing problem. So what do I mean? So you consider the set A star, which is a subset of A, which is set of beta and A with nonzero values having equal to W. So beta, the beta vector looks like that. And there are L nonzero entries, right? And each of the nonzero entries have a value W. So the cardinality of A star is N 2 cell. And an algorithm that can detect elements in A can also detect elements in A star simply because of the fact that A star is a subset of A. A is a much larger set, right? So the advantage of doing this is that we are kind of converting this original problem of trying to recover in the set A to a testing problem. So an algorithm that has to detect in the set A star is basically doing a test, right? It has to test between which of these N choose L possibilities of beta it has to select, right? Or, rather, it's doing a classification kind of. And the advantage of reducing it to testing problem is that there are information theoretic inequalities that gives lower bounds on the probability of error and testing problems. So what do I mean? >>: Go back. So it's not necessarily true if you solve for A star you're going to have the same set of [inaudible] today, right? If you optimize for the full A, you may get a different set -- may fit better by choosing some of these 2 zeros [inaudible]. Or is that true? You solve this for the sub problem -- and you find a certain set of nonzero values. >> Antony Joseph: Right. >>: You went and solved some magic box, the full problem, guaranteed to get the same set of nonzero values, not the same values but the same set? The same features [inaudible]. >>: No, it's a lower bound. >> Antony Joseph: So the thing here is that what I'm saying for lower bound for detecting in the set A star would be a lower bound for detecting in the set A, because detecting in the set A is a tougher problem. But you're right. You cannot automatically generalize from A star to A. But since we are worried about the lower bound, to provide a lower bound on A star will also be a lower bound for the much larger set A. >>: [inaudible] bounds. >> Antony Joseph: What's that. >>: You're not trying to solve the problem you're just ->> Antony Joseph: Right. Right. So as I said, the information theoretic limits inequalities providing lower bounds in quality of error. What do I mean? So the signal-to-notion ratio you can define here since the entries of X or ID with variance one, the signal-to-noise ratio is simply the norm of this vector beta. The square of that. So in this case it's L times W squared, right? And an important quantity that appears in these lower bounds is this quantity known has the capacity, which is half log one plus SNR. And ->>: How is this change if the errors are nonunivariance? >> Antony Joseph: So if the errors are nonunivariance, then they would be divided by the variance of the errors, right? That's why -- since that's a scaling I just took it to be unit variance. So the lower bound on the sample size for detection in A star is actually a consequence of a very famous theorem in coding, which is the Shannon's Coding Theorem and it's not really meant for regression problems but it can be readily applied to our situation. So it basically says that to be able to detect any beta from A star, one requires N to be at least this quantity. It's log of the cardinality of A star divided by C. So let me write that down. N should be at least log of cardinality of A star divided by C, where C is -- so since A star is actually a simpler problem than A, the same lower bound holds for detecting in A. But we are actually interested in detection in the set A star. Not A. For reasons I'll tell you soon enough. So let me just mention that the Shannon's Coding Theorem was not meant to be used for such applications. So there are stronger lower bounds on the sample size N having recently proved specifically for this setup by these researchers. But these bounds agree with the Shannon lower bounds under the setting that the sparsity over the dimension, which is L over capital N is small. So in this regime, the Shannon lower bounds are still the strongest that is there. >>: Clarification. Can you go back a slide? >> Antony Joseph: Yes. >>: So detect any phase [inaudible] is that mean any of the set of phases or particular venue we're trying to detect? >> Antony Joseph: So any beta. Right. So any beta. And so what this theorem actually says more specifically is that suppose you have an algorithm that is there for some algorithm that you have for detection in A star. And that algorithm gives you an estimate beta hat, right? Now, what this says, what this theorem says is that the probability that beta star is not equal to the average probability over all beta in A star, this is bounded away from 0 if N is less than that quantity. So if the probability is bounded away from zero, then you cannot -- so the Shannon bound is still the strongest you can get in this regime where the sparsity or dimensionality is small. But Shannon's bound is a converse result. It may not be that you can get -- so a converse result, so it basically tells you what you cannot do, right? It need not be that there can be an algorithm that can actually achieve these limits. So that means what I mean by achieve is that can one find a practical algorithm it can detect any element in A star with N near this quantity, log cardinality of A over C. >>: So isn't the variance of the noise parameter typical algorithms would be a controlled parameter so you could play tricks with a bound by basically turning up or turning down the noise model that you were working with. >> Antony Joseph: No, no, the noisy said is normal with variance one, right? >>: But if you were -- when one is actually solving a regression problem, they would very often vary parameters that change the parameterization of noise, you could choose to have higher or lower noise that you were modeling. >> Antony Joseph: Yes. But this colors everything. The lower bound is ->>: But in the equation you were saying you would scale the [inaudible] where you had the signal-to-noise ratio. That would scale with the noise variance, right? >> Antony Joseph: Yes. Yes. So if the noise variance is small then the signal-to-noise ratio would be higher, and then you would require the lower bound would be smaller, right? >>: Okay. >> Antony Joseph: Yes. >>: So in other words if you do know that the actual variance of noise varies the bound would scale correspondingly log linearly or -- >> Antony Joseph: It would be log of -- SNR would be -- so in general if there is a certain variance for the noise, the SNR would be the normal beta squared divided by the noise variance. >>: We come into the log into the bottom? >> Antony Joseph: Yes. It would come in the log in the bottom, yep. So can one actually find a practical algorithm that can detect elements in A star with N near this quantity? Go ahead. >>: So since you described this model the generative model, wouldn't the make of the area be giving you this exact, the result you're looking for? >>: So I'll come to the map estimator soon. But, no, I mean, because this lower bound which I said, that was a converse result. It was not meant -- it's not like we took the maximum epi star estimator and saw how, what the sample size was. It was not done by analyzing any algorithm. It was just some information theoretical limit inequalities for testing problems. But, yes, that is the best estimator for this problem. And yes it's also important to analyze that. So I'll talk about that soon. >>: I won't say too much, I'll stop, but back to the theorem. Isn't that a probabilistic statement? You could detect a beta by chance with a smaller end, right? >> Antony Joseph: So I mean this probability would be -- I mean, this probability is over all beta, right? >>: Not that you can -- even if your guess is three, you have the odds sometimes that the answer would be three. But your probability of guessing right, regardless of what other material you use is bounded away from zero by a constant. This is what detection means. >> Antony Joseph: Okay. Thanks. So that's the question. There's another way of repressing this question. So like I said N should be at least that, and we want to see whether there's a practical procedure that can make N close to that. I'll tell you why we are interested in that soon enough. But there's another way of formulating this, if you define this quantity red which is log cardinality of A, rate R, which is the log Cardinals of A star divided by N. By Shannon's theorem, you need -- I mean, this quantity rate cannot be greater than C. If you want to detect the elements in A star, right. So another way of rephrasing this question is devise a practical algorithm to detect elements in A star with R near C. Rate near this quantity capacity. Right? So let me tell you why we are interested in this, detection near this sample size N near log of A star over C. It's because of communication problem which we formulated as a regression problem. So there's a communication problem as follows: In communication, you want to send messages reliably through a noise medium, also known as a channel, right? So we assume that the set of messages is A star. So a message corresponds to a quotient vector in A star. Now, there's quite a bit of story that I'm hiding. So when you're actually talking on a cell phone you're not actually selling quotient vectors but you're actually speaking. But what's actually done is that this time is divided into blocks, and then whatever you speak is digitized. And then digitized meaning turned into binary strings and those binary strings are basically mapped into one of those quotient vectors in A star. So that's a story which can be there in the background. Let's assume that the set of messages is A star. And I said you transmit through a noisy medium so the noise medium actually adds a normal nonzero noise to what is sent. Let me be more specific. Suppose a sender wants to transmit a vector beta in A star. What he does he encodes the beta using a metrics X of IDE normal entries. This is where the relationship where the multi-label approach for, coding a multi-label prediction comes in. So you encode beta to X beta. The receiver gets Y which is a noisy version of X beta. The epsilon is the noise added by the medium. And since we're assuming that the channel which is the noisy medium is the Gaussian channel, the noise is normal with variance 1. The receiver's goal is to get the correct beta from A star from the knowledge of Y and X so it's a regression problem. The reason you want to minimize the sample size, you want to make the sample size as small as possible, is that you want to minimize the number of transmissions. So the way the sender transmits this X beta is he first transposes the first entry of X beta and the second entry up to the Nth entry, and you don't want the number of transmissions to be large. Right? So you want to make this N as small as possible for that reason. So now let me discuss practical, discuss feasible strategies for solving this problem. One of the strategies would be greedy algorithms, which is the approach we take. So let me first say that the best estimator, as you said, was the map estimator, which is you find the beta in A star which minimizes this norm, Y minus X beta. The L2 norm of that. This has the least average probability of error. By average probability of error I mean precisely this. However, this is not computationally feasible because you can do a manual search over all beta in A star, because the A star is a huge set. So however it's really important to analyze how this estimator performs. Because like you mentioned, Shannon's theorem is a converse result. So it tells you what you cannot do, right? So even though this lower bound is there, it may be the case that there is no scheme let alone a practical scheme that with which you can achieve this lower bound, right? So it's for that reason it's important to analyze this estimator first, because if this estimator performs badly you cannot even hope that any practical scheme can perform well. Right? And in our paper we show that this estimator actually performs quite well. So we have a short with practical schemes. Okay. >>: So this results shows that the Shannon's bound is actually a tight bound. >> Antony Joseph: Right. But you see this estimate is not completely feasible. >>: But just [inaudible]. >> Antony Joseph: That's right. It's a tight bound, right. So possible computational feasible strategy, very famous, something that has received a lot of attention is L1 penalized methods. We, like I said, we use the technique of greedy approaches. So let me give you a bit more background on greedy approaches. So the way these algorithms work is that they select one permanent step, recompute the [inaudible] and repeats the process. That's the general high level way these things work. So example one way of doing this would be you start with an initial fit of zero. You update the vector, which is Y minus the fit in the previous step. You find the term JK that maximizes the inner product with residuals. You find the feature which has the most correlation with residue. You update the fit using the selected term JK. I'll tell you ways you can update the fit, in the next slide. And then you see whether a particular criterion is met. A stopping criterion will be if the norm of this residue vector is small or not. That would be a stopping criterion. If it's not met, you do it again. So ways you can update this fit, if you take this fit, as the new fit as a previous fit, plus the newly selected term XJK, but then the coefficient of that term would be the actual inner product, then you have the matching pursuit algorithm. If it's not exactly the inner product but a small portion of that inner product, epsilon K is a small positive quantity, then you get forward stage-wise algorithm. And if the new fit is actually a projection of Y on the selected terms, then it's this algorithm known as the orthogonal matching pursuit. It's like a whole host of algorithms that one can have, and in fact there is a whole host of ways you can fit this next term. So that gives rise to a huge class of algorithms. And such algorithms have been used for neural nets, L2 boosting. Learning with structured sparsity, learning in reproducible kind of and Hilbert spaces and feature selection, which is the thing we're interested in. In fact, even I have done some work on feature selection using the orthogonal matching pursuit algorithm. So the results from this literature do imply that a sample size of the order of N choose L. Remember N choose L is the cardinality of the set A star, right? These do imply that sample size of the order of N choose L you can recover the position of the nonzeros for any beta in N star. For example, these results do show. However, we're more interested in the constant. Not just order of magnitude but order of constant. You get you can recover with N close to 1 over C, right? >>: What are the constants from those results? >> Antony Joseph: The constant -- the thing is they're pretty, they don't really need to worry about the constant there. So it could be nine, ten, it could be anything. The reason is that they were focused on generality of applications, right? We have a specific problem where we have got the X matrixes ID normal entries and quotient vector has the structure where the nonzeros are W. So the question is can we leverage that to give more stronger results? Right, than what is there existing in current literature. So ->>: Naive question but [inaudible] how did you know yours was stronger? I just don't see ->> Antony Joseph: No, I mean -- no, I mean, they say that with N of some constant ->>: [inaudible]. >> Antony Joseph: So they don't really, their point here is about what the constant is. The order of magnitude of something. But, yeah, so... So let's just -- so let me describe the [inaudible] algorithm. It's actually a variant of these -- it's a minor variant of these greedy algorithms. We have a situation like this where the nonzeros are W. So notice that Y is if you denote S as a set of columns where beta, the quotient vector is nonzero then Y is W then the sum is over J and S plus noise, right? So XJ is correlated with Y if and only if J belongs to S. Now for the first step we consider the statistic Z 1 which is an inner product of XJ with Y divided by R of Y. And the set of terms that are selected in the first step are those where Z 1 J are greater than a positive of threshold tau. So I'll tell you what this positive threshold is soon. For the second step, we consider terms that are not selected in the first step. In previous steps. So these are first step greater than 2. So any term selected on previous steps are selected. There's no going back and changing that. From previous steps you get these fit vectors where the fit vectors are the sum of the XJs over the set detected in that step times this scalar quantity W. We compute the residual. Calculate this inner product statistics. So it's the inner product of XJ with residual divided by the normal residual. You select those for which are greater than some threshold. And stop whenever your algorithm is selected or whenever there are no terms of threshold. Now, the thing about this algorithm is that we can after some effort actually characterize the distribution of these statistics ZKG. And the distribution of this statistics is as follows: For any J and S, recall that S is the set of terms where beta is nonzero, right? For any J and S, ZKJ is bounded from with [inaudible] with high probability, I shifted normal. These WKJs are ID normal zeros and other variables. For any JKS is shifted normal. And for any J not in S it's simply a noise vector, a normal zero and random variable. These mu Ks -- these mean mu Ks are greater than zero and they increase with steps. So it's this mean that gives you a way of discriminating between the correct and the wrong terms. So a little more detail -- oh, so one thing I would like to mention is that these results are nonasymptotic. So that was where a lot of our effort went into. Because for the first step it's actually easy to calculate the distribution of Z 1. It's not a big effort. But because from the second step, because there are turns detected in the first step, there are lots of dependencies that arise, and actually getting such a characterization required a lot of effort. >>: What is L here? >> Antony Joseph: What is that? >>: What is L? >> Antony Joseph: Sparsities, the number of nonzero terms. So the error priority, error meaning the priority with which this is satisfied is exponentially smaller than L. A little more details, since I'm actually running out of time. Let me just say how this threshold tau is selected. Since for any J not in S, GKJI are ID normal zero one. Right? And since there are N minus L wrong terms, and since the maximum of N minus normal random variables are roughly like square root 2 log N minus L. You like to put the threshold as that to minimize false positives, right? The mean mu K has this form. So it's a function of a fraction of the character detection up to the previous steps. Where this function is actually an increasing function. So what this gives us is that the total traction of terms detected after step K has expectation lower bounded by GUK minus 1, where G is this function. It's some function on 01. So the take-home message from this is that let's say for R the rate R, remember rate is log of cardinality of A divided by N. Suppose you take a rate of 4.5 times C and take an SNR of 7 the function G looks like that on 01 it looks like that and the way our algorithm our theory predicts that our algorithm progresses is something like that. So the first point over here is a fraction of terms detected out of the first step. You go like that and the second point with the fraction of terms particularly after the second step, et cetera, and you keep going until this last point. So our theory predicts that you can actually get most of it, 95 percent of it correct. And with an error probability that is like 10 to the 1 minus 4. So this was the rate of .45 times C. What if you tried to increase the rate a bit more? Say .45 times C. Then function G looks like this. Now look at this it's stretching Y equals to X line. So what our theory predicts is that our algorithm actually just has to stop over there. So you actually make a lot of mistakes at least our bound gives us that you make 85 percent mistakes. And the error probability is like 10 to the minus 5 but that doesn't matter because you're already making a lot of mistakes. It turns out that the rate R can only be made as high as this quantity. Half SNR 01 plus SNR. In the set up we are in and using our algorithm. And a little bit of algebra can show you that this quantity's actually less than this capacity C. So you really cannot drive the rate high. Notice that R0 increases to half as S and R goes to infinity. And so in asymptotic regime, a similar threshold of recovery was noticed for the last two algorithms. So this is how we show that this threshold is for finite SNR. Wainwright noticed it as SNR tends to infinity. But the thing is that we need rate to be made as high as capacity. But here using our setup and our algorithm we can only make it as high as R not. How do we reach as high as capacity? So what we do is that we assume N to be of the form L times B. And we consider coefficient vectors of this form. So it's divided into L sections where each section has got V terms. And there's exactly one nonzero in each section. Right? And the sum of squares of the nonzeros is equal to the SNR. We considered coefficient vectors like that and we take A star to be all such vectors, right? The cardinality of A is A star now is B to the bar L. Notice now using this modified A star we need to again ensure that we can make the sample size as small as this. A star can be anything. It could be any finite set. All we need to do is show that the sample sizes can be made as small as that. So the way we consider actually are of this form. They decrease exponentially and they flatten out. Now you run the algorithm on this modified A star. And if you do that, remember like previously I was showing that with rate .55 times the algorithm was getting stuck. Now with this modified A star if you run it, you can actually make the algorithm go actually detect most of the terms correctly. And so the way this algorithm progresses using this modified A star is that in the first step, since the weights were decreasing like that, the initial sections they are higher probability of detection. So on X axis you have the section number. The initial sections are higher probability of detection and decreases since the weight decreases like that, the probability of detection decreases like that. And the probability of detection by the seventh step decreases dramatically. By the 13th step it's even more. And by the 19th step, which is the last step, almost all the sections have a higher probability of detection. And this can be summarized in a theorem. If you fix any rate less than CB, where this quantity CB tends to capacity for large P, then our result can be stated as follows: For any rate less than CB, one has a fraction of mistakes of order 1 over log P. If P tends to infinity this tends to 0. And the probability is exponentially small in L. This quantity delta is actually the difference from CB from R. So what this actually tells us is that in certain sparsity regimes it's actually possible to achieve information theory limits using a practical scheme, right, using a greedy approach. And the error probability is exponentially small. This is something that I didn't stress, but so the existing results on high dimensional regression, why they are more general in nature, the error probability are not exponentially small. And one of the reasons for that is that they focus on exact recovery of the non0 terms and that is too stringent a criterion to have recover all the nonzeros exactly. You cannot get your error probability that high as we can. Since we allow for a small number of mistakes, we can get our error probability really small. >>: Is there a latent sparsity if you have log sparsity, would everything become really hard? >> Antony Joseph: What is log sparsity? >>: Basically linear number of, the number of relevant terms being linear the total number of features, if the number of features you want to recover is log. So, for example, other problems that people will kind of flood the feature space with various transformations of the ->> Antony Joseph: The number of nonzeros may not be linear. It needs to be sublinear. I mean ->>: Because N equals LB, right? If you have N equals something log of L, if instead of a linear dependence on L, you have a log dependence on L. >> Antony Joseph: Oh, okay. Okay. >>: Log and then the number of total ->> Antony Joseph: I see. I see. So number of -- this thing is -- I think it should still work. I mean, yes, I need to think about it a bit more. But, yeah, I don't see this -- I need to think about it a bit more, thanks. And so these results actually provide a theoretic approval and practical communication scheme. So the reason why we are doing all this is that the communication scheme that I use nowadays they perform very well empirically. But a theoretical understanding of how they perform for any given rates R less than capacity is not known. And so by formulating this as a regression problem, we try to -- we are trying to not only give a practical scheme but also give a practical scheme that we can have some theoretical result of how it performs for any given rate less than capacity. So future work. Analysis of soft decisions instead of hard decisions. So here in each step we select terms and deselect terms. What if we made soft decision? We're pretty sure that this would improve the algorithm significantly. But the question is then how do we analyze it theoretically. Implications for the general high dimension regression problem, like since we are -- do these results actually carry over? Can we extend these results over for the general high dimension regression problem? The other thing I'm interested in learning is structure notions with sparsity. So when the sparsity is -- there's something else other than the fact that there are L nonzeros you know something more than that. And I'm interested in working on other problems in high dimension statistics like matrix completion, high dimensional PC and [inaudible] learning. And so with that, thank you, and thank you for the time. Thank you. [applause] >>: I'm still puzzled. >> Antony Joseph: Okay. Please. >>: At the very beginning you make an assumption that the X matrix, the elements IID. >> Antony Joseph: Right. >>: With variance one. >> Antony Joseph: Right. >>: I understood you to mean that each element of the matrix is IID. One. You're not talking about the vectors being IID, the univariance cover matrix. >> Antony Joseph: Right. >>: If every variable is IID or variance is IID, why is the selection -- you just take the first L of them, the same behavior, and predicted as why any other set of them? See what I'm saying. IID, then the selection thing seems like why are you doing it in the first place. If it's IID, take it and if that's your assumption? Is that something huge? >> Antony Joseph: So the thing is that we need to ensure selection not just for the first L. So, yes, they're IID. You have Y equals X beta. >>: Beta is not IID. >>: It's just the model. >> Antony Joseph: So they're non0 LEW in our case? >>: But that's choosing a subset of the commons effects. I mean, we ->> Antony Joseph: So the thing is that you have to -- the algorithm should be such that it should guarantee selection for any subset of feature selected, not just -- so you don't know the subset that was elected. The algorithm has to detect that no matter -- maybe ->>: There's a goal set beta, only the ones -- the coefficients for beta will ->>: Trying to find their beta hat. >>: Or you just run subset of [inaudible] XRRD. But that's beta you already speak, those are the ones we need to -- to detect those ones and choose them. They're generated in RIID. >>: Right. >>: But in a sense if beta has a zero coefficient in the first index, then Y is [inaudible] first column of X. >>: But who cares. Distribute as the same column. The regression ->>: We're trying to find something about beta. Not about the Xs. >>: That's what's puzzling me. >>: If X is the coding matrix and beta is the thing you're trying to recover from the signal you apply the best way to understand is from the coding, the coding communication problem and you set ->>: Stuff on the regression. >>: Yes. It's misleading to think about it as regression at some level because it's really a decoding problem. But wants to think about it as regression there's a natural way of thinking about it in that way. >>: So you're saying that it's a practical, probably a practical communication scheme. But basically still you kind of pre-decide how many you need to send, right? And at the receiver end you need to still use greedy algorithm to recover some of this. But your message is kind of -- you mentioned there's other type of message based on our ->> Antony Joseph: Yes. >>: So do you know the performance of that algorithm? >> Antony Joseph: Yes. So the thing is that -- so the reason why we started working on this was that existing communication scheme, they perform well in practice but a theoretical understanding is not there. So we needed to get somehow to demonstrate that a theoretical demonstration that an algorithm can achieve rates up to capacity. Now, for -- I'm pretty sure that under that setup, it should perform -- well, our algorithm and particularly it performs better than the lasso, but as long as -- the maximum rate you can achieve with lasso I'm pretty sure it must be the same. But how do you prove that theoretically. >>: You mean algorithmically perform -- see how the similar problem is. >> Antony Joseph: Right. >>: Because I think in certain -- in certain setups I think the result would be the lasso solution would be selected even more better than the greedy algorithm. >> Antony Joseph: Right. But the thing here is that in this greedy algorithm we have made use of the fact that the nonzeros are W. And also like -- so I told you that the statistics we use were based on inner product, right? But in our paper it's actually a statistic that's close to that. That is a motivation for it. But the statistics that we analyze is actually close to that. So what I'm trying to say is that in our analysis and of our algorithm we tried to make use, the most we can, of this normal structure and the fact that the nonzeros are W. The lasso algorithm, it's a more general thing. So that's one reason why our algorithm performs better than the lasso. >>: What are the comparable techniques that work well in practice but don't have theoretical bounds? >> Antony Joseph: You mean used in communication. So the technique that's used in communication [inaudible] are these LLPC codes. They use belief propagation techniques. Now, belief propagation algorithms, they have got a good theoretical understanding of when -- so these are algorithms in graphs. I don't know much about the details. But there is a good theoretical understanding when these graphs are trees. But the graphs that they use while communication they have, they're not trees. They have cycles in them. But they still use these algorithms and they work. But a theoretical understanding there's still a gap, except in some very special channels, there's still no theoretical understanding of how the codes that are used nowadays in practice perform. >> Dengyong Zhou: Let's thank the speaker. [applause]