>> Dengyong Zhou: I'm glad to introduce Anthony here. ... student from Department of Stats Yale University. Today he'll...

advertisement
>> Dengyong Zhou: I'm glad to introduce Anthony here. Anthony is a Ph.D.
student from Department of Stats Yale University. Today he'll be talking about
achieving limits in aggression. Antony.
>> Antony Joseph: Hello. So thank you, Dengy. So as you said that my talk is
on achieving information theoretical limits in high dimensional regression.
So before I talk about these information theoretic limits and how we go about
achieving it, let me talk about the general framework of high dimensional
regression and give you a bit, some examples just to get you motivated.
So I'm sure that almost all of you must be familiar with this. So you have a linear
model. Y is equal to X beta plus epsilon, where the difference between this and
the classical linear model is that the number of columns is typically much larger
than the number of rows.
In this case the number of columns -- sorry, the number of columns is the
dimension. It's typically much larger than the number of rows, which is the
sample size.
So under this assumption, if you want to say something meaningful about beta,
you need sparsity assumption on it. So the most common sparsity assumption is
beta has got say L nonzero entries where L is typically much smaller than the
dimension.
>>: What if the assumption has to be sparsity or could it be any or strong prior
assumption?
>> Antony Joseph: Yes. So sparsity -- yes, you could have other assumptions
also, right? For example, the beta is contained in a certain L-1 ball or something
like that. Yes, definitely but this is the most common assumption.
And the talk -- my talk is going to focus on this assumption and information
theoretical limits in this assumption, yes, you're right.
So perhaps under this assumption the most important problem would be feature
selection would be that would be recovering the position of the nonzeros in beta,
right?
And this has been an area of a lot of active interest nowadays. So, for example,
it's got applications in biology, where you're identifying the locations in a gene
responsible for a disease, graphical model selection, where you want to estimate
a sparse graph, which also is like sparse [inaudible] estimation and compressed
sensing where you have to recover a signal from relatively large measurements.
So before I start the main part of my talk, let me talk about two examples that I
came across recently. One is this example in face recognition. In this, the
situation is that you have got a number of people.
So here we've got 10,000 people. And you have got a number of images of each
person. So, for example, the first person she's got 16 images. And the second
person has got 12 images.
And these images are actually different. They vary in light illumination and
gestures of the person. And the goal of this face recognition device is to, when a
person comes in front of the device, to detect whether that person is there in the
database, and if he's there, detect which person it is.
So this can be formulated as a high dimension regression problem. So in this
case the X matrix would be each column of the X matrix would correspond to an
image of the person.
So, for example, the first person would have 16 columns. Second person would
have 12 columns, et cetera. And if a person stands in front of the device, it's like
providing the device with a Y variable. And the assumption is that you have got a
sparse linear combination of the columns of the X matrix.
So since this image is of the second person but it's not exactly the same. It's
probably differing in certain aspects, so you would expect that this would be a
sparse -- a linear combination of the columns in the corresponding to the second
person.
So this is one example. Another example that I actually found recent interest, I
just found this recently, is this method of output coding for multi-label prediction.
The reason I bring this up is because the problem -- so I got started being
interested in high dimension regression because of the coding problem, which I'll
tell you about shortly.
And this has got relations to that. So here the scenario is that you've got a
number of documents. Say one up to a thousand, where each document has got
a few labels from a large label set.
So let's say you've got a label set of a thousand labels. And each document has
got a few labels from that large label set. So these documents could be images
also like I think the paper, their paper they considered images from the ESP
game dataset.
So this could be images also. And so, for example, for Document 1 you have
this large vector, and with most of them zeros. And with the one, when the label
is present in the document, right?
Now, so the training data consists of these label vectors, along with other
explanatory for these documents. Now because this label record is very large, it
may not be a good idea to fit a model directly to the label vectors.
So what they do is they encode these label vectors so here beta is the label
vector by multiplying it by matrix A of, say, random Gaussian entries, to get much
lower dimensional vector Y.
And then instead of feeding the model to this huge label vector you fit it to the Y
vector, which is much lower dimension. And then when you have got a new
document, you try predicting the Y vector first and then use a reconstruction
algorithm to get the original label vectors back.
So this is one advantage of -- so this wouldn't be a good idea when the labels
have a certain hierarchy. I mean, but if you only know that there's just the
sparsity and nothing else, then this might be a good idea.
And in fact there are ways of -- if you know something more about the sparsity,
like the sparsity appears in groups, you can include that in the reconstruction
algorithm.
So these are examples of high dimensional regression. The question that I was
interested in was the relationships between sample size, dimension sparsity and
signal-to-noise ratio for accurate feature selection.
And, two, the nature in which the above changes when one allows for a small
number of mistakes. What I mean by mistakes is both false alarms and fail
detections. False positives and false negatives.
So actually this is important, because it might be the case that under the sparsity
assumption that I said that there are a few L non-zero entries. Some of the
non-zero entries may be relatively small, right?
So putting the condition that you want to recover all of them, that may be too
stringent a criterion. So you want to be more flexible and then you want to see
how these relationships change when you allow that flexibility.
But the main part of my talk would be on this, the first part.
>>: Signal-to-noise ratio.
>> Antony Joseph: Yes, signal-to-noise ratio, I define it as you can define it as
the norm of X beta, the L-2 norm. So that will become clear in my context soon.
So let me give you the outline of my talk. So I'll first talk about information
theoretical limits of sparse recovery. And then I'll talk about this communication
problem, which I'll describe only briefly.
But this was the main reason I got interested in the information theoretic limits.
And then I'll provide a discussion of practical approaches, for example, greedy
algorithms for solving you know discussing how close to these information
theoretic limits existing tactical approaches get. And then third I'll discuss the
performance of an iterative algorithm that we propose and along with a little bit
about the theoretical analysis actually demonstrating that one can get these
information theoretic limits.
So let me start about talking about information theoretic limits of sparse recovery.
So assume that you have this linear model, and assume that the entries of X are
ID with, say, variance 1 and epsilon has ID numbers 01 normal entries -- I mean
ideally it would be a normal 01 sigma. But that's just a scaling. Let's just assume
that it's a normal 01 entry.
And like I said the beta, the portion of vector is sparse. So most specifically you
assume that the quotient vector belongs to a set A, where A is a set of beta with
L nonzeros and with the nonzeros having magnitude at least W.
So the nonzeros have a sort of minimum magnitude.
>>: You have exactly L nonzero and all of them or at least asignificant.
>> Antony Joseph: Right. Off magnitude W.
>>: [inaudible] everything is [inaudible].
>> Antony Joseph: Yes.
>>: Every individual variable with X.
>> Antony Joseph: Right. So the thing is that here we are worried about
information theoretic limits. So more specifically we are interested in lower
bounds in the sample size N to detect nonzeros of any beta in it, right?
Now, you're right that this is a strong assumption, but this lower bound would
hold for a broad class of matrixes.
So in fact if it cannot hold for ideally matrices are the best kind of matrices you
can get for which you can get -- you can get recovery with the lower sample size
possible.
>>: So the feature selection is constant, impose a problem. The problem would
be something biology, assume that you have two features which are simply
identical.
>> Antony Joseph: Right.
>>: Would you recognize both of them or one of them.
>> Antony Joseph: Then you would be doing one -- taking a linear combination
of them.
>>: But it might just pick one of them. It might just take the combination of two,
anything could happen.
>> Antony Joseph: Right. So in real life dataset like as you were saying that you
don't have this ideal assumption.
So in this case when the variety, this thing wouldn't happen, when those two
features would be identical. So, yes, but you're right. Real life datasets you
typically don't have that scenario.
So the question is then how do we arrive at such a lower bound? What is a
lower bound, right? So what statisticians do in such a case of trying to find a
lower bound for support recovery for the set A is that they try to reduce this
problem to testing problem.
So what do I mean? So you consider the set A star, which is a subset of A,
which is set of beta and A with nonzero values having equal to W.
So beta, the beta vector looks like that. And there are L nonzero entries, right?
And each of the nonzero entries have a value W. So the cardinality of A star is N
2 cell. And an algorithm that can detect elements in A can also detect elements
in A star simply because of the fact that A star is a subset of A. A is a much
larger set, right?
So the advantage of doing this is that we are kind of converting this original
problem of trying to recover in the set A to a testing problem. So an algorithm
that has to detect in the set A star is basically doing a test, right? It has to test
between which of these N choose L possibilities of beta it has to select, right?
Or, rather, it's doing a classification kind of. And the advantage of reducing it to
testing problem is that there are information theoretic inequalities that gives lower
bounds on the probability of error and testing problems.
So what do I mean?
>>: Go back. So it's not necessarily true if you solve for A star you're going to
have the same set of [inaudible] today, right? If you optimize for the full A, you
may get a different set -- may fit better by choosing some of these 2 zeros
[inaudible]. Or is that true? You solve this for the sub problem -- and you find a
certain set of nonzero values.
>> Antony Joseph: Right.
>>: You went and solved some magic box, the full problem, guaranteed to get
the same set of nonzero values, not the same values but the same set? The
same features [inaudible].
>>: No, it's a lower bound.
>> Antony Joseph: So the thing here is that what I'm saying for lower bound for
detecting in the set A star would be a lower bound for detecting in the set A,
because detecting in the set A is a tougher problem.
But you're right. You cannot automatically generalize from A star to A. But since
we are worried about the lower bound, to provide a lower bound on A star will
also be a lower bound for the much larger set A.
>>: [inaudible] bounds.
>> Antony Joseph: What's that.
>>: You're not trying to solve the problem you're just ->> Antony Joseph: Right. Right. So as I said, the information theoretic limits
inequalities providing lower bounds in quality of error. What do I mean?
So the signal-to-notion ratio you can define here since the entries of X or ID with
variance one, the signal-to-noise ratio is simply the norm of this vector beta. The
square of that. So in this case it's L times W squared, right?
And an important quantity that appears in these lower bounds is this quantity
known has the capacity, which is half log one plus SNR. And ->>: How is this change if the errors are nonunivariance?
>> Antony Joseph: So if the errors are nonunivariance, then they would be
divided by the variance of the errors, right? That's why -- since that's a scaling I
just took it to be unit variance.
So the lower bound on the sample size for detection in A star is actually a
consequence of a very famous theorem in coding, which is the Shannon's
Coding Theorem and it's not really meant for regression problems but it can be
readily applied to our situation.
So it basically says that to be able to detect any beta from A star, one requires N
to be at least this quantity. It's log of the cardinality of A star divided by C.
So let me write that down. N should be at least log of cardinality of A star divided
by C, where C is -- so since A star is actually a simpler problem than A, the same
lower bound holds for detecting in A.
But we are actually interested in detection in the set A star. Not A. For reasons
I'll tell you soon enough. So let me just mention that the Shannon's Coding
Theorem was not meant to be used for such applications. So there are stronger
lower bounds on the sample size N having recently proved specifically for this
setup by these researchers.
But these bounds agree with the Shannon lower bounds under the setting that
the sparsity over the dimension, which is L over capital N is small.
So in this regime, the Shannon lower bounds are still the strongest that is there.
>>: Clarification. Can you go back a slide?
>> Antony Joseph: Yes.
>>: So detect any phase [inaudible] is that mean any of the set of phases or
particular venue we're trying to detect?
>> Antony Joseph: So any beta. Right. So any beta. And so what this theorem
actually says more specifically is that suppose you have an algorithm that is
there for some algorithm that you have for detection in A star.
And that algorithm gives you an estimate beta hat, right? Now, what this says,
what this theorem says is that the probability that beta star is not equal to the
average probability over all beta in A star, this is bounded away from 0 if N is less
than that quantity.
So if the probability is bounded away from zero, then you cannot -- so the
Shannon bound is still the strongest you can get in this regime where the sparsity
or dimensionality is small.
But Shannon's bound is a converse result. It may not be that you can get -- so a
converse result, so it basically tells you what you cannot do, right? It need not be
that there can be an algorithm that can actually achieve these limits.
So that means what I mean by achieve is that can one find a practical algorithm it
can detect any element in A star with N near this quantity, log cardinality of A
over C.
>>: So isn't the variance of the noise parameter typical algorithms would be a
controlled parameter so you could play tricks with a bound by basically turning up
or turning down the noise model that you were working with.
>> Antony Joseph: No, no, the noisy said is normal with variance one, right?
>>: But if you were -- when one is actually solving a regression problem, they
would very often vary parameters that change the parameterization of noise, you
could choose to have higher or lower noise that you were modeling.
>> Antony Joseph: Yes. But this colors everything. The lower bound is ->>: But in the equation you were saying you would scale the [inaudible] where
you had the signal-to-noise ratio. That would scale with the noise variance,
right?
>> Antony Joseph: Yes. Yes. So if the noise variance is small then the
signal-to-noise ratio would be higher, and then you would require the lower
bound would be smaller, right?
>>: Okay.
>> Antony Joseph: Yes.
>>: So in other words if you do know that the actual variance of noise varies the
bound would scale correspondingly log linearly or --
>> Antony Joseph: It would be log of -- SNR would be -- so in general if there is
a certain variance for the noise, the SNR would be the normal beta squared
divided by the noise variance.
>>: We come into the log into the bottom?
>> Antony Joseph: Yes. It would come in the log in the bottom, yep.
So can one actually find a practical algorithm that can detect elements in A star
with N near this quantity? Go ahead.
>>: So since you described this model the generative model, wouldn't the make
of the area be giving you this exact, the result you're looking for?
>>: So I'll come to the map estimator soon. But, no, I mean, because this lower
bound which I said, that was a converse result. It was not meant -- it's not like
we took the maximum epi star estimator and saw how, what the sample size
was.
It was not done by analyzing any algorithm. It was just some information
theoretical limit inequalities for testing problems. But, yes, that is the best
estimator for this problem. And yes it's also important to analyze that. So I'll talk
about that soon.
>>: I won't say too much, I'll stop, but back to the theorem. Isn't that a
probabilistic statement? You could detect a beta by chance with a smaller end,
right?
>> Antony Joseph: So I mean this probability would be -- I mean, this probability
is over all beta, right?
>>: Not that you can -- even if your guess is three, you have the odds sometimes
that the answer would be three. But your probability of guessing right, regardless
of what other material you use is bounded away from zero by a constant. This is
what detection means.
>> Antony Joseph: Okay. Thanks. So that's the question. There's another way
of repressing this question. So like I said N should be at least that, and we want
to see whether there's a practical procedure that can make N close to that. I'll tell
you why we are interested in that soon enough. But there's another way of
formulating this, if you define this quantity red which is log cardinality of A, rate R,
which is the log Cardinals of A star divided by N. By Shannon's theorem, you
need -- I mean, this quantity rate cannot be greater than C. If you want to detect
the elements in A star, right.
So another way of rephrasing this question is devise a practical algorithm to
detect elements in A star with R near C. Rate near this quantity capacity. Right?
So let me tell you why we are interested in this, detection near this sample size N
near log of A star over C.
It's because of communication problem which we formulated as a regression
problem. So there's a communication problem as follows: In communication,
you want to send messages reliably through a noise medium, also known as a
channel, right?
So we assume that the set of messages is A star. So a message corresponds to
a quotient vector in A star. Now, there's quite a bit of story that I'm hiding.
So when you're actually talking on a cell phone you're not actually selling
quotient vectors but you're actually speaking. But what's actually done is that
this time is divided into blocks, and then whatever you speak is digitized. And
then digitized meaning turned into binary strings and those binary strings are
basically mapped into one of those quotient vectors in A star. So that's a story
which can be there in the background.
Let's assume that the set of messages is A star. And I said you transmit through
a noisy medium so the noise medium actually adds a normal nonzero noise to
what is sent. Let me be more specific. Suppose a sender wants to transmit a
vector beta in A star. What he does he encodes the beta using a metrics X of
IDE normal entries.
This is where the relationship where the multi-label approach for, coding a
multi-label prediction comes in.
So you encode beta to X beta. The receiver gets Y which is a noisy version of X
beta. The epsilon is the noise added by the medium.
And since we're assuming that the channel which is the noisy medium is the
Gaussian channel, the noise is normal with variance 1.
The receiver's goal is to get the correct beta from A star from the knowledge of Y
and X so it's a regression problem. The reason you want to minimize the sample
size, you want to make the sample size as small as possible, is that you want to
minimize the number of transmissions.
So the way the sender transmits this X beta is he first transposes the first entry of
X beta and the second entry up to the Nth entry, and you don't want the number
of transmissions to be large. Right? So you want to make this N as small as
possible for that reason.
So now let me discuss practical, discuss feasible strategies for solving this
problem. One of the strategies would be greedy algorithms, which is the
approach we take.
So let me first say that the best estimator, as you said, was the map estimator,
which is you find the beta in A star which minimizes this norm, Y minus X beta.
The L2 norm of that. This has the least average probability of error. By average
probability of error I mean precisely this.
However, this is not computationally feasible because you can do a manual
search over all beta in A star, because the A star is a huge set. So however it's
really important to analyze how this estimator performs. Because like you
mentioned, Shannon's theorem is a converse result.
So it tells you what you cannot do, right? So even though this lower bound is
there, it may be the case that there is no scheme let alone a practical scheme
that with which you can achieve this lower bound, right?
So it's for that reason it's important to analyze this estimator first, because if this
estimator performs badly you cannot even hope that any practical scheme can
perform well. Right?
And in our paper we show that this estimator actually performs quite well. So we
have a short with practical schemes. Okay.
>>: So this results shows that the Shannon's bound is actually a tight bound.
>> Antony Joseph: Right. But you see this estimate is not completely feasible.
>>: But just [inaudible].
>> Antony Joseph: That's right. It's a tight bound, right. So possible
computational feasible strategy, very famous, something that has received a lot
of attention is L1 penalized methods.
We, like I said, we use the technique of greedy approaches. So let me give you
a bit more background on greedy approaches. So the way these algorithms work
is that they select one permanent step, recompute the [inaudible] and repeats the
process. That's the general high level way these things work.
So example one way of doing this would be you start with an initial fit of zero.
You update the vector, which is Y minus the fit in the previous step. You find the
term JK that maximizes the inner product with residuals.
You find the feature which has the most correlation with residue. You update the
fit using the selected term JK. I'll tell you ways you can update the fit, in the next
slide. And then you see whether a particular criterion is met. A stopping criterion
will be if the norm of this residue vector is small or not. That would be a stopping
criterion.
If it's not met, you do it again. So ways you can update this fit, if you take this fit,
as the new fit as a previous fit, plus the newly selected term XJK, but then the
coefficient of that term would be the actual inner product, then you have the
matching pursuit algorithm.
If it's not exactly the inner product but a small portion of that inner product,
epsilon K is a small positive quantity, then you get forward stage-wise algorithm.
And if the new fit is actually a projection of Y on the selected terms, then it's this
algorithm known as the orthogonal matching pursuit. It's like a whole host of
algorithms that one can have, and in fact there is a whole host of ways you can fit
this next term.
So that gives rise to a huge class of algorithms. And such algorithms have been
used for neural nets, L2 boosting. Learning with structured sparsity, learning in
reproducible kind of and Hilbert spaces and feature selection, which is the thing
we're interested in. In fact, even I have done some work on feature selection
using the orthogonal matching pursuit algorithm.
So the results from this literature do imply that a sample size of the order of N
choose L. Remember N choose L is the cardinality of the set A star, right?
These do imply that sample size of the order of N choose L you can recover the
position of the nonzeros for any beta in N star. For example, these results do
show.
However, we're more interested in the constant. Not just order of magnitude but
order of constant. You get you can recover with N close to 1 over C, right?
>>: What are the constants from those results?
>> Antony Joseph: The constant -- the thing is they're pretty, they don't really
need to worry about the constant there. So it could be nine, ten, it could be
anything.
The reason is that they were focused on generality of applications, right? We
have a specific problem where we have got the X matrixes ID normal entries and
quotient vector has the structure where the nonzeros are W.
So the question is can we leverage that to give more stronger results? Right,
than what is there existing in current literature.
So ->>: Naive question but [inaudible] how did you know yours was stronger? I just
don't see ->> Antony Joseph: No, I mean -- no, I mean, they say that with N of some
constant ->>: [inaudible].
>> Antony Joseph: So they don't really, their point here is about what the
constant is. The order of magnitude of something. But, yeah, so...
So let's just -- so let me describe the [inaudible] algorithm. It's actually a variant
of these -- it's a minor variant of these greedy algorithms. We have a situation
like this where the nonzeros are W.
So notice that Y is if you denote S as a set of columns where beta, the quotient
vector is nonzero then Y is W then the sum is over J and S plus noise, right?
So XJ is correlated with Y if and only if J belongs to S. Now for the first step we
consider the statistic Z 1 which is an inner product of XJ with Y divided by R of Y.
And the set of terms that are selected in the first step are those where Z 1 J are
greater than a positive of threshold tau.
So I'll tell you what this positive threshold is soon. For the second step, we
consider terms that are not selected in the first step. In previous steps. So these
are first step greater than 2. So any term selected on previous steps are
selected. There's no going back and changing that.
From previous steps you get these fit vectors where the fit vectors are the sum of
the XJs over the set detected in that step times this scalar quantity W. We
compute the residual. Calculate this inner product statistics. So it's the inner
product of XJ with residual divided by the normal residual.
You select those for which are greater than some threshold. And stop whenever
your algorithm is selected or whenever there are no terms of threshold. Now, the
thing about this algorithm is that we can after some effort actually characterize
the distribution of these statistics ZKG.
And the distribution of this statistics is as follows: For any J and S, recall that S
is the set of terms where beta is nonzero, right? For any J and S, ZKJ is
bounded from with [inaudible] with high probability, I shifted normal. These
WKJs are ID normal zeros and other variables.
For any JKS is shifted normal. And for any J not in S it's simply a noise vector, a
normal zero and random variable. These mu Ks -- these mean mu Ks are
greater than zero and they increase with steps.
So it's this mean that gives you a way of discriminating between the correct and
the wrong terms. So a little more detail -- oh, so one thing I would like to mention
is that these results are nonasymptotic.
So that was where a lot of our effort went into. Because for the first step it's
actually easy to calculate the distribution of Z 1. It's not a big effort. But because
from the second step, because there are turns detected in the first step, there are
lots of dependencies that arise, and actually getting such a characterization
required a lot of effort.
>>: What is L here?
>> Antony Joseph: What is that?
>>: What is L?
>> Antony Joseph: Sparsities, the number of nonzero terms. So the error
priority, error meaning the priority with which this is satisfied is exponentially
smaller than L.
A little more details, since I'm actually running out of time. Let me just say how
this threshold tau is selected. Since for any J not in S, GKJI are ID normal zero
one. Right? And since there are N minus L wrong terms, and since the
maximum of N minus normal random variables are roughly like square root 2 log
N minus L. You like to put the threshold as that to minimize false positives, right?
The mean mu K has this form. So it's a function of a fraction of the character
detection up to the previous steps. Where this function is actually an increasing
function.
So what this gives us is that the total traction of terms detected after step K has
expectation lower bounded by GUK minus 1, where G is this function. It's some
function on 01. So the take-home message from this is that let's say for R the
rate R, remember rate is log of cardinality of A divided by N. Suppose you take a
rate of 4.5 times C and take an SNR of 7 the function G looks like that on 01 it
looks like that and the way our algorithm our theory predicts that our algorithm
progresses is something like that.
So the first point over here is a fraction of terms detected out of the first step.
You go like that and the second point with the fraction of terms particularly after
the second step, et cetera, and you keep going until this last point.
So our theory predicts that you can actually get most of it, 95 percent of it correct.
And with an error probability that is like 10 to the 1 minus 4.
So this was the rate of .45 times C. What if you tried to increase the rate a bit
more? Say .45 times C. Then function G looks like this.
Now look at this it's stretching Y equals to X line. So what our theory predicts is
that our algorithm actually just has to stop over there.
So you actually make a lot of mistakes at least our bound gives us that you make
85 percent mistakes. And the error probability is like 10 to the minus 5 but that
doesn't matter because you're already making a lot of mistakes.
It turns out that the rate R can only be made as high as this quantity. Half SNR
01 plus SNR. In the set up we are in and using our algorithm. And a little bit of
algebra can show you that this quantity's actually less than this capacity C. So
you really cannot drive the rate high.
Notice that R0 increases to half as S and R goes to infinity. And so in asymptotic
regime, a similar threshold of recovery was noticed for the last two algorithms.
So this is how we show that this threshold is for finite SNR. Wainwright noticed it
as SNR tends to infinity. But the thing is that we need rate to be made as high as
capacity. But here using our setup and our algorithm we can only make it as
high as R not. How do we reach as high as capacity?
So what we do is that we assume N to be of the form L times B. And we
consider coefficient vectors of this form. So it's divided into L sections where
each section has got V terms. And there's exactly one nonzero in each section.
Right?
And the sum of squares of the nonzeros is equal to the SNR. We considered
coefficient vectors like that and we take A star to be all such vectors, right?
The cardinality of A is A star now is B to the bar L. Notice now using this
modified A star we need to again ensure that we can make the sample size as
small as this.
A star can be anything. It could be any finite set. All we need to do is show that
the sample sizes can be made as small as that.
So the way we consider actually are of this form. They decrease exponentially
and they flatten out. Now you run the algorithm on this modified A star. And if
you do that, remember like previously I was showing that with rate .55 times the
algorithm was getting stuck. Now with this modified A star if you run it, you can
actually make the algorithm go actually detect most of the terms correctly.
And so the way this algorithm progresses using this modified A star is that in the
first step, since the weights were decreasing like that, the initial sections they are
higher probability of detection. So on X axis you have the section number.
The initial sections are higher probability of detection and decreases since the
weight decreases like that, the probability of detection decreases like that. And
the probability of detection by the seventh step decreases dramatically. By the
13th step it's even more.
And by the 19th step, which is the last step, almost all the sections have a higher
probability of detection. And this can be summarized in a theorem. If you fix any
rate less than CB, where this quantity CB tends to capacity for large P, then our
result can be stated as follows: For any rate less than CB, one has a fraction of
mistakes of order 1 over log P. If P tends to infinity this tends to 0. And the
probability is exponentially small in L. This quantity delta is actually the
difference from CB from R.
So what this actually tells us is that in certain sparsity regimes it's actually
possible to achieve information theory limits using a practical scheme, right,
using a greedy approach. And the error probability is exponentially small. This is
something that I didn't stress, but so the existing results on high dimensional
regression, why they are more general in nature, the error probability are not
exponentially small. And one of the reasons for that is that they focus on exact
recovery of the non0 terms and that is too stringent a criterion to have recover all
the nonzeros exactly. You cannot get your error probability that high as we can.
Since we allow for a small number of mistakes, we can get our error probability
really small.
>>: Is there a latent sparsity if you have log sparsity, would everything become
really hard?
>> Antony Joseph: What is log sparsity?
>>: Basically linear number of, the number of relevant terms being linear the total
number of features, if the number of features you want to recover is log. So, for
example, other problems that people will kind of flood the feature space with
various transformations of the ->> Antony Joseph: The number of nonzeros may not be linear. It needs to be
sublinear. I mean ->>: Because N equals LB, right? If you have N equals something log of L, if
instead of a linear dependence on L, you have a log dependence on L.
>> Antony Joseph: Oh, okay. Okay.
>>: Log and then the number of total ->> Antony Joseph: I see. I see. So number of -- this thing is -- I think it should
still work. I mean, yes, I need to think about it a bit more. But, yeah, I don't see
this -- I need to think about it a bit more, thanks.
And so these results actually provide a theoretic approval and practical
communication scheme. So the reason why we are doing all this is that the
communication scheme that I use nowadays they perform very well empirically.
But a theoretical understanding of how they perform for any given rates R less
than capacity is not known. And so by formulating this as a regression problem,
we try to -- we are trying to not only give a practical scheme but also give a
practical scheme that we can have some theoretical result of how it performs for
any given rate less than capacity.
So future work. Analysis of soft decisions instead of hard decisions. So here in
each step we select terms and deselect terms. What if we made soft decision?
We're pretty sure that this would improve the algorithm significantly. But the
question is then how do we analyze it theoretically.
Implications for the general high dimension regression problem, like since we
are -- do these results actually carry over? Can we extend these results over for
the general high dimension regression problem? The other thing I'm interested in
learning is structure notions with sparsity.
So when the sparsity is -- there's something else other than the fact that there
are L nonzeros you know something more than that.
And I'm interested in working on other problems in high dimension statistics like
matrix completion, high dimensional PC and [inaudible] learning. And so with
that, thank you, and thank you for the time. Thank you.
[applause]
>>: I'm still puzzled.
>> Antony Joseph: Okay. Please.
>>: At the very beginning you make an assumption that the X matrix, the
elements IID.
>> Antony Joseph: Right.
>>: With variance one.
>> Antony Joseph: Right.
>>: I understood you to mean that each element of the matrix is IID. One.
You're not talking about the vectors being IID, the univariance cover matrix.
>> Antony Joseph: Right.
>>: If every variable is IID or variance is IID, why is the selection -- you just take
the first L of them, the same behavior, and predicted as why any other set of
them? See what I'm saying. IID, then the selection thing seems like why are you
doing it in the first place. If it's IID, take it and if that's your assumption? Is that
something huge?
>> Antony Joseph: So the thing is that we need to ensure selection not just for
the first L. So, yes, they're IID. You have Y equals X beta.
>>: Beta is not IID.
>>: It's just the model.
>> Antony Joseph: So they're non0 LEW in our case?
>>: But that's choosing a subset of the commons effects. I mean, we ->> Antony Joseph: So the thing is that you have to -- the algorithm should be
such that it should guarantee selection for any subset of feature selected, not
just -- so you don't know the subset that was elected. The algorithm has to
detect that no matter -- maybe ->>: There's a goal set beta, only the ones -- the coefficients for beta will ->>: Trying to find their beta hat.
>>: Or you just run subset of [inaudible] XRRD. But that's beta you already
speak, those are the ones we need to -- to detect those ones and choose them.
They're generated in RIID.
>>: Right.
>>: But in a sense if beta has a zero coefficient in the first index, then Y is
[inaudible] first column of X.
>>: But who cares. Distribute as the same column. The regression ->>: We're trying to find something about beta. Not about the Xs.
>>: That's what's puzzling me.
>>: If X is the coding matrix and beta is the thing you're trying to recover from the
signal you apply the best way to understand is from the coding, the coding
communication problem and you set ->>: Stuff on the regression.
>>: Yes. It's misleading to think about it as regression at some level because it's
really a decoding problem. But wants to think about it as regression there's a
natural way of thinking about it in that way.
>>: So you're saying that it's a practical, probably a practical communication
scheme. But basically still you kind of pre-decide how many you need to send,
right? And at the receiver end you need to still use greedy algorithm to recover
some of this.
But your message is kind of -- you mentioned there's other type of message
based on our ->> Antony Joseph: Yes.
>>: So do you know the performance of that algorithm?
>> Antony Joseph: Yes. So the thing is that -- so the reason why we started
working on this was that existing communication scheme, they perform well in
practice but a theoretical understanding is not there.
So we needed to get somehow to demonstrate that a theoretical demonstration
that an algorithm can achieve rates up to capacity.
Now, for -- I'm pretty sure that under that setup, it should perform -- well, our
algorithm and particularly it performs better than the lasso, but as long as -- the
maximum rate you can achieve with lasso I'm pretty sure it must be the same.
But how do you prove that theoretically.
>>: You mean algorithmically perform -- see how the similar problem is.
>> Antony Joseph: Right.
>>: Because I think in certain -- in certain setups I think the result would be the
lasso solution would be selected even more better than the greedy algorithm.
>> Antony Joseph: Right. But the thing here is that in this greedy algorithm we
have made use of the fact that the nonzeros are W. And also like -- so I told you
that the statistics we use were based on inner product, right?
But in our paper it's actually a statistic that's close to that. That is a motivation for
it. But the statistics that we analyze is actually close to that.
So what I'm trying to say is that in our analysis and of our algorithm we tried to
make use, the most we can, of this normal structure and the fact that the
nonzeros are W.
The lasso algorithm, it's a more general thing. So that's one reason why our
algorithm performs better than the lasso.
>>: What are the comparable techniques that work well in practice but don't have
theoretical bounds?
>> Antony Joseph: You mean used in communication. So the technique that's
used in communication [inaudible] are these LLPC codes. They use belief
propagation techniques.
Now, belief propagation algorithms, they have got a good theoretical
understanding of when -- so these are algorithms in graphs. I don't know much
about the details.
But there is a good theoretical understanding when these graphs are trees. But
the graphs that they use while communication they have, they're not trees. They
have cycles in them. But they still use these algorithms and they work. But a
theoretical understanding there's still a gap, except in some very special
channels, there's still no theoretical understanding of how the codes that are
used nowadays in practice perform.
>> Dengyong Zhou: Let's thank the speaker.
[applause]
Download