Document 17831850

advertisement
>> Lin Xiao: Today we are very happy to have Adams Wei Yu from CMU. Adams is a PhD student at
CMU. He did a summer internship last year. Today he will talk about something different from his
intern work, okay.
>> Adams Wei Yu: Okay. Thanks for the introduction. I’m Adams from CMU. Today, I will introduce
very efficient Doubly Stochastic Primal Dual algorithm for Empirical Risk Minimization problem. This is a
joint work with Qihang Lin and Tianboa Yang. They are both from University of Iowa.
A lot of machine learning problem can be abstracted to the follow Empirical risk minimization problem.
It is essentially minimizing a convex objective which contains two parts. The first part is called the
Empirical loss which computed from the given data. The second part is called the convex regularization
function. For example, it can be the [indiscernible] which can be incorporated our prior knowledge on
the decision variable.
We further assume that the loss can be decomposed as the summation of the different loss on different
data, okay. If the loss is a square loss then it would be a linear regression problem. If the loss is a
sigmoid function then it is a logistic regression. If the loss is a smooth hinge loss then it would be a
smooth SVM. Okay, so indeed capture a lot of problems in the machine learning community.
We have the forwarding assumption of the problem. The first is that the representation g can have a
decomposed structure which is a summation of the different coordini. We also assume that each gj is
lambda strongly convex. We also assume that the loss functions phi i is one over gamma smooth.
Okay, this is all our assumption. If we introduce the Fenchel’s conjugate of the loss function which is the
Phi i’s there. Then we can reformulate the original Empirical risk minimization problem as the bilinear
set upon problem which we’ll introduce a new term which is the bilinear set upon. This A is the data
matrix with each row is a data. It is in each data is of dimension p, okay.
In the following we will work on this formulation rather than the original formulation. Actually there is a
lot of various works on this Empirical risk minimization problem. There are two lies basically. The first
lie is called the gradient, stochastic gradient method. Actually, in each reiteration such kind of methods
use the stochastic gradient which in expectation is the same as the full gradient. But they will have a
variance. The algorithm include the, proposed by the affording authors.
Another lie algorithm is called a stochastic variance reduced gradient method. This method introduces a
new term and substitutes another term such that the stochastic gradient still has the same expectation
as the full gradient. But it has a lower variance. This is called the stochastic variance reduction, okay.
Another lie of algorithm is called Coordinate methods which in each iteration they just sample a
randomized select a few coordinate and do the coordinate descent. This line of algorithm is not
accelerated but these lines are accelerated which means that they have a square root dependence of
the condition number [indiscernible]. Recently there’s some algorithm proposed on Primal-dual
algorithm which they [indiscernible] the set upon formulation.
Okay, so all of these algorithms assume that the number of data n is huge. But the dimension p is the
relative small. All of them just focus on sample one or a batch of data points in each iteration, okay.
>>: [indiscernible] stochastic variance [indiscernible]?
>> Adams Wei Yu: Yeah, Chung has a paper in NIPS two thousand thirteen.
>>: It was different from most of this [indiscernible].
>> Adams Wei Yu: Yeah, basically the general idea would be the same. It’s just that…
>>: Ideas the same but…
>> Adams Wei Yu: Yeah.
>>: [indiscernible].
>> Adams Wei Yu: Yeah.
>>: [indiscernible]?
>> Adams Wei Yu: [indiscernible].
>>: She has a post-doc in [indiscernible].
>> Adams Wei Yu: Yeah, she has a [indiscernible]…
>>: [indiscernible].
>> Adams Wei Yu: [indiscernible] she is assistant professor in the University of Hong Kong. Okay, so,
yeah but what if when p is relatively larger than n which is, which would happen in the high dimensional
case. In that case we can just simply sample one or batch of feature in each iteration and apply the dual
version of the approaches mentioned above.
But what if both n and p would be large? A natural solution is just to sample one data point and one
feature in each iteration. There are few previous works on this line. This talk we also fall in the
category, okay.
In this talk we care about and answer the follow questions. The first is that, what is the iteration
complexity of a coordinate method for problem that we can sample both from the primal and dual
spaces per iteration? It’s essentially to answer what is the convergence upper bound of the algorithm?
The second question is when would this type of algorithm would be better than the purely primal or
purely dual coordinate methods? This would be concern about the overall complexity of the algorithm.
The third question is what’s the fundamental limit of this type of algorithm? This is essentially the
convergence lower bound of this type of algorithm.
Okay, so we will first answer the first question by proposing a new algorithm. Okay, so let’s quickly
review the problem. We would like to solve this Empirical risk minimization problem. We formulate it
as a saddle-point problem. A of i is the ith row of data and A j is the jth column of the data, okay.
Now, let me reveal one very famous algorithm which is proposed by [indiscernible] Zhang and Lin Xiao in
two thousand and fifteen.
>>: Sorry, famous.
[laughter]
>> Adams Wei Yu: I think so. I think it’s famous because it created a lot of citations. Okay…
>>: These two.
[laughter]
>> Adams Wei Yu: Okay, the idea is as follows. In each iteration there is a sample, a subset of data
points which the number of data point would be m. First they do use this n data point to do a dual
update. This y is the dual variable.
Okay, so if the sample, if the data is sampled then update the corresponding dual variable. If not then
remain unchanged. Then we do a momentum update of this dual variable, okay. After that we use this
updated dual variable to update the primal variable. But in this step they do a full primal update which
means that they update a full vector of x, okay. Then again they use the momentum update.
Okay, this is, basically the SPDC. In their paper they show that if they choose a proper step size. Then
their convergence rate would look like this which has dependence on the log one over epsilon, which
means that the convergence rate is linear. Their convergence on their dependence on the condition
number cover is square root, which means that they are [indiscernible] algorithm.
This condition number is defined as the maximum [indiscernible] of the data over the Stony convex and
a smooth parameter, okay. It illustrates how you post of the problem. Normally lambda and gamma
would be small. Then the [indiscernible] would be very huge. They have a square root of this
[indiscernible] which means that they are low, that the complexity is relatively low, okay. This is
basically the theoretical result of SPDC.
The overall complexity of SPDC is just nothing but the number of iterations times the per-iteration cost.
Since their per-iteration cost is mp, m is the number of sample from the data and p is the dimension of
the data. It is due to the inner product. The overall complexity will look like this.
But what if we sample both from the data and features? This is what we proposed the Doubly
Stochastic Primal Dual Coordinate method, which is the short term is DSPDC. The only difference is that
in the primal update we select one coordinate to update, rather than to do full update. Also the step
size and this momentum step size would be rather different.
This is our result. After choosing this step size Tau and sigma to ensure this gap should be smaller than
epsilon. Our DSPDC needs such iteration number, okay. Again, it is the dependence on epsilon is log
one over epsilon which means that it is a linear convergence algorithm. But we have some other
additional term here.
Now, in the following I will discuss the difference between these two algorithms. Okay, first of all we
need to introduce the capital Lambda q m which is essentially the largest [indiscernible] of the block of
data. Suppose we divided the data into m times q block. This lambda is just the largest [indiscernible] of
the block. Okay, so for, now we will connect this term to the previous condition number of
[indiscernible]. If the data is dense and evenly distribute, and it has evenly distributed features then this
term is approximated like this, okay.
In that case the DSPDC would require this number of iterations. This, our result is interesting if the q
equals p and m equals to one which means that we do not do the primal sample. Meaning is that we do
the full primal update then it will recover this optimal rate of SPDC and the accelerated SPDC.
If q equals to one and m equals to n which means that we do not do the dual sampling and do the dual,
full update. Then it will recover the optimal rate by [indiscernible] Lin’s paper, okay. That means our
result captures the previous result.
>>: Is that [indiscernible] paper [indiscernible], the one that, this one, yeah.
>> Adams Wei Yu: I think it’s the one with you.
>>: With [indiscernible].
>> Adams Wei Yu: Yeah.
>>: Okay.
>> Adams Wei Yu: I think so.
>>: Okay.
>> Adams Wei Yu: Okay, so now for the following discussion we will see that the m over, the ratio of the
primal sampling is larger than the dual sampling. Oh, sorry, the dual sampling is larger than the ratio of
the primal sampling without loss of generality. Otherwise we can just apply a dual version of the DSPDC.
>>: Here I think you better explain n is the number of sample.
>> Adams Wei Yu: Yeah.
>>: Why we use [indiscernible]…
>> Adams Wei Yu: Yes, yes.
>>: P is number of the dimension of parameters…
>>: [indiscernible]…
>>: Q is the [indiscernible] feature that we use.
>> Adams Wei Yu: Yeah.
>>: It is composed as a full batch. The batch that [indiscernible] mini batch [indiscernible].
>>: Yeah…
>>: I’m on the Q.
>>: [indiscernible]…
>>: M and the Q can both be one.
>> Adams Wei Yu: Yeah.
>>: [indiscernible] one example or one feature about it. Then you will be m and p left but if you use a
mini batch or block correlates them. That would be kind of an issue.
>>: [indiscernible]
>>: I think that’s the most important part.
>> Adams Wei Yu: Yeah.
>>: Cool.
>> Adams Wei Yu: Yeah, so, yeah exactly. If p equals q which means that we do the full primal update
then the rate will reduce to this one which is nearly optimal for the dual coordinate method. The
optimality is proved by Alekh Agarwal and Leon Bottou last year, okay.
>>: Can you recite [indiscernible] paper?
>> Adams Wei Yu: Yeah, yeah, I…
>>: He has a…
>> Adams Wei Yu: Yeah.
>>: Probably a better proof.
>>: Correct proof.
>> Adams Wei Yu: And correct proof.
>>: Yeah, correct proof.
>>: Yes, yes.
>> Adams Wei Yu: If n equals m which means that we do not do the sampling from the dual then it
becomes this which is very likely to be optimal. But no one has proved this at this moment.
>>: You’ll very likely [indiscernible] from Empirical results.
>> Adams Wei Yu: Likely means that we do not see how can we accelerate this. That’s why we, it’s
likely to be optimal.
>>: Okay, so the asymmetrical part is you do not have the square root, right? If you compare this one to
the first one you have a square root [indiscernible] square root of n divided by [indiscernible].
>> Adams Wei Yu: Because we assume this one is larger than this.
>>: Okay.
>> Adams Wei Yu: That means this part only…
>>: [indiscernible]…
>> Adams Wei Yu: N over m and if it is equal then it will be one, right. Actually n equals to m and this
ratio larger than this implies p equals q.
>>: Yeah, but there’s no symmetrics. There’s no symmetric, still no symmetric.
>> Adams Wei Yu: If p equals to q then actually this should be one, right.
>>: No, I know…
>> Adams Wei Yu: Yeah.
>>: But it’s not p of q. Just assume the ridge were the same, okay.
>> Adams Wei Yu: The ridge are the same?
>>: Suppose then your math equals to p of q.
>> Adams Wei Yu: Yes.
>>: You know it could be ten for example for a hundred.
>> Adams Wei Yu: Okay.
>>: Then your complexity of the, would be, oh no it’s at the bottom.
>> Adams Wei Yu: That’s symmetric. It’s also symmetric.
>>: They have this square root. You do have a square root over…
>> Adams Wei Yu: P q, because you assume that p o, n over m equals the p over q, right. Then this
would be also equals to n over m.
>>: You’re right, look at the first lambda…
>> Adams Wei Yu: That’s what?
>>: You have a square root [indiscernible].
>> Adams Wei Yu: Which one?
>>: There one [indiscernible] the second of the [indiscernible].
>> Adams Wei Yu: This one.
>>: Yeah.
>> Adams Wei Yu: Okay.
>>: Yeah, square, just go to the coefficient on the [indiscernible] the square root [indiscernible].
>> Adams Wei Yu: Okay.
>>: You have a square root of n and m.
>> Adams Wei Yu: If p equal to q yes then this would be…
>>: Don’t look at that, just look at that.
>> Adams Wei Yu: Okay.
>>: You got a square root, [indiscernible] square root over m.
>> Adams Wei Yu: Yes.
>>: The next one.
>> Adams Wei Yu: This one?
>>: [indiscernible] p over q [indiscernible] square root of p of q [indiscernible].
>> Adams Wei Yu: Yes.
>>: Oh, okay…
>> Adams Wei Yu: But this…
>>: [indiscernible], okay.
>> Adams Wei Yu: Yeah.
>>: Okay.
>> Adams Wei Yu: We have additional assumption. I mean actually they would be symmetric if you’re,
yeah. Okay, so this is a combination between DSPDC and SPDC. For this part, for the second part they
are the same except that DSPDC has additional factor p over q. If, and this part would have additional p
over q. If p equals to q then exactly we covered the SPDCs convergence rate, okay.
But on the other hand it shows that DSPDC has a slower convergence rate because p is always larger
than or equal to q. This term is always larger than this term. This term is always larger than this term.
But that makes sense because in the SPDC they do the full coordinate update, but we just sample. That
means we, basically we need more iteration to convergence to the same epsilon, okay. That means
DSPDC in terms of convergence ideas DSPDC is lower than SPDC.
For the per-iteration cost SPDC has m times p. But DSPDC also has m times p. That means the overall
complexity of DSPDC is this. It is still slower than SPDC.
>>: That’s higher.
>> Adams Wei Yu: Yeah, it’s slower.
>>: Oh, slower…
>> Adams Wei Yu: Slower, yeah, yeah, because appears we have additional p over q term here. P is
over…
>>: [indiscernible] larger…
>> Adams Wei Yu: Yeah.
>>: The [indiscernible].
>> Adams Wei Yu: Yeah. That means in general this really is slower than SPDC. Okay, so that may not
be good news. Then we may ask to ask when will the DSPDC have some advantage over the SPDC, right.
In the forwarding we will consider two cases in which, in these cases DSPDC is better than SPDC. The
first scenario is when the data is factorized which means that the data A can be represented as the
multiplication of two matrix. One is U and the other is V. U is the n times d, m by d mainly V is a d by p
matrix. V is much, much smaller than the minimum of n and p.
This type of factorized data can be always obtained for example by the singular value decomposition or
the negative matrix factorization, or randomized matrix approximation, and some randomized
data/feature reduction. It is often used as a preprocessing step before training a statistical model to
reduce the noise and also to speed up the training, okay. Now the A is represented by the multiplication
of a very tall matrix and a very fast matrix.
Okay, so now to be, this is, in this case we want to build a prediction model in the reduced low
dimension space using this U. Consider that the original model is a i times x. The reduced model is just
the U i times x hat, okay. But the reduced model can be [indiscernible] but it’s a little bit harder to
explain because we already project the data to a lower dimension space. It is also harder to use for
subsequent analysis for, such as the predicting new instance and features [indiscernible]. This is a
tradeoff. This is a computational explanation tradeoff.
Okay the factorized data model can be also formulated as follows. This is the original model and after
that, after, do the data dimension reduction. Then it will become this. Now, the factorized data is the U
times V, just, we mentioned before.
Okay, so when we work with such factorized data we can see some positive point and negative point.
For example, we trade the quality of the data for factorized structure. The structure should accelerate
all the optimization algorithms for the ERM. However, for different algorithms the benefit from the
factorized structure is two different extents which we will explain in the following table.
Now, we will maybe see the detail of how we implement this. The message is that for the factorized
data DSPDC per iteration cost of the SPDC is dm times dq which is much slower than n, p, because d is
much smaller than the maximum n and p, okay. This is the implementation detail. We just need to
maintain a U bar and a V bar additionally to make it feasible.
Now this is the compilation between DSPDC and the other algorithm on the factorized data. This
column is the number of iterations. We can see that in this case we assume that m equals to q equals to
one which means that we initialization we only achieve sample one data and one feature.
The number of iteration DSPDC is larger than the SPDC and also larger than the ASDCA. But the cost per
iteration DSPDC is the smallest. The overall complexity DSPDC is the smallest by a factor of p, okay.
>>: That’s only when the first term [indiscernible].
>> Adams Wei Yu: Yes, so especially when n is large then the first term would be dominant. Okay, so,
yes. We can see that for the factorized data scenario this DSPDC in d is faster than the competing
competitor.
Now, we will talk about the second scenario which is called a Matrix Risk Minimization. In this case the
decision variable X is a matrix. This X matrix is the composed to p block. Each block is a D by D matrix.
Now, the formulation is almost the same as the vector version except that the inner product is replaced
by the trace [indiscernible]. The [indiscernible] is replaced by the [indiscernible].
Okay, so there are a lot of applications of this formulation such as the matrix trace regression and
distance metric learning. But for this scenario we need, for most of the scenario we need to impose
additional constraint such that each block of this decision variable is semidefinite positive, okay.
That would pose as a computational challenge. That is for each iteration we need to project the X, the
variable X to the PSD cone. We need to do an Igon value decomposition per iteration which has
complexity and is Q [indiscernible] of d, okay.
Now, we’ll skip most of the detail and just show the theoretical result. For this matrix risk minimization
the number of iteration is the same. But we only need to do one Igon value decomposition per
iteration. But all the other competitors need to do p Igon value decomposition so the overall complexity
DSPDC again wins.
Okay, so again the advantage consists in the first term compared to the SPDC. Actually, in practical
scenario DSPDC is much faster. The reason is although the number of reiteration looks larger than SPDC.
However, both of, all of this are upper bound which means that they might be loose. But the cost per
iteration is concrete which means for each iteration DSPDC in d save p times computation. The practical
time saving by DSPDC is more significant. We will see in the experiment.
Now, we will show some Empirical study. The first is on the factorized data. In this case we choose the
loss to be the smooth hinge loss. We generated data as the following which I will skip. We compare our
algorithm with SDCA and SPDC. In terms of the number of arithmetic operations we can see that DSPDC
is faster than the two state-of-the-art competitors, okay.
>>: Which Y axis?
>> Adams Wei Yu: The Y axis is the, this one. This is the objective value, yeah. The primal objective,
okay, this is the primal objective gap.
>>: Exactly.
>> Adams Wei Yu: Okay.
>>: I think for this [indiscernible] you’re using a [indiscernible] implementation of SPDC.
>> Adams Wei Yu: Why?
>>: Because without one or two there’s a, [indiscernible] operation with SPDC.
>> Adams Wei Yu: Yes.
>>: But you can make it the same as SPDC.
>> Adams Wei Yu: SPDC?
>>: Yeah, per iteration will be same because SPDC [indiscernible] times two.
>> Adams Wei Yu: Okay.
>>: This means there’s no [indiscernible] at all. This is all sparse update.
>> Adams Wei Yu: Yeah, we also use this update. I mean we may, we use the, because it works on the
factorized data.
>>: Okay, factorized data.
>> Adams Wei Yu: Yeah, it’s factorized data not the original formulation.
>>: Yeah because I seen it in the original formulation SPDC would be [indiscernible] instead.
>> Adams Wei Yu: Yeah, yeah in the original I think SPDC would be the fastest. Okay and we also use
the lambda equals to ten to the minus two which means that the [indiscernible] is not that large, okay.
But even for the small lambda which means that the [indiscernible] is large DSPDC is still comparable
with SPDC in, actually it’s a little bit faster, yeah. Yeah and SPDC is the slowest.
Okay, as for the real data we run the three algorithms on three real data sets. The code type there has
to be one in Real-sim. We manually project the original data to low dimension space which only contain
twenty features. We choose the sparse lambda one would be ten to the minus four. The [indiscernible]
lambda two would be ten to the minus two, okay. Again, we can see that DSPDC outperformed the
other two competitors.
Okay, the second experiment is on the matrix risk minimization. We generated data as following.
>>: [indiscernible] comprised of time [indiscernible].
>> Adams Wei Yu: That’s a good question. Actually, if comparison time the SPDC might be a little bit
slower. It depends on the implementation because as I mentioned DSPDC uses more iterations. If you
implement it on MATLAB for example, you need much more for loop. Then it will be much more slower.
Yeah, so we show the arithmetic operation just to, for the sake of fair. In that case it would be much
more fair independent of how you implement this.
>>: It’s not fair is because in the sense that you didn’t know how much your numerical computing can
be [indiscernible].
>> Adams Wei Yu: Yeah…
>>: For example, yeah, if you, that’s the [indiscernible] that time.
>> Adams Wei Yu: In our case we just enforced the use of singles rep computation, okay. In this matrix
risk minimization each variable, the ground truth variable is choose as the d by d identity matrix. There
are p such kind of identity matrix. In this case we again choose the smooth hinge loss. We assume that
the d is two hundred and the p which means that the number of the block of this d by d matrix is one
hundred. We use, we sample one hundred data point and the lambda is set to zero point zero one.
As we mentioned before DSPDC in this scenario is even faster and more significantly faster than the
other two competitors under different sample scheme. No matter m is larger than q and equals to q or
m is more than q, okay. That’s basically because in each iteration we only need one Igon value
decomposition. But the other two compare the need p Igon value decomposition. That’s saving of
time’s very significant, okay.
This answers the second question. That is when the DSPDC would be much more efficient we give two
scenarios. Now, the third question is what’s the fundamental limitation of this type of algorithm? To
answer these questions first we need to formally define what’s this kind of algorithm? It is basically like
in each iteration we restrict, we only sample one coordinate and one data. Then we can add the points
to the current solution which is just the linear span of either this point or its gradient, or the inner
product between the data block and this point.
We do that again for; we do it both for primal and dual. This is the abstract description of this time of
doubly randomized coordinate algorithm. Now, in the following we will give a bad case to show that
this algorithm would need. I mean at least how many iteration this algorithm would need to solve that
problem.
The bad case is just, is quadratic function both for the primal and for the dual. The data we generate,
we use is very sparse. It just looks like a diagonal matrix, okay. In this case we also assume that the
number of dimension is the same as the number of data, okay, which means n equals q. Now, we can
easily compute the parameter that is necessary. For example, the lambda, the gamma, and q m, and by
the theorem of DSPDC it would need this number of iteration to make it converge to the optimal
solution. On the other hand because this formulation is quadratic we have a closed form solution. It’s
as simple as this and just a few line of linear algebra.
Now, we discuss about why this case is bad for this type of algorithm. The first observation is that the
closed form, I mean the optimal solution has non-zero entries both for the primal and for the dual.
Okay, but for the algorithm we consider only when there is one primal-dual interlace in the sampling
sequence could add one non-zero entry to both primal and dual solution. But this is a very rare event.
That means that only after a lot of iteration can we increase one non-zero entry to the solution.
Now, this is an example of the sample sequence and I will explain what is the interlace. Each bracket is
the primal and dual indices, that is example. This for the iteration one, this is for the iteration two, and
so on and so forth. Only when we first sample one index in the primal and later on we sample another
index in the dual. They coincide with this one can we increase one non-zero of the solution.
We prove that that’s non-trivial. We show the probability that how can we get this? It would be a very,
that’s a very rare event. We’ll skip the technical detail here. We will show the result as follows.
Suppose if we apply any algorithm of this type to the set of problem one. The output is x t and y t then
to make sure that the primal-dual gap is more than epsilon we need such type, such number of
iterations to make it converge, okay. Compare with our lower, upper bound unfortunately we show that
our bound is not tight by a number, by an n factor. If n is more than it would be almost the same. But if
n is large the gap is still large.
There is an open question is can we close the gap? Our conjecture is that the upper bound cannot be
tighter. But maybe we can give a worse case to make this lower bound tighter. Okay, so this answers
our third question, what’s the fundamental limitation of this type of algorithm?
>>: [indiscernible] type?
>> Adams Wei Yu: Just conjecture because, I mean because when we construct this case we feel that
we can still find space to make it worse. Actually, on, okay in this case on average we need to sample n
times to increase one non-zero component of the solution. Actually, we think that if we can, only after
we sample n square times we can increase the number of non-zero then these will be tight. We think
that’s possible. Yeah, okay, so that’s our remaining open question.
I remember that in [indiscernible] paper that the lower bound, the bound of SPDC and the low bound is
still have some gap, right. The square root of n or something…
>>: [indiscernible]
>> Adams Wei Yu: Okay.
>>: [indiscernible] exact.
>> Adams Wei Yu: Okay, so take a message of this talk is that we propose a doubly stochastic primal
dual coordinate method for the Empirical risk minimization problem. This algorithm has a linear
convergence rate. It has a lowest overall complexity when we apply to the factorized data scenario or to
the matrix risk minimization scenario.
We also give a lower bound on the iteration complexity for the class of primal dual stochastic coordinate
methods. There is an open problem remaining. That is can we close the gap between the lower bound
and the proposed upper bound? Thanks.
[applause]
>>: Great talk.
>> Adams Wei Yu: Thank you.
>>: More slides look like?
>> Adams Wei Yu: Oh, reference.
>>: Ah, reference.
>>: That’s a upper bound, lower bound issue. You [indiscernible]
>>: [indiscernible] characters.
>> Adams Wei Yu: Yeah.
>>: [indiscernible] optimal. What about it’s not optimal? But what I know about has problem. This
reoccurs.
>> Adams Wei Yu: Yeah, actually, actually we okay, we have for this theoretical result. I mean for the
upper bound we have proved this almost one year ago. To think about the lower bound we also think
for a few months. Our intuition is if this is why I mentioned. I think that the case could be worse.
Why did the gap, I mean the lower bound cannot be I mean improved because for the SPDC. SPDC is
optimal, right.
>>: Yeah, that’s optimal.
>> Adams Wei Yu: As we, so our condition DSPDC would be also optimal. There is, I mean we have no
space to improve, right. This is our intuition.
>>: [indiscernible] the same, right?
>> Adams Wei Yu: Yeah.
>>: Yeah, I think the you’re, the upper bound looks more safe for me. But more the [indiscernible]
they’re easy to be times one. But the lower bound can be [indiscernible].
>> Adams Wei Yu: Yeah, yeah, actually our original. We want to sample one, okay when we sample,
there is one interlace that would be increased by one the non-zero. Actually, I think it would be one
over n square but our [indiscernible] there is just one over n. That’s why, I mean, yeah.
We have finished this for awhile but we did not submit it because we want to close the gap but we
didn’t manage to do that at this moment, yeah.
>>: [indiscernible] to each. Close the gap [indiscernible]
>>: [indiscernible]
[laughter]
>>: [indiscernible] that facing [indiscernible] is no good. But I like to [indiscernible] the three
[indiscernible].
>>: [indiscernible].
>>: [indiscernible].
>> Adams Wei Yu: Okay.
>>: Yeah, so that’s okay, yeah so the [indiscernible].
>> Adams Wei Yu: Which bound?
>>: That’s the lower bound.
>> Adams Wei Yu: The lower bound. But Lan has already finished that.
>>: Yeah, yeah.
>>: But somehow he [indiscernible]…
>>: [indiscernible]…
>> Adams Wei Yu: He does not, Lan’s, actually Lan’s [indiscernible] a little bit, I mean easier than
Agarwal’s. Agarwal, I mean the lower bound that he carries is on a more general class of algorithm. Lan
only focused on a small class of algorithm. That’s why it’s easier to achieve.
>>: Yeah.
>> Adams Wei Yu: Yeah.
>>: [indiscernible] also try to [indiscernible].
>> Adams Wei Yu: Yeah we, actually we tried too. Actually, but we translate the method I’m going to
speak [indiscernible]. But in terms of time it might not be faster. But I mean for the matrix case it would
be faster because we save a lot of time in doing the Igon value decomposition no [indiscernible] C plus
plus.
>>: [indiscernible] so [indiscernible] simpler [indiscernible]. You can compare the exact time
[indiscernible].
>> Adams Wei Yu: Yeah, yeah, it’s a good suggestion.
>> Lin Xiao: Okay.
[applause]
>> Adams Wei Yu: Thanks.
>>: [indiscernible] again.
>> Adams Wei Yu: Okay.
Download