>> Ofer Dekel: Next speaker is Hariharan Narayanan and... entitled Testing the Manifold Hypothesis.

advertisement
>> Ofer Dekel: Next speaker is Hariharan Narayanan and the talk is
entitled Testing the Manifold Hypothesis.
>> Hariharan Narayanan: Okay. Hi. Thank you for the invitation to
speak here. This is work with Charles Fefferman and Sanjoy Mitter. So
as we all know, nowadays we often encounter data which has many
dimensions. And often the dimension is comparable or larger than the
number of samples we have, that we can analyze.
And the analysis of such data is associated with some specific
phenomena. One of which is well known. It's called the curse of
dimensionality. Among its aspects are the fact that the sample
complexity of function approximation goes exponentially in the
dimension.
And there's also a comparable algorithmic law as the dimension grows,
which is also exponential. What is also sometimes mentioned there are
certain blessings associated with data analysis in high dimension. But
these are actually not so much blessings as ways of getting around the
curse. And also they generally come up mostly in the analysis and not
in the actual bounds.
So it becomes easier to analyze certain phenomena because of laws of
large numbers that come up in asymptotic analysis. And there's also
this phenomenon of concentration of measure where in high dimensions
random quantities start behaving like deterministic quantities.
And that is also beneficial when it comes to analysis. So one way of
getting around the curse of dimensionality is what is called manifold
learning. And this is based on the manifold hypothesis, which is the
hypothesis that this high dimensional data actually lies in the
vicinity of a low dimensional manifold.
And if this were true, we're often able to prove bounds that do not
depend on the ambient dimension of the data but rather on the dimension
of the manifold. This is the hypothesis that has gained a lot of
traction over the past one or two decades. And there's a great deal of
work that is based on this hypothesis where, for example, in iso map
you develop maps that reduce the dimensionality of the data, assuming
that it actually comes from some, a low dimensional surface embedded in
high dimensions. But the question of whether or not data actually lies
in a manifold is less well understood. And that is what I would like
to talk about today.
I would like to come up with a test that says whether or not high
dimensional data lies in the vicinity of low dimensional manifold.
So one comment that should be made when addressing this question is
that this is very much dependent on the representation of the data. So
there may be representations where data lie on a manifold and
representations where they don't. For example, if you have a scene and
you move a camera around the scene in a smooth manner, but then you
parameterize the image that you see by the position of the camera, that
is something that is smooth and it would lie on a manifold. The
position or orientation of the camera lies on a manifold. But if the
actual scene is cluttered and you try to measure two scenes that are
shifted by a little bit by say something like L-2 distance, that is not
going to be smooth.
So there may be representations where data is on a manifold and other
representations where it's not. But here we'll be talking about a
fixed representation. And now I'll move on to the statistical and
algorithmic aspects of the question of testing if data lie in your
manifold.
So before I move on, I need to tell you what I mean by a manifold in
some quantitative sense, because otherwise you can always fit a very
complicated curve to any number of points. So you have to have certain
trade-off between how complicated the curve is and how many points you
have and things like that. So because of that, let's introduce a few
quantities.
So we'll define the reach of a set to be the largest number such that
for any R less than that number tau, if you take a point at the
distance of tau to the set-so here's the set, then that set has a
unique nearest point. In this case, it's this point. So, for example,
if you had to have a straight line then the reach is infinite, because
any point has a unique nearest point to the line. But if you take a
circle of a radius tau, then the reach of the circle is tau itself,
because the center point of that -- because the center of that circle
has a unique -- has several nearest points. And therefore it's not
true with tau. So here is a curve of large reach. Basically the reach
is something like this. It's the radius you need to enlarge before
which the boundary of that tube starts behaving badly. And that
happens at some point like this.
And here this has much smaller reach. And we'll define GE of DV and
tau to be the family of D dimensional submanifolds of the unit ball
which have a D dimensional volume less or equal to V and reach of less
or equal to tau.
And a reach that is greater or equal to tau, I'm sorry. So here's our
formalization of the question. We're going to assume that the data is
drawn IEED from some unknown probability, distribution D, in visible
space and X1, X2, et cetera, are IEEd samples and based on these
samples you have to come up with an algorithm that says yes or no, that
there is a manifold or there's not a manifold.
And so with the error epsilon, given error epsilon, dimension D volume
D tau and the confidence one minus delta our question is there some
algorithm that will take this many IEED samples from P and then output
yes or no depending on whether there is a manifold belonging to this
class, namely having bounded volume and reach such that the average
square data of random point from this manifold is less than epsilon.
So we are asking for a mean squared error to the manifold to be less
than epsilon. That's our notion of nearness. So this has two aspects.
First, how much data do you need before which you can even answer this
question in a statistical sense.
Because if you have too little data and ask does this data lie on a
very complicated curve, then just on the basis of that data you
probably would always be able to fit the curve but it would not tell
you anything about the actual probability distribution. As you got
more and more points, your answer would probably come out false.
This is the issue of generalization error. And we have a reserve that
says that the sample complexity depends only on the intrinsic dimension
volume and reach of the manifolds you're fitting but not on the ambient
dimension of the Hilbert space where the data lies. The number of
samples you need to answer the question really does not rely on the
ambient dimension, and that is good news. And secondly there is the
algorithmic question. So given certain number of data points, how do
we test whether or not there is a manifold that lies close to these
data points in the average squared sense?
And here we are allowing ourselves certain slack. So we are -- this is
to make the problem easier for us. We allow -- we don't necessarily
say it would have to be volume less or equal to V. Any volume less or
equal to CV is okay with us, but it would be nice to make that one plus
epsilon.
Now I'll talk about the results we have on the sample one complexity
and move on to the algorithm. So here's some more notation. Let LMPB
be the expected squared distance of a random point M to P where that
random point is drawn from P.
And we also define the empirical loss, empirical of M to be the sum
over all I equal from 1 to S of the distance of XI to M squared by S.
This is parallel to empirical error and true error. The loss
corresponds to true error empirical error compares to empirical error.
And we define the sample complexity to be the smallest number S such
that there exists some rule A which when given X1 X2 up to XIID from P
will be able to produce a manifold ME. This is the manifold that the
rule produces. Whose performance in terms of loss is not much worse
than the best possible manifold in the class with high probability.
So this is the usual formulation of statistical learning theory. So
that's our sample complexity. And it's well known that in order to get
upper bounds on the sample complexity it's good enough to get uniform
bounds relating the empirical squared, empirical although loss and the
true loss, and the reason is that suppose I could guarantee when I take
S samples the average square is over all manifolds to hear the
manifolds on the X axis, is uniformly close to the true loss. Then
instead of optimizing the true loss, which is what I want to do, I
could optimize the empirical loss and get an answer that is pretty
close. If I knew that both of these are uniformly close with high
probability.
So it's a sufficient condition to get a uniform bound of this kind in
order to prove sample complexity bounds.
And we are able to prove such uniform bounds over the space of
manifolds and show that if S is larger than V times 1 over epsilon plus
1 over tau to the D, plus log 1 over delta by epsilon squared then with
probability at least 1 minus delta, the loss of every manifold in the
class of interest, that is this, is close to the empirical loss with
the samples. So this is a uniform bound over the space of manifolds.
And the way we prove this bound is by approximating manifold as point
clouds and then considering a point cloud as the center set of K means
and then using, getting a uniform bound over K means, which is
independent interest.
So the idea is that you have a manifold but you can actually
approximate it as a set of points, sufficiently small scale, and then
view them and then declare that the manifold is actually not this
manifold but discrete set of points. And then talk about the set, the
distances to the discrete set of points but then you're actually in the
setting of K means.
And here instead of proving manifolds you can just as well approve a
uniform bound over these point clouds and this is in fact a stronger
statement, because every manifold can be approximated by a point cloud
and the distance but the converse is not true. So we're proving the
harder uniform bound when one is proving this for K means.
And the question for K means after a little bit of linear algebra
becomes a question of proving uniform bound over functions of the form
minimum over IAI.X and this is because the function for K means is
minimum over IAI minus X norm square because this nearest point squared
that's the K means loss, and you just open it out, get rid of the X
term, X square term because that is common. And get rid of the AI term
because that is fixed. You get something like this. And this however
was in high dimensions because you're working in a Hilbert space, whose
dimension we have no control over.
So in order to bring this down to a dimension we have control over we
use the Johnson Lindenstrauss lemma and randomly project over S
dimension squared dimensions and there by the Johnson Lindenstrauss
lemma all the essential geometry characteristics are preserved and we
can do our analysis on this lower dimensional space.
And that's essentially how we get a sample complexity bound of our K
over epsilon squared or S over epsilon.
And so this is actually the complexity of that function class. It's
called a fact chartering dimension. And we can show that this fact
chartering dimension is bounded in this way in terms of K and epsilon
for K means and this leads to a sample complexity that is K over
epsilon squared, some logarithmic terms.
So this uses a number of results in empirical process. So in K means
there was this lower bound that was known to the form K over epsilon
squared log 1 over delta by epsilon squared to [inaudible] what it says
is you really need this many samples before which if you do K means on
the sample, you can expect to get something meaningful. If you have
less than this number of samples, then just bring K means on that is
not going to give you something good in the long run. So this is a
lower bound. This upper bound was proven by [inaudible] which is from
K squared over epsilon squared. This means that K squared over epsilon
squared samples suffice. If you have this many samples and you do K
means on that you can estimate to get something reasonable. And our
bound improves upon this in reasonable when K over epsilon is
reasonably small, because there's a log in front of it.
So we get something new for K means as well. Now let me move on to the
algorithmic part. A consequence of this uniform bound is that instead
of fitting to a measure, which is what we were asked to do, fit the
probability distribution, fit a manifold to probability of
distribution, we can instead just sample that probability distribution
and fit a manifold to that, because of this uniform bound that we have.
So this has -- this allows us to do certain dimensional reduction.
you can in fact search for manifolds within the fine span of the
samples and you don't have to search for manifolds in the entire
Hilbert space. So this uniform bound leads to a dimensionality
reduction of that kind.
And
And so now that we tell you something about the algorithmic question.
To hear the question is given N points X1 to XN is there a manifold
with bounded dimension volume and reach such that the average squared
distance of a point from that collection to M is less than epsilon? So
we have to do an optimization over space of manifolds, given N explicit
points. There's no randomness now.
So we have a theorem that says that, yes, you can do this. It has an
exponential dependence on all the parameters of the manifold. But it
will output yes if there is a manifold of this class of interest, and
it will output no if there's no manifold, not just in the class of
interest but in a somewhat larger class of manifolds.
And here's the sort of outline of how the manifold goes, how the
algorithm goes. So first you observe that any manifold whose dimension
is bounded by D, volume by V and reach bounded by tau, is almost
contained inside and our fine subspace of dimension NP where NP is some
packing number of the manifold. And the reason is you can take a fine
net of this manifold, take lots of points in that, and take and the
fine span of those points and now the manifold will almost lie in that
fine span.
And this reduces the dimension to NP. And now you exhaust -- now we're
working in some relatively slow dimensional space and we're going to
exhaustively search among all candidate manifolds in some sense. So
the space of candidate manifolds of course is a continuum. So you
can't do that with a computer but you can look at every manifold in a
small part of it. And that's going to be the idea. So you first take
an evenly spaced set of points. You're going to have to take lots of
these. These are kind of like gross approximations of the manifold we
hope to find. Then in the neighborhood of this you define a vector
bundle, which for every point in this neighborhood takes a fine
subspace of dimension N minus D where if D is the dimension of the
manifold. N is the dimension of the ambient space. These subspaces
are meant to be normal to the manifold. So they have complementary
dimension. But there is one subspace for every point in this
neighborhood. So there are lots of these -- this is called a vector
bundle over the tubular neighborhood. The base space is this tube, and
the fibers are all look like the normal bundle. And then we define the
[inaudible] manifold of set of zeros for specific section of this
vector bundle. A section of this vector bundle is a function which for
every point assigns a value to the corresponding fiber. So the set of
all points which are mapped on to themselves would correspond to the
zeros of the section.
So this is a specific construction. And then once you have this
putative manifold so this is the set -- this is the putative or guest
manifold, obtained from the zeros of the section, you try to adjust it
for the, in a lateral way to get the actual manifold. And there we're
going to do some sort of locally we're going to try to use function
approximation, use optimization over functions to get what are called
local sections or things that are defined locally on the manifold.
And this involves with these inequalities, which I won't go into. And
finally you patch all of these local sections together to get a
manifold. So here is one local section. Here is another local
section. Unfortunately, they don't agree. And that's the whole point.
But you take this vector bundle and then you average out using the
vector bundle. So you kind of average out on these and then you get a
global manifold. So that's the algorithm.
And some concluding remarks. We have an algorithm which is more of a
theoretical framework for testing the manifold hypothesis, and we have
improved sample complexity bounds for K means. So we'd like to make
this practical and test on real data and understand how to make this
nonperimetric where you vary the reach and volume and some intelligent
way with data. And how would one improve the efficiency, for example,
better optimization or sections of the vector bundle and how to
understand the rule of topology in optimization question. Thank you.
[applause]
>> Ofer Dekel:
Time for questions.
>>: The condition you define reach how is it different from curvature?
>> Hariharan Narayanan: So this reach involves global aspects also.
So, for example, if you took two parallel lines at a distance of R from
each of them, the reach of that set is R over 2. Because there are
points at a distance R over 2 that have two near rose points to that
set. So it's not just a local -- it measures nearness to
self-intersection.
>> Ofer Dekel: Any other questions? So I had a question actually. So
it would be really cool -- I mean, you described defining the metaphor.
And if I have some classification task or some type of decision
theoretic task, and I want to exploit the fact that I assume the data
is on the manifold, can I take this approach and couple it with -let's say specification algorithm and then take advantage of that.
>> Hariharan Narayanan: Hopefully, yeah. At the present moment that
guest manifold mutated manifold is infinitely defined as set of Z over
function. There in order to actually yield on the manifold you'd need
to do something iterative optimization to get the manifold. But I
think -- I hope so, too, yeah.
>>: So in the question anchoring testing the distance of the manifold
you had both reach and epsilon, points B to B, seems the error and the
reach somehow you could find -- that you would relax reach and error
could potentially be used
>> Hariharan Narayanan: That is true. But that trade-off, I think,
depends on the ambient dimension. So I don't want to bring in the
dimension of the Hilbert space, because actually all of these reserves
work in accountable dimension, too. So I agree with you. But if you
want to try to fill space in a Hilbert space, then you really need a
lot of volume. So, yeah, I agree with you. But the bounds would
depend on the ambient dimension, yeah.
>> Ofer Dekel:
Let's thank the speaker.
Sorry, one more question.
>>: Just ask one brief question. Since you have these constant C that
enter into the sample size, can you talk about how big those are, if
they're like 10 to the 30th it's not practical, but if it's 5, then it
is
>> Hariharan Narayanan: So those constants depend on the intrinsic
dimension of the manifold. And they're exponential in that.
>>: If we go back to some of the isomap original papers where they were
showing the two dimensional manifolds in 3-D space, but so for B equals
2.
>> Hariharan Narayanan: I hope that it won't be too bad. They're
explicit. They're not -- they're things that you know where they come
from.
>>:
Right.
It would be nice to know how big that would be
>> Ofer Dekel: Okay. Let's thank the speaker one more time. [applause]
Download