>> Yuval Peres: Good morning, everyone. We're very... Carlsson from Stanford University who's here to tell us topology...

advertisement
>> Yuval Peres: Good morning, everyone. We're very happy to have Professor Gunnar
Carlsson from Stanford University who's here to tell us topology and data.
>> Gunnar Carlsson: Thanks very much for the invitation.
So what I want to talk about today is methods that we've been working on for the last few years
on trying to understand why we call it qualitative properties of data.
So and in particular what am I thinking of in terms of data? I'm thinking of possibly finite metric
spaces; that is to say where we have data matrix and regard the set of columns, maybe with the
L2 metric, maybe with a correlation metric, as a finite metric space.
But actually more generally even networks or any kind of set where there's a notion of similarity
that one can define, a quantitative notion of similarity.
So and let me point out that this idea of studying finite metric spaces is actually quite an old idea,
goes back to D'Arcy Thompson who is a biologist I believe in the early part of the last century
who was interested in studying the form and degree of variation of forms in biological systems or
biological organisms as you change from one group to a nearby group.
So here are two examples of fish. You can see that they're slightly different. On the other hand,
they share a lot as well. So you can see the mouth, it looks roughly the same, the fins up here are
very similar.
And so his idea was let's do the following: Let's put a few landmark points, maybe we'll put
something at the tip of the mouth, the tip of the fin, the eye, and so forth, have a list of maybe 10
or 15 of these landmark points, study the distances between those points, and then ask can the
degree of variation in those distances, in those configurations of distances give us a good
summary of the degree of variation among the animals in a particular species, for example.
So statistical shape theory, one form of it, actually attempts to study this, exactly this problem
that D'Arcy Thompson posed; that is to say, we're trying to study what happens when you have
configurations of -- finite configurations of points in a Euclidean space.
So, again, they use a small set of landmarks and the distances between them, and what they then
do is they actually build a whole space that consists of all the possible configurations, a
Euclidean space of, say, K points in RN.
Now, if you can understand, then, the set of points in this space of all configurations, if you can
understand how this set of points coming, say, from a particular species of fish fits in there, then
that would be regarded as a way of trying to understand the degree of variation in there.
Now, and so sigma can be given the metric -- actually the whole set of all these configurations
can be given a metric in various ways. So that's a second-order object here where we're talking
about distances between metric spaces, if you like.
So the space is defined as you take the set of K points in Euclidean N space and you factor out -say two points are the same, if there's a rigid motion carrying the one to the other, and they then
manage to metrize that space.
Now, Gromov has recently introduced a new metric which doesn't restrict itself to Euclidean
space. And this is a very complicated object, I would say, but nevertheless it is a notion of a
metric on the set of all metric spaces.
Okay. So now of course many kinds of data are effectively analyzed using a notion of distance
or metric. Euclidean data, always there's a particular choice there that one makes coming from
the Euclidean formulas, but, on the other hand, in genomic analysis one studies something more
like a Hamming distance, variance on a Hamming distance between sequences of members of an
alphabet.
But if you think about it, the theory that we just talked about, the one that David Kendall and
then Misha Gromov set up doesn't really yield useful information for large numbers of data
points. The computations become outrageous very fast. So if you're talking even about
hundreds of data points, you're now looking about trying to understand properties of a hundred
dimensional space. It's a difficult way to go.
And the Gromov-Hausdorff, the distance that Gromov introduced is actually quite difficult to
compute and often, furthermore, one is looking in your object, in your geometric object for some
kinds of qualitative features.
So another comment about these things, though, is that sometimes in this analysis of data the
metrics are very well justified theoretically. If one thinks, for example, of physics, a lot of
physical systems, the metrics really make a lot of sense and you believe that you should take
them very seriously.
But on the other hand sometimes it only reflects an intuitive notion of similarity. So, for
example, in the genomic situation, your Hamming distances, you say, well, I believe something
is similar if there are only a few changes between the two of them.
And so here what one thinks of, one thinks of the distance as reflecting a notion of similarity. So
nearby points are similar, far apart ones are different.
However, I want to say we don't trust the large distances. We don't believe that those are
necessarily significant. Because what all we are building into the system is the notion of
similarity, the things that are very close, where the distances are small.
But, in fact, the large distances, you know, unlike a physical situation, may really not be very
significant. In other words, if you think about two genomic situations, one which differs from
the other by, you know, in a hundred slots and another which differs by 150, it's not clear that
you would think that those would be very -- that you could really quantify the notion of
difference using that kind of arbitrarily defined notion of distance.
Furthermore, even in small distances we only trust them a little bit. If you have a pair of points
over in one part of a metric space that's a distance 1.5 apart and another where it's 1.8 apart,
really don't think that that is necessarily -- tells that, well, those two points over there are really
much closer than these two.
And it's illustrated here if we're imagining a sequence here where on the left those points are
different because the yellow is replaced by black, which is slightly longer over here, so these
would look like they're closer. But in terms of a sequence, just the length that's involved in
representing a particular phenomenon is not necessarily indicative of the significance of it.
So actually what we're think about is something more like a 0/1-valued quantity, alike or not
alike. Similar or not similar.
Okay. Now, okay. So when the metrics aren't based on theory like that, it says you don't want to
attempt to try to understand very refined notions concerning the metrics, so such as curvature, for
example. Because if you do that, since you believe that large distances are kind of not to be
trusted anyway, it means that you're -- garbage in and you expect to be getting garbage out.
So it says instead that you should try to study some properties which are a bit robust to some
changes in the metrics. And what are some such properties? Topology actually studies the
idealized versions of such properties.
So, in fact, geometry can be defined as the study of metrics and, you know, informally topology
studies, those properties that remain after you allow yourself to stretch and deform but not to
tear. And mathematically what that suggests is what happens if we sort of permit arbitrary
rescalings of metrics. What remains about the geometric object after we do that.
Your initial reaction is nothing remains, because, after all, when there's a distance, we can't really
consider distance as a 0/1-notion because, of course, points which are only at distance 0 apart if
they're equal and otherwise they're not equal.
So instead what's done on the theory side here is you say, well, actually what we're going to do is
instead consider the notion of a point and its distance to a set instead of distances between pairs
of points.
So when you do that here, we've got a point that's a distance 0, the black point is a distance 0
from the blue open ball, and that's topology. Topology keeps track of distances being equal to 0
to sets.
So what are things that remain, then? What are some properties that remain after you permit
yourself that? Well, a number of connected components remains. Here three connected
components remain after -- no matter what metric you put on this, no matter how you rescale a
metric here, you will always retain the fact that there are three connected components. We say
two points are in the same component if they're connected by a path.
So the set of connected path components is actually a topological invariant, and we can even
start talking about higher order versions of that now to try to understand what happens. Suppose
that we know that a space consists of a single connected component, what more can you say
about it?
Well, you can say the following. Maybe it's connected in more than one way. So here I've got a
connected space. Think of the gray thing here as the connected space, and I've got two points in
it. It's path connected because I can draw a path from the one point to the other one, but in fact I
can draw a second path like that, and I can even do the following. I can say let me look at those
paths and let me see can I identify when two of those are the same, are connected to each other
as paths.
So before when we talked about a spacing with connected, if I draw a path from a point to
another point, it's in the same component, here I'm going to say two paths are the same if I can
draw a path of paths between them.
So you'll see there that what I'm saying is that that upper orange group, those all are the same,
these guys up here, and the reds are all the same, but it turns out that the oranges are not same as
the reds. As I start trying to deform or stretch those orange paths, there's no way I can get past
that obstacle in the middle.
>> What was path mean if the space is discrete?
>> Gunnar Carlsson: Okay. I should say at the moment I'm giving you idealized story. I'm
giving you topology story. And the whole import, then, of the thing is going to be how do you
move that over into the other setting.
Okay. So counting or parameterizing sort of redundancy of connection reflects properties of the
space. So here we've got this one is parameterized by the integer 1, for example, because it loops
once around. And this one represents 2 because we go around twice, and this one represents
minus 1 because it's going to around in the other direction.
And there are notions of parameterizing higher order things than paths. So you can talk about
Kth order versions where components are zeroth order. The above calculation, the one for paths,
is first order, but then there's second, third, fourth, and so on higher order stuff where the second
order captures two spheres. And in a sense one can even talk about a notion of so-called power
series expansion of a space with better and better approximations given as you calculate more
and more of these parameterizations.
So why might you want to understand the higher order structure in the space? Well, suppose, for
example, we're given a continuous time series, so something moving along a sine wave, for
example. And then consider the set of two vectors of the form, you know, a point and then the
one that follows at one time unit later. Call that the delay embedding.
So this set will trace out a circle as the points move through this periodic motion. So the set of
configurations of a point and its follower will produce a circle. Okay. So but presence of
periodic behavior here in the time series is reflected by this first -- this Betti number, as we call,
this count of the first -- of the number of paths, different paths computed on this image of the
delay embedding.
Now, this could also be detected by Fourier methods, but supposing the motion isn't periodic but
just recurrent, you know, it doesn't necessarily move in a periodic fashion. So the image would
then still be a circle or perhaps an annulus, but the Fourier methods would have difficulties.
So topological image, the topological structure of the image of this delay embedding is actually
reasonably robust to noise.
Also we might ask how do we go about recognizing letters. Well, we look at these, so we look at
the letter A, and we're trying to recognize it from the letter B. Now, of course, the A can appear
in many different ways. You can stretch and deform the A, and we still recognize it as an A.
You can similarly do the B. We'll recognize it as a letter B.
In fact, we know there are many different fonts for that. So but the thing that remains is that the
A has one loop and the B has two loops, and so in fact the number of loops turns out to be a
topological invariant. And so in fact it distinguishes here.
So, again, we would say the first -- the A has a first Betti number of 1 and that B has a first Betti
number of 2.
So it turns out now there's formalism, again, on the pure side for actually defining how many
loops there are of the various orders, and it's called homology.
And what I want to just show you here is it's -- you know, although the subject has a very
abstract reputation, the subject that does this, called algebraic topology, nevertheless, it reduces
to matrix calculations for matrices over a Boolean field, a finite field.
So in this case what you do here, the space that I'm looking at up there is given to me as a
simplicial complex. So you see we have a triangle up here, and then we have this path here too.
So there are two connected components, and then there's a single loop here.
You can encode all that information, and this matrix. So here's a matrix which has all the edges.
You'll see AB there, AC, BC, and DE, and then the vertices A, B, C, D and E. And then now
what I do is I put a 1 in a slot wherever a point over on this side is contained in the corresponding
edge up there. It's called the boundary matrix. And it turns out it has rank 3, as you look at it.
And so what you'll find is that the number of connected components is in this case equal to the
number of vertices. That is to say the full number of columns, minus the rank, and the number of
loops is equal to the dimension of the null space of that little Boolean matrix.
So, in other words, there's nothing, you know, very far-fetched about the calculations that you
ultimately do to determine these things. It's just real simple linear algebra.
Okay. Formally, then, for every space in every integer you get a Boolean vector space called HK
of X, the dimension is the number of connected components of H naught, the dimension of HK
of X is a count of the number of K-dimensional holes, and a continuous map induces linear
transformation or matrix between these homology groups. And the dimension then, as I've
mentioned, is called the Betti number.
So here Betti 1 equals 1 on the left and Betti 1 equals 2 -- typo there -- on B.
And just to give you an idea how it works more generally here, if you look at a two-sphere, you
can see the Betti number here. This reflects the fact that, you know, there are no loops in this
space that can't get dragged down to a point. So if you imagine drawing any loop here, it can get
dragged, compressed, deformed into a single point, and then Betti 2 equals 1 is that it's a
two-dimensional surface.
And then just a torus example here, we've got two independent loops here and then here. And
then a Betti 2 equals 1 again. And then even this situation here, we've got four loops, as you can
see, one on each side of the double torus, and even a Klein bottle, which is an unorientable,
hard-to-visualize gadget which we'll come back to here in a minute. But it has the same Betti
numbers, as you can see, as the torus.
All right. So now the question is how do we import this kind of idea into a more discrete setting
where we're dealing with finite sets of points, maybe large point clouds.
So what ideas might we want to import. I'm going to try to tell you about three of these. We
may only get to two. These homological shape signatures, the notion of a Betti number, I want
to talk about how that plays out in the case of point clouds, finite metric spaces. I want to talk
about mapping methods for getting visual representations and compressed representations of the
geometric objects, and finally what I would call diagrams and functoriality, which can help in
assessing stability of qualitative properties.
Okay. But so what's the key. The key to moving from this abstract world of sort of, you know,
complete information kind of spaces into the data world where you have, you know, finite sets of
points is through what we call persistence.
Persistence is something which was really introduced by statisticians initially under the idea of
hierarchical clustering. So let me remind you that clustering is what I could call the statistical
version of taking connected components of the topological space. Typically a cluster algorithm
is something which separates a dataset into conceptually distinct pieces, and hopefully,
presumably, your notion of distance reflects that so that the points within a cluster are closer to
each other than the points if they're then the distance between a pair of points in two different
clusters.
Many choices of how to do this. As you probably know, there are journals devoted to this, books
written about it. Often it requires a scale choice. So an idea about how to do this for a finite
metric space, the very simplest thing is to choose a threshold epsilon in your dataset, which now
is a finite set of points with distances, and just simply connect up, build a graph where you
connect up any points whose distance is a less than epsilon.
Hierarchical clustering, and this is what statisticians observed, was that it was artificial to have to
choose that threshold. There are some ad hoc methods that one can build in to say, oh, this is the
one I want to choose, but basically they said it's much better if I can build a summary or a
tracking of the behavior of the number of clusters over the entire range of scale values.
And so what they do, then, is they say if we've got a finite metric space we're going to build a
tree on top of there, or actually a dendrogram, which is a tree, rooted tree, which -- with a map to
the real line which is keeping track of thresholds.
So the interpretation of this is we initially start out with this dataset down here, but then, oh,
these two points are a distance, you know, roughly .1 apart, and so we merge them. And then
we'll merge this point in here a little bit higher up, and then this group, which has already been
merged, then this group which has been merged will get merged at this point.
But the dendrogram is a very effective small representation of the entire -- of all possible
clusterings that occur at all the scale values. And so the statistician's observation is that's much
more useful and much more informative than any single choice.
So the import of persistence is that if you're trying to look for things that are not just connected
components but loops, there exists a similar summary for homology in all degrees.
So it relies on building a simplicial complex or graph based on the dataset. So a finite set X with
distance function D and a scale parameter epsilon, so the vertex set is a set of all data points, and
there's an edge precisely if the distance is less than or equal to epsilon. And, similarly, if you've
got three points, if all the pairwise distances are less than epsilon, then you include that triangle.
So this is Vietoris-Rips complex. And so here we have a picture of it. If we've got a circle, then
my dataset there is those red points. I connect them up, roughly speaking, the epsilon here is
roughly the interpoint distance, and you'll notice that I recover the structure of the circle from
this complex at this point.
Now, of course if I had chosen the threshold too small, nothing would get connected and nothing
interesting would happen. If I had chosen it too large, everything would be filled in and it also
wouldn't reflect the structure. So it leads you to think, oh, there should be a sweet spot in here
and I should know how to choose the epsilon which is the sweet spot.
But actually that's a losing battle. Instead I think what one should do is you should study how
this complex -- how the structure of this complex grows with epsilon. And, remember, that's
exactly what the statisticians are doing when they're talking about the connected components.
So here you can see how the complex is evolving. Over here there are no connections. Here I'm
connected up. I'm at the sweet spot. Here I'm at something which, well, I filled in some
triangles, but actually fundamentally it kind of looks like a circle, but as it gets very large, of
course, it will lose that character.
Okay. So single linkage clustering, the most vanilla version of hierarchical clustering, is
obtained by forming the connected components of V of X and epsilon, and then the dendrograms
are constructed by studying all values of epsilon at once and understanding the maps between
components induced by these inclusions. That's the summary. That contains all the information
in the dendrogram.
So what I want to show you now is suppose that instead I'm interested in finding the loops. And
I'm interested in finding the first Betti number. That is to say, the homology.
So what you're looking at here now is a -- this is a dataset. So imagine that the points here are all
the intersections of these edges. And I've built the Vietoris-Rips complex on this, and you'll
notice at the back of the room, if you're in the back of the room, it will look very clear that this is
a -- that this is a circle. Unfortunately, something happened in the sampling or whatever here
that I've got these two little holes that are there too. They're also holes in this complex.
So if I were to go ahead and do the homology calculation, this simple linear algebra calculation
on this, I would find Betti 1 equals 3. Not the right answer. I want to see the rough, the coarse,
the essential structure of the geometry and not just -- and not these little loops. But 2 homology
contains no notion of size as it stands.
But what we do have is we have maps between the vector space for the homology at one
threshold and the homology at the other threshold. Because, remember, I mentioned that when
you build the homology it has the property and not only does it produce vector spaces, but it
produces linear transformations between vector spaces. And the Vietoris-Rips, for a smaller
scale value, embeds inside the Vietoris-Rips for the larger scale value.
So let's look what happens. You might tell me here, first of all, well, the only problem with this
is that I didn't choose the epsilon right. So maybe I should just enlarge it a little bit. And so
when I do that, yes, I can enlarge it a bit to the point where I fill in these upper two. So that's a
good thing, but now maybe I introduced something down here because there's an edge and a path
down here. So it's sort of conspiring against me down here.
In this calculation I also don't do the calculation correctly. I get Betti 1 equals 2. And what I
want is Betti 1 equals 1. The answer, though, is to say actually that we're going to study not only
the homology of these two complexes but the linear transformation between them. Because in
this case the linear transformation carries this big loop to the corresponding big loop on the other
one.
But in this case the two little blue loops got filled in. They went to zero, and on the other end
this one did not come from down below. So what we'll call here the homology that persists
across the change in scale values is a single copy. And that is, in fact, the structure that we're
after.
Okay. Yeah. So here's a picture taking components is functorial, I want to say that taking
components respects maps.
But for homology, instead of a dendrogram, we now get what we call a persistence Boolean
vector space by applying HK to the increasing sequences of Vietoris-Rips complexes.
So what this means, persistence vector space simply means a bunch of vector spaces together
with maps between them as I get the increasing scale choices. They can be classified by
barcodes. It turns out a barcode looks like this, just like vector spaces are classified by
barcodes -- sorry, by dimension.
So these persistence vector spaces are classified by barcodes. So here what this represents is a
feature which is born here and then lives for a long scale value -- long range of scale values, and
then here are ones which live for some smaller scale values.
There's a correspondence between the dendrogram and its barcode, because it has a barcode 2.
That loses some information from the dendrogram itself. However, this barcode notion extends
to the higher dimensional invariance.
So the left-hand endpoints are thought of as birth times and the right-hand endpoints are death
points. And so here's another typical example. This is what a barcode might look like for this
statistical circle up here. So we would look at it and we would say, look, this looks like a circle.
The barcode we would say, oh, yes, it looks like a circle. We see a single long bar, which is an
honest geometric feature, and then the smaller ones, which we would say are, you know, sort of
more -- perhaps noise.
Okay. So what I want to show you now is how this plays out to study a particular -- an actual
dataset. So this is joint work with my collaborators, Vin de Silva, Tigran Ishkhanov, and Afra
Zomorodian.
So suppose that you start out and we're going to try to study images now. So an image taken by
a black-and-white digital camera can be viewed as a vector with a gray scale value for each
pixel. So it's actually a very large vector. There are thousands, maybe even tens of thousands of
pixels for cameras. And so the images lie in this very high dimensional space. We'll call it pixel
space.
So David Mumford asked the question what can you say about the set of all images that you
obtain when you take images with a digital camera. So ask the -- you know, just sort of a
thought experiment, what happens if I took all the possible images I could take going around the
world and made them all into some huge dataset sitting inside this pixel space.
Okay. So that's an ambitious question. So let's see. So understanding this notion, this is the way
these set of images behave is of interest from the point of view of the studying the visual
pathways and so forth.
So their observation is that we can't really do this business up here. We can't study this full set of
images up here. We don't have that full set. It's too high dimensional. It's also probably a high
co-dimension inside P because we know that the set of things that are images are highly
restricted.
So what Lee and -- or what Mumford and his collaborators, Ann Lee and Kim Pedersen, did was
to say let's say we're going to try and instead study the local structure of the images statistically.
That is to say we're going to study 3-by-3 patches.
So here's a 3-by-3 patch. So these are 3-by-3 patches of adjacent slots.
And so each one of those will then be some 9 vector. So each vector -- each patch gives a vector
an R9. First observation is that, because most images contain some solid patches, that the things
that will be most frequently occurring are the nearly constant or low-contrast patches. That will
dominate the statistics.
So, in a sense, that's an observation we can make. And once we make that, we ought to then
ignore it because we can't say much more about constant patches.
So the low contrast will dominate the statistics. And so what Lee, Mumford, and Pedersen did
was to construct a dataset consisting entirely of high-contrast patches. So they collected about 4
1/2 million high-contrast patches from a collection of images obtained by two Dutch
neuroscientists. They normalized the mean intensity of the patch, so you've got a 9 vector,
subtract the mean from it so that you get it down to something with mean 0. It's now an
eight-dimensional object. And then also normalize the contrast which effectively corresponds to
the L2 norm of that normalized vector to obtain a dataset on a seven-dimensional ellipsoid.
Okay. So what we then want to study now, so what one finds when you do this is that when you
take this dataset, it actually more or less fills out the seven-sphere. That is to say it's big enough
that no matter where you go in the seven-sphere you can find some points.
And so in a sense we can take that dataset and start to run things and prove somehow that the
dataset was a seven-sphere, and at the end of the day we have nothing of interest there because of
course we made it into a seven-dimensional, into a seven-sphere.
But so, on the other hand, the density of the points, though, varies quite a bit. And so we're
going to study the dataset of most frequently occurring high-contrast patches. So here is a -- here
is what we get.
Now, I'll tell you -- all right. This was subsampled at this point because when we did this the
software wasn't sort of mature enough to compute on very large sets.
So, in any case, what we've got here is we've got 50,000 points. These are the 25 percent densest
points as measured by a particular density estimator that we chose. And I'll tell you that density
estimator has a scale parameter built in it like with variances in sort of a kernel density estimator.
So this is one that's chosen with a rather tight such variance. And so the barcodes that come out
now are -- it's not just that there's a bunch of long ones, it's that there's always five. There's
always five. So, in other words, you resample here, you take a different 50,000 points, and you
take the 25 percent densest. You always get five of these long bars. So what that suggests is that
we've got a space which has first Betti number of 5. Yeah.
>> This is [inaudible] three patches.
>> Gunnar Carlsson: Correct.
>> Oh. Never mind.
>> Gunnar Carlsson: I mean, this is a space -- this is a subset of space which is a priori
seven-dimensional. And now we're seeing some homological behavior here down at the very
bottom.
So the question is what does that tell you? What sort of interpretation does one have of that
output of this dataset? Well, you can hunt around for a while and sort of look for the simplest
possible explanation. Then there's more than one in this case.
So in this case what we get is that here's a picture, though, that works. This is called -- what we
called the three-circle model. So here the picture is that there's a single primary circle. And this
primary circle consists of patches which are essentially an edge between a dark and a light
region. And the angle of that is that primary.
Then there are these two secondary circles as well, red and green, and they don't touch each
other. So the red and green circles don't touch each other, but each touches the black circle.
So does the data fit with this model is the question. And in fact it does. So here's the picture. So
you can see the primary circle up here is exactly what we're talking about. Here there's a -- if
you like, there's a transition from dark to white. There's -- at an angle of about -- what is that,
135 degrees, and here that angle is changing. So here is the vertical patches, here are the
horizontal patches, and then here is something in between.
But now we have these secondary things as well. So you'll notice this one is the same as that one
up there. This one is the same as that one, and there's a typo here, I guess. Yeah. The typo, of
course, is that these two are the same. So this one should be like this one down here. Sorry
about that.
Similarly, this secondary one here, these green ones intersect at those two points. But now what
we're looking at here is we're looking at patches that are entirely horizontal in nature. But notice
that they transition from light to dark. This one starts to gray out. This one gets darker until I
get to a point up here where I've got a single dark line in the middle of a white region.
>> Can you explain again the primary circle?
>> Gunnar Carlsson: Yes. So the primary circle, so it is a discrete representation of what we
would have if we took a function -- think of a function on a square, an intensity function, and
suppose it were actually linear in the two variables. And so it would be -- it would define an
angle between the -- of the line that's sort of -- the lines of constancy, if you like, of those shades.
And so the angle of that line is what is giving you the theta that's describing this circle. So you
can see here it's just -- like this patch here, if I smoothed it out and rotated it I would get from
here to here and then from there to there.
>> The colors, the red and green here, have no significance?
>> Gunnar Carlsson: The color is only to point out to you that this one is the same as this one.
I'm trying to show you that the three-circle model, you know, the one circle fits with -- you
know, overlaps with the other one.
>> Visualizing the difference in color [inaudible] dominated.
>> Gunnar Carlsson: Yes. Well, fair enough. That's right. But, no, they're supposed to be -you know, this primary circle is supposed to be entirely uniform. Really these things should be
just gray and it's just a matter of the rotations.
So a question is is there a two-dimensional surface inside which this data fits. And so you can
now run a much bigger calculation here with different density thresholds and so on and find that
the barcodes that comet out -- and here's there's a Betti 2 barcode as well, so here this one says
Betti 0 is 1, it says the space the connected.
Here this says Betti 1 is equal to 2. Here this says Betti 2 is equal to 1. Okay. So this is sort of
the fingerprint for this object. And an object which has that -- exactly that fingerprint is the
so-called Klein bottle. So Klein bottle is a surface, a two-dimensional surface which doesn't fit
inside R3. You can't embed it without tearing it. It's what happens if you do a Möbius band,
build a Möbius band out of paper and now also try to connect points from the top to the bottom.
When you do that you'll find yourself tearing the paper.
So this is the identification space model. So that's the description of what -- of that space. So
think here of this space as being a rectangle, even a piece of paper where I'm going to glue Q to
Q and P to P. That's the Möbius band identification. And I'm going to glue R to R and S to S,
which is the -- which is the other identification across the -- for the one horizontal line to the
other one.
So we think, then, this space is just a rectangle, but where we have to keep in mind that there are
some things that are glued to each other. A more useful way of looking at the space than that
picture that I showed you before.
Do the three circles naturally fit inside? Well, yeah, they do. Here's the primary circle. This is a
circle, you can see, because Q is glued to Q and P is glued to P. But we've also got the two
secondary circles there. Notice they intersect the primary circle in two points, and then they
form their own circles.
Okay. So that's a picture of how the three-circle model fits in. And, in fact, now what you can
see is how the patches -- and now I've rounded them off -- fit, are parameterized by that Klein
bottle.
Now, I should tell you that I jumped -- when I said that the Klein bottle has the Betti number of
1, 2, and 1, the torus also has Betti numbers 1, 2 and 1. So in principle the space could have
been a torus.
But it turns out that there's a way of doing the homology calculation, not of the Boolean field,
but maybe with a mod 3 field or even a rational field. And when you do that, you find that you
distinguish between the two and, in fact, it turns out that it is a Klein bottle as is then indicated
by the fact that you can find a very rational, very sensible parameterization of the patches by the
Klein bottle and not by the torus. At least we weren't able to.
Klein bottle also makes sense that you can now produce from this, what I would call a platonic
model for the patches. So think of the patches now as being functions on the unit square. So
we'll think about quadratic functions, polynomials in two variables. And if I take the subset of
all those -- of all those quadratic functions in two variables which have the property that, well, let
me look at 3 and 4 first, that the integral here, this is the mean value of 0, that's the mean
centering that Mumford and company did.
This here was the contrast normalization. So when you take the L2 norm of F, and now you put
the additional condition that F is written as a composite of a single variable quadratic with a
linear map, and that space is something which one can compute directly, is homomorphic, is the
same as a Klein bottle. So it gives you kind of that platonic model for it.
So this homology -- so I've sort of shown you how it can sort of suggest, you know, a very pretty
description of parameterization of a family of dense patches, dense image patches. But actually
they can be adapted to do many different things, to do many different kinds of shape recognition.
So, for example, they can be adapted to capture something that we would think of as less
geometric than being a sphere or a torus or a Klein bottle. I might instead be thinking of some
property of -- like a network or a metric space being built up out of modular components and
even into a hierarchy.
And so the barcodes can actually capture the presence of hierarchy here. So you can see there
I've got a metric space which has clearly got a hierarchical decomposition, and what it's reflected
in is the fact that, you know, there's this kind of -- if you were computing the length of the bars
here, you'd see that the lengths were sort of peaked, had three different peaks to them.
And so we've actually done this, adapted this to work on graphs and sort of random graphs
models to see that it can -- under a certain definition of what's meant by hierarchy, that it can
actually detect the presence of hierarchy.
And it can be used as -- the homological signatures can then also be used as sort of a search tool
if you're trying to understand a large family of geometric objects, for example. So you could use
them and say, look, I'm looking for something with particular features, maybe I will look for
them using -- seeing which ones have the right homological signatures.
Okay. So I want to switch now to -- so I've just talked about signatures that produce ways of
looking and sort of recognizing certain kinds of geometric objects. But another thing that you
might ask for is can you produce sort of compressed and maps of a dataset. So by compressed
here I'll mean a simplicial complex.
So rather than trying to take a dataset and coordinatizing it as with, you know, linear regression,
I'm going to try to coordinatize it by mapping it to a simplicial complex.
So the answer is that one can build such methods. Let me tell you how that goes. And let me tell
you also roughly where we believe this sits is this sits as a method that's kind of intermediate
between straight clustering on the one hand and more linear analytic linear algebraic methods,
such as principle component analysis and multidimensional scaling, on the other hand.
So the starting point is just a space and it's equipped with a covering. So it's got a set of -collection of sets that are -- make up its union. They're not necessarily disjoint. That would be a
clustering or would be a decomposition like that. But these sets might have an overlap.
So the sets can overlap. And for each of these U alphas, I'm going to let pi naught of U alpha
denote a set of path components.
So what I can do then is I can form the disjoint union of all these sets and I can then form the
graph by connecting all pairs of components which overlap.
So what's the picture. Here's a circle. And it's covered by three sets: red, yellow, and blue.
Bluish, greenish.
So what do I do. I take each of these three sets, I break them apart like this, so I've got three
different sets. And now I compute the connected components. You can see the red and the blue
one have one connected component, the yellow have one -- has two. And so what I build then is
a collection of four nodes: the red node for the single component up at the top, the blue node at
the bottom, and then the two yellow nodes here in the middle.
And now I connect the ones that overlap to get this complex. And what you'll see I've done is
I've recovered the original circle up to -- as its topology.
Suppose I have point cloud data instead now and a covering of that point cloud data. We build
this simplicial complex in the same way but with the pi naught operation replaced by
single-linkage clustering and possibly with a fixed parameter epsilon.
So here's that -- so here's the picture of the coverings in that case. You'll see there's some orange
happening up there and some green happening down below.
So how do we choose the coverings. We build a reference map to a well-understood metric
space, like the realign or the plane or the circle. And whenever I've got coverings of the
well-known reference space, I can pull those back by taking the inverse images and get a
covering. So here's a picture that describes it. You'll see a reference map down there to the
realign, you'll see a covering of the realign, and you'll see then that I've got a covering of my
whole space. So in this case the blue one, for example, will break into two components.
So what might be the reference functions. Well, we might use density estimators as a mapped R.
We might use measures of so-called data depth, which, for example, the sum of the square
distances to a given point. That's a measure of centrality, so it's how close to the center of the
space you are. Eigenfunctions of a graph Laplacian could even be used. But also user-defined,
data-dependent filter functions.
So what I want to show you now is some examples of what happens when you carry this out. So
this is a sanity check for us. This is -- this right here was a dataset done -- a diabetes study done
at Stanford in the 1970s.
On the left there you'll see -- so this was a dataset. It had I believe five coordinates: four
metabolic numbers and also a normalized notion of size to it.
So what was done in the '70s then was to use projection pursuit method, which is a method
which finds sort of the best linear projection of the dataset. And when they did that, they found a
three-dimensional projection that looks like this. And somebody drew this by hand, drew all
these points in by hand and kind of filled it all in. So that's where people were 40 years ago. Or
35 years ago.
But you can see that there's a structure here. There's sort of a central blob or a central core and
then two flares coming out. And so the qualitative observation that there are these two flares
coming out, that is -- turns out to be the observation that there are two kinds of diabetes: there's
the adult onset and juvenile onset, Type 1 and Type 2. And then the central core here are the
people who are near normal, normal or near normal.
So when we apply our construction to this, we get these graphs coming out. So what you'll see
here is there's a central core in both of them, and then there are some flares going out.
And the fact that I'm showing you two of them is saying -- telling -- is reminding you or pointing
out, I should say, that the construction that we wrote down is actually -- has a multiscale aspect
to it, because you can choose your coverings to be more or less refined. So you can actually get
to see the dataset at lower or higher resolution.
And so what this says is at both levels of resolution here there are clearly the two flares coming
out, which correspond to, then, the Type 1 and Type 2.
Here's another -- one other sanity check that we did was to study some gene expression data
coming out of a cell cycle example where there's periodic motion happening in the gene
expression profiles over time.
And the way that's reflected -- remember I talked about finding -- when you find loops in the
data, that may very well correspond somehow to the presence of periodic or recurrent behavior.
And so indeed here you'll see that the representation contains a loop, and that loop is
representing this periodicity that's going on here.
Okay. Now so this one here, this is an example from RNA hairpin folding. So this is an
example of data coming from a very high dimensional confirmation space for a complicated
biochemical molecule. So the confirmation space is built by studying bond angles and things
like that.
And so in this case what did this picture correspond to? This turned out to correspond to the fact
that in this particular dataset -- so this is a simulated dataset for RNA hairpin holding -- there
were two different trajectories to the folded state. And they are reflected by these two points in
here.
Now, initially, when we look at this, it's quite possible to say, you know, a little feature like this,
a little loop like this might very well be artifact. You know, that does certainly happen here.
And, you know, maybe if you were to change your scale choices or something it would change.
But another approach is to go back to the practitioners and ask, look, can you tell me the
difference between these two. Are we capturing something for you. And indeed the answer is
that we did, because, as you can see here, the description of the different trajectories to the folded
state look like that.
So the moral of this is that this is something where a PCA method, because this feature is rather
small in the dataset. PCA or an MDS method is going to have a lot of trouble with a dataset like
this because of the smallness of the feature. The larger aspects of the distance are going to wash
it out. And so, you know, that's the value, the technique in this setting.
So what I want to show you now, this is a joint work with Monica Nicolau. This is gene
expression microarray dataset from breast cancer. One of the bigger and, you know, more highly
regarded datasets, so-called NKI breast cancer dataset.
So in this case, you know, gene expression, what you have is you have a matrix where the
columns are the tumors or the samples and the rows are genes and the number entry at any point
is the expression level of the given gene in the given tumor.
And so that for us is Euclidean data. So it's rather -- it's rather easy to build one of these
constructions for us. So in this case the filter that we used in this case -- well, there are two
filters that work and produce this. One is a specifically well-understood notion called DSGA, so
disease-specific genomic analysis, which measures, so to speak, the distance from normal tissue.
Because there are normal samples up here. In fact, they're up here.
But even that eccentricity or data depth filter that we looked at would also work.
So now the picture -- let me interpret the picture for you. So up here we have the normal tissue
samples. And then as you're moving away here you're finding tumors which are near normal,
which are -- whose -- indeed there is tumor tissue here, but it's rather close to normal tissue.
Gets worse out here, and then there's a break here between these two. So going out to the left are
the so-called basal tumors. This is a collection of tumors which have very bad prognosis. It's a
well-understood piece of the taxonomy of breast cancer.
On the other hand, this flare coming out here had not been observed before, so, in fact, the way
this dataset has been approached, a lot of people have gone at it with their favorite clustering
methods, and they sort of come up with different numbers of clusters occasionally. They're
families called luminal A and luminal B, but there's always a question of how many there are.
Some would say there's five classes, some would say seven, some would say four.
So this -- but this class here has not been seen before. And the amazing thing that happened is
that at the outer part of this flare, the outer 22 patients in this flare out of nearly 300, but the
outer 22, all survived the length of the study.
So, in other words, that this flare, the end of this flare represents a very high survival group of
breast cancer patients. Let me point out that no information about survival went into building the
picture. So the picture is straight from the gene expression profile.
So what you're seeing here on the right is actually a dendrogram. So that's the hierarchical
clustering that we talked about before, hierarchical clustering description of the same dataset.
And what you'll see is that the points on this flare, they kind of go all over the place among the
different clusters here.
I would argue that it would be very difficult to extract this very high survival group from this
clustering. And I would say that it's not even so surprising that that would be the case. Because
supposing you -- supposing you believe our picture here which says, look, this space is basically
connected, space is a connected object, if you're doing something like clustering, which is where
all you're able to do is to tear things apart, that's all you do with clustering, you break things into
pieces, you are very quickly in some cases going to separate things from other things to which
they belong, where they belong together.
And so that's the idea, that clustering is sometimes too blunt an instrument, and sometimes you
want to understand and keep track of the actual geometry of the dataset.
>> A question?
>> Gunnar Carlsson: Yeah.
>> I know of [inaudible] people taking [inaudible] taking the kind of latest and greatest of
statistical [inaudible] modeling and you created topological features that are input along with
traditional kinds of atomic data as evidence, and then built in classifiers that actually understand
the shape and structure of the topological space [inaudible].
>> Gunnar Carlsson: I don't know of any such work. But I -- I feel like that's absolutely
something one should try. So you're thinking of a neural net or something like that, but feeding
in information.
>> We've done some work where we've taken, for example, structure of [inaudible] properties
like Web structure and used them as evidence. And we do pretty well with [inaudible]
classifying things like the goodness of results, a set of certain results, for example.
>> Gunnar Carlsson: Um-hmm. Yeah.
>> And the rich features are the actual topological features [inaudible].
>> Gunnar Carlsson: Right.
>> [inaudible] supervised learning where [inaudible] and instead of computing [inaudible] they'll
use the notion on distance [inaudible]. They're using like [inaudible]. So some of it goes in
there, but I think that's [inaudible].
>> Gunnar Carlsson: Well, there is -- so that's -- it's an interesting point. So some of the
topology is captured in analysis. And we talked about these eigenvalues of Laplacians. And so,
you know, there's a notion of Hodge theory which says that there's a bit of topology, what's
called the rational part, that would be captured by, you know -- precisely by eigenfunctions.
However, if you look back at the example of this Klein bottle, for example, it's actually got -some of the homology has so-called torsion in it, which means that it cannot be captured by the
analytic method. So the analytic methods, you know, would in that case not see the difference
between, you know, Klein bottle and the torus.
So I think, you know, it would be interesting to take some of the -- you know, some of the
analytic things but then also some of the discrete ties to things, feed those discrete-tied things in
as well, and then work with that.
Now, let me see. Do I have ten more minutes? I'm seeing here that I'm at 11:30. Was it
scheduled for an hour?
>> Yuval Peres: Ten more minutes.
>> Gunnar Carlsson: Ten more minutes, okay. Let me talk real quickly, then, about -- yeah,
here's another picture, which I won't go into the detail here. Okay.
So let me say -- so the topological methods can produce signatures for recognizing a wide variety
of shapes. They can also be used to provide useful visual representations of data. That's what
we just saw in this mapper.
Methods are flexible enough to operate on many kinds of unstructured data. So, in other words,
the method we're talking about here, it doesn't need to be Euclidean. It just has to have a notion
of distance. And it doesn't even need to be triangle inequality kind of distance. It just needs to
be a measure of dissimilarity.
But now what I want to show you here is what -- something that we haven't sort of fully
implemented yet as a method but which I think is kind of -- you might find kind of interesting.
So let me remind you the bootstrap method. And I'm going to give you the very simplest version
of that, because I'm sure there are those of you who know it much better than I.
So you study statistics of measure the central tendency across different samples within a dataset,
and it can give an assessment of reliability of conclusions to be drawn from the statistics of the
dataset.
And let me just say, so the idea is that in order to understand the dataset and sampling strategy,
you don't want to take a dataset and simply compute the mean of a statistic on it. It's much more
informative to take samples and study those means and study those means themselves as a
dataset. That's really the bit of the bootstrap philosophy; that I want to say I want to carry this
over to questions not about numerical statistics but what about decomposing a set into pieces or
finding is there a loop inside a space.
So I want to talk -- so let me -- so here's an example, then. So this is a -- imagine that the
datasets -- so the data points in this case are just the yellow background here. So think of a very
densely [inaudible]. Suppose that I do an experiment and I repeat the experiment over and over
again. That is to say I sample points from it and I build one of these Vietoris-Rips complexes
and I compute the Betti 1. That is to say, the loop.
So in this case over here I'm going to build -- I have a red sample and a green sample, and they're
both going to report back, ah, Betti 1 is equal to 1. And so of course if I have something that's
reporting that back, I would say, oh, you know, if I'm reporting Betti 1 a lot, probably I believe
that there's a single big loop in there.
But it might not be because this example over here could also be the case, this more Swiss
cheese texture kind of example of the data. It would also -- could in principle also report back,
you know, a loop, but they are not the same loop. So in this case the green is not the same as the
red. Whereas over here these two are the same because you notice I can deform the one into the
other one.
So a question is how do you build a methodology that sees the difference between those two. So
this is called Zig-Zag persistence. So suppose you have a finite set of samples from the point
cloud data. You're going to construct new samples, each on the union of the Ith one and the I
plus 1st. And now not -- we have not just a family of samples but we actually have a diagram of
samples, a collection of samples, but relationships between them, maps including the one and the
other one.
So you can see here SI gets included in SI union, SI minus 1, and SI minus 1 in here and so forth.
So now we actually have -- again, kind of like persistence, we've got a diagram with maps, but in
this case the maps are changing directions, alternating. So we're going to build a Vietoris-Rips
on these and apply K-dimensional homology to get a diagram of vector spaces of this same
shape.
Now, the idea is if the homology classes, if it's the case that I build a homology class here,
homology class here, which agree when I map into here, that's more consistent with the first
model that we had there, the one which had a single circle. If they map to different things, that's
more like the Swiss cheese model.
Okay. So we want to understand, then, the degree to which you have, so to speak, long vectors
in here where you -- a class here which goes in here which comes from something here, which
goes to something here and comes from something here. Do we have long such vectors.
So in order to carry out that analysis, it turns out that just like for the persistence vector spaces,
we had a classification in terms of barcodes, we'd want an analogous classification of diagrams
of this shape.
Turns out that there exists such a classification. And in fact the main theorem is that the
classification is manageable, doable through simple linear algebra methods, exactly if the
underlying shape of the diagram is one of these Dynkin diagrams. So the one we're interested in
is this AN up here, because notice this is the underlying directed graph, so the arrow is going in
alternate directions here.
So it turns out, then, that those things can be classified by barcodes as well. So each diagram has
a decomposition into a direct sum of the cyclic diagrams. And so we get a barcode classification
there. And so the idea is now barcodes for those with long bars would indicate phenomena that
are stable over many different samples.
And so the others, the small ones would be ones that appear in particular samples and then map
to something which is either zero or is distinct from anything coming from any other sample.
Various uses can be made of this, actually. It turns out it's going to allow you to compute much
more effectively even the ordinary persistence much more quickly.
Okay. So I think I will stop there and thank you for your attention.
[applause]
>> Yuval Peres: Any questions?
>> Gunnar Carlsson: Yeah.
>> [inaudible] very simple programs, for example, have a bunch of addresses [inaudible]
mistakes.
>> Gunnar Carlsson: Yes.
>> And you want to find just clusters, but you want to do it very quickly. So doing your
technique lends themselves effective, fast computing?
>> Gunnar Carlsson: Let's see. So let me understand the question, then. Let's see. So you've
got addressed and now you need some way of recognizing when it's a mistake. Am I right? I
mean, that's ->> Right. So ->> The cluster, then, [inaudible] same address really is just mistakes.
>> Gunnar Carlsson: Oh, I see. Oh, yes, yes. Oh, right. So I'm sure under those circumstances
you could devise a similarity measure, some measure of similarity which says, yeah, you know,
you haven't -- you know, you missed the street for a road or something like that, but roughly
speaking it's the same address. So, yes, I think that was the point about that we can sort of take
any kind of metric, any kind of similarity notion.
So if we can formulate it correctly, then in that case it sounds like some variant of a Hamming
distance, or -- but taking into account the words I suppose would be -- would make it reasonable.
>> [inaudible]
>> Gunnar Carlsson: No, I'm not saying -- no, no. But I'm saying in principle -- in principle,
yes, I mean, I'm guessing that maybe the difficult, then, is building the similarity measure. Is
that -- is that a fair statement?
>> That's what most approaches do.
>> Gunnar Carlsson: Yeah.
>> So I was intrigued that you have this epsilon [inaudible].
>> Gunnar Carlsson: Yes, right, right.
>> So it's interesting to consider. But then it takes a lot of time, if you change your metric.
>> Gunnar Carlsson: If you change the metric. It's not the scale, though. Actually, what's
interesting is, that, you know, the calculating the persistence homology is not much harder than
doing it for a single case. That's a surprising fact. I mean, it's not much harder.
And so you can get that summary in reasonable time, the summary for what happens over all
scale values. But now you mentioned changing the actual underlying measure of dissimilarity.
>> Right. What do you consider closed.
>> Gunnar Carlsson: Right. Exactly.
>> You really cannot use Hamming metric in these cases because you have insertions and
deletions [inaudible] more sophisticated metrics that are -- take longer to compute.
>> Gunnar Carlsson: Uh-huh. I see. So is that the problem, that basically the distances that
work are kind of -- are too difficult to compute or too time-consuming?
>> Or work for some cases but not for others.
>> Gunnar Carlsson: I see.
>> Nobody has [inaudible] probably the reason [inaudible].
>> Gunnar Carlsson: Right.
>> [inaudible] to search.
>> Gunnar Carlsson: So we have thought about doing some things in search, and we have done
some work. I mentioned that we have, you know, applied to networks in particular to corpora
text and so on. So indeed we've done some work in that direction.
You know, we can talk about those in more detail, if you like. Yeah.
>> So dimension 0 you have this dendrogram tree structure [inaudible] you only have this
barcode. Is there any analog to ->> Gunnar Carlsson: There is. This is the difference -- so this is the statistical version of what
is -- in topology is called the difference between homotopy and homology. So in topology you
have homotopy groups and homology groups, and there's actually homomorphism from
homotopy to homology.
Now, when you go to this setting, in dimension 0 the homotopy is the dendrogram and the
homology is the barcode. And so if you did homotopy persistently, you know, that would be
perhaps the most natural analog to that. There are some obstacles that come up because it's
much harder to do directed families of groups that aren't necessarily vector spaces. And the -you know, the homotopy groups are not vector spaces.
>> [inaudible] one dimensional [inaudible] most interesting.
>> Gunnar Carlsson: Well, it depends on your point of view. You mean because it's nonabelian
and so on.
>> Yeah.
>> Gunnar Carlsson: That's true. That also makes it the hardest. But, you know, the higher
dimensional homotopy in the topology world is quite interesting.
On the other hand, you know, I think your point is well taken here that, you know, real life
typically isn't 50-dimensional. You know, real life typically -- the things one can understand are
zero, one, two, three dimensional. And so you would be looking at low dimensional things.
>> In your last example when you had this sequence of samples and you connected adjacent
ones ->> Gunnar Carlsson: Yeah.
>> That would be very reasonable if these samples came in some time-structured way, because
they're arbitrary samples taken in a bootstrap, seems arbitrary to just connect the adjacent ones,
putting them into this linear graph structure input [inaudible].
>> Gunnar Carlsson: Absolutely. That's where this remark or I should say the theorem of
Gabriel comes in, because in order to have a good classification for the diagrams, we need them
to be linearly arranged. You'll notice he doesn't have a two-dimensional array in there, and that's
because those things are classified by more than any kind of discrete invariant. They actually
have sort of continuous families of moduli for them.
So but actually, I mean, I think you're right, though, in principle. In other words, one should
invent a coarse invariant but that takes -- you know, doesn't impose an ordering on things. But
as a first cut at studying it, it's not -- it's not so bad. And, you know, the way -- what we would
simply do here is we would do, you know, a big many samples, confine the ones that produce a
Betti 1, put them all together in one of these things and now see do they match up.
>> Yuval Peres: Any other questions? Let's thank Gunnar again.
>> Gunnar Carlsson: Thank you.
[applause]
Download