>> Yuval Peres: Good morning, everyone. We're very happy to have Professor Gunnar Carlsson from Stanford University who's here to tell us topology and data. >> Gunnar Carlsson: Thanks very much for the invitation. So what I want to talk about today is methods that we've been working on for the last few years on trying to understand why we call it qualitative properties of data. So and in particular what am I thinking of in terms of data? I'm thinking of possibly finite metric spaces; that is to say where we have data matrix and regard the set of columns, maybe with the L2 metric, maybe with a correlation metric, as a finite metric space. But actually more generally even networks or any kind of set where there's a notion of similarity that one can define, a quantitative notion of similarity. So and let me point out that this idea of studying finite metric spaces is actually quite an old idea, goes back to D'Arcy Thompson who is a biologist I believe in the early part of the last century who was interested in studying the form and degree of variation of forms in biological systems or biological organisms as you change from one group to a nearby group. So here are two examples of fish. You can see that they're slightly different. On the other hand, they share a lot as well. So you can see the mouth, it looks roughly the same, the fins up here are very similar. And so his idea was let's do the following: Let's put a few landmark points, maybe we'll put something at the tip of the mouth, the tip of the fin, the eye, and so forth, have a list of maybe 10 or 15 of these landmark points, study the distances between those points, and then ask can the degree of variation in those distances, in those configurations of distances give us a good summary of the degree of variation among the animals in a particular species, for example. So statistical shape theory, one form of it, actually attempts to study this, exactly this problem that D'Arcy Thompson posed; that is to say, we're trying to study what happens when you have configurations of -- finite configurations of points in a Euclidean space. So, again, they use a small set of landmarks and the distances between them, and what they then do is they actually build a whole space that consists of all the possible configurations, a Euclidean space of, say, K points in RN. Now, if you can understand, then, the set of points in this space of all configurations, if you can understand how this set of points coming, say, from a particular species of fish fits in there, then that would be regarded as a way of trying to understand the degree of variation in there. Now, and so sigma can be given the metric -- actually the whole set of all these configurations can be given a metric in various ways. So that's a second-order object here where we're talking about distances between metric spaces, if you like. So the space is defined as you take the set of K points in Euclidean N space and you factor out -say two points are the same, if there's a rigid motion carrying the one to the other, and they then manage to metrize that space. Now, Gromov has recently introduced a new metric which doesn't restrict itself to Euclidean space. And this is a very complicated object, I would say, but nevertheless it is a notion of a metric on the set of all metric spaces. Okay. So now of course many kinds of data are effectively analyzed using a notion of distance or metric. Euclidean data, always there's a particular choice there that one makes coming from the Euclidean formulas, but, on the other hand, in genomic analysis one studies something more like a Hamming distance, variance on a Hamming distance between sequences of members of an alphabet. But if you think about it, the theory that we just talked about, the one that David Kendall and then Misha Gromov set up doesn't really yield useful information for large numbers of data points. The computations become outrageous very fast. So if you're talking even about hundreds of data points, you're now looking about trying to understand properties of a hundred dimensional space. It's a difficult way to go. And the Gromov-Hausdorff, the distance that Gromov introduced is actually quite difficult to compute and often, furthermore, one is looking in your object, in your geometric object for some kinds of qualitative features. So another comment about these things, though, is that sometimes in this analysis of data the metrics are very well justified theoretically. If one thinks, for example, of physics, a lot of physical systems, the metrics really make a lot of sense and you believe that you should take them very seriously. But on the other hand sometimes it only reflects an intuitive notion of similarity. So, for example, in the genomic situation, your Hamming distances, you say, well, I believe something is similar if there are only a few changes between the two of them. And so here what one thinks of, one thinks of the distance as reflecting a notion of similarity. So nearby points are similar, far apart ones are different. However, I want to say we don't trust the large distances. We don't believe that those are necessarily significant. Because what all we are building into the system is the notion of similarity, the things that are very close, where the distances are small. But, in fact, the large distances, you know, unlike a physical situation, may really not be very significant. In other words, if you think about two genomic situations, one which differs from the other by, you know, in a hundred slots and another which differs by 150, it's not clear that you would think that those would be very -- that you could really quantify the notion of difference using that kind of arbitrarily defined notion of distance. Furthermore, even in small distances we only trust them a little bit. If you have a pair of points over in one part of a metric space that's a distance 1.5 apart and another where it's 1.8 apart, really don't think that that is necessarily -- tells that, well, those two points over there are really much closer than these two. And it's illustrated here if we're imagining a sequence here where on the left those points are different because the yellow is replaced by black, which is slightly longer over here, so these would look like they're closer. But in terms of a sequence, just the length that's involved in representing a particular phenomenon is not necessarily indicative of the significance of it. So actually what we're think about is something more like a 0/1-valued quantity, alike or not alike. Similar or not similar. Okay. Now, okay. So when the metrics aren't based on theory like that, it says you don't want to attempt to try to understand very refined notions concerning the metrics, so such as curvature, for example. Because if you do that, since you believe that large distances are kind of not to be trusted anyway, it means that you're -- garbage in and you expect to be getting garbage out. So it says instead that you should try to study some properties which are a bit robust to some changes in the metrics. And what are some such properties? Topology actually studies the idealized versions of such properties. So, in fact, geometry can be defined as the study of metrics and, you know, informally topology studies, those properties that remain after you allow yourself to stretch and deform but not to tear. And mathematically what that suggests is what happens if we sort of permit arbitrary rescalings of metrics. What remains about the geometric object after we do that. Your initial reaction is nothing remains, because, after all, when there's a distance, we can't really consider distance as a 0/1-notion because, of course, points which are only at distance 0 apart if they're equal and otherwise they're not equal. So instead what's done on the theory side here is you say, well, actually what we're going to do is instead consider the notion of a point and its distance to a set instead of distances between pairs of points. So when you do that here, we've got a point that's a distance 0, the black point is a distance 0 from the blue open ball, and that's topology. Topology keeps track of distances being equal to 0 to sets. So what are things that remain, then? What are some properties that remain after you permit yourself that? Well, a number of connected components remains. Here three connected components remain after -- no matter what metric you put on this, no matter how you rescale a metric here, you will always retain the fact that there are three connected components. We say two points are in the same component if they're connected by a path. So the set of connected path components is actually a topological invariant, and we can even start talking about higher order versions of that now to try to understand what happens. Suppose that we know that a space consists of a single connected component, what more can you say about it? Well, you can say the following. Maybe it's connected in more than one way. So here I've got a connected space. Think of the gray thing here as the connected space, and I've got two points in it. It's path connected because I can draw a path from the one point to the other one, but in fact I can draw a second path like that, and I can even do the following. I can say let me look at those paths and let me see can I identify when two of those are the same, are connected to each other as paths. So before when we talked about a spacing with connected, if I draw a path from a point to another point, it's in the same component, here I'm going to say two paths are the same if I can draw a path of paths between them. So you'll see there that what I'm saying is that that upper orange group, those all are the same, these guys up here, and the reds are all the same, but it turns out that the oranges are not same as the reds. As I start trying to deform or stretch those orange paths, there's no way I can get past that obstacle in the middle. >> What was path mean if the space is discrete? >> Gunnar Carlsson: Okay. I should say at the moment I'm giving you idealized story. I'm giving you topology story. And the whole import, then, of the thing is going to be how do you move that over into the other setting. Okay. So counting or parameterizing sort of redundancy of connection reflects properties of the space. So here we've got this one is parameterized by the integer 1, for example, because it loops once around. And this one represents 2 because we go around twice, and this one represents minus 1 because it's going to around in the other direction. And there are notions of parameterizing higher order things than paths. So you can talk about Kth order versions where components are zeroth order. The above calculation, the one for paths, is first order, but then there's second, third, fourth, and so on higher order stuff where the second order captures two spheres. And in a sense one can even talk about a notion of so-called power series expansion of a space with better and better approximations given as you calculate more and more of these parameterizations. So why might you want to understand the higher order structure in the space? Well, suppose, for example, we're given a continuous time series, so something moving along a sine wave, for example. And then consider the set of two vectors of the form, you know, a point and then the one that follows at one time unit later. Call that the delay embedding. So this set will trace out a circle as the points move through this periodic motion. So the set of configurations of a point and its follower will produce a circle. Okay. So but presence of periodic behavior here in the time series is reflected by this first -- this Betti number, as we call, this count of the first -- of the number of paths, different paths computed on this image of the delay embedding. Now, this could also be detected by Fourier methods, but supposing the motion isn't periodic but just recurrent, you know, it doesn't necessarily move in a periodic fashion. So the image would then still be a circle or perhaps an annulus, but the Fourier methods would have difficulties. So topological image, the topological structure of the image of this delay embedding is actually reasonably robust to noise. Also we might ask how do we go about recognizing letters. Well, we look at these, so we look at the letter A, and we're trying to recognize it from the letter B. Now, of course, the A can appear in many different ways. You can stretch and deform the A, and we still recognize it as an A. You can similarly do the B. We'll recognize it as a letter B. In fact, we know there are many different fonts for that. So but the thing that remains is that the A has one loop and the B has two loops, and so in fact the number of loops turns out to be a topological invariant. And so in fact it distinguishes here. So, again, we would say the first -- the A has a first Betti number of 1 and that B has a first Betti number of 2. So it turns out now there's formalism, again, on the pure side for actually defining how many loops there are of the various orders, and it's called homology. And what I want to just show you here is it's -- you know, although the subject has a very abstract reputation, the subject that does this, called algebraic topology, nevertheless, it reduces to matrix calculations for matrices over a Boolean field, a finite field. So in this case what you do here, the space that I'm looking at up there is given to me as a simplicial complex. So you see we have a triangle up here, and then we have this path here too. So there are two connected components, and then there's a single loop here. You can encode all that information, and this matrix. So here's a matrix which has all the edges. You'll see AB there, AC, BC, and DE, and then the vertices A, B, C, D and E. And then now what I do is I put a 1 in a slot wherever a point over on this side is contained in the corresponding edge up there. It's called the boundary matrix. And it turns out it has rank 3, as you look at it. And so what you'll find is that the number of connected components is in this case equal to the number of vertices. That is to say the full number of columns, minus the rank, and the number of loops is equal to the dimension of the null space of that little Boolean matrix. So, in other words, there's nothing, you know, very far-fetched about the calculations that you ultimately do to determine these things. It's just real simple linear algebra. Okay. Formally, then, for every space in every integer you get a Boolean vector space called HK of X, the dimension is the number of connected components of H naught, the dimension of HK of X is a count of the number of K-dimensional holes, and a continuous map induces linear transformation or matrix between these homology groups. And the dimension then, as I've mentioned, is called the Betti number. So here Betti 1 equals 1 on the left and Betti 1 equals 2 -- typo there -- on B. And just to give you an idea how it works more generally here, if you look at a two-sphere, you can see the Betti number here. This reflects the fact that, you know, there are no loops in this space that can't get dragged down to a point. So if you imagine drawing any loop here, it can get dragged, compressed, deformed into a single point, and then Betti 2 equals 1 is that it's a two-dimensional surface. And then just a torus example here, we've got two independent loops here and then here. And then a Betti 2 equals 1 again. And then even this situation here, we've got four loops, as you can see, one on each side of the double torus, and even a Klein bottle, which is an unorientable, hard-to-visualize gadget which we'll come back to here in a minute. But it has the same Betti numbers, as you can see, as the torus. All right. So now the question is how do we import this kind of idea into a more discrete setting where we're dealing with finite sets of points, maybe large point clouds. So what ideas might we want to import. I'm going to try to tell you about three of these. We may only get to two. These homological shape signatures, the notion of a Betti number, I want to talk about how that plays out in the case of point clouds, finite metric spaces. I want to talk about mapping methods for getting visual representations and compressed representations of the geometric objects, and finally what I would call diagrams and functoriality, which can help in assessing stability of qualitative properties. Okay. But so what's the key. The key to moving from this abstract world of sort of, you know, complete information kind of spaces into the data world where you have, you know, finite sets of points is through what we call persistence. Persistence is something which was really introduced by statisticians initially under the idea of hierarchical clustering. So let me remind you that clustering is what I could call the statistical version of taking connected components of the topological space. Typically a cluster algorithm is something which separates a dataset into conceptually distinct pieces, and hopefully, presumably, your notion of distance reflects that so that the points within a cluster are closer to each other than the points if they're then the distance between a pair of points in two different clusters. Many choices of how to do this. As you probably know, there are journals devoted to this, books written about it. Often it requires a scale choice. So an idea about how to do this for a finite metric space, the very simplest thing is to choose a threshold epsilon in your dataset, which now is a finite set of points with distances, and just simply connect up, build a graph where you connect up any points whose distance is a less than epsilon. Hierarchical clustering, and this is what statisticians observed, was that it was artificial to have to choose that threshold. There are some ad hoc methods that one can build in to say, oh, this is the one I want to choose, but basically they said it's much better if I can build a summary or a tracking of the behavior of the number of clusters over the entire range of scale values. And so what they do, then, is they say if we've got a finite metric space we're going to build a tree on top of there, or actually a dendrogram, which is a tree, rooted tree, which -- with a map to the real line which is keeping track of thresholds. So the interpretation of this is we initially start out with this dataset down here, but then, oh, these two points are a distance, you know, roughly .1 apart, and so we merge them. And then we'll merge this point in here a little bit higher up, and then this group, which has already been merged, then this group which has been merged will get merged at this point. But the dendrogram is a very effective small representation of the entire -- of all possible clusterings that occur at all the scale values. And so the statistician's observation is that's much more useful and much more informative than any single choice. So the import of persistence is that if you're trying to look for things that are not just connected components but loops, there exists a similar summary for homology in all degrees. So it relies on building a simplicial complex or graph based on the dataset. So a finite set X with distance function D and a scale parameter epsilon, so the vertex set is a set of all data points, and there's an edge precisely if the distance is less than or equal to epsilon. And, similarly, if you've got three points, if all the pairwise distances are less than epsilon, then you include that triangle. So this is Vietoris-Rips complex. And so here we have a picture of it. If we've got a circle, then my dataset there is those red points. I connect them up, roughly speaking, the epsilon here is roughly the interpoint distance, and you'll notice that I recover the structure of the circle from this complex at this point. Now, of course if I had chosen the threshold too small, nothing would get connected and nothing interesting would happen. If I had chosen it too large, everything would be filled in and it also wouldn't reflect the structure. So it leads you to think, oh, there should be a sweet spot in here and I should know how to choose the epsilon which is the sweet spot. But actually that's a losing battle. Instead I think what one should do is you should study how this complex -- how the structure of this complex grows with epsilon. And, remember, that's exactly what the statisticians are doing when they're talking about the connected components. So here you can see how the complex is evolving. Over here there are no connections. Here I'm connected up. I'm at the sweet spot. Here I'm at something which, well, I filled in some triangles, but actually fundamentally it kind of looks like a circle, but as it gets very large, of course, it will lose that character. Okay. So single linkage clustering, the most vanilla version of hierarchical clustering, is obtained by forming the connected components of V of X and epsilon, and then the dendrograms are constructed by studying all values of epsilon at once and understanding the maps between components induced by these inclusions. That's the summary. That contains all the information in the dendrogram. So what I want to show you now is suppose that instead I'm interested in finding the loops. And I'm interested in finding the first Betti number. That is to say, the homology. So what you're looking at here now is a -- this is a dataset. So imagine that the points here are all the intersections of these edges. And I've built the Vietoris-Rips complex on this, and you'll notice at the back of the room, if you're in the back of the room, it will look very clear that this is a -- that this is a circle. Unfortunately, something happened in the sampling or whatever here that I've got these two little holes that are there too. They're also holes in this complex. So if I were to go ahead and do the homology calculation, this simple linear algebra calculation on this, I would find Betti 1 equals 3. Not the right answer. I want to see the rough, the coarse, the essential structure of the geometry and not just -- and not these little loops. But 2 homology contains no notion of size as it stands. But what we do have is we have maps between the vector space for the homology at one threshold and the homology at the other threshold. Because, remember, I mentioned that when you build the homology it has the property and not only does it produce vector spaces, but it produces linear transformations between vector spaces. And the Vietoris-Rips, for a smaller scale value, embeds inside the Vietoris-Rips for the larger scale value. So let's look what happens. You might tell me here, first of all, well, the only problem with this is that I didn't choose the epsilon right. So maybe I should just enlarge it a little bit. And so when I do that, yes, I can enlarge it a bit to the point where I fill in these upper two. So that's a good thing, but now maybe I introduced something down here because there's an edge and a path down here. So it's sort of conspiring against me down here. In this calculation I also don't do the calculation correctly. I get Betti 1 equals 2. And what I want is Betti 1 equals 1. The answer, though, is to say actually that we're going to study not only the homology of these two complexes but the linear transformation between them. Because in this case the linear transformation carries this big loop to the corresponding big loop on the other one. But in this case the two little blue loops got filled in. They went to zero, and on the other end this one did not come from down below. So what we'll call here the homology that persists across the change in scale values is a single copy. And that is, in fact, the structure that we're after. Okay. Yeah. So here's a picture taking components is functorial, I want to say that taking components respects maps. But for homology, instead of a dendrogram, we now get what we call a persistence Boolean vector space by applying HK to the increasing sequences of Vietoris-Rips complexes. So what this means, persistence vector space simply means a bunch of vector spaces together with maps between them as I get the increasing scale choices. They can be classified by barcodes. It turns out a barcode looks like this, just like vector spaces are classified by barcodes -- sorry, by dimension. So these persistence vector spaces are classified by barcodes. So here what this represents is a feature which is born here and then lives for a long scale value -- long range of scale values, and then here are ones which live for some smaller scale values. There's a correspondence between the dendrogram and its barcode, because it has a barcode 2. That loses some information from the dendrogram itself. However, this barcode notion extends to the higher dimensional invariance. So the left-hand endpoints are thought of as birth times and the right-hand endpoints are death points. And so here's another typical example. This is what a barcode might look like for this statistical circle up here. So we would look at it and we would say, look, this looks like a circle. The barcode we would say, oh, yes, it looks like a circle. We see a single long bar, which is an honest geometric feature, and then the smaller ones, which we would say are, you know, sort of more -- perhaps noise. Okay. So what I want to show you now is how this plays out to study a particular -- an actual dataset. So this is joint work with my collaborators, Vin de Silva, Tigran Ishkhanov, and Afra Zomorodian. So suppose that you start out and we're going to try to study images now. So an image taken by a black-and-white digital camera can be viewed as a vector with a gray scale value for each pixel. So it's actually a very large vector. There are thousands, maybe even tens of thousands of pixels for cameras. And so the images lie in this very high dimensional space. We'll call it pixel space. So David Mumford asked the question what can you say about the set of all images that you obtain when you take images with a digital camera. So ask the -- you know, just sort of a thought experiment, what happens if I took all the possible images I could take going around the world and made them all into some huge dataset sitting inside this pixel space. Okay. So that's an ambitious question. So let's see. So understanding this notion, this is the way these set of images behave is of interest from the point of view of the studying the visual pathways and so forth. So their observation is that we can't really do this business up here. We can't study this full set of images up here. We don't have that full set. It's too high dimensional. It's also probably a high co-dimension inside P because we know that the set of things that are images are highly restricted. So what Lee and -- or what Mumford and his collaborators, Ann Lee and Kim Pedersen, did was to say let's say we're going to try and instead study the local structure of the images statistically. That is to say we're going to study 3-by-3 patches. So here's a 3-by-3 patch. So these are 3-by-3 patches of adjacent slots. And so each one of those will then be some 9 vector. So each vector -- each patch gives a vector an R9. First observation is that, because most images contain some solid patches, that the things that will be most frequently occurring are the nearly constant or low-contrast patches. That will dominate the statistics. So, in a sense, that's an observation we can make. And once we make that, we ought to then ignore it because we can't say much more about constant patches. So the low contrast will dominate the statistics. And so what Lee, Mumford, and Pedersen did was to construct a dataset consisting entirely of high-contrast patches. So they collected about 4 1/2 million high-contrast patches from a collection of images obtained by two Dutch neuroscientists. They normalized the mean intensity of the patch, so you've got a 9 vector, subtract the mean from it so that you get it down to something with mean 0. It's now an eight-dimensional object. And then also normalize the contrast which effectively corresponds to the L2 norm of that normalized vector to obtain a dataset on a seven-dimensional ellipsoid. Okay. So what we then want to study now, so what one finds when you do this is that when you take this dataset, it actually more or less fills out the seven-sphere. That is to say it's big enough that no matter where you go in the seven-sphere you can find some points. And so in a sense we can take that dataset and start to run things and prove somehow that the dataset was a seven-sphere, and at the end of the day we have nothing of interest there because of course we made it into a seven-dimensional, into a seven-sphere. But so, on the other hand, the density of the points, though, varies quite a bit. And so we're going to study the dataset of most frequently occurring high-contrast patches. So here is a -- here is what we get. Now, I'll tell you -- all right. This was subsampled at this point because when we did this the software wasn't sort of mature enough to compute on very large sets. So, in any case, what we've got here is we've got 50,000 points. These are the 25 percent densest points as measured by a particular density estimator that we chose. And I'll tell you that density estimator has a scale parameter built in it like with variances in sort of a kernel density estimator. So this is one that's chosen with a rather tight such variance. And so the barcodes that come out now are -- it's not just that there's a bunch of long ones, it's that there's always five. There's always five. So, in other words, you resample here, you take a different 50,000 points, and you take the 25 percent densest. You always get five of these long bars. So what that suggests is that we've got a space which has first Betti number of 5. Yeah. >> This is [inaudible] three patches. >> Gunnar Carlsson: Correct. >> Oh. Never mind. >> Gunnar Carlsson: I mean, this is a space -- this is a subset of space which is a priori seven-dimensional. And now we're seeing some homological behavior here down at the very bottom. So the question is what does that tell you? What sort of interpretation does one have of that output of this dataset? Well, you can hunt around for a while and sort of look for the simplest possible explanation. Then there's more than one in this case. So in this case what we get is that here's a picture, though, that works. This is called -- what we called the three-circle model. So here the picture is that there's a single primary circle. And this primary circle consists of patches which are essentially an edge between a dark and a light region. And the angle of that is that primary. Then there are these two secondary circles as well, red and green, and they don't touch each other. So the red and green circles don't touch each other, but each touches the black circle. So does the data fit with this model is the question. And in fact it does. So here's the picture. So you can see the primary circle up here is exactly what we're talking about. Here there's a -- if you like, there's a transition from dark to white. There's -- at an angle of about -- what is that, 135 degrees, and here that angle is changing. So here is the vertical patches, here are the horizontal patches, and then here is something in between. But now we have these secondary things as well. So you'll notice this one is the same as that one up there. This one is the same as that one, and there's a typo here, I guess. Yeah. The typo, of course, is that these two are the same. So this one should be like this one down here. Sorry about that. Similarly, this secondary one here, these green ones intersect at those two points. But now what we're looking at here is we're looking at patches that are entirely horizontal in nature. But notice that they transition from light to dark. This one starts to gray out. This one gets darker until I get to a point up here where I've got a single dark line in the middle of a white region. >> Can you explain again the primary circle? >> Gunnar Carlsson: Yes. So the primary circle, so it is a discrete representation of what we would have if we took a function -- think of a function on a square, an intensity function, and suppose it were actually linear in the two variables. And so it would be -- it would define an angle between the -- of the line that's sort of -- the lines of constancy, if you like, of those shades. And so the angle of that line is what is giving you the theta that's describing this circle. So you can see here it's just -- like this patch here, if I smoothed it out and rotated it I would get from here to here and then from there to there. >> The colors, the red and green here, have no significance? >> Gunnar Carlsson: The color is only to point out to you that this one is the same as this one. I'm trying to show you that the three-circle model, you know, the one circle fits with -- you know, overlaps with the other one. >> Visualizing the difference in color [inaudible] dominated. >> Gunnar Carlsson: Yes. Well, fair enough. That's right. But, no, they're supposed to be -you know, this primary circle is supposed to be entirely uniform. Really these things should be just gray and it's just a matter of the rotations. So a question is is there a two-dimensional surface inside which this data fits. And so you can now run a much bigger calculation here with different density thresholds and so on and find that the barcodes that comet out -- and here's there's a Betti 2 barcode as well, so here this one says Betti 0 is 1, it says the space the connected. Here this says Betti 1 is equal to 2. Here this says Betti 2 is equal to 1. Okay. So this is sort of the fingerprint for this object. And an object which has that -- exactly that fingerprint is the so-called Klein bottle. So Klein bottle is a surface, a two-dimensional surface which doesn't fit inside R3. You can't embed it without tearing it. It's what happens if you do a Möbius band, build a Möbius band out of paper and now also try to connect points from the top to the bottom. When you do that you'll find yourself tearing the paper. So this is the identification space model. So that's the description of what -- of that space. So think here of this space as being a rectangle, even a piece of paper where I'm going to glue Q to Q and P to P. That's the Möbius band identification. And I'm going to glue R to R and S to S, which is the -- which is the other identification across the -- for the one horizontal line to the other one. So we think, then, this space is just a rectangle, but where we have to keep in mind that there are some things that are glued to each other. A more useful way of looking at the space than that picture that I showed you before. Do the three circles naturally fit inside? Well, yeah, they do. Here's the primary circle. This is a circle, you can see, because Q is glued to Q and P is glued to P. But we've also got the two secondary circles there. Notice they intersect the primary circle in two points, and then they form their own circles. Okay. So that's a picture of how the three-circle model fits in. And, in fact, now what you can see is how the patches -- and now I've rounded them off -- fit, are parameterized by that Klein bottle. Now, I should tell you that I jumped -- when I said that the Klein bottle has the Betti number of 1, 2, and 1, the torus also has Betti numbers 1, 2 and 1. So in principle the space could have been a torus. But it turns out that there's a way of doing the homology calculation, not of the Boolean field, but maybe with a mod 3 field or even a rational field. And when you do that, you find that you distinguish between the two and, in fact, it turns out that it is a Klein bottle as is then indicated by the fact that you can find a very rational, very sensible parameterization of the patches by the Klein bottle and not by the torus. At least we weren't able to. Klein bottle also makes sense that you can now produce from this, what I would call a platonic model for the patches. So think of the patches now as being functions on the unit square. So we'll think about quadratic functions, polynomials in two variables. And if I take the subset of all those -- of all those quadratic functions in two variables which have the property that, well, let me look at 3 and 4 first, that the integral here, this is the mean value of 0, that's the mean centering that Mumford and company did. This here was the contrast normalization. So when you take the L2 norm of F, and now you put the additional condition that F is written as a composite of a single variable quadratic with a linear map, and that space is something which one can compute directly, is homomorphic, is the same as a Klein bottle. So it gives you kind of that platonic model for it. So this homology -- so I've sort of shown you how it can sort of suggest, you know, a very pretty description of parameterization of a family of dense patches, dense image patches. But actually they can be adapted to do many different things, to do many different kinds of shape recognition. So, for example, they can be adapted to capture something that we would think of as less geometric than being a sphere or a torus or a Klein bottle. I might instead be thinking of some property of -- like a network or a metric space being built up out of modular components and even into a hierarchy. And so the barcodes can actually capture the presence of hierarchy here. So you can see there I've got a metric space which has clearly got a hierarchical decomposition, and what it's reflected in is the fact that, you know, there's this kind of -- if you were computing the length of the bars here, you'd see that the lengths were sort of peaked, had three different peaks to them. And so we've actually done this, adapted this to work on graphs and sort of random graphs models to see that it can -- under a certain definition of what's meant by hierarchy, that it can actually detect the presence of hierarchy. And it can be used as -- the homological signatures can then also be used as sort of a search tool if you're trying to understand a large family of geometric objects, for example. So you could use them and say, look, I'm looking for something with particular features, maybe I will look for them using -- seeing which ones have the right homological signatures. Okay. So I want to switch now to -- so I've just talked about signatures that produce ways of looking and sort of recognizing certain kinds of geometric objects. But another thing that you might ask for is can you produce sort of compressed and maps of a dataset. So by compressed here I'll mean a simplicial complex. So rather than trying to take a dataset and coordinatizing it as with, you know, linear regression, I'm going to try to coordinatize it by mapping it to a simplicial complex. So the answer is that one can build such methods. Let me tell you how that goes. And let me tell you also roughly where we believe this sits is this sits as a method that's kind of intermediate between straight clustering on the one hand and more linear analytic linear algebraic methods, such as principle component analysis and multidimensional scaling, on the other hand. So the starting point is just a space and it's equipped with a covering. So it's got a set of -collection of sets that are -- make up its union. They're not necessarily disjoint. That would be a clustering or would be a decomposition like that. But these sets might have an overlap. So the sets can overlap. And for each of these U alphas, I'm going to let pi naught of U alpha denote a set of path components. So what I can do then is I can form the disjoint union of all these sets and I can then form the graph by connecting all pairs of components which overlap. So what's the picture. Here's a circle. And it's covered by three sets: red, yellow, and blue. Bluish, greenish. So what do I do. I take each of these three sets, I break them apart like this, so I've got three different sets. And now I compute the connected components. You can see the red and the blue one have one connected component, the yellow have one -- has two. And so what I build then is a collection of four nodes: the red node for the single component up at the top, the blue node at the bottom, and then the two yellow nodes here in the middle. And now I connect the ones that overlap to get this complex. And what you'll see I've done is I've recovered the original circle up to -- as its topology. Suppose I have point cloud data instead now and a covering of that point cloud data. We build this simplicial complex in the same way but with the pi naught operation replaced by single-linkage clustering and possibly with a fixed parameter epsilon. So here's that -- so here's the picture of the coverings in that case. You'll see there's some orange happening up there and some green happening down below. So how do we choose the coverings. We build a reference map to a well-understood metric space, like the realign or the plane or the circle. And whenever I've got coverings of the well-known reference space, I can pull those back by taking the inverse images and get a covering. So here's a picture that describes it. You'll see a reference map down there to the realign, you'll see a covering of the realign, and you'll see then that I've got a covering of my whole space. So in this case the blue one, for example, will break into two components. So what might be the reference functions. Well, we might use density estimators as a mapped R. We might use measures of so-called data depth, which, for example, the sum of the square distances to a given point. That's a measure of centrality, so it's how close to the center of the space you are. Eigenfunctions of a graph Laplacian could even be used. But also user-defined, data-dependent filter functions. So what I want to show you now is some examples of what happens when you carry this out. So this is a sanity check for us. This is -- this right here was a dataset done -- a diabetes study done at Stanford in the 1970s. On the left there you'll see -- so this was a dataset. It had I believe five coordinates: four metabolic numbers and also a normalized notion of size to it. So what was done in the '70s then was to use projection pursuit method, which is a method which finds sort of the best linear projection of the dataset. And when they did that, they found a three-dimensional projection that looks like this. And somebody drew this by hand, drew all these points in by hand and kind of filled it all in. So that's where people were 40 years ago. Or 35 years ago. But you can see that there's a structure here. There's sort of a central blob or a central core and then two flares coming out. And so the qualitative observation that there are these two flares coming out, that is -- turns out to be the observation that there are two kinds of diabetes: there's the adult onset and juvenile onset, Type 1 and Type 2. And then the central core here are the people who are near normal, normal or near normal. So when we apply our construction to this, we get these graphs coming out. So what you'll see here is there's a central core in both of them, and then there are some flares going out. And the fact that I'm showing you two of them is saying -- telling -- is reminding you or pointing out, I should say, that the construction that we wrote down is actually -- has a multiscale aspect to it, because you can choose your coverings to be more or less refined. So you can actually get to see the dataset at lower or higher resolution. And so what this says is at both levels of resolution here there are clearly the two flares coming out, which correspond to, then, the Type 1 and Type 2. Here's another -- one other sanity check that we did was to study some gene expression data coming out of a cell cycle example where there's periodic motion happening in the gene expression profiles over time. And the way that's reflected -- remember I talked about finding -- when you find loops in the data, that may very well correspond somehow to the presence of periodic or recurrent behavior. And so indeed here you'll see that the representation contains a loop, and that loop is representing this periodicity that's going on here. Okay. Now so this one here, this is an example from RNA hairpin folding. So this is an example of data coming from a very high dimensional confirmation space for a complicated biochemical molecule. So the confirmation space is built by studying bond angles and things like that. And so in this case what did this picture correspond to? This turned out to correspond to the fact that in this particular dataset -- so this is a simulated dataset for RNA hairpin holding -- there were two different trajectories to the folded state. And they are reflected by these two points in here. Now, initially, when we look at this, it's quite possible to say, you know, a little feature like this, a little loop like this might very well be artifact. You know, that does certainly happen here. And, you know, maybe if you were to change your scale choices or something it would change. But another approach is to go back to the practitioners and ask, look, can you tell me the difference between these two. Are we capturing something for you. And indeed the answer is that we did, because, as you can see here, the description of the different trajectories to the folded state look like that. So the moral of this is that this is something where a PCA method, because this feature is rather small in the dataset. PCA or an MDS method is going to have a lot of trouble with a dataset like this because of the smallness of the feature. The larger aspects of the distance are going to wash it out. And so, you know, that's the value, the technique in this setting. So what I want to show you now, this is a joint work with Monica Nicolau. This is gene expression microarray dataset from breast cancer. One of the bigger and, you know, more highly regarded datasets, so-called NKI breast cancer dataset. So in this case, you know, gene expression, what you have is you have a matrix where the columns are the tumors or the samples and the rows are genes and the number entry at any point is the expression level of the given gene in the given tumor. And so that for us is Euclidean data. So it's rather -- it's rather easy to build one of these constructions for us. So in this case the filter that we used in this case -- well, there are two filters that work and produce this. One is a specifically well-understood notion called DSGA, so disease-specific genomic analysis, which measures, so to speak, the distance from normal tissue. Because there are normal samples up here. In fact, they're up here. But even that eccentricity or data depth filter that we looked at would also work. So now the picture -- let me interpret the picture for you. So up here we have the normal tissue samples. And then as you're moving away here you're finding tumors which are near normal, which are -- whose -- indeed there is tumor tissue here, but it's rather close to normal tissue. Gets worse out here, and then there's a break here between these two. So going out to the left are the so-called basal tumors. This is a collection of tumors which have very bad prognosis. It's a well-understood piece of the taxonomy of breast cancer. On the other hand, this flare coming out here had not been observed before, so, in fact, the way this dataset has been approached, a lot of people have gone at it with their favorite clustering methods, and they sort of come up with different numbers of clusters occasionally. They're families called luminal A and luminal B, but there's always a question of how many there are. Some would say there's five classes, some would say seven, some would say four. So this -- but this class here has not been seen before. And the amazing thing that happened is that at the outer part of this flare, the outer 22 patients in this flare out of nearly 300, but the outer 22, all survived the length of the study. So, in other words, that this flare, the end of this flare represents a very high survival group of breast cancer patients. Let me point out that no information about survival went into building the picture. So the picture is straight from the gene expression profile. So what you're seeing here on the right is actually a dendrogram. So that's the hierarchical clustering that we talked about before, hierarchical clustering description of the same dataset. And what you'll see is that the points on this flare, they kind of go all over the place among the different clusters here. I would argue that it would be very difficult to extract this very high survival group from this clustering. And I would say that it's not even so surprising that that would be the case. Because supposing you -- supposing you believe our picture here which says, look, this space is basically connected, space is a connected object, if you're doing something like clustering, which is where all you're able to do is to tear things apart, that's all you do with clustering, you break things into pieces, you are very quickly in some cases going to separate things from other things to which they belong, where they belong together. And so that's the idea, that clustering is sometimes too blunt an instrument, and sometimes you want to understand and keep track of the actual geometry of the dataset. >> A question? >> Gunnar Carlsson: Yeah. >> I know of [inaudible] people taking [inaudible] taking the kind of latest and greatest of statistical [inaudible] modeling and you created topological features that are input along with traditional kinds of atomic data as evidence, and then built in classifiers that actually understand the shape and structure of the topological space [inaudible]. >> Gunnar Carlsson: I don't know of any such work. But I -- I feel like that's absolutely something one should try. So you're thinking of a neural net or something like that, but feeding in information. >> We've done some work where we've taken, for example, structure of [inaudible] properties like Web structure and used them as evidence. And we do pretty well with [inaudible] classifying things like the goodness of results, a set of certain results, for example. >> Gunnar Carlsson: Um-hmm. Yeah. >> And the rich features are the actual topological features [inaudible]. >> Gunnar Carlsson: Right. >> [inaudible] supervised learning where [inaudible] and instead of computing [inaudible] they'll use the notion on distance [inaudible]. They're using like [inaudible]. So some of it goes in there, but I think that's [inaudible]. >> Gunnar Carlsson: Well, there is -- so that's -- it's an interesting point. So some of the topology is captured in analysis. And we talked about these eigenvalues of Laplacians. And so, you know, there's a notion of Hodge theory which says that there's a bit of topology, what's called the rational part, that would be captured by, you know -- precisely by eigenfunctions. However, if you look back at the example of this Klein bottle, for example, it's actually got -some of the homology has so-called torsion in it, which means that it cannot be captured by the analytic method. So the analytic methods, you know, would in that case not see the difference between, you know, Klein bottle and the torus. So I think, you know, it would be interesting to take some of the -- you know, some of the analytic things but then also some of the discrete ties to things, feed those discrete-tied things in as well, and then work with that. Now, let me see. Do I have ten more minutes? I'm seeing here that I'm at 11:30. Was it scheduled for an hour? >> Yuval Peres: Ten more minutes. >> Gunnar Carlsson: Ten more minutes, okay. Let me talk real quickly, then, about -- yeah, here's another picture, which I won't go into the detail here. Okay. So let me say -- so the topological methods can produce signatures for recognizing a wide variety of shapes. They can also be used to provide useful visual representations of data. That's what we just saw in this mapper. Methods are flexible enough to operate on many kinds of unstructured data. So, in other words, the method we're talking about here, it doesn't need to be Euclidean. It just has to have a notion of distance. And it doesn't even need to be triangle inequality kind of distance. It just needs to be a measure of dissimilarity. But now what I want to show you here is what -- something that we haven't sort of fully implemented yet as a method but which I think is kind of -- you might find kind of interesting. So let me remind you the bootstrap method. And I'm going to give you the very simplest version of that, because I'm sure there are those of you who know it much better than I. So you study statistics of measure the central tendency across different samples within a dataset, and it can give an assessment of reliability of conclusions to be drawn from the statistics of the dataset. And let me just say, so the idea is that in order to understand the dataset and sampling strategy, you don't want to take a dataset and simply compute the mean of a statistic on it. It's much more informative to take samples and study those means and study those means themselves as a dataset. That's really the bit of the bootstrap philosophy; that I want to say I want to carry this over to questions not about numerical statistics but what about decomposing a set into pieces or finding is there a loop inside a space. So I want to talk -- so let me -- so here's an example, then. So this is a -- imagine that the datasets -- so the data points in this case are just the yellow background here. So think of a very densely [inaudible]. Suppose that I do an experiment and I repeat the experiment over and over again. That is to say I sample points from it and I build one of these Vietoris-Rips complexes and I compute the Betti 1. That is to say, the loop. So in this case over here I'm going to build -- I have a red sample and a green sample, and they're both going to report back, ah, Betti 1 is equal to 1. And so of course if I have something that's reporting that back, I would say, oh, you know, if I'm reporting Betti 1 a lot, probably I believe that there's a single big loop in there. But it might not be because this example over here could also be the case, this more Swiss cheese texture kind of example of the data. It would also -- could in principle also report back, you know, a loop, but they are not the same loop. So in this case the green is not the same as the red. Whereas over here these two are the same because you notice I can deform the one into the other one. So a question is how do you build a methodology that sees the difference between those two. So this is called Zig-Zag persistence. So suppose you have a finite set of samples from the point cloud data. You're going to construct new samples, each on the union of the Ith one and the I plus 1st. And now not -- we have not just a family of samples but we actually have a diagram of samples, a collection of samples, but relationships between them, maps including the one and the other one. So you can see here SI gets included in SI union, SI minus 1, and SI minus 1 in here and so forth. So now we actually have -- again, kind of like persistence, we've got a diagram with maps, but in this case the maps are changing directions, alternating. So we're going to build a Vietoris-Rips on these and apply K-dimensional homology to get a diagram of vector spaces of this same shape. Now, the idea is if the homology classes, if it's the case that I build a homology class here, homology class here, which agree when I map into here, that's more consistent with the first model that we had there, the one which had a single circle. If they map to different things, that's more like the Swiss cheese model. Okay. So we want to understand, then, the degree to which you have, so to speak, long vectors in here where you -- a class here which goes in here which comes from something here, which goes to something here and comes from something here. Do we have long such vectors. So in order to carry out that analysis, it turns out that just like for the persistence vector spaces, we had a classification in terms of barcodes, we'd want an analogous classification of diagrams of this shape. Turns out that there exists such a classification. And in fact the main theorem is that the classification is manageable, doable through simple linear algebra methods, exactly if the underlying shape of the diagram is one of these Dynkin diagrams. So the one we're interested in is this AN up here, because notice this is the underlying directed graph, so the arrow is going in alternate directions here. So it turns out, then, that those things can be classified by barcodes as well. So each diagram has a decomposition into a direct sum of the cyclic diagrams. And so we get a barcode classification there. And so the idea is now barcodes for those with long bars would indicate phenomena that are stable over many different samples. And so the others, the small ones would be ones that appear in particular samples and then map to something which is either zero or is distinct from anything coming from any other sample. Various uses can be made of this, actually. It turns out it's going to allow you to compute much more effectively even the ordinary persistence much more quickly. Okay. So I think I will stop there and thank you for your attention. [applause] >> Yuval Peres: Any questions? >> Gunnar Carlsson: Yeah. >> [inaudible] very simple programs, for example, have a bunch of addresses [inaudible] mistakes. >> Gunnar Carlsson: Yes. >> And you want to find just clusters, but you want to do it very quickly. So doing your technique lends themselves effective, fast computing? >> Gunnar Carlsson: Let's see. So let me understand the question, then. Let's see. So you've got addressed and now you need some way of recognizing when it's a mistake. Am I right? I mean, that's ->> Right. So ->> The cluster, then, [inaudible] same address really is just mistakes. >> Gunnar Carlsson: Oh, I see. Oh, yes, yes. Oh, right. So I'm sure under those circumstances you could devise a similarity measure, some measure of similarity which says, yeah, you know, you haven't -- you know, you missed the street for a road or something like that, but roughly speaking it's the same address. So, yes, I think that was the point about that we can sort of take any kind of metric, any kind of similarity notion. So if we can formulate it correctly, then in that case it sounds like some variant of a Hamming distance, or -- but taking into account the words I suppose would be -- would make it reasonable. >> [inaudible] >> Gunnar Carlsson: No, I'm not saying -- no, no. But I'm saying in principle -- in principle, yes, I mean, I'm guessing that maybe the difficult, then, is building the similarity measure. Is that -- is that a fair statement? >> That's what most approaches do. >> Gunnar Carlsson: Yeah. >> So I was intrigued that you have this epsilon [inaudible]. >> Gunnar Carlsson: Yes, right, right. >> So it's interesting to consider. But then it takes a lot of time, if you change your metric. >> Gunnar Carlsson: If you change the metric. It's not the scale, though. Actually, what's interesting is, that, you know, the calculating the persistence homology is not much harder than doing it for a single case. That's a surprising fact. I mean, it's not much harder. And so you can get that summary in reasonable time, the summary for what happens over all scale values. But now you mentioned changing the actual underlying measure of dissimilarity. >> Right. What do you consider closed. >> Gunnar Carlsson: Right. Exactly. >> You really cannot use Hamming metric in these cases because you have insertions and deletions [inaudible] more sophisticated metrics that are -- take longer to compute. >> Gunnar Carlsson: Uh-huh. I see. So is that the problem, that basically the distances that work are kind of -- are too difficult to compute or too time-consuming? >> Or work for some cases but not for others. >> Gunnar Carlsson: I see. >> Nobody has [inaudible] probably the reason [inaudible]. >> Gunnar Carlsson: Right. >> [inaudible] to search. >> Gunnar Carlsson: So we have thought about doing some things in search, and we have done some work. I mentioned that we have, you know, applied to networks in particular to corpora text and so on. So indeed we've done some work in that direction. You know, we can talk about those in more detail, if you like. Yeah. >> So dimension 0 you have this dendrogram tree structure [inaudible] you only have this barcode. Is there any analog to ->> Gunnar Carlsson: There is. This is the difference -- so this is the statistical version of what is -- in topology is called the difference between homotopy and homology. So in topology you have homotopy groups and homology groups, and there's actually homomorphism from homotopy to homology. Now, when you go to this setting, in dimension 0 the homotopy is the dendrogram and the homology is the barcode. And so if you did homotopy persistently, you know, that would be perhaps the most natural analog to that. There are some obstacles that come up because it's much harder to do directed families of groups that aren't necessarily vector spaces. And the -you know, the homotopy groups are not vector spaces. >> [inaudible] one dimensional [inaudible] most interesting. >> Gunnar Carlsson: Well, it depends on your point of view. You mean because it's nonabelian and so on. >> Yeah. >> Gunnar Carlsson: That's true. That also makes it the hardest. But, you know, the higher dimensional homotopy in the topology world is quite interesting. On the other hand, you know, I think your point is well taken here that, you know, real life typically isn't 50-dimensional. You know, real life typically -- the things one can understand are zero, one, two, three dimensional. And so you would be looking at low dimensional things. >> In your last example when you had this sequence of samples and you connected adjacent ones ->> Gunnar Carlsson: Yeah. >> That would be very reasonable if these samples came in some time-structured way, because they're arbitrary samples taken in a bootstrap, seems arbitrary to just connect the adjacent ones, putting them into this linear graph structure input [inaudible]. >> Gunnar Carlsson: Absolutely. That's where this remark or I should say the theorem of Gabriel comes in, because in order to have a good classification for the diagrams, we need them to be linearly arranged. You'll notice he doesn't have a two-dimensional array in there, and that's because those things are classified by more than any kind of discrete invariant. They actually have sort of continuous families of moduli for them. So but actually, I mean, I think you're right, though, in principle. In other words, one should invent a coarse invariant but that takes -- you know, doesn't impose an ordering on things. But as a first cut at studying it, it's not -- it's not so bad. And, you know, the way -- what we would simply do here is we would do, you know, a big many samples, confine the ones that produce a Betti 1, put them all together in one of these things and now see do they match up. >> Yuval Peres: Any other questions? Let's thank Gunnar again. >> Gunnar Carlsson: Thank you. [applause]