1 >> Yuval Peres: All right. Good morning. We're happy to welcome Richard Peng, who will tell us about firefighter regression algorithms using spectral graph theory. >> Richard Peng: Thanks, Yuval. So the talk will be structured as follows. I'll first start off by talking about some regression algorithms and why and how to solve them, so why do we solve them and how do we solve them. Then I'll move on to talk about fast solvers for linear systems, and then I'll finish by describes some of the graph theoretic tools needed for such solver, such as tree embeddings. So let me start off with a problem that you have probably all seen in one way or another, which is the learning/inference problem. For this problem, you're given a signal that's potentially noisy, but you know there are some very good underlying structure to it. You want to extract out -- you want to remove the noise so you want to extract both the real signal so there's sort of step function on the right, and you want to be able to find the hidden pattern as well. So because this is an important problem, there have been a lot of tools that's used to solve it. And one of the most commonly used tool is the idea of a regression algorithm. So the regression -- so a regression objective. So this kind of objective is -- I put some constraints on a vector, X. And I want to minimize some norm of this vector. So usually the P norm. So P is usually picked to be at least one for this co to be a convex -- for this to be a convex problem, and the constraints on X are usually convex constraints so the simplest form of these convex constraints are linear equalities, so X is -- is in some subspace. Usually, the X can be either the underlying structure or could be the not noisy signal. So this kind of a problem has many applications, and let me just describe a few of them very quickly. So the first one's probably more well known than the object itself, which is lasso. So this is an area known as compressive sensing, and it was one key objective is this -- that lasso objective by Tibshirani, which Ames to minimize the one norm of X, subject to AX equal to S, the signal. So this has a -- this was shown to be extremely structured and sparse output 2 and it also is very robust. It's extremely resilient to noise. one -- this has a lot of applications. So this is Also, the other application I want to talk about is image processing. So here, this is an example of a process known as Poisson image processing. The idea is that I have two images that I -- that really don't belong to each other. For example a polar bear and a picture of Mars, taken from a space rover, and I blend them into each other. So I want to somehow merge the boundaries of these two things. So the objective to be solved is defining some underlying graph on these pixels and I try to minimize some energy function depending on the difference of neighboring pixels and the ideal value that I want. So this objective up there, something over IJ of XI minus XJ minus a standard boundary, and by solving this, you get to the picture on the right. So you can create these fairly strange settings. And the third application of this is a problem that perhaps the more theoretical oriented of us are more familiar with, which is the minimum cut problem. This problem asks to, given a graph of two special vertices, S and T, and I want to remove the fewest edges so that S and T becomes disconnected in this graph. So this is a classic problem in combinatorial optimization, and it's not very clear that this is a regression problem. But as it turns out, you can formulate it as one. So consider it a falling variation. I want to label the vertices of the graph so the S gets labeled zero. T gets labeled one, and I want to minimize something over all edges of the difference of the labels at the end point of that edge. So the convention is still not as clear as I would like to show, but on one side of it, I think I can try to convince you, which is I can label everything on one side of the cut to be zero, label everything else on the other side of the cut to be one. And the size of the cut is simply summing over all. So any edge that crosses the cut has a difference of one. Any other edge has a difference of zero. So summing over all these edges gives me the size of the cut. So hopefully, I've convinced you that these problems exist in a variety of contexts, which means that there is reason to study a very efficient algorithm for them. 3 >>: Sorry, for the last problem, the Xs are restricted to zero why? >> Richard Peng: So again, you can't really are relax them to real numbers, and there are results that show that. If they're the optimal fraction solution equal to the optimal integer solution. >>: All right. >> Richard Peng: So this is -- so the [indiscernible] for minimizing these type of objectives have been studied for a very long time in computer science. So and the history of these algorithms can essentially be divided into 20-year intervals. So in the '40s and '60s [indiscernible] simplex methods and KT conditions, we showed these problems are contractible so there exists [indiscernible]. In the 60s, 80s, there were work on ellipsoid algorithm and a variety of other methods that show these problems can be solved in polynomial time, which is kind of a gold standard for theoretically efficient. And between the '80s and '90s, there were further developments that which led to the [indiscernible] interior point algorithms, which showed that these problems tend to be solved very efficiently. So generally idea of interior point algorithm is that problems using about square root M Newton steps. And let M to be the number of nine zeroes, and that's the this tilde O notation, I will use the log factors and consistently throughout this talk as well. it solves one of these to describe efficiency, I size of the problem and I will use it So before continuing talking about algorithms, let me first explain why do we need fast algorithms. The reason is that a lot of the data that come from these regression problems are big. For example, this picture, which is almost like a cell phone camera picture, has about ten to the six pixels. So M is at least ten to the six. And if you go to videos or 3D medical data, you can easily get to data that's about ten to the nine or bigger. And if you look at the key sub routine that's used in interior point algorithms, it's this idea of a Newton step. So interior point algorithms solve one of these problems using about square root M Newton steps. And what is a Newton step? There's a fairly simple characterization of it, which is that I'm solving a linear system. So the key bottle neck in getting fast 4 algorithms for [indiscernible] problems, at least the way we know them today, is we need fast linear system solves. To give a little -- to give further reason that we actually want fast Lynn-year system solvers, I'm going to quote -- I'm going to use this quote from Boyd-Vanderberghe, which said if you look at the number of iterations for a randomly generated typical SDP instance, it only grows very slowly as the problem size grows. So that the chart on the right is this figure that shows -- that plots the number of iterations against the log of the problem size, and as you can see it doesn't really grow much. But for the more sharp item, you may notice that the size of the item only tops out at about ten to the three. So it's only about a thousand. And there's a good reason for this, which is that we really don't have good ways of solving general linear systems very fast. So the linear system algorithms are actually -- linear system solvers are actually one of the earliest study algorithmic problems. They worked on this as early as the first century, and also I think Newton worked on this. Also happens to be one of the most rediscovered algorithms in that the algorithm that we're discovered is known as Gaussian emanation and in modern terminology is in cubed algorithm. It was shown by Strassen that [indiscernible] can be lowered to 29.8, many work followed to the state of the art today which is about 2.3727. But if you think about this number, even if you analyze it in conjunction of with interior point algorithms, we're still looking at a quadratic time algorithm, and what's more problematic about this approach is that these algorithms also need an M squared space, which is even more problematic when you're about 10 to the 12. So what's often used in practice, at least in practice today, is -- are actually methods that get around this issue. So the method that's often used in practice are first order methods. So some type of coordinate, gradient descent or subgradient methods. So these methods have the nice property that the cost per iteration is very low, but they do try this for slow convergence. And often, what happens in them is that they run for a number of iterations because of time constraints. They're cut. So one way to think about these algorithms is they are training essentially solution quality for time, because time is such a crucial resource for these minimization problems. On the other hand, there has been some work recently which showed that at least for some 5 instances, we can solve linear systems extremely fast. This is due to a work by Spielman and Teng in 2004, who showed that for a linear system that's related to graphs, you can solve them in nearly linear time, roughly M poly log N. I will talk more about this algorithm in 12 slides, but first let me explore some of the consequences of this. So the consequence of this is now that we have first that these second order methods that use linear systems have a low complexity. So have a low number of iterations. On the other hand, these linear system solvers gives us a low complexity per step as well. So if you combine these two algorithms together, you can get extremely efficient algorithms with fairly good -- with very good guarantees as well. And in the study of graph algorithms, this led to the idea of a Laplacian paradigm. So some representative who works there is a work by Daitch and Spielman, who showed minimum cost flow algorithm, that runs about M to the 1.5 time, and also there was a more recent work by Christiano, Kelner, Madry and Spielman and Teng who showed the state of the art album for approximating maximum flow and minimum cutting graphs. So this algorithm rank about M to the four thirds time. So some of the work that I've done in this area are related to stepping these algorithms to the more general setting of regression, graph-based regression, and the first extension that I've worked on was in a joint work with Chin, Madry and Miller. Where we showed that it's possible to extend their algorithm to the more general type of -- to the more general classes of regression objective that come out of image processing. So specifically, we showed that it's possible to solve the group, the L2 objective, which is a close cousin of the group lasso objective. And this turns out to capture almost any image processing objective that's convex that we could find. So here is an example of denoising an image. So by setting up the objectives correctly, you can first remove some of the small noise and if you overdo it, you can also get to the image on the far right, which is a cartoon-like version, where the person's ear is considered noise as well and also removed. On a more theoretical side, in a joint work with Kelner and Miller, we show that it's possible to extend this album to solving a K-commodity flow, which is 6 that you're trying route -- you're trying to shift goods through a network, but these goods aren't exactly exchangeable, so the thing about trying to route apples and oranges through the same graph. And this problem actually is very relevant to a lot of the practical regression objectives. They're related to multi-varied labelings and graphs. So instead of labeling every vertex with a single number, as in a minimum cut setting, you're assigning K labels for each vertex and minimizing some objective, depending on their differences. And in more recent work joined with my advisor, Gary Miller, we show that for the more structured graphs, such as the graph that you get out of images or videos, so basically any graph with a good separator structure, you can actually speed up the Christiano, et al., framework to even faster running times. So this is probably also relevant for a lot of these image-related applications, because images, they do tend to have fairly good separator structures. So the implication of fast solvers, if you take these general [indiscernible] into account, is that they give fast algorithms for solving a variety of these graph-based regression objectives. And at the same time, as I will talk about later, they also lead to very -- they also lead to a parallel algorithms. And there are actually a variety of other fast solvers as well, such as planar embedding, solving finite element systems and obtaining good separators for graphs. So finding good balance [indiscernible] graphs. Due to time constraints, I will not be able to talk much about them. I'll be happy to talk about these offline. Okay. So for the next bit, I'll try to talk about how these fast solvers work, and the problem is that given a matrix A and a vector B, I want to find the vector X at AX equal to B. So to describe running time, I will say A is -- I will let N to denote the dimension of A so A is an N value matrix and it has M non-zero entries. So it's the problem [indiscernible]. But because A is related graph, it actually has some very specialized structure. And the structure is that it's a graph Laplacian. So there are many formal -- many possible formal definitions of graph Laplacian. The 7 simplest one is that it's a degree matrix. Diagonal matrix containing the degrees, mean Nuss adjacency matrix of the graph. The other way to think about it is that for each entry, AIG, this entry is a degree of the vertex if it's on the diagonal. Otherwise, it's the negation of the weight of the edge. So it's negative WIJ. So consider this example of a graph Laplacian on the left. The graph below is the graph corresponding to it. So here the -- this vertex, this is the first vertex, has degrees, three. Three edges coming into it. So it corresponds with an entry of three in the matrix. This first edge corresponds to those two negative one entries, because the matrix is symmetric so you put one on both sides as well. So the next edge, this edge corresponds to these two negative ones, and this one corresponds to the third one. There is actually an extension of a graph Laplacian in numerical [indiscernible] known as the symmetric [indiscernible] matrices. It was shown in work by Gremban and Miller that solving graph Laplacians is essentially equivalent to solving SDD matrices. So in terms of stating the result, this is often talked as solvers for SDP linear systems. But in terms of algorithms, it's suffice just to think about graph Laplacians. And then the what are the issues with dealing graphs? >>: Sorry, on the previous one, are the WIJs ones, or can they be different for every entry? >> Richard Peng: graphs. >>: They can be different for every entry. These are weighted In this case, it's not the degree, it's just the row sum? >> Richard Peng: Yeah, it's a weighted degree. So I think at various points in this talk, I will actually go to weighted graphs for simplicity. These all generalize to weighted cases. So the, one of the biggest challenges we're dealing with graphs is that they can be very unstructured. So, for example, this is a graph corresponding to a social network. These are graphs that at the top level look almost like complete graphs in that there's very little good partitioning that happen, but 8 we know that there are some underlying structures under there. And the other problem is that if you're using this kind of algorithm, inside other more sophisticated routines, the intermediate linear system that's been solved, because there are so few of them, almost capture the entire difficulty of that optimization problem because we know that these regression problems are difficult, which is why we want to solve them, but difficulty has to go somewhere. So in some sense, a lot of difficulty of the problem actually goes into solving the linear systems. So the result by Spielman and Teng formally can be stated as given an N by N graph Laplacian, M non-zero entries such that I know that the exact answer is X, so that I know that B equal to AX for some vector X, this algorithm is able to find you an approximate solution to some epsilon. Such that it runs in near linear time. So notice that this algorithm is approximate in nature in that there is an error guarantee that happens there, but if you look at the running time, the dependency on epsilon is only log one over epsilon so it's a very strong approximation. And by nearly linear, I mean M poly log N. So log N to the power of C for some constant C. So also the error is being measured in the A norm so you got to think about this as the norm being induced by the matrix. So it's the square root of the transpose. This G edge should be an A, by the way. And so this is a great algorithm from a theoretical point of view, but there is -- but from a practical point, there is a caveat to it. Which is that consider -- so notice that there's this log N to the power of C that's in the running time. So personally, I've never been able to track down what exactly the value of C is, so I just resort to authority for some estimates. So the estimate by Spielman is at 70. By my advisor, Gary Miller, is 32. Co-author Koutis says 15, Teng says 12, and the lowest I've heard is from Lorenzo Orecchio, which is 6. So consider the case where M is about ten to the six. So you're dealing with a million by million linear system. Here, log N to the power of six is more than a million. So this algorithm, for a lot of the practical [indiscernible] that we're dealing with, is essentially a quadratic time algorithm. 9 So in some -- in two joint works with Koutis and Miller, we showed that essentially the same kind of guarantees can be obtained with the same running time as M logged squared N, and also working in 2011, we showed that we can get to M log N. So we essentially improved the run time exponent to a single log. And for the next bit of the talk, I'll try to describe briefly how this algorithm works. So the algorithm has three key components. Iterative methods, graph sparsifiers and low stretch embeddings. I'll talk about -- I'll talk mostly about the last two, which means I'll try to give you the two-minute version of iterative methods next. So iterative method is one of these great ideas from the numerical analysis, which states that if I want to solve a linear system in A, matrix A, I can solve it by solving linear systems in B and iterating. So I can solve it by making a number of calls to a system that's similar and solving that system instead. The idea of similarity is defined through a spectral conditions, which is written using this [indiscernible] spectral notation. So the spectral less than or equal to. Formally, this means that [indiscernible] means that for any vector X, the A norm of X is less than the B norm of X, but all the uses that I will make of this spectral notation in this talk, I think I can appeal to your intuition about scalers in that the operation I will do to them is essentially manipulating them in ways that's almost identical to manipulating the less or equal for scalers. And one of the key ideas in these solvers is that because A is a matrix that's obtained from a graph, we might as well pick B as a graph, and analyze this kind of spectral less or equal to conditions using graph theoretic notations. So I will replace A with G, B with H, and deal with graph from this point. So because we want to find an easier H and we're dealing with graphs, there are two obvious candidates for considering what's an easier graph. The first one is fewer vertices. The second one is fewer edges. As it turns out, we can reduce vertex count if the edge count is small. So I will just focus on getting H with a smaller number of edges, so that H is similar to G. So what I want is I want graph sparsifiers. These are sparse equivalents of graphs that preserve some property. These have been studied extensively in computer science in various forms. Examples are spanners which preserve dissonance or diameter of our graph, cut sparsifiers, which preserves the value 10 of all the cuts, up to some small factor, what we actually need is spectral sparse fires, which seems to preserve the item values in eigenvectors, so the spectrum of the graph. Formally, what we need is an [indiscernible] known as ultra sparsifier which is for a graph with N vertices M edges, along with a parameter K, I want a graph with N minus one plus M poly log N over K edges, such that these two graphs are within a factor K of each other. So I essentially want to reduce the number of edges by the same factor as the similarity factor. But I'm willing to lose some poly log factors. As it turns out, this poly log factor has dependency on running time. It was shown by Spielman Teng that getting ultra sparsifier with this type of quality implies solvers with essentially the same exponent of log in the running time. >>: What is order of K that you take? >> Richard Peng: So the order of K that would be used is slightly larger than this poly log. So it's a rough -- so the value of K that's used for these algorithms in the Spielman Teng framework is setting K to a poly log factor. So it's mildly sparser from a theoretical point of view, but it's still more than a constant factor reduction in the problem size. And the Spielman Teng work can essentially be viewed as obtaining ultra sparsifier constructions for fairly large value of P. So ultra sparsifier for a fairly large poly log factor. Before we continue further, let me just do a sanity check and consider one example of a graph that we want to build ultra sparsifier for. So this graph is a complete graph. For those of you who are familiar with [indiscernible] linear systems and complete graphs, you are probably aware this is a fairly easy problem, in the sense you can just read off the answer. But let's just say we want to get ultra sparsifier for it. I claim here random works in that I just pick about N log and random edges. But rescale these edges so that the [indiscernible] expectation, the total weight is roughly the same so rescale them out by a factor of N over log N. And [indiscernible] has shown that with high probability, the resulting graph is going to be ultra sparsifier. So it's going to get N log edges, which is far less than the factor and also [indiscernible] factor approximation. 11 So but this only works for complete works because all the edges are roughly the same. So for general graphs, we do something slightly different, which is we want to have different probabilities for keeping each edges, because the edges aren't exactly the same as each other anymore. So the general graph sampling mechanism, this is due to a framework by [indiscernible] is that I want to sample edges such that each edge is kept with probability P of E. I want to associate a probability value for each edge. And I want to sample these edges with that probability, and if I do pick that edge, I want to scale it up to maintain -- to maintain the expectation. So I want to scale by a factor of roughly 1 over P for any expectation, I got the same edge back. So there are a few quick observations that can be made about this framework. The first one is that the number of edges that's kept is equal to the sum of the probabilities. Just because any expectation I getting [indiscernible] probability P of E. And what we need to do is we need to prove the concentration of this sampling. So before I try to describe the concentration result, I need to take a slight detour and describe a key quantity that's immediate for measuring this concentration, which is [indiscernible] factor resistance in a graph. So this is another one of those notions that had me defining of a bunch of ways. And the physics definition is that I'm going to view the graph as a [indiscernible] so each edge is a conductor with conductors equalling to its weight. And the and you unit of between way you measure effective resistance is you literally take a voltmeter plug the two end points into those two vertices, and you try to run one current from one vertex to the other vertex. And the resistant value these two vertices is just a number you read off your voltmeter. In general, the way you compute a factor resistance is that you solve -- you solve GX equal to some indicator vector and you read off the difference in its solution between these two vertices. This is something that I think is taught in [indiscernible], but I don't really remember it. And I think, from what I can remember this leads to a fairly complicated calculations, because once you're asked to do Gaussian emanation on a five by five matrix on a test, these are some horror stories. 12 So let's just [indiscernible] simpler version, which is I have a single edge. Here, the effective resistance of this edge is one over the weight of the edge. And also I will say I will use some rules about [indiscernible] series, which if I have two resistors, a weight -- chained together in series, one after the other, the resistant value of the combination is just a sum of the resistant values of those. So the two rules that's going to be helpful is resistant value is inversely proportional to the weight of the edge, and if I'm a channel resistors, the resistive value of them as a single component is just some of the resistive values. So now that I've revealed a little bit of the combination, I can release the result by Spielman Srivastava. What they showed is that by letting the sampling probability equal to the weight of the edge times its effective resistance, times extra log N factor, this will give a graph that's within a factor of two of the original graph, with high probability. So there's some probabilistic issues I'm hiding under the rug, but what's more crucial is that, this fact about the sum of these probabilities. This is an amazing fact which is that the sum over all edges of the weight times the effective resistance in a complete graph is exactly N minus one. It's a bit of an odd result, but let's consider some simple cases. Consider the case where the graph is a tree, every edge is a unit of weight. The only -- for each edge, the only path is between two end points by itself. So each edge has effective resistance of one and if you're dealing with a graph, you have N minus -- dealing with a graph that's a tree, you have N minus one edges, so you have a sum total of N minus one. What's amazing is this is actually true for general graphs. So -- and the implication of it is that every graph has a spectral sparsifier with N log N edges. So we constructed spectral sparsifiers, but we haven't really got into ultra sparsifiers or solvers. And the reason is that there's actually a slight caveat, which is what I -- which is essentially the chicken and egg problem. The issue is how do we compute effective resistance? As I mentioned earlier, computing effective resistance, you need to solve a linear system in general, for general graphs. And this was actually the case in the Spielman Srivastava work as well. So the way they obtain the effective resistance is they used a solver to solve the linear systems. But if you look at the solver, what they need is they need spectral sparsifiers. So come back to this very quickly. 13 So most of our work can be considered as a way to work around this issue. And there were two things that we do that do this. The first thing is that instead of using effective resistance, we use fairly crude upper bounds for them. And the sec thing that we do is we actually modify the problem so where we actually can get fairly good upper bounds. So to get upper bounds, we use the result from electrical engineering, known as Rayleigh's monotonicity law. Otherwise known as I just unplugged my computer. Which is that if I remove edges from a graph, or if I remove components from an electrical circuit, the thing will stop working eventually, which means I've made the resistant value go to infinity at some point. The slower version of that is if I remove components from an electrical circuit, the resistant value [indiscernible] can only increase. So to get upper bounds for effective resistance, we can actually measure effective resistance with respect to only a part of the graph. And what you do in the most aggressive version of this, which is to calculate the effective resistance with respect to a tree. So we don't want infinite values because that's really bad for sampling, but what we can do is measure them with effect of a tree. So we toss out all of the edges except a small part of the graph. And here to measure effective resistance between two vertices, only the path between them matters. So these two edges hang off the path. They don't really help for effective resistance so all we need to do to measure effective resistance is to sum up the resistant value along the path. Turns out this is actually a known quantity. This is known as the stretch of an edge with respect to the graph. And what we need is we need a tree for the graph, such that the total stretch of all the edges is small. What's even more amazing is that for any graph, turns tree. So for any graph, there exists a spanning tree stretch of all the edges is roughly N log N. This is factor, and I'll get to more details about this in 12 >>: When are you going to do the stretch? >> Richard Peng: I'm just summing over all. out there exist as good for which the total hiding a log log N slides again. 14 >>: But only like I guess [indiscernible] every edge, right? >> Richard Peng: So there is this expectation type guarantee, which is a stronger guarantee, and it turns out that from a metric embedding point of view, these are interchangeable notations, these notions. But for in terms of getting a fast algorithm is actually so [indiscernible] we -- well, this algorithm, the N log N algorithm is crucial that is a weaker notion of sum and is actually open to get an expectation, but it's probably possible to get expectation as well. So what does this imply, though, is that our upper bounds for effective resistance, so our upper bounds for way times effective resistance that's going to sum to M log N. Which means that if you combine with sampling, you get about M lock squared and edges back, which is more than the graph we started with. So this is a problem. And the question -- and this actually, this situation is actually most evident in the case of unweighted graph, who every edge is a unit weight. So consider this tree in red. I'm not sure even how clear it is. But if you use this tree to stretch out that I'm pointing to, because it's taking two edges to reach it, it's a length to path so the resistant value is two, and because every edge is length one, the shortest path you can get between any two vertices is one, which means that the low sampling probability we're going to get if we do this to an unweighted graph is at least which really didn't mean anything, because if you sample everything with probability one, you're going to get the same thing back. So the question is, is this even a good tree? So now let's just step back a bit and see what we're missing with this kind of approach. So what we needed was an ultra sparsifier. So this is something that has a factor K difference with the original graph but has about M poly log N divided by K edges. What we just generated has about M minus one plus M log squared edges. The N minus one I put there to make them look about the same. But if you recall the Spielman Srivastava guarantee, they say that G and H are within a factor of two of each other, so it's a very good approximation. And the other thing to notice is that we have not used K at all. So the one we needed has this factor of K. We have not introduced K in our [indiscernible]. So we're going to use K somehow and I propose we use it in the simplest way possible, which is going within this notion that the tree we got is good. It's 15 a good part of the graph. So let's make it heavier by a factor of K. And because by doing this, we're going to get a different graph, G prime, and this graph is going to be within a factor -- is a factor of K approximation of our original graph, and now I claim something good happens. So now you get your edge in the tree, instead of having weight one, has weight one over -- has weight K, which means that the resistant values are inversely proportional to the edge weight. These two resistors now have resistance one over K each. So you look at the same edge, its stretch now went down by a factor of K. So this is actually true for any edge, and what you get is that the tree-based effective resistance between any two points now decrease by a factor of K. So what we need to do now is to do accounting. And we are going to pay for the tree edges, because the tree edges they didn't really change themselves. So they got heavier, but their paths also got heavier. So they're still going to be there. So I've got N minus one tree edges. But the off-tree edges, all the probability went down by a factor of K. So we're looking at M log squared over K edges now. And once you sample it, you're going to get a sparser graph. But the issue is that we kind of cheated when we modified the problem so we didn't really sparsify G. We sparsified G prime. But notice that we have these two qualities. We have the G and G prime within a factor of K of each other, and the G prime and H are within a factor of two. So ->>: Just so my intuitions keep track, what is the typical size of K? smaller than two, bigger than two? A very big number? >> Richard Peng: or so. >>: It's like poly log N. Is it So think about it as about a thousand So it's a big number and it's -- but it's small compared to M. >> Richard Peng: It's very small compared to M, but it's much bigger than most constants. So there are actually more gradual versions where smaller constants of K are used, but to describe the algorithm, I think, easiest description of the algorithm. Uses a reasonably big but not too big value of K. 16 So if you combine these, you get the G and H are within a factor of 2K of each other. It's appealing to intuition about the scalers. And if you look at the one we needed, we needed ultra sparsifiers, so we need about M log -- N to the P over K edges and a factor K difference. What we just generated. We generated H, that's within a factor of 2K of G. And we got about M log squared and over K edges. So if you look at these two definitions and you look at those two conditions there, up to a constant, roughly the same condition with P equal to two. So what you obtain if you combine this with the framework given by Spielman Teng is you get an algorithm that runs about ten L squared in time. And it turns out you can actually do even better by taking the notion that the tree is good, even further. And the idea there is that instead of on the smaller graph, so you -- now you've got yourself to a smaller graph, instead of computing another tree of another tree from scratch, you can compute the tree -- you can reuse the tree from the current iteration. Using this trick, you can get to an M log N time algorithm. There were some extensions that we have done with this work as well, in joint work with Koutis and Levin, we showed that mildly dense graphs can especially be sparsified in linear time. Also, in a joint work with Gary Miller, we showed that it's possible to extend some of these ideas to general matrices. So the idea there is you want to find, for general [indiscernible] simpler versions, and you want to do this kind of sparsification in proportion to a solve. So dealing with fairly -- you're dealing with a class of matrices that formed by an outer product of [indiscernible] matrices, which happen quite a bit in machine learning. So to summarize solvers, what happened here is that by using the ideas from spectral graph theory, we're able to construct similar graphs. But in using these similar graphs, we're able to obtain extremely fast algorithms for solving linear systems. But if you think about this algorithm, there's a key backbone in this algorithm. Which is that we need these good trees for graphs. So to get these fast solvers, a critical component are these algorithms from combinatorial graph theory, which aims to generate these good trees. And for the next -- and for the rest of this talk, I'll talk about algorithms for getting these good tree embeddings. So these trees in theory are known as low stretch spanning trees, and what they are is we call it the sampling probability of edge equals to the weight times the effective resistance of tree 17 path. So for simplicity, let's just consider the unit weight case again. So here, the stretch of edge is simply the length of the tree path connecting these two end points. And the most for the spanning is if you sum over all edges, the distance between their end points is fairly small. So this is fairly different than most of the definition of spanning trees that we're used to. A good example of this is the square mesh. So the square root N by square root N square mesh. The first tree to consider is the hair comb, which is take the first column and every single row. I kind of -- because the edges are unit weighted, this is a good tree from a lot of the [indiscernible] that we're used to. It's a max weight spanning tree it's the shortest path tree if you start from the top left vertex. But I claim that the stretch of this isn't that good. Actually, for quite a bit of the edges, the stretch are good. So for all the edges on the second left column, the stretch of these edges are three. So you just take one left, go one up and go around. But the problem -- how does weight cross the midpoint. But once you cross the midpoint, you got to go all the way to the left, go one down and come back. So the stretch of half the edges in the graph is about square root of N. So you got a total of roughly N to the three halves. And as it turns out, the tree that's good for the grid is this kind of recursive C construction. So you want to partition a graph up into smaller diameter pieces and combine them together to join these trees. And then there were a lot of work done on this, starting with the work by [indiscernible] and west, and the state of the results combine two works by Elkin, Emek, Spielman and Teng. And Abraham, Bartal and Neiman who showed that it's possible to get a total stretch about M log 1/2 for any graph, it's actually a state of the art when we were working on the solver. And also is hiding the log log N factor. But if you look into it, look closely at the algorithms, you realize there is a slight problem with them. From an efficient [indiscernible] perspective, which is their running time is roughly M log squared N, and this M log squared N can be explained pie the simple explanation is that you're essentially running about log N shortest path on this graph. So you're taking a part of graph, running shortest path and repeating. And you're doing this by log N levels. And the running time is essentially shortest path time log N. 18 But to get a faster [indiscernible] algorithm, because our solver ran about M log N time, which you need to get around this bottle neck as well. So the way we did this is we drew some ideas from our work by Orlin, Madduri, Subramani and Williamson, who showed that if a graph has only K distinct edge length, so has a small number of edge length. So recall the [indiscernible] graph is about linear time. And what it showed is that if a graph has about a K distinct edge length, the running time of shortest path can be reduced to about M log K. And the way we took -- the way we use this fact is that we rounded all the edge weights to powers of two. So if you think of polynomial bounded edge weights, your graph now has about log N distinct edge weights. So each shortest path running about M log log N time, so you got a total work that's roughly M log N, ignoring poly log factors. We actually made some improvements to this [indiscernible] result as well, we actually improved this M log -- M over NK to actually M plus N log K. So almost the [indiscernible] style bound. So generating these trees actually is even more interesting from a parallel perspective. And the reason from a parallel perspective this is interesting is due to a work joint with Blelloch, Gupta, Koutis, Miller and Tangwongsan, where we showed that the Spielman Teng ultra sparsifier framework can be readily parallelized to roughly about M to the one third depth and nearly linear work. If you combine this with the Laplacian paradigm for getting graph algorithms shown earlier, the result is that you get theoretically fast parallel algorithms for many graph problems, and somewhat to our surprise, it's actually a very difficult thing. And the state of the parallel graph algorithm, before our work, was that the state of the art of parallel algorithm for max flow actually had a higher run time depth, had a higher time than a total work by the Goldberg Rowe algorithm, which is the state of the art max flow algorithm. And what we obtained was actually parallel algorithms whose work are close to the stated art sequential and had reasonable depth so that the parallel depth of these algorithms are roughly about M to the two-thirds. And for -- in terms of parallel graph algorithms, this is actually fairly interesting in that our current existing frameworks for getting parallel algorithms, so, for example, while most used one is the max reduce framework 19 and to quote Vassvilitski from Google, these frameworks actually seemingly now very good for graph computation. So this actually leads to some ideas of getting interesting ways of doing parallel graph problems on a massively parallel scale. And this quote was actually made in the context of -- more specifically to one problem. Which is the parallel shortest path problem in directed graphs. [indiscernible] BFS on directed graph. Actually, even computing reachability, ST reachability on [indiscernible] graph, I don't think there are methods known to keep speedups over sequential algorithms. But if you look at framework by Elkin, Emek, Spielman and Teng, what happens is that the first step using this algorithm, as I mentioned earlier, is that's your shortest path. So generating these trees in parallel, it turns out to be not exactly easy. And we were able to get these embeddings was we did several things. We first went and looked at an earlier idea by Alon, Karp, Peleg, West so there's actually the first [indiscernible] spanning tree construction, and then we combined that with the idea of repeated local clustering of the graph. So you repeatedly cluster out small pieces of your graph, combine them and cluster again. Also, you sample at different densities to find these clusterings. So these were ideas that were used in approximate parallel shortest path graphs by Cohen, and the pictorial version of this element essentially looks like the following. We have a graph, find the centers, chop off some species of the graph, get rid of them, find more centers, chop off the rest of the graph. So with this, we were able to get a good parallel algorithm for solvers and therefore a lot of these parallel graph algorithms as well. So the big picture to sum things up, though, is that to solve a lot of these graph-based regression, even combinatorial graph based problems, we need fast linear system solvers. Well, fast linear system solver is a good way of getting these algorithms. But to get these fast linear system solvers, we also need a variety of combinatorial tools from the graph embedding literature. And I think some interesting questions for future work are can we get better regression algorithms, can we get a faster and more parallel solver? So one way to view a parallel solver is [indiscernible] sparse representation of an 20 inverse of a matrix. This actually is not the case in the current parallelization. So [indiscernible] is can we get a sparse and approximate pseudo universe of the matrix. And also, can we extend solvers to other type of systems. That's all. Any questions? >>: We can take this offline because we're going to meet separately, but I'm very interested in the kinds of things that arise in image processing which are quad meshes, you know, you think of them as planar graphs, things like that. You mentioned you have some recent results so I can ask you about them, unless you already have slides. The other thing is the theoretical results for a regular mesh is the order and multi-grid solver, so it seems like there's still a lot and what's the current gap between these tree-based approaches and multi-grid? >> Richard Peng: So for planar graphs there was a work by my two co-authors, Koutis and Miller, who show that a linear time algorithm, for these regular planar meshes, and they were the multi-grid algorithms, are shown to be to run extremely fast because they run what's called a V cycle so the recursive structure is much more compact than our algorithms. So if you -- so there actually have -- so Yanis and Gary actually have some code which combine some of the ideas from multi-grid solvers with some of these ideas. And my impression is that to get fast algorithms for these well structured graphs, probably ideas from both of these needs to come in at some point. And also that I think we have a fairly efficient solver package. >>: Combinatorial multi-graph work or it's newer than that? >> Richard Peng: >>: It's an updated version of the CMG solver. Okay, because I did not have order N also the graph. >> Yuval Peres: >> Richard Peng: Any other questions? Thank you. Thank you, Richard.