>>: It's my pleasure today to welcome Saeed Amizadeh. And then Saeed is a PhD student for University of Pittsburgh. And today he's working on machine learning and data mining on -- especially on large-scale data. Today he's going to teach us how to first approximate transmission matrix for large data set. And this is also related to his internship work last year with Bo Thiesson. >> Saeed Amizadeh: Okay. Thank you, Scott. Hello. So I'm not going to introduce myself. Scott already did. So this work is actually kind of related to my internship work last year here with Bo Thiesson. And my dissertation, my PhD dissertation is about large-scale -- working with large-scale data sets, especially with graph-based methods. And today I'm going to talk about this work that I actually presented last week at UAI for basically approximating the transition metrics for random walk on the graphs when your data set is large, meaning the number data points is huge. So I'm going to start with the -- just a brief introduction of graph-based methods and why they are useful in machine learning and then talk about some challenges that we have when we work with these methods. And then I'll present the variational dual-tree framework in general, which was originally introduced for density estimation. And then I'll talk about how we can apply this method for estimating random walk. And then some experiments and some conclusion remarks. So why do we care about graph-based methods? So usually in graph-based methods we have a bunch of data points and data points and we can build a similarity graph, meaning that the -- so we have our data point set has nodes, and edges basically show the similarity between two nodes. And this similarity graph basically captures the geometry of the data. So it's closely related to the distribution of data. And actually there are some works that tries to focus on the connection between density estimation and the similarity graph. So the formal definition is that we define a graph where the nodes are the data points and the edges basically we can have -- we can have an edge between each two data point. And if we don't have an edge, the weight for that edge is going to be zero. So the edges are weighted. And the higher the weight means the higher the similarity between those data points. So just to give you a brief motivation, why these methods are important and why they help us. The reason is usually when we have a space -- when we have some data point in some space, we usually have a metric in that space. And sometimes our data has some, you know, weird shape, you know, manifold structure, meaning that the distance that we have or the metric that we have in that space is not meaningful locally -globally. It is meaningful locally, however, when, for example, these two points, A and B, their Euclidean distance here is not meaningful because what we have in mind as a distance between A and B is this red curve here, not this straight line. So the global is not necessarily meaningful. But the local, in the locality we have a meaningful distance. And the goal is to somehow use the graph structure be the graph and somehow aggregate these local distance to infer a meaningful global similarity metric or dissimilarity distance. Depends on how you define it. And so that's the main probably one of the most important philosophies. Behind these frameworks. So usually we need to define similarity function, which is basically transforming the distance can be -- well, here in this talk we are talking about Euclidean distance. Transforming the distance into the similarity. And here we use the very popular Gaussian kernel, which basically has a bandwidth parameter with it. We'll talk about the bandwidth actually later in this talk and how we can actually set the bandwidth. Because bandwidth is a design parameter, meaning that if you decrease your bandwidth towards zero you're going to end up with a very sparse graph. If you increase it, you will get the very dense graph. So it's very important where to set signal. So a very famous and very matrix that people define is the Laplacian matrix, that turns out to be very useful in basically graph-based methods. The Laplacian matrix have -you know, we have different Laplacians. This is probably the most famous one, unnormalized Laplacian metrics. So you can define these diagonal metrics which each element is the degree of each node. And you can define your Laplacian metrics as the degree -- as the diagonal metrics minus the weight metrics. Or you have the symmetric normalized Laplacian which is asymmetric, or we have the random walk Laplacian, which is basically the identity metrics minus the random walk transition matrix. This is if you basically write it down, this term here is basically the transition metrics of the random walk. So why do we care about Laplacian matrix in general? Well, the reason is the eigenvectors of these metrics actually contains the cluster structure of the data. Basically the eigenvectors that are associated with smaller eigenvalues encode the coarser structure in the data. So because it encodes the geometry of the data, the eigenvectors of this metrics is very important. So we can use these eigenvectors and eigenvalues to embed our data points in a new space where the Euclidean distance is globally meaningful. And remember we talked about the Euclidean distance or distance in the input space is not globally meaningful? No, we want to embed our data set in a new space such that the distance no matter is global or global is meaningful. So one way to do it is this very general form of embedding in the space. So basically for each point XI we can define ZI as these are the coordinates of the ZI the transformation for. And this UKI is basically Ith element of the Kth eigenvector. And phi is a decreasing function of the eigenvalues. So basically we can -- we can define different phis. It's your desire. And different phis result in different global distance metric. For example, if you define phi as the exponential function, then you get diffusion distance, which is the physical diffusion. Gives you the physical -- has exactly the same meaning. If you define phi as one over lambda, you get the resistance distance in actually called networks. So with different choices of phis, you can have different global similarity metrics. Yes, please? >>: Don't you want to -- so usually when people do [inaudible] paper, where they do spectral clustering there's a crucial step after you take the eigenvectors that you should normalize by -- normalize each point so that each of these -- each of the U, UJIs has unit [inaudible]. And then they do [inaudible] clustering in that space. >> Saeed Amizadeh: Yes. >>: So don't you -- don't -- do you want to do some kind of normalization here as well per I? >> Saeed Amizadeh: Yes. So the thing is these UIs, these eigenvectors are assumed to be normalized already so ->>: No, no, no, you normalize -- that's the eigen-- each eigenvector ->> Saeed Amizadeh: Yes. >>: -- across all the Is ->> Saeed Amizadeh: Yes. >>: That norm is constant. I'm talking about ->> Saeed Amizadeh: The rows. >>: -- for each I [inaudible] across the eigenvectors, however many eigenvectors you take, you want to normalize that so that the resulting points they think -- they do this so that the resulting points lie along, you know, K dimensional sphere. >> Saeed Amizadeh: Yes. >>: Or for better K means cluster. >> Saeed Amizadeh: Yes. So you can summarize that into your phi function. In fact ->>: [inaudible]. >> Saeed Amizadeh: Go ahead, sorry. >>: So then the phi would depend on I, right, because normalize would be different for each I? >> Saeed Amizadeh: Yes. >>: Okay. >> Saeed Amizadeh: Yes. That's -- I mean, phi doesn't need to be, you know, same thing for -- if, as I said, this is just one form of transformation. This is the -- this is based on the paper by Ghahramani that -- but, yeah, you can have different form of -- actually the very, very early form of spectra clustering, the phi the actual a step function. So you have phi equal to one for the first K eigenvectors and the rest are zero. But the phi can be as the waste paper, it can be dependent on I to transform it. So the thing is you have some sort of transformation here. But in order to make it an embedding, the most important thing is it should be decreasing function. In order to keep the -- and by decreasing I mean a non-increasing basically. Because you want the effect of the -- so as you go from left to right basically you are moving from the coarser structure to more finer structure. So you want to keep the coarser clusters apparently. Okay. So what are the applications? So the applications, I mean, the very first application is dimensionality reduction. So if you -- your data already is lying on some sort of manifold, you can basically take only few eigenvectors to basically present your data in the new space. So effectively you increase the dimension. Spectral clustering you can find non-spherical clusters because the eigenvectors represent the structure. Semi-supervised learning you can propagate the labels from your labeled points to unlabeled data again using the -- for example random walk kernel. Function approximation, for example, you have a reinforcement learning problem and each state is a node, you can try to basically estimate -- approximately the value function of the reinforcement for the -- for your reinforcement learner on the graph, which is a basic function approximation. So there are many applications with the graph-based methods. So what are the challenges? Well, I'm going -- in this talk, we are going to actually focus on, you know, large-scale data. So basically our challenge is the larger scale data. So large scale can mean large dimension. So what's the challenge with the dimension? Well, in this talk I'm not going to focus on dimension, but this is part of the large -- the large-scale problem, actually part of my thesis work. So the problem is curse of dimensionality. You can show that if you let your -- the bandwidth parameter change with your sample size, then the error between the eigenvectors of your sample and the eigenfunctions of the true population exponentially depends on dimension. And that's the curse of dimensionality. Basically your error from the true eigenfunctions of the population is -- exponentially depends on dimension. And one way to solve this problem is using the independence structure. Either, you know, you are having some -- you are having a data set problem where the features are independent or features, you know, are divided into groups or you can impose the independence with the cost that you can -- you know, you can have some approximation error but at least you can increase the estimation error. So if we have this independence structure or we impose it, then we can decompose our problem into solve problems each of which has a refused dimensionality. Basically divide and conquer approach. The challenge that we are going to actually talk in this talk today is the large N, which the number of data points are huge. So the first challenge in large dimension is more of a statistical challenge. This is more of a computational challenge. So in general, if we want to build this kernel matrix -- and by kernel matrix, Laplacian or the random walk kernel matrix, it needs -- in general, it needs N squared time and memory, and therefore it's not applicable for large-scale problems. So this you can -- we can see it as an instance of the general N-body problem where you have, you know, N particles in the space and you want to compute the mutual effect between these particles. It's a famous problem in physics. And there are many solutions to this problem. So in machine learning, the solutions -- we have different classes of solutions. One solution is sparsifying the nodes. This can be subsampling. Just take a subsample of nodes. Or building your backbone graph, you know, making supernodes. Or it can be a sparsification of the edges. So, for example, building KNN graphs. So this is going to give you a sparse matrix. Or epsilon graphs. Or B matching. These are all from the same family of sparsification of the edges. The third approach is basically, okay, try to go around actually building this kernel matrix. So one example is if I want to compute the eigenvector or eigenfunction of the kernel, can I somehow directly do it without building the matrix? And there are basically works and papers on this how to do it. The other idea is parameter sharing. Basically instead of sparsification we share parameters. And our method basically belongs to this group of parameter sharing approaches. So now, I'm going to actually talk about the variational dual-tree framework as the baseline for our framework. And please stop me whenever things are not clear or you have any question. Okay. So forget about random walk for now. We are not concerned with random walk or graph-based methods. Let's say we want to do kernel density estimation for N points. So for each point I want to compute this kernel density estimator. And for all points it requires me to do N squared computations. Here this is -- this is basically is your Gaussian kernel. This is basically the likelihood of XI generated by this kernel and MJ is basically the center of the kernel. And P of MJ for simplicity, we can just set it to one over N. So the same weight for all kernels. If you look at this problem each data point plays two roles. One as a data point XI where we want to compute the density at, the other is the center of the Gaussian kernel, MI. Okay. So we can go ahead and compute that, but we can actually reformulate our problem into a variational problem. So I can basically define and basically this is the logarithm of the likelihood. I can introduce the variational parameter and use the Jensen inequality and get this lower bound on my likelihood. And this is basic the KL divergence between my variational distribution and the true distribution. And this is the likelihood of one data point. I can sum over all data points here. And I can try to maximize this lower bound with a constraint that the sum of the QIs should be one. Then this is going -- if I solve this variational problem, this is going to be my solution for QIJ. So here if we look at this, these QIJs -- so by PI here, the true distribution, I mean, the reverse of this one. So basically P of MJ given XI. So QIJ approximates the P of MJ given XI. And P of MJ given XI is the membership probability. So is the probability that XI is generated by the kernel MJ. Membership probability of XI to kernel MJ. So if we look at this solution here -- so basically this is an exact solution. This is an exact solution because the KL divergence is going to be zero. So the question is why do we bother to do this at all. I mean, you can -- and the answer is we'll see later. So we want -- basically we want to come -- we want to compute, approximate this posterior P of MJ given XI using QIJ, the variational parameter. And if you compute this thing, approximate this thing, then we can compute the likelihood. Okay. And this is -- this has a benefit over this one because this one is a likelihood, this one is a probability distribution. So this one sums up to one, this one doesn't. So this gives us a very easy basic base -- method to use the variational approximation because we can have a constraint. We can easily set up our constraint. With this one, we cannot easily set up our constraints. So okay. What's the basic idea? So the basic idea is -- so if I want to compute the membership probability, so suppose these are my kernels centers are -- these are Gaussians and this is my data point. So if I want to compute the P of MJ given XI, I need to do it for every pair here. So the idea is parameter sharing. So the idea is just group these kernel together and just approximate the effect this membership probability with one parameter. So here in this example would reduce the number of parameters from four to one. So you can see it's different -- these are two different approaches toward reducing the number of parameters. One is the parameter sharing. The other one is sparsification. In the sparsification you just zero out some of the edges. You say, okay, I don't look at this -- just assume that it's zero. So these are two different is mentalities toward reducing the number of parameters. Our method belongs to the parameter sharing. Why for example K -- K nearest neighbor is in this group. Okay. So single-tree parameter sharing, meaning that, okay, I have this data point and I can build a hierarchy over my kernels. And this is a cluster hierarchy over kernels. So I can just, you know, approximate the effect of all of my kernels with only one number or I can make it a little bit finer by going a level down, one level down, and just approximating with two numbers and so forth. And the hierarchy we assume all the hierarchies are binary trees here. But in general it can be any hierarchy. It doesn't need to be binary, necessarily. So -- so if we do this while this is a huge savings, then suppose I have N data points. So basically I need to repeat this process for all of the data points. This is just for one data point. The next idea then the next idea basically kicks in, which says, okay, why don't we do the same thing on the data points as well, build the same hierarchy, and let's say we want to the effect of all kernels on -- over all data points with one number. Of course this is a very coarse, rough estimate -- approximation. But you can, you know, go further down and again refine it. But the general idea is to use two trees. So that's a -- that's where the dual-tree methods come -- kicks in. One for the data point, and the other one for kernel. And, of course, here in the density estimation the kernels and the data points are the same set. So this is going to be the same tree. So we don't actually need to build two trees. We have only one tree referring to itself, pointing to itself. Okay. Then if this is the matrix that I compute the P of MJ given XI, the membership matrix, then I can basically say, okay, this is my data kernel, this is my -- sorry. This is my data tree, this is my kernel tree. Then I can say, okay, I have this subtree and this subtree. These are two subtrees. So I want to estimate the effect of this subtree over this subtree with only one number. So I can plug them together and just represent the effect with only one parameter, one variational parameter QAB. And so then this is going to be a blocked matrix of that transition -- story, the membership probabilities. Okay. Any questions so far? >>: So [inaudible] you get to this later, but I'm just wondering about the computational savings of this approach versus the versional just computing all the N squared similarity. >> Saeed Amizadeh: Yeah. Good question. We'll get to it. Okay. So now suppose I give these trees. For now you don't have to build the trees. I give you the trees. And you -- you use these trees, you come up with block partitioning of your matrix. Then basically you can reformulate the variational problem that we had before using the blocks. So this is the block partition version of the variational optimization function that we were talking about. So before every element was a block. There was no approximation. There was no block. Now we have -- we have blocks, meaning that the parameters inside each block -- so each block correspond to one parameter. So we can basically just with some simple math we can just reorganize our optimization function. And this is going to be our lower bound log likelihood for the variational parameter -- for the optimization. And we're going to get back to this think when we talk about random walk identification. So we can actually solve this problem, try to maximize this lower bound with the constraint that the sum of each row should be one. Why is that the constraint? Because we are approximating P of MJ given XI. And this sum over all J should be one. This is the membership probability. So we can easily with some math we can translate the constraint in terms of blocks. >>: Given the tree, I'm just wondering how you determine which subtree forms a block? Do you have some ->> Saeed Amizadeh: Yeah. That's a -- we'll get to that point, yeah. I didn't -- I didn't talk about how to build a tree and how to do the block partitioning yet. Yeah. But we'll get -- we just assume that we have them now. What we want to do is we want to find the Q values. We want to feel the matrix. So this problem can be solved, as the previous paper by Bo, this problem has a closed-form solution. And the closed-form solution can be found in order of the number of blocks. And the number of blocks is the number of parameters in here. And this is -- this is a closed-form solution, meaning that it's not a iterative algorithm. So it just makes two pass of the tree and you have your solution. Okay. Now building a hierarchy. The first step. We assume we have the hierarchy. But this is not the case in reality. So there are many methods that we can use in building the hierarchy. The first one is the bottom-up agglomerative clustering, which takes order of N squared. We can use KD trees. We can use cover trees. We can use ball trees. We can use anchor trees. So for obvious reasons we want to avoid the first one. Because it's order of N squared. This is kind of anti-thesis to our motivation. So but except for this one, the rest of them can be -- you can use any of them. But we'll talk about it later in this talk that -- well, this is a very crucial step in the whole framework. And all of these methods, although they have, you know, these very neat orders, you know, when you look at them, it's linear. But what is this C? I mean, they are very controversial. I mean, it depends, it really depends on the structure of your data. This can be as high as N squared. >>: [inaudible] dimension of the data for cover trees? >> Saeed Amizadeh: The cover trees ->>: The C is you just somehow measure ->> Saeed Amizadeh: Yes. Yes. >>: [inaudible] exponentially with ->> Saeed Amizadeh: Yes. Yes. >>: [inaudible]. >> Saeed Amizadeh: Yes. It's two to the power of your intrinsic dimensionality. But given that your [inaudible] is low. If it is not, then you're in trouble. Or with the KD trees depends on dimension. Anchor trees have their own problem. But here in this -- uniquely in this paper that I -- I use anchor trees. This is the framework that actually we used last summer in my internship in anchor trees. But you can replace it with any tree that you want. So just give a very quick demonstration of anchor tree. So you have N data points. You build a square root of N anchors. I don't get into the details. But you can think about each anchor as a cluster. Then you merge them into-- you know, in an agglomerative way because it's N square root of them. So the merging is going to take you linear of N. A square root of N times a square root of N. Then basically you recursively repeat the process for each anchor. So the construction time in the original paper they said it's N log N. But we did some analysis and in the worst case, it can be large. But it's still -- still it's less than N squared. And the worst case -- by worst case I mean there is no structure in the data. All the data points are equally distant from each other in the space. So no -- so the interesting dimensionality is the same as original dimensionality. So there is no structure. Okay. Any question on this part before we move on to the random walk section? Okay. Okay. Now random walk. Okay. What's the connection between the random walk on the graph and this basically variational method that we talked about? So the original problem, the original optimization problem without blocking. So we wanted to maximize this lower likelihood given that the sum of the Q values are equal to one. This is the original variational problem. And if you solve this problem, this is your result. This is QIJs. And this is -- this is your -- and basically the arrow is zero, the exact method. But if you look at this thing, look at this term, what is this? This is the Gaussian kernel, right? Gaussian kernel is -- this is the similarity between XI and MJ. And this is the normalization. This is what you do actually if you want to compute the random walk. You compute the similarity. You just divide it by -- you normal -- and normalize it. So basically the Q matrix can be seen as the approximate -- the block partition approximation of the transition matrix. So the P matrix is basically is the transition matrix, and the Q matrix is just the block partition approximation. So what is the interpretation then? In this new context, this new view of the Q matrix, we have a new interpretation for our blocked -- blocked version of our opposition function. And this is our blocked version of the lower-bound log-likelihood. The first term is just a normalization. And it's constant in terms of Q. However, the second term, this is a -- this is a term that people use when they want to learn similarity in graphs, meaning that if I want to maximize this and this -- there's a negative sign here, so I want to assign small -- if the distance between two blocks, two -- sorry, two clusters, A and B, is huge, I want to assign smaller Q values. So the higher the distance, the lower the similarity. This is a very common term when people want to learn the similarity. This term tends to connecting basically each node to its closest neighbor with probability one, Q 1, and disconnect it from the rest of the nodes with Q equal to zero. Find the closest neighbor and connect that with all the power, with all Q. What is this term? This term, if you look at it, this is actually the sum of the entropies of the outgoing probabilities from all node. This is just the Shannon entropy of the distribution. This one tends to -- because if you want to maximize this entropy here, therefore we need to have a uniform distribution to maximize -- for maximization of the entropy. So this one tends to connect each node to all of the nodes with equal probability. So these two terms here work against each other. So you can look at this term the entropy term as a regularization term. Because if we don't use this term, then we're going to end up with a disconnected graph. Every node is just connected through it's closest neighbor. And this coefficient here basically is just a trade-off between these two terms. And if you look at this term, this is actually the bandwidth. And this basically as a sanity check, basically the higher the -- sorry, the lower the bandwidth you can see that your graph will be sparser. Because this term becomes stronger. And the higher the sigma, this term becomes stronger, so your graph will be denser. So this random walk view gives us this new interpretation of the objective function. Of what it does exactly. Okay. Any question? Sorry. >>: So [inaudible] how does your lower bound compare the approximation quality to KNN with an added small uniform connection to all nodes? Because it sounds like this is what ->> Saeed Amizadeh: Yes. >>: -- you're doing. >> Saeed Amizadeh: So -- well, we didn't perform that experiment. We did compare it with KNN because we wanted to compare these two different ideas of sparsification and parameter sharing. But -- so the -- what you are saying is basically instead of using, you know, sparse matrix use epsilon instead of zero in KNN. >>: [inaudible]. >> Saeed Amizadeh: Yeah. We didn't do that experiment, that framework. Because we wanted to compare, you know, sharing versus sparsification. And that is already a sharing idea. But, again, it depends how you implement your KNN because if you wanted to relate your KNN to likelihood, you need to weight the edges. So KNN by itself is not enough. You need to weight them with the similarity, the Gaussian similarity. Then you can show that as you increase K, you basically -- you converge to the exact method, which is ->>: I guess here you're also clustering the samples, the core [inaudible] KNN. >> Saeed Amizadeh: Yes. >>: So you're [inaudible]. >> Saeed Amizadeh: Yes. >>: And then [inaudible]. >> Saeed Amizadeh: Yes. I mean, depend -- again, depends how you implement -- I mean, even if -- I mean, if you want to implement the very booth force KNN you'll have problem in large scale because you cannot. Because booth force KNN again, again for the construction of KNN graph, it's N squared. Even actually more than N squared. N squared log N. Because you need the sorting and you want to avoid that. So you want to use -- again, you want to use the tree for -- even for KNN. And in our experiment we did that basically we used the tree for KNN. >>: I'm not arguing the KNN, but I wonder -- so in order to solve for the optimal Q method, both method has complexity of the size of -- the largest block? >> Saeed Amizadeh: The number of blocks. >>: [inaudible] blocks. >> Saeed Amizadeh: The number of blocks basically, not the size of blocks. Just the number of blocks with the number of your parameters. >>: Right. >> Saeed Amizadeh: Yeah. And I assume in your KNN -- well, the thing is these are, you know, all independent problems. So as soon as you give me a partitioning, no matter how you get that partitioning, you get the partitioning using KNN or using our method. The rest of it is just solving for Qs the same. >>: Yeah. But now I'm wondering whether you can replace the step where you solve for the optimal Q with this approximation where you just -- where you pick the -- maybe the approximate top K and then, you know, set a large weight for that for Q and then the small epsilon weights for the rest of them. >> Saeed Amizadeh: Yes. That's -- that's definitely -- uniquely we didn't do it but that's definitely a valid method to do because you can actually -- instead of actually computing PLMJ given XI, as you mention you can just compute directly P of XJ given the likelihood, just compute the likelihood for the first K and just take -- and just take the average or something. The thing is that becomes very ad hoc how you provide it to one parameter, how to spare it. Here we have a very formal method to do it. But, yeah, that's -- that can be [inaudible]. Okay. So this parameter here, this bandwidth, basically adjust trade-off between these two terms. Now, the question is how to adjust this sigma. There are many methods. I mean, in the literature there are many heuristics how to do it. And we are not claiming that our method is the best way to do it. But at least it has some good interpretation what it means actually. So as we said before, the bandwidth basically adjust the DK rate of your similarity as a function of distance. And so if we look at our -- again, this is, as I said, we use this formal and this objective over and over in these slides. So this is our objective. This is the block version of our lower bound objective. This is -- if you look at this term, this is a quasi-concave function of bandwidth. This term is constant in terms of bandwidth, but these two terms, the quasi-concave function of bandwidth. >>: [inaudible]. >> Saeed Amizadeh: Okay. So ->>: [inaudible]. >> Saeed Amizadeh: My bad. Sorry. >>: That's all right. >> Saeed Amizadeh: I should have put it in the slides. So concave function you have one maximum there. But in quasi-concave you still have one maximum, but it's -- the second derivative can be positive. >>: You just [inaudible]. >> Saeed Amizadeh: That -- yes. >>: Okay. >> Saeed Amizadeh: Yes. Yes. Yes. Exactly. So this means that in terms of the bandwidth our objective function have one maximum. And we can find that maximum in close form. If we solve this thing for sigma, this is our solution. And if we look at this solution, this also has a nice interpretation. If we look at this term here, this is just -- so if you take out the dimension out from this, read out this dimension -- if you take the dimension out, this is basically the average expected distance that the random walker traversed after one time step. So this -- so this one is basically the nominator is just the expected distance that the random walker expected distance that the random walker -- the sum of the expected distance. And when you normalize it by N, it's the -- it's just the average expected distance that you traverse. And before you ask me this question, I'm going to answer it right now that while here how come -- I mean, because intuitively as we decrease sigma, as we decrease bandwidth, our likelihood should go up. So there's no limit on it, right? As we go towards zero, our likelihood goes to infinity. Becomes larger and larger. The catch here is that our objective function is not the log-likelihood. Itself, it's the lower bound on the log-likelihood. And this lower bound gives us this nice interpretation of very straightforward metric to find the sigma. Question? Quickly, yes. You can use the same kind of idea but not necessarily the same -- but the same in the same -- in the same night you can say, okay, in this -- in the exact method no blocking, nothing, just exact method I can always, you know, use the Jensen inequality and this is going to be the -- a lower bound on my likelihood, using Jensen. Then I can always try to maximize this lower bound in terms of bandwidth. And I guess this basically optimal solution for sigma. Because this is again a quasi-concave. But this is not a specially case of our block-partitioned version. And the reason is that basically this is not a tight bond anyway, but our bound as you increase the number of parameters, as you refine it, you get closer and closer to the actual likelihood. But this one is just a heuristic, just that. But we use this heuristic in our experiment to set the sigma for the exact method. Okay. Fast multiplication. So in many applications if Q is -- imagine Q is my transition matrix. I need to compute the multiplication of Q by an arbitrary vector Y. And this in general, this is an N squared operation, a matrix times a vector. We want to do this of course efficiently. So we have to log this simple algorithm on the -- on our tree that basically you can fit all the elements of Y to their corresponding leaves in the tree. And as a first step, there's a collect up step that you basically sum up all your children and you just keep this statistic at each node which is basically the sum of the un-- the children. So this -- this step basically takes linear order, O of N. Then using this statistic, we can have a distribute-down step where you compute along each path from the root to the leaves. You compute this sum. And you can basically use dynamic programming to save this -- save the summation at each time step. And we can easily show that this is basically order of the number of blocks in your tree. So the whole process takes the O of number of blocks because number of blocks as we see later is always greater than -- greater or equal to N. So this gives us a very fast algorithm to do the multiplication by an arbitrary vector. And why is this important? Because in many applications like labor propagation, labor propagation, if this is your initial labels for some label data, the propagation in the limit you have this vector of labels for all -- for the whole graph. So this is -- this involves the inversion of these matrix. But if you don't want to do the inversion you can approximate it with a finite number of iterations using this process here. And in this process you need to do the multiplication of Q by a vector over time. Or another application is the eigen-decomposition of Q. You want to find the eigenvectors of Q which we know by -- we know by now. And if we use Power method or Arnoldi iteration, we need to compute this multiplication, iteratively these multiplications by a vector. So this gives us a very fast multiplication algorithm for these methods. Now to a question, computational complexity. So -- so -- so this is the computational complexity for our method. The construction time in the worst case if you don't have any structure in your anchor tree, this takes -- you know, for anchor tree, to make an anchor tree. But in average, because we always have some sort of structure, this is L log N. Plus the number of blocks to basically -- this is the estimation of the Q values. The multiplication time is order of the number of blocks. So as we see here, number of blocks plays a very crucial role in the complexity of the whole framework. So everything depends on the number of blocks or the number of parameters. So the question is what is this number of blocks and how we can actually change it or fix it. So the number of blocks, actually we can show that it changes between the linear order and N squared. At N squared basically you get the exact method. No approximation. So what are the coarsest level, the number of blocks is two times N minus one. That's the minimal number of blocks that you can have. Not less than this value. And we'll see why this value. So the coarsest level of approximation. So to get a value block-partitioning, the subtrees that we paired them together to form a block needs to be nonoverlapping. If two subtrees, if two subclusters -- two clusters are overlapping to have the same points, the approximation is not a valid approximation. So we need the subtrees to be over -- nonoverlapping. But for any given subtree in the tree, the largest nonoverlapping subtree is its sibling. So for this subtree the -- this subtree is a nonoverlapping subtree, right? But the largest nonoverlapping is just its sibling. Yes. So, therefore, the coarsest level of approximation in -- the coarsest level we just pair each subtree with its sibling as here. So the number of blocks then is going to be the number of internal nodes, which is N minus one times two because we have two directions. So it's going to be two times N minus one. This is the minimal number of block. So a rational approach is to start with the coarsest level of approximation and refine our model as we need more accuracy, or as we have more computational power afford it basically. So -- and what does refinement mean? One step of refinement means that splitting your block into two either horizontally or vertically. And what does this mean? So if this is one -- sorry. So if these are two clusters and this arc represent a block, so I prepared these two and this is a block. So I can even refine it this arc this way with pointing this to the children of B. This is a vertical refinement or horizontal refinement from children of A to B. So these are two different refinement that I can have. So every time that I split a block, I introduce a new parameter. So I have a new parameter here. And so of course by introducing a new parameter, I relax the constraint, meaning that the lower bound on the likelihood -- on the log-likelihood. Is going to be higher because I -- basically I -- I basically remove one of the constraints. So you can show that always you increase the log-likelihood. By refinement. And mathematically it's easy to see it because you can always assign the previous parameters in your optimization problem. So now is question is which block to split. So of course all of them will -- gives us some gain in terms of likelihood. So the very obvious answer is that, okay, give me the one that gives me the most gain in terms of likelihood. That is going to block that. And that is going to be the block that I want to split. But to find this block, I need to split all the blocks one at a time, solve the optimization, compute the log-likelihood. Gain and take the maximum. But this is very expensive. We don't want to do this for all the blocks. So our solution here is basically we try to locally solve the optimization for each possible split. We assume that all other parameters are fixed. So we -- every time we just solve a local optimization problem. And this local optimization problem takes order of one computation. So it doesn't take order of a number of blocks. Then we pick the split with the maximum local gain. Of course this result is suboptimal. But we can very efficiently implement it using a Priority Queue. Okay. So is this clear? Any question? We can move to the experiments. Okay. All right. So in the experiment we tried to solve this semi-supervised learning problem. So given a small set of data points we want to find the label given a small set of, sorry, label data points, we want to find the label for the rest of unlabeled data points using label propagation on the graph. So this is the label propagation. And we don't want to compute the inverse of the -- this metrics. So we do this iteratively. So the performance metrics that we measure are construction time, propagation time or basic propagation is the multiplication time. And the classification accuracy. Given that we know the true labels for the class labels. Okay. And the baselines. The very first baseline is the exact method. Basically we compute the similarity between each per, then we normalize the similarity to find the transition metrics, transition probability. The other is the fast K-nearest neighbor. So as I mentioned before, we don't want to use the straight K-nearest neighbor because the construction time for the exact K-nearest neighbor is large. So we want to use again the trees to find the K-nearest neighbors. So we use the same anchor tree for K-nearest neighbor. So for the first comparison we use the same cluster hierarchy that we use for both approaches. So for both of them it's anchor tree. And just through -- I don't want to get into details. But how -- how the tree use helps K-nearest neighbor is basically it's -- you can cut some computations in the tree if you already know the K-nearest neighbor. So that -- that helps you to cut the computations. That's why it makes the K-nearest neighbor fast. And these are the Theoretical orders on the paper for these three methods. The exact fast KNN and our framework which variation of dual-tree framework. And this is the construction time memory that it uses and the multiplication time. So -- so okay. In the first experiment we want to basically the goal is to compute the construction time, measure the construction time as the size of the problem is increasing. And the size of the problem is basically the number of data points. And then the other goal is basically to measure the accuracy, see how much we lose in terms of accuracy if we do that approximation. And the data set that we use is SecStr. It's standard benchmark in semi-supervised learning. And dimensions is 315. The number of data points, we increase it. That's why I didn't mention it here because it's going to change. And there are two classes. 10 percent of the data points are labeled for each problem size. And we try to find the labels for the rest of 95 -- 90 percent. So the bandwidth for each method is computed separately using the techniques that we talked already. And that approximation level for our framework is going to be at the coarsest level where the number of blocks is two times N minus one. And for the KNN, for the K-nearest neighbor we define the approximation level as the number as K. So K2 is the roughest approximation level with fast KNN approach. So we use K equals to two. And these are the results. So the time and the size are in log-scale here. And these are the three methods. And this is the classification accuracy. So as we see, we didn't actually lose that -- so the red is basically the exact method. We didn't lose that much in terms of accuracy. However, we gain order of magnitudes in terms of construction compared to two other methods. And in terms of multiplication, we are much better than of course the exact method and comparable to the KNN because as soon as you build a KNN you have a sparse metrics and the multiplication can be very fast for KNN. The question is how to -- but still we -- how to build the KNN and that can be as slow as we showed it here. Okay. Second experiment, we want to study the effect of refinement in the model. So the goal is to compute the accuracy as the model is being refined and of course measure the refinement time, how much time we need to refine each model. Again, the data set is a benchmark data set, digit1, which has 1500 data points and 241 dimensions. So we have two results here. One time when we have 10 label data points and the other time when we had 100 labeled data points. These are the standard in the data -- in the data set. And, again, the bandwidth we computed using the method that we explained before. As I just mention it, the refinement for the K-nearest neighbor is defined as increasing K because as you increase K and assume that each edge in the KNN is weighted with the Gaussian similarity, then you can show that as K becomes N, approaches N, your method basically converges to the exact method. This makes KNN consistent -consistent approximation -- approximator of the exact method. So -- and if you look at here because we want to hear the -- we computed the effect of refinement, so we want to make sure that for both methods the number of parameters is the same. So we want to make sure that the number of blocks in our framework is equal to K times N at each time. If we do this, then the number of parameters -- because the KNN method has K times N parameters. So for a fair comparison we want to have the same number of parameters for both methods. So this is the result. This is the construction time again here. And this is the log-scale. I don't know why I -- I think I cropped the figure, so the numbers are not here. But this is order of magnitude of course faster. And this is the refinement time, our method compared to KNN. And this is the result of refinement. So this -- this axis here is the number of parameters here. So as we increase the number of parameters, here we show that our method gets better compared to KNN. This is when the sample -- when the label sample -- sorry, the size of the label data is 10. This is when the size of the label data is 100. But for the 100, as we see here, KNN basically beats our method and also the exact method. And our explanation for this result is that this data must have a very clearcut manifold such that KNN can actually -- because KNN is sparse can catch it very fast. And but in the other one, when we have a smaller label data, KNN didn't actually -- it was very variant. It couldn't actually be improved. But our method improved as we increased the number of parameters. Of course, if we increase the number of parameters more, we're going to have more improvement. But that's always a computational issue. That's a tradeoff. We want to stop at some point. Because otherwise the computational complexity is going to be higher. So the next one. Okay. The next experiment is -- okay. So far all the experiments that we did was, you know, on kind of a medium sized data set. Here we want to actually see whether our method is really scalable for really large-scale data sets. And the data set that we chose, one is -- off of these are from the Pascal large-scale data -large-scale data challenge. You can actually go and look these data sets. This is like a massive million data points. The other one is three point half a million data points. And pretty much high dimensional as well. So we -- we didn't compare it to other methods because we couldn't run other methods on these data sets because it took forever for other methods and also the memory issue. But we did the experiment for our method with -- at the coarsest level. And I believe this is the result. So here is just -- just we want to show the computational complexity. It takes like for the first data set four and a half hours to build the model and 11 minutes for the propagation for multiplication basically. For the other data set it takes all -- almost two days to build a graph on two point half a million data points which translates into half a -- sorry. Seven million parameters. And about 93.3 minutes to do the propagation. So these are just -- these are all on the serial computer. Of course we can make it even faster if we parallelize the whole framework. And this is possible because the underlying data structure that we used and -- is all based on trees. So we can always decompose these trees on different machines and do the computations, you know, in parallel and make it even faster. But these are the results on the serial computer. And I think that was it. If there's no question from experiments, I can move quickly to conclusions. Okay. So as we showed, I mean, in average the construction time can be as low as N log N, using our framework, instead of N squared. And N log N is basically for the building a tree. If you have the tree, it's going to be linear actually. The multiplication can be as low as the linear because the number of blocks can be as low as the linear time. So we can have a very fast multiplication. And you can -- and multiplication is very useful for label propagation and eigendecomposition. Memory usage can be again as low as the linear order because the number of blocks can be as low as the linear order. And the framework provides us straightforward method to find the optimal bandwidth with a nice interpretation. And also the whole framework is a multi-level approximation framework, meaning that we can have that approximation at different levels. And we can refine our model on demand basically how much accuracy we need. And also -- I mean, the other -- the other -- from the other point of view, you can have maximum -- you can have a maximum CPU resource. You said, okay, this is my maximum CPU resource gives me the best refinement. So this is the maximum number of blocks that I can afford. Give me the best refinement for the metrics. And we developed this technique to find the best -- actually suboptimal large partition. And as I said before, the framework is not dependent on the choice of tree that we use for cluster hierarchy. Therefore, we can easily substitute this tree with some trees that have some theoretical guarantees. We couldn't find any work on, you know, for some theoretical guarantees for anchor trees. But for example if you use cover trees, cover trees have some theoretical guarantees. However, as I said, all of these tree methods, they give you a bond and they give you a order with some constant in it. And that constant can kill you in practice because that constant depends on the dimension -- on the many factors on the -basically on the geometry of the data. So in theory, yes, you can improve the order if you, for example, replace the anchor tree with cover tree. Or basically you can have some theoretical guarantees. But in order to sure that theoretical guarantee your computational complexity, that constant that shows up in the computational complexity can kill you. And, yes. And I think that was it. Thank you. [applause]. >> Saeed Amizadeh: Any question? >>: [inaudible] computation in parallel? >> Saeed Amizadeh: Yes. >>: So does -- is the -- so what [inaudible] is the communication time [inaudible] so now you have [inaudible]. >> Saeed Amizadeh: Yes. >>: So I'm not sure if the case here the communicated time is actually trivial so you don't have to worry about that [inaudible]. >> Saeed Amizadeh: Well, I mean, I'm not very -- I'm not an expert in parallel computing, but all I can say is that in this framework, all the computations are done hierarchically. So if you're doing the -- let's say the computation at the coarsest level, right, so this -- the left subtree is going to be independent of the right subtree. And so you can -- and then again, this recursively goes down. And but yet there is some, you know, overhead for communication. But if you can keep the communication inside each subtree in a recursive fashion, then probably you can, I don't know, decrease them -the communication time to the order of N basically. With some constant, of course. Because the number -- so it's going to be the number of internal nodes is the number of pairs, you know, siblings. And the number of internal nodes is basically N, is N minus one something. So of course there is some constant out, and that probably depends on ->>: [inaudible]. >> Saeed Amizadeh: Okay. >>: All right. Thanks. >> Saeed Amizadeh: Thank you.