>>: It's my pleasure today to welcome Saeed Amizadeh. ... student for University of Pittsburgh. And today he's working...

advertisement
>>: It's my pleasure today to welcome Saeed Amizadeh. And then Saeed is a PhD
student for University of Pittsburgh. And today he's working on machine learning and
data mining on -- especially on large-scale data.
Today he's going to teach us how to first approximate transmission matrix for large data
set. And this is also related to his internship work last year with Bo Thiesson.
>> Saeed Amizadeh: Okay. Thank you, Scott. Hello. So I'm not going to introduce
myself. Scott already did. So this work is actually kind of related to my internship work
last year here with Bo Thiesson. And my dissertation, my PhD dissertation is about
large-scale -- working with large-scale data sets, especially with graph-based methods.
And today I'm going to talk about this work that I actually presented last week at UAI for
basically approximating the transition metrics for random walk on the graphs when your
data set is large, meaning the number data points is huge.
So I'm going to start with the -- just a brief introduction of graph-based methods and why
they are useful in machine learning and then talk about some challenges that we have
when we work with these methods. And then I'll present the variational dual-tree
framework in general, which was originally introduced for density estimation. And then
I'll talk about how we can apply this method for estimating random walk. And then some
experiments and some conclusion remarks.
So why do we care about graph-based methods? So usually in graph-based methods
we have a bunch of data points and data points and we can build a similarity graph,
meaning that the -- so we have our data point set has nodes, and edges basically show
the similarity between two nodes.
And this similarity graph basically captures the geometry of the data. So it's closely
related to the distribution of data. And actually there are some works that tries to focus
on the connection between density estimation and the similarity graph.
So the formal definition is that we define a graph where the nodes are the data points
and the edges basically we can have -- we can have an edge between each two data
point. And if we don't have an edge, the weight for that edge is going to be zero. So
the edges are weighted.
And the higher the weight means the higher the similarity between those data points.
So just to give you a brief motivation, why these methods are important and why they
help us. The reason is usually when we have a space -- when we have some data point
in some space, we usually have a metric in that space. And sometimes our data has
some, you know, weird shape, you know, manifold structure, meaning that the distance
that we have or the metric that we have in that space is not meaningful locally -globally. It is meaningful locally, however, when, for example, these two points, A and
B, their Euclidean distance here is not meaningful because what we have in mind as a
distance between A and B is this red curve here, not this straight line. So the global is
not necessarily meaningful.
But the local, in the locality we have a meaningful distance. And the goal is to somehow
use the graph structure be the graph and somehow aggregate these local distance to
infer a meaningful global similarity metric or dissimilarity distance. Depends on how you
define it.
And so that's the main probably one of the most important philosophies.
Behind these frameworks.
So usually we need to define similarity function, which is basically transforming the
distance can be -- well, here in this talk we are talking about Euclidean distance.
Transforming the distance into the similarity. And here we use the very popular
Gaussian kernel, which basically has a bandwidth parameter with it.
We'll talk about the bandwidth actually later in this talk and how we can actually set the
bandwidth. Because bandwidth is a design parameter, meaning that if you decrease
your bandwidth towards zero you're going to end up with a very sparse graph. If you
increase it, you will get the very dense graph. So it's very important where to set signal.
So a very famous and very matrix that people define is the Laplacian matrix, that turns
out to be very useful in basically graph-based methods. The Laplacian matrix have -you know, we have different Laplacians. This is probably the most famous one,
unnormalized Laplacian metrics.
So you can define these diagonal metrics which each element is the degree of each
node. And you can define your Laplacian metrics as the degree -- as the diagonal
metrics minus the weight metrics. Or you have the symmetric normalized Laplacian
which is asymmetric, or we have the random walk Laplacian, which is basically the
identity metrics minus the random walk transition matrix.
This is if you basically write it down, this term here is basically the transition metrics of
the random walk. So why do we care about Laplacian matrix in general? Well, the
reason is the eigenvectors of these metrics actually contains the cluster structure of the
data.
Basically the eigenvectors that are associated with smaller eigenvalues encode the
coarser structure in the data. So because it encodes the geometry of the data, the
eigenvectors of this metrics is very important.
So we can use these eigenvectors and eigenvalues to embed our data points in a new
space where the Euclidean distance is globally meaningful. And remember we talked
about the Euclidean distance or distance in the input space is not globally meaningful?
No, we want to embed our data set in a new space such that the distance no matter is
global or global is meaningful.
So one way to do it is this very general form of embedding in the space. So basically
for each point XI we can define ZI as these are the coordinates of the ZI the
transformation for. And this UKI is basically Ith element of the Kth eigenvector. And phi
is a decreasing function of the eigenvalues.
So basically we can -- we can define different phis. It's your desire. And different phis
result in different global distance metric. For example, if you define phi as the
exponential function, then you get diffusion distance, which is the physical diffusion.
Gives you the physical -- has exactly the same meaning.
If you define phi as one over lambda, you get the resistance distance in actually called
networks. So with different choices of phis, you can have different global similarity
metrics. Yes, please?
>>: Don't you want to -- so usually when people do [inaudible] paper, where they do
spectral clustering there's a crucial step after you take the eigenvectors that you should
normalize by -- normalize each point so that each of these -- each of the U, UJIs has
unit [inaudible]. And then they do [inaudible] clustering in that space.
>> Saeed Amizadeh: Yes.
>>: So don't you -- don't -- do you want to do some kind of normalization here as well
per I?
>> Saeed Amizadeh: Yes. So the thing is these UIs, these eigenvectors are assumed
to be normalized already so ->>: No, no, no, you normalize -- that's the eigen-- each eigenvector ->> Saeed Amizadeh: Yes.
>>: -- across all the Is ->> Saeed Amizadeh: Yes.
>>: That norm is constant. I'm talking about ->> Saeed Amizadeh: The rows.
>>: -- for each I [inaudible] across the eigenvectors, however many eigenvectors you
take, you want to normalize that so that the resulting points they think -- they do this so
that the resulting points lie along, you know, K dimensional sphere.
>> Saeed Amizadeh: Yes.
>>: Or for better K means cluster.
>> Saeed Amizadeh: Yes. So you can summarize that into your phi function. In fact ->>: [inaudible].
>> Saeed Amizadeh: Go ahead, sorry.
>>: So then the phi would depend on I, right, because normalize would be different for
each I?
>> Saeed Amizadeh: Yes.
>>: Okay.
>> Saeed Amizadeh: Yes. That's -- I mean, phi doesn't need to be, you know, same
thing for -- if, as I said, this is just one form of transformation. This is the -- this is based
on the paper by Ghahramani that -- but, yeah, you can have different form of -- actually
the very, very early form of spectra clustering, the phi the actual a step function.
So you have phi equal to one for the first K eigenvectors and the rest are zero. But the
phi can be as the waste paper, it can be dependent on I to transform it.
So the thing is you have some sort of transformation here. But in order to make it an
embedding, the most important thing is it should be decreasing function. In order to
keep the -- and by decreasing I mean a non-increasing basically. Because you want
the effect of the -- so as you go from left to right basically you are moving from the
coarser structure to more finer structure. So you want to keep the coarser clusters
apparently.
Okay. So what are the applications? So the applications, I mean, the very first
application is dimensionality reduction. So if you -- your data already is lying on some
sort of manifold, you can basically take only few eigenvectors to basically present your
data in the new space. So effectively you increase the dimension.
Spectral clustering you can find non-spherical clusters because the eigenvectors
represent the structure.
Semi-supervised learning you can propagate the labels from your labeled points to
unlabeled data again using the -- for example random walk kernel.
Function approximation, for example, you have a reinforcement learning problem and
each state is a node, you can try to basically estimate -- approximately the value
function of the reinforcement for the -- for your reinforcement learner on the graph,
which is a basic function approximation.
So there are many applications with the graph-based methods.
So what are the challenges? Well, I'm going -- in this talk, we are going to actually
focus on, you know, large-scale data. So basically our challenge is the larger scale
data. So large scale can mean large dimension. So what's the challenge with the
dimension?
Well, in this talk I'm not going to focus on dimension, but this is part of the large -- the
large-scale problem, actually part of my thesis work.
So the problem is curse of dimensionality. You can show that if you let your -- the
bandwidth parameter change with your sample size, then the error between the
eigenvectors of your sample and the eigenfunctions of the true population exponentially
depends on dimension. And that's the curse of dimensionality. Basically your error
from the true eigenfunctions of the population is -- exponentially depends on dimension.
And one way to solve this problem is using the independence structure. Either, you
know, you are having some -- you are having a data set problem where the features are
independent or features, you know, are divided into groups or you can impose the
independence with the cost that you can -- you know, you can have some
approximation error but at least you can increase the estimation error.
So if we have this independence structure or we impose it, then we can decompose our
problem into solve problems each of which has a refused dimensionality. Basically
divide and conquer approach.
The challenge that we are going to actually talk in this talk today is the large N, which
the number of data points are huge. So the first challenge in large dimension is more of
a statistical challenge. This is more of a computational challenge. So in general, if we
want to build this kernel matrix -- and by kernel matrix, Laplacian or the random walk
kernel matrix, it needs -- in general, it needs N squared time and memory, and therefore
it's not applicable for large-scale problems.
So this you can -- we can see it as an instance of the general N-body problem where
you have, you know, N particles in the space and you want to compute the mutual effect
between these particles. It's a famous problem in physics. And there are many
solutions to this problem.
So in machine learning, the solutions -- we have different classes of solutions. One
solution is sparsifying the nodes. This can be subsampling. Just take a subsample of
nodes. Or building your backbone graph, you know, making supernodes. Or it can be a
sparsification of the edges. So, for example, building KNN graphs. So this is going to
give you a sparse matrix. Or epsilon graphs. Or B matching. These are all from the
same family of sparsification of the edges.
The third approach is basically, okay, try to go around actually building this kernel
matrix. So one example is if I want to compute the eigenvector or eigenfunction of the
kernel, can I somehow directly do it without building the matrix? And there are basically
works and papers on this how to do it.
The other idea is parameter sharing. Basically instead of sparsification we share
parameters. And our method basically belongs to this group of parameter sharing
approaches.
So now, I'm going to actually talk about the variational dual-tree framework as the
baseline for our framework. And please stop me whenever things are not clear or you
have any question.
Okay. So forget about random walk for now. We are not concerned with random walk
or graph-based methods. Let's say we want to do kernel density estimation for N
points. So for each point I want to compute this kernel density estimator. And for all
points it requires me to do N squared computations.
Here this is -- this is basically is your Gaussian kernel. This is basically the likelihood of
XI generated by this kernel and MJ is basically the center of the kernel. And P of MJ for
simplicity, we can just set it to one over N. So the same weight for all kernels.
If you look at this problem each data point plays two roles. One as a data point XI
where we want to compute the density at, the other is the center of the Gaussian kernel,
MI.
Okay. So we can go ahead and compute that, but we can actually reformulate our
problem into a variational problem. So I can basically define and basically this is the
logarithm of the likelihood. I can introduce the variational parameter and use the
Jensen inequality and get this lower bound on my likelihood. And this is basic the KL
divergence between my variational distribution and the true distribution.
And this is the likelihood of one data point. I can sum over all data points here. And I
can try to maximize this lower bound with a constraint that the sum of the QIs should be
one. Then this is going -- if I solve this variational problem, this is going to be my
solution for QIJ.
So here if we look at this, these QIJs -- so by PI here, the true distribution, I mean, the
reverse of this one. So basically P of MJ given XI. So QIJ approximates the P of MJ
given XI. And P of MJ given XI is the membership probability. So is the probability that
XI is generated by the kernel MJ. Membership probability of XI to kernel MJ.
So if we look at this solution here -- so basically this is an exact solution. This is an
exact solution because the KL divergence is going to be zero. So the question is why
do we bother to do this at all. I mean, you can -- and the answer is we'll see later. So
we want -- basically we want to come -- we want to compute, approximate this posterior
P of MJ given XI using QIJ, the variational parameter. And if you compute this thing,
approximate this thing, then we can compute the likelihood.
Okay. And this is -- this has a benefit over this one because this one is a likelihood, this
one is a probability distribution. So this one sums up to one, this one doesn't. So this
gives us a very easy basic base -- method to use the variational approximation because
we can have a constraint. We can easily set up our constraint. With this one, we
cannot easily set up our constraints.
So okay. What's the basic idea? So the basic idea is -- so if I want to compute the
membership probability, so suppose these are my kernels centers are -- these are
Gaussians and this is my data point. So if I want to compute the P of MJ given XI, I
need to do it for every pair here. So the idea is parameter sharing. So the idea is just
group these kernel together and just approximate the effect this membership probability
with one parameter.
So here in this example would reduce the number of parameters from four to one. So
you can see it's different -- these are two different approaches toward reducing the
number of parameters. One is the parameter sharing. The other one is sparsification.
In the sparsification you just zero out some of the edges. You say, okay, I don't look at
this -- just assume that it's zero.
So these are two different is mentalities toward reducing the number of parameters.
Our method belongs to the parameter sharing. Why for example K -- K nearest
neighbor is in this group.
Okay. So single-tree parameter sharing, meaning that, okay, I have this data point and
I can build a hierarchy over my kernels. And this is a cluster hierarchy over kernels. So
I can just, you know, approximate the effect of all of my kernels with only one number or
I can make it a little bit finer by going a level down, one level down, and just
approximating with two numbers and so forth.
And the hierarchy we assume all the hierarchies are binary trees here. But in general it
can be any hierarchy. It doesn't need to be binary, necessarily.
So -- so if we do this while this is a huge savings, then suppose I have N data points.
So basically I need to repeat this process for all of the data points. This is just for one
data point.
The next idea then the next idea basically kicks in, which says, okay, why don't we do
the same thing on the data points as well, build the same hierarchy, and let's say we
want to the effect of all kernels on -- over all data points with one number. Of course
this is a very coarse, rough estimate -- approximation. But you can, you know, go
further down and again refine it.
But the general idea is to use two trees. So that's a -- that's where the dual-tree
methods come -- kicks in. One for the data point, and the other one for kernel. And, of
course, here in the density estimation the kernels and the data points are the same set.
So this is going to be the same tree. So we don't actually need to build two trees. We
have only one tree referring to itself, pointing to itself.
Okay. Then if this is the matrix that I compute the P of MJ given XI, the membership
matrix, then I can basically say, okay, this is my data kernel, this is my -- sorry. This is
my data tree, this is my kernel tree. Then I can say, okay, I have this subtree and this
subtree. These are two subtrees. So I want to estimate the effect of this subtree over
this subtree with only one number. So I can plug them together and just represent the
effect with only one parameter, one variational parameter QAB.
And so then this is going to be a blocked matrix of that transition -- story, the
membership probabilities. Okay. Any questions so far?
>>: So [inaudible] you get to this later, but I'm just wondering about the computational
savings of this approach versus the versional just computing all the N squared similarity.
>> Saeed Amizadeh: Yeah. Good question. We'll get to it.
Okay. So now suppose I give these trees. For now you don't have to build the trees. I
give you the trees. And you -- you use these trees, you come up with block partitioning
of your matrix. Then basically you can reformulate the variational problem that we had
before using the blocks. So this is the block partition version of the variational
optimization function that we were talking about.
So before every element was a block. There was no approximation. There was no
block. Now we have -- we have blocks, meaning that the parameters inside each block
-- so each block correspond to one parameter.
So we can basically just with some simple math we can just reorganize our optimization
function. And this is going to be our lower bound log likelihood for the variational
parameter -- for the optimization. And we're going to get back to this think when we talk
about random walk identification.
So we can actually solve this problem, try to maximize this lower bound with the
constraint that the sum of each row should be one. Why is that the constraint?
Because we are approximating P of MJ given XI. And this sum over all J should be
one. This is the membership probability.
So we can easily with some math we can translate the constraint in terms of blocks.
>>: Given the tree, I'm just wondering how you determine which subtree forms a block?
Do you have some ->> Saeed Amizadeh: Yeah. That's a -- we'll get to that point, yeah. I didn't -- I didn't
talk about how to build a tree and how to do the block partitioning yet. Yeah. But we'll
get -- we just assume that we have them now.
What we want to do is we want to find the Q values. We want to feel the matrix. So this
problem can be solved, as the previous paper by Bo, this problem has a closed-form
solution. And the closed-form solution can be found in order of the number of blocks.
And the number of blocks is the number of parameters in here.
And this is -- this is a closed-form solution, meaning that it's not a iterative algorithm. So
it just makes two pass of the tree and you have your solution.
Okay. Now building a hierarchy. The first step. We assume we have the hierarchy.
But this is not the case in reality. So there are many methods that we can use in
building the hierarchy. The first one is the bottom-up agglomerative clustering, which
takes order of N squared.
We can use KD trees. We can use cover trees. We can use ball trees. We can use
anchor trees.
So for obvious reasons we want to avoid the first one. Because it's order of N squared.
This is kind of anti-thesis to our motivation.
So but except for this one, the rest of them can be -- you can use any of them. But we'll
talk about it later in this talk that -- well, this is a very crucial step in the whole
framework. And all of these methods, although they have, you know, these very neat
orders, you know, when you look at them, it's linear. But what is this C? I mean, they
are very controversial. I mean, it depends, it really depends on the structure of your
data. This can be as high as N squared.
>>: [inaudible] dimension of the data for cover trees?
>> Saeed Amizadeh: The cover trees ->>: The C is you just somehow measure ->> Saeed Amizadeh: Yes. Yes.
>>: [inaudible] exponentially with ->> Saeed Amizadeh: Yes. Yes.
>>: [inaudible].
>> Saeed Amizadeh: Yes. It's two to the power of your intrinsic dimensionality. But
given that your [inaudible] is low. If it is not, then you're in trouble.
Or with the KD trees depends on dimension. Anchor trees have their own problem. But
here in this -- uniquely in this paper that I -- I use anchor trees. This is the framework
that actually we used last summer in my internship in anchor trees. But you can replace
it with any tree that you want.
So just give a very quick demonstration of anchor tree. So you have N data points.
You build a square root of N anchors. I don't get into the details. But you can think
about each anchor as a cluster. Then you merge them into-- you know, in an
agglomerative way because it's N square root of them. So the merging is going to take
you linear of N. A square root of N times a square root of N. Then basically you
recursively repeat the process for each anchor.
So the construction time in the original paper they said it's N log N. But we did some
analysis and in the worst case, it can be large. But it's still -- still it's less than N
squared. And the worst case -- by worst case I mean there is no structure in the data.
All the data points are equally distant from each other in the space. So no -- so the
interesting dimensionality is the same as original dimensionality. So there is no
structure.
Okay. Any question on this part before we move on to the random walk section? Okay.
Okay.
Now random walk. Okay. What's the connection between the random walk on the
graph and this basically variational method that we talked about?
So the original problem, the original optimization problem without blocking. So we
wanted to maximize this lower likelihood given that the sum of the Q values are equal to
one. This is the original variational problem.
And if you solve this problem, this is your result. This is QIJs. And this is -- this is your
-- and basically the arrow is zero, the exact method. But if you look at this thing, look at
this term, what is this? This is the Gaussian kernel, right? Gaussian kernel is -- this is
the similarity between XI and MJ. And this is the normalization.
This is what you do actually if you want to compute the random walk. You compute the
similarity. You just divide it by -- you normal -- and normalize it.
So basically the Q matrix can be seen as the approximate -- the block partition
approximation of the transition matrix. So the P matrix is basically is the transition
matrix, and the Q matrix is just the block partition approximation.
So what is the interpretation then? In this new context, this new view of the Q matrix,
we have a new interpretation for our blocked -- blocked version of our opposition
function. And this is our blocked version of the lower-bound log-likelihood. The first
term is just a normalization. And it's constant in terms of Q.
However, the second term, this is a -- this is a term that people use when they want to
learn similarity in graphs, meaning that if I want to maximize this and this -- there's a
negative sign here, so I want to assign small -- if the distance between two blocks, two
-- sorry, two clusters, A and B, is huge, I want to assign smaller Q values.
So the higher the distance, the lower the similarity. This is a very common term when
people want to learn the similarity.
This term tends to connecting basically each node to its closest neighbor with
probability one, Q 1, and disconnect it from the rest of the nodes with Q equal to zero.
Find the closest neighbor and connect that with all the power, with all Q. What is this
term? This term, if you look at it, this is actually the sum of the entropies of the outgoing
probabilities from all node. This is just the Shannon entropy of the distribution.
This one tends to -- because if you want to maximize this entropy here, therefore we
need to have a uniform distribution to maximize -- for maximization of the entropy. So
this one tends to connect each node to all of the nodes with equal probability.
So these two terms here work against each other. So you can look at this term the
entropy term as a regularization term. Because if we don't use this term, then we're
going to end up with a disconnected graph. Every node is just connected through it's
closest neighbor.
And this coefficient here basically is just a trade-off between these two terms. And if
you look at this term, this is actually the bandwidth. And this basically as a sanity
check, basically the higher the -- sorry, the lower the bandwidth you can see that your
graph will be sparser. Because this term becomes stronger. And the higher the sigma,
this term becomes stronger, so your graph will be denser.
So this random walk view gives us this new interpretation of the objective function. Of
what it does exactly. Okay. Any question? Sorry.
>>: So [inaudible] how does your lower bound compare the approximation quality to
KNN with an added small uniform connection to all nodes? Because it sounds like this
is what ->> Saeed Amizadeh: Yes.
>>: -- you're doing.
>> Saeed Amizadeh: So -- well, we didn't perform that experiment. We did compare it
with KNN because we wanted to compare these two different ideas of sparsification and
parameter sharing. But -- so the -- what you are saying is basically instead of using,
you know, sparse matrix use epsilon instead of zero in KNN.
>>: [inaudible].
>> Saeed Amizadeh: Yeah. We didn't do that experiment, that framework. Because
we wanted to compare, you know, sharing versus sparsification. And that is already a
sharing idea. But, again, it depends how you implement your KNN because if you
wanted to relate your KNN to likelihood, you need to weight the edges. So KNN by
itself is not enough. You need to weight them with the similarity, the Gaussian
similarity.
Then you can show that as you increase K, you basically -- you converge to the exact
method, which is ->>: I guess here you're also clustering the samples, the core [inaudible] KNN.
>> Saeed Amizadeh: Yes.
>>: So you're [inaudible].
>> Saeed Amizadeh: Yes.
>>: And then [inaudible].
>> Saeed Amizadeh: Yes. I mean, depend -- again, depends how you implement -- I
mean, even if -- I mean, if you want to implement the very booth force KNN you'll have
problem in large scale because you cannot. Because booth force KNN again, again for
the construction of KNN graph, it's N squared. Even actually more than N squared. N
squared log N. Because you need the sorting and you want to avoid that. So you want
to use -- again, you want to use the tree for -- even for KNN. And in our experiment we
did that basically we used the tree for KNN.
>>: I'm not arguing the KNN, but I wonder -- so in order to solve for the optimal Q
method, both method has complexity of the size of -- the largest block?
>> Saeed Amizadeh: The number of blocks.
>>: [inaudible] blocks.
>> Saeed Amizadeh: The number of blocks basically, not the size of blocks. Just the
number of blocks with the number of your parameters.
>>: Right.
>> Saeed Amizadeh: Yeah. And I assume in your KNN -- well, the thing is these are,
you know, all independent problems. So as soon as you give me a partitioning, no
matter how you get that partitioning, you get the partitioning using KNN or using our
method. The rest of it is just solving for Qs the same.
>>: Yeah. But now I'm wondering whether you can replace the step where you solve
for the optimal Q with this approximation where you just -- where you pick the -- maybe
the approximate top K and then, you know, set a large weight for that for Q and then the
small epsilon weights for the rest of them.
>> Saeed Amizadeh: Yes. That's -- that's definitely -- uniquely we didn't do it but that's
definitely a valid method to do because you can actually -- instead of actually computing
PLMJ given XI, as you mention you can just compute directly P of XJ given the
likelihood, just compute the likelihood for the first K and just take -- and just take the
average or something.
The thing is that becomes very ad hoc how you provide it to one parameter, how to
spare it. Here we have a very formal method to do it. But, yeah, that's -- that can be
[inaudible].
Okay. So this parameter here, this bandwidth, basically adjust trade-off between these
two terms. Now, the question is how to adjust this sigma. There are many methods. I
mean, in the literature there are many heuristics how to do it. And we are not claiming
that our method is the best way to do it. But at least it has some good interpretation
what it means actually.
So as we said before, the bandwidth basically adjust the DK rate of your similarity as a
function of distance. And so if we look at our -- again, this is, as I said, we use this
formal and this objective over and over in these slides. So this is our objective. This is
the block version of our lower bound objective. This is -- if you look at this term, this is a
quasi-concave function of bandwidth. This term is constant in terms of bandwidth, but
these two terms, the quasi-concave function of bandwidth.
>>: [inaudible].
>> Saeed Amizadeh: Okay. So ->>: [inaudible].
>> Saeed Amizadeh: My bad. Sorry.
>>: That's all right.
>> Saeed Amizadeh: I should have put it in the slides. So concave function you have
one maximum there. But in quasi-concave you still have one maximum, but it's -- the
second derivative can be positive.
>>: You just [inaudible].
>> Saeed Amizadeh: That -- yes.
>>: Okay.
>> Saeed Amizadeh: Yes. Yes. Yes. Exactly. So this means that in terms of the
bandwidth our objective function have one maximum. And we can find that maximum in
close form. If we solve this thing for sigma, this is our solution.
And if we look at this solution, this also has a nice interpretation. If we look at this term
here, this is just -- so if you take out the dimension out from this, read out this dimension
-- if you take the dimension out, this is basically the average expected distance that the
random walker traversed after one time step.
So this -- so this one is basically the nominator is just the expected distance that the
random walker expected distance that the random walker -- the sum of the expected
distance. And when you normalize it by N, it's the -- it's just the average expected
distance that you traverse.
And before you ask me this question, I'm going to answer it right now that while here
how come -- I mean, because intuitively as we decrease sigma, as we decrease
bandwidth, our likelihood should go up. So there's no limit on it, right? As we go
towards zero, our likelihood goes to infinity. Becomes larger and larger.
The catch here is that our objective function is not the log-likelihood. Itself, it's the lower
bound on the log-likelihood. And this lower bound gives us this nice interpretation of
very straightforward metric to find the sigma.
Question? Quickly, yes. You can use the same kind of idea but not necessarily the
same -- but the same in the same -- in the same night you can say, okay, in this -- in the
exact method no blocking, nothing, just exact method I can always, you know, use the
Jensen inequality and this is going to be the -- a lower bound on my likelihood, using
Jensen.
Then I can always try to maximize this lower bound in terms of bandwidth. And I guess
this basically optimal solution for sigma. Because this is again a quasi-concave. But
this is not a specially case of our block-partitioned version. And the reason is that
basically this is not a tight bond anyway, but our bound as you increase the number of
parameters, as you refine it, you get closer and closer to the actual likelihood.
But this one is just a heuristic, just that. But we use this heuristic in our experiment to
set the sigma for the exact method.
Okay. Fast multiplication. So in many applications if Q is -- imagine Q is my transition
matrix. I need to compute the multiplication of Q by an arbitrary vector Y. And this in
general, this is an N squared operation, a matrix times a vector.
We want to do this of course efficiently. So we have to log this simple algorithm on the
-- on our tree that basically you can fit all the elements of Y to their corresponding
leaves in the tree. And as a first step, there's a collect up step that you basically sum up
all your children and you just keep this statistic at each node which is basically the sum
of the un-- the children.
So this -- this step basically takes linear order, O of N. Then using this statistic, we can
have a distribute-down step where you compute along each path from the root to the
leaves. You compute this sum. And you can basically use dynamic programming to
save this -- save the summation at each time step. And we can easily show that this is
basically order of the number of blocks in your tree.
So the whole process takes the O of number of blocks because number of blocks as we
see later is always greater than -- greater or equal to N.
So this gives us a very fast algorithm to do the multiplication by an arbitrary vector. And
why is this important? Because in many applications like labor propagation, labor
propagation, if this is your initial labels for some label data, the propagation in the limit
you have this vector of labels for all -- for the whole graph.
So this is -- this involves the inversion of these matrix. But if you don't want to do the
inversion you can approximate it with a finite number of iterations using this process
here. And in this process you need to do the multiplication of Q by a vector over time.
Or another application is the eigen-decomposition of Q. You want to find the
eigenvectors of Q which we know by -- we know by now. And if we use Power method
or Arnoldi iteration, we need to compute this multiplication, iteratively these
multiplications by a vector. So this gives us a very fast multiplication algorithm for these
methods.
Now to a question, computational complexity. So -- so -- so this is the computational
complexity for our method. The construction time in the worst case if you don't have
any structure in your anchor tree, this takes -- you know, for anchor tree, to make an
anchor tree. But in average, because we always have some sort of structure, this is L
log N. Plus the number of blocks to basically -- this is the estimation of the Q values.
The multiplication time is order of the number of blocks. So as we see here, number of
blocks plays a very crucial role in the complexity of the whole framework. So everything
depends on the number of blocks or the number of parameters.
So the question is what is this number of blocks and how we can actually change it or
fix it. So the number of blocks, actually we can show that it changes between the linear
order and N squared. At N squared basically you get the exact method. No
approximation.
So what are the coarsest level, the number of blocks is two times N minus one. That's
the minimal number of blocks that you can have. Not less than this value. And we'll
see why this value.
So the coarsest level of approximation. So to get a value block-partitioning, the
subtrees that we paired them together to form a block needs to be nonoverlapping. If
two subtrees, if two subclusters -- two clusters are overlapping to have the same points,
the approximation is not a valid approximation.
So we need the subtrees to be over -- nonoverlapping. But for any given subtree in the
tree, the largest nonoverlapping subtree is its sibling. So for this subtree the -- this
subtree is a nonoverlapping subtree, right? But the largest nonoverlapping is just its
sibling. Yes.
So, therefore, the coarsest level of approximation in -- the coarsest level we just pair
each subtree with its sibling as here. So the number of blocks then is going to be the
number of internal nodes, which is N minus one times two because we have two
directions. So it's going to be two times N minus one.
This is the minimal number of block.
So a rational approach is to start with the coarsest level of approximation and refine our
model as we need more accuracy, or as we have more computational power afford it
basically.
So -- and what does refinement mean? One step of refinement means that splitting
your block into two either horizontally or vertically. And what does this mean? So if this
is one -- sorry. So if these are two clusters and this arc represent a block, so I prepared
these two and this is a block. So I can even refine it this arc this way with pointing this
to the children of B. This is a vertical refinement or horizontal refinement from children
of A to B. So these are two different refinement that I can have.
So every time that I split a block, I introduce a new parameter. So I have a new
parameter here. And so of course by introducing a new parameter, I relax the
constraint, meaning that the lower bound on the likelihood -- on the log-likelihood. Is
going to be higher because I -- basically I -- I basically remove one of the constraints.
So you can show that always you increase the log-likelihood. By refinement. And
mathematically it's easy to see it because you can always assign the previous
parameters in your optimization problem.
So now is question is which block to split. So of course all of them will -- gives us some
gain in terms of likelihood. So the very obvious answer is that, okay, give me the one
that gives me the most gain in terms of likelihood. That is going to block that. And that
is going to be the block that I want to split.
But to find this block, I need to split all the blocks one at a time, solve the optimization,
compute the log-likelihood. Gain and take the maximum. But this is very expensive.
We don't want to do this for all the blocks.
So our solution here is basically we try to locally solve the optimization for each possible
split. We assume that all other parameters are fixed. So we -- every time we just solve
a local optimization problem. And this local optimization problem takes order of one
computation. So it doesn't take order of a number of blocks.
Then we pick the split with the maximum local gain. Of course this result is suboptimal.
But we can very efficiently implement it using a Priority Queue.
Okay. So is this clear? Any question? We can move to the experiments. Okay. All
right. So in the experiment we tried to solve this semi-supervised learning problem. So
given a small set of data points we want to find the label given a small set of, sorry,
label data points, we want to find the label for the rest of unlabeled data points using
label propagation on the graph. So this is the label propagation. And we don't want to
compute the inverse of the -- this metrics. So we do this iteratively.
So the performance metrics that we measure are construction time, propagation time or
basic propagation is the multiplication time. And the classification accuracy. Given that
we know the true labels for the class labels.
Okay. And the baselines. The very first baseline is the exact method. Basically we
compute the similarity between each per, then we normalize the similarity to find the
transition metrics, transition probability.
The other is the fast K-nearest neighbor. So as I mentioned before, we don't want to
use the straight K-nearest neighbor because the construction time for the exact
K-nearest neighbor is large. So we want to use again the trees to find the K-nearest
neighbors. So we use the same anchor tree for K-nearest neighbor. So for the first
comparison we use the same cluster hierarchy that we use for both approaches. So for
both of them it's anchor tree.
And just through -- I don't want to get into details. But how -- how the tree use helps
K-nearest neighbor is basically it's -- you can cut some computations in the tree if you
already know the K-nearest neighbor. So that -- that helps you to cut the computations.
That's why it makes the K-nearest neighbor fast.
And these are the Theoretical orders on the paper for these three methods. The exact
fast KNN and our framework which variation of dual-tree framework. And this is the
construction time memory that it uses and the multiplication time.
So -- so okay. In the first experiment we want to basically the goal is to compute the
construction time, measure the construction time as the size of the problem is
increasing. And the size of the problem is basically the number of data points.
And then the other goal is basically to measure the accuracy, see how much we lose in
terms of accuracy if we do that approximation.
And the data set that we use is SecStr. It's standard benchmark in semi-supervised
learning. And dimensions is 315. The number of data points, we increase it. That's
why I didn't mention it here because it's going to change. And there are two classes.
10 percent of the data points are labeled for each problem size. And we try to find the
labels for the rest of 95 -- 90 percent.
So the bandwidth for each method is computed separately using the techniques that we
talked already. And that approximation level for our framework is going to be at the
coarsest level where the number of blocks is two times N minus one. And for the KNN,
for the K-nearest neighbor we define the approximation level as the number as K. So
K2 is the roughest approximation level with fast KNN approach.
So we use K equals to two. And these are the results. So the time and the size are in
log-scale here. And these are the three methods. And this is the classification
accuracy. So as we see, we didn't actually lose that -- so the red is basically the exact
method. We didn't lose that much in terms of accuracy. However, we gain order of
magnitudes in terms of construction compared to two other methods.
And in terms of multiplication, we are much better than of course the exact method and
comparable to the KNN because as soon as you build a KNN you have a sparse
metrics and the multiplication can be very fast for KNN. The question is how to -- but
still we -- how to build the KNN and that can be as slow as we showed it here.
Okay. Second experiment, we want to study the effect of refinement in the model. So
the goal is to compute the accuracy as the model is being refined and of course
measure the refinement time, how much time we need to refine each model.
Again, the data set is a benchmark data set, digit1, which has 1500 data points and 241
dimensions. So we have two results here. One time when we have 10 label data points
and the other time when we had 100 labeled data points. These are the standard in the
data -- in the data set.
And, again, the bandwidth we computed using the method that we explained before. As
I just mention it, the refinement for the K-nearest neighbor is defined as increasing K
because as you increase K and assume that each edge in the KNN is weighted with the
Gaussian similarity, then you can show that as K becomes N, approaches N, your
method basically converges to the exact method. This makes KNN consistent -consistent approximation -- approximator of the exact method.
So -- and if you look at here because we want to hear the -- we computed the effect of
refinement, so we want to make sure that for both methods the number of parameters is
the same. So we want to make sure that the number of blocks in our framework is
equal to K times N at each time.
If we do this, then the number of parameters -- because the KNN method has K times N
parameters. So for a fair comparison we want to have the same number of parameters
for both methods. So this is the result. This is the construction time again here. And
this is the log-scale. I don't know why I -- I think I cropped the figure, so the numbers
are not here. But this is order of magnitude of course faster.
And this is the refinement time, our method compared to KNN. And this is the result of
refinement. So this -- this axis here is the number of parameters here. So as we
increase the number of parameters, here we show that our method gets better
compared to KNN. This is when the sample -- when the label sample -- sorry, the size
of the label data is 10. This is when the size of the label data is 100.
But for the 100, as we see here, KNN basically beats our method and also the exact
method. And our explanation for this result is that this data must have a very clearcut
manifold such that KNN can actually -- because KNN is sparse can catch it very fast.
And but in the other one, when we have a smaller label data, KNN didn't actually -- it
was very variant. It couldn't actually be improved. But our method improved as we
increased the number of parameters.
Of course, if we increase the number of parameters more, we're going to have more
improvement. But that's always a computational issue. That's a tradeoff. We want to
stop at some point. Because otherwise the computational complexity is going to be
higher.
So the next one. Okay. The next experiment is -- okay. So far all the experiments that
we did was, you know, on kind of a medium sized data set. Here we want to actually
see whether our method is really scalable for really large-scale data sets. And the data
set that we chose, one is -- off of these are from the Pascal large-scale data -large-scale data challenge. You can actually go and look these data sets.
This is like a massive million data points. The other one is three point half a million data
points. And pretty much high dimensional as well.
So we -- we didn't compare it to other methods because we couldn't run other methods
on these data sets because it took forever for other methods and also the memory
issue. But we did the experiment for our method with -- at the coarsest level.
And I believe this is the result. So here is just -- just we want to show the computational
complexity. It takes like for the first data set four and a half hours to build the model and
11 minutes for the propagation for multiplication basically. For the other data set it
takes all -- almost two days to build a graph on two point half a million data points which
translates into half a -- sorry. Seven million parameters. And about 93.3 minutes to do
the propagation.
So these are just -- these are all on the serial computer. Of course we can make it even
faster if we parallelize the whole framework. And this is possible because the
underlying data structure that we used and -- is all based on trees. So we can always
decompose these trees on different machines and do the computations, you know, in
parallel and make it even faster. But these are the results on the serial computer.
And I think that was it. If there's no question from experiments, I can move quickly to
conclusions. Okay.
So as we showed, I mean, in average the construction time can be as low as N log N,
using our framework, instead of N squared. And N log N is basically for the building a
tree. If you have the tree, it's going to be linear actually.
The multiplication can be as low as the linear because the number of blocks can be as
low as the linear time. So we can have a very fast multiplication. And you can -- and
multiplication is very useful for label propagation and eigendecomposition.
Memory usage can be again as low as the linear order because the number of blocks
can be as low as the linear order.
And the framework provides us straightforward method to find the optimal bandwidth
with a nice interpretation.
And also the whole framework is a multi-level approximation framework, meaning that
we can have that approximation at different levels. And we can refine our model on
demand basically how much accuracy we need. And also -- I mean, the other -- the
other -- from the other point of view, you can have maximum -- you can have a
maximum CPU resource. You said, okay, this is my maximum CPU resource gives me
the best refinement. So this is the maximum number of blocks that I can afford. Give
me the best refinement for the metrics. And we developed this technique to find the
best -- actually suboptimal large partition. And as I said before, the framework is not
dependent on the choice of tree that we use for cluster hierarchy. Therefore, we can
easily substitute this tree with some trees that have some theoretical guarantees.
We couldn't find any work on, you know, for some theoretical guarantees for anchor
trees. But for example if you use cover trees, cover trees have some theoretical
guarantees. However, as I said, all of these tree methods, they give you a bond and
they give you a order with some constant in it. And that constant can kill you in practice
because that constant depends on the dimension -- on the many factors on the -basically on the geometry of the data.
So in theory, yes, you can improve the order if you, for example, replace the anchor tree
with cover tree. Or basically you can have some theoretical guarantees. But in order to
sure that theoretical guarantee your computational complexity, that constant that shows
up in the computational complexity can kill you. And, yes. And I think that was it.
Thank you.
[applause].
>> Saeed Amizadeh: Any question?
>>: [inaudible] computation in parallel?
>> Saeed Amizadeh: Yes.
>>: So does -- is the -- so what [inaudible] is the communication time [inaudible] so now
you have [inaudible].
>> Saeed Amizadeh: Yes.
>>: So I'm not sure if the case here the communicated time is actually trivial so you
don't have to worry about that [inaudible].
>> Saeed Amizadeh: Well, I mean, I'm not very -- I'm not an expert in parallel
computing, but all I can say is that in this framework, all the computations are done
hierarchically. So if you're doing the -- let's say the computation at the coarsest level,
right, so this -- the left subtree is going to be independent of the right subtree. And so
you can -- and then again, this recursively goes down. And but yet there is some, you
know, overhead for communication. But if you can keep the communication inside each
subtree in a recursive fashion, then probably you can, I don't know, decrease them -the communication time to the order of N basically. With some constant, of course.
Because the number -- so it's going to be the number of internal nodes is the number of
pairs, you know, siblings. And the number of internal nodes is basically N, is N minus
one something.
So of course there is some constant out, and that probably depends on ->>: [inaudible].
>> Saeed Amizadeh: Okay.
>>: All right. Thanks.
>> Saeed Amizadeh: Thank you.
Download