36537

advertisement
36537
>> Li Deng: Okay. Welcome to this lecture. We thank Moontae for
flying all the way from Cornell to come here for this week. So he will
talk to us about topic modeling, and this will be the paper he will
present at upcoming NIPS so we get a preview of his work.
He's extremely productive researcher and intern and we appreciate that
he spent one more week with us. Thank you. Bye.
>> Moontae Lee: Thanks for the introduction. So today I will present
robust spectral inference for joint stochastic manufacturing
factorization and topic modeling. This is joint work with David Bindel
and David Mimno, professors at Cornell University.
Before going into the outline, I will briefly talk about what's the
difference between several learning models. So in machine learning, in
order to learn the model parameter, we have roughly two different
methods. One is likely based training. We first choose a proper
likelihood estimator. Of course, there are a lot of different
estimators like pseudo likelihood, maximum likelihood math likelihood,
and then we form a likelihood function in terms of the model parameters
and then find the best parameter, usually via optimization.
But there are another class of method called method of moments, which
is, first, relate a population moments to the model parameters and then
estimate population moments via sample.
And it usually involves solving multiple different equations. So if I
compare these two different approach to learn parameters, the solving
likelihood usually uses optimization, which requires multiple
iterations that makes the algorithm usually slow whereas method of
moments we are solving equations, closed form equations and speed is
relatively fast.
And in terms of the estimation quality, unless the likelihood function
be designed strictly convex, it's not always optimal, whereas the
method of moments, it's statistically consistent. What I mean by
statistically consistent is if we get more and more sample, it always
converges to the right estimator.
But whenever there's a mismatch between our models and the real data,
likelihood-based method has intrinsic power to manage model mismatch,
but it's unclear in the method of moments.
So one the biggest focuses of this talk is how to handle model mismatch
in a specific problem related to topping model.
And the method that I'm interested in this paper is using both matrix
algebra and probabilistic inference together in the same framework.
Then so I will briefly explain the latent Dirichlet allocation. So we
have two hard parameters for polynomial distribution and for each
document and position we first sample the topic from the topic
distribution of that document. And based on that topic, we sample the
word. And this is how the topping model explains the document
generation.
If I proposed the corresponding view in terms of matrix vectorization,
it looks like this. So we have a word distribution for each topic, and
topic distribution for each document nth document.
And because of this is a distribution, this matrix is a column
stochastic, which means the sum of every column entries are equal to
one.
So let's say we have currently only one document on the right-hand side
and this document, the first document, it is the first topic is the
biggest topic, which affects a lot. Then what we are expected to
observe in terms of the word level is the duplication of these words,
first word and fourth and fifth word.
And if the second document contains the second topics and fourth topics
a lot and then the expected document looks like the sum of these, these
words and these, these, these words.
But in reality, in reality, always the model is not coinciding with the
real data. So we are observing these noise in here.
Now I'm explaining the events based for joint stochastic matrix
factorization in comparing to the previous LDA model, now we have a
pair of topics and pair of words here. So these pair of topics are
sampled from topic, topic distribution and then based on those pair of
topics, we are sampling the words. And words are of course related to
the word topic distribution in here.
So the corresponding matrix vectorization view look like this. So now
we have a topic/topic matrix, the sum of all entries will be equal to
one. So this is a joint stochastic matrix. And we have word topping
matrix here and transpose here.
What they observe is here, this is a word-word co-occurrence matrix or
just nth document.
So in reality what we observe is containing some degree of noises like
this. So the goal in this paper is to decompose what we observed into
this fashion. Okay. So I will explain why we are using second order
method rather than the first order method. So the benefit which is
proven in Arora, et al, 2012, unfortunately the first order statistics,
which is simple word occurrence in each document is far from the ideal
stuff. But fortunately the second order word-word co-occurrence matrix
converges well to the ideal word-word co-occurrence.
mathematically.
This is proved
So the goal is to decompose the noisy observation of our co-occurrence
matrix into those BAB transpose and what is called inference here is to
recover the word topping matrix, B, and in order to do that, we are
going to utilize something called separability assumption. What I mean
by that is the typical non-negative matrix vectorization the goal is to
minimize the difference between C and BAB transpose, usually in terms
of Frobenius norm. But just here merely doing that will not give us a
great result, and usually it produces unrecognizable topics. It's
because there's no identifiabilities guarantee.
In some sense the separability exception, which I'll explain on the
next slide, will guarantee those identifiability to this problem. So
again just minimizing C and BA, the difference between C and BAB
transpose is not the goal of this paper.
So what is separability? Separability assumption. Each topic K has a
specific anchor word SK which is exclusive to that topic. What I mean
by exclusive is described in here. So when that topic is given, we
have a positive probability to observe that anchor word whereas given
different topics there is no probability to observe that specific word.
This implies not every document about topic K must contain that anchor
word, but every document which contains that anchor word will tell us
and list something about that topic.
Then what I mean by anchor word defined in the previous slide
corresponds to this red/black this here. The red-black is Intel
dedicated to the first topic because there is no probability of seeing
this word, fifth word, given another topic.
So we have three anchor words in this picture.
>>: So this assumption is valid?
>> Moontae Lee: This assumption is valid.
>>: Is this assumption valid or not?
>> Moontae Lee: So in real data it's, of course, not always valid.
But without this assumption, as I said, the decomposition doesn't have
an identifiability. So usually all these unsupervised learning, we
would like to identify the mixture, different mixture, topping mixture,
for example, in this problem. That those mixtures are not sufficiently
separated to each other without this assumption.
>>: But this will group some different topics into one, right?
>>: No, this actually separates different topics more distinctively.
>>: By assumption doesn't hurt, then it's possible that you see the
same anchor word with several different topics and all those are
different topics that you may group those together as well, right?
>> Moontae Lee: It's different because topic is not observed. Topic
is a hidden variable. So if we have this assumption, what we would
like to do in the inference is to create topic which satisfies this
assumption as much as possible as we can.
And because these anchor words are exclusive to one topic, it actually
tries to learn topic as separated as possible. Topic is not what we
can observe.
So if I reorder these and push all these red/blacks into the above and
reorder, then it means B matrix contains diagonal matrix in here which
is called D in this notation. So basically this decomposition will be
rewritten into this blocked diagonal matrix form and there's already
one interesting correspondence. So if I see this block after the
reenumeration, then the DAD transpose will correspond to certain matrix
of the co-occurrence matrix.
>>: So this looks like extension of the original topic model, two
branches? Or is it the same topping model with a different mode?
>> Moontae Lee: It's not exactly the same topping model, because we no
longer have the prior for the word topic matrix part. But it actually
subsumes some degree of those all LAD different models. So you will
see.
>>: I see. Okay. Okay.
>>: I want to know if I missed something. So can't there be two
anchor words in the same column?
>> Moontae Lee: Two anchor words in the same column, in here?
>>: Yeah.
>> Moontae Lee: No, there's no way to get those two same anchor words
in the same column. Because anchor word assumption means -- in certain
rows there's only one activated cell.
>>: Right.
>> Moontae Lee: And the number of those rows is also hyperparameter.
>>: But I mean the condition says that you have to be the only
non-white thing in your row. Doesn't say anything about your column.
So I don't see why you can't have two red ones. The last row could be
identical to the next to last row.
>> Moontae Lee: So, again, this is not we can observe. This is not we
can observe B. So what we can observe is only C, then as a user input,
for example, in this picture, user -- I'd like to learn topping model
with topic three. Number of topic three. Then we are going to
decompose this matrix assuming there is three certain blocks in A.
>>: Okay. So it's just another requirement, the anchor word?
>> Moontae Lee: It's not another requirement. That's how the
inference goes in terms of the directionality. Now, you will see the
details.
>>: I don't expect them to help me but I will see it.
>>: Does that mean that even if you have two anchor words with the
thing you can ignore one, so from -- is it one anchor word that's
enough?
>> Moontae Lee: No. Because if you would like to separate the dataset
into 10 different mixtures, you need 10 anchor words.
>>: Basically that's all I'll ask for. Because you're assuming you
have a diagonal [indiscernible] at the top.
>> Moontae Lee: Yep.
>>: Which means for each column you only have one as well.
>> Moontae Lee: Yep.
>>: The parameter, is it possible that for some topic, for some
topics, you have more than one anchor word?
>> Moontae Lee: That is not allowed in this model.
>>: So this is not allowed.
>> Moontae Lee: Yep. So every topic must have only one anchor word.
So in 2014, there's an extension, theoretical extension. So every
topic has multiple different anchor words. I haven't seen any real
implementation of that paper, but there's an extension. So if I have
anchor word, this is one of the main inference, how to find the C bar
IJ is well normalized version of word-word co-occurrence matrix so it's
now conditional probability given observing one word, what's the
probability of observing another word and because of this equation in
the end these relations will be whole. So some row, each rows in the
normalized co-occurrence matrix will be convex combination of certain
rows corresponding to anchor words, and these the coefficient, the sum
of these coefficients will be equal to one and these corresponding to
the word topping matrix that we've seen in the previous slide.
So assuming we somehow know the anchor word, the rest of the inference
is just to figure out these coefficients. So assume we somehow learn
those anchor words and then to solve the previous equation is just
solve non-negative list square, simplex constraint, there will be many
different methods. Use exponented gradient and there is subnet node
and this is easily parallelizable for each word. That's one of the
biggest benefits.
And then its inference becomes very simple Bayes Rule. This will be
one entry in E matrix, I-K entry and once we know the convex
coefficient based on these methods, all these entries could be
rewritten based on Bayes rule.
So finding anchor words really matter. So at the beginning Arora 2012
they tried to solve a lot of LP. Basically pick one row in the
co-occurrence matrix and see whether it could be reconstructed, it
could be reconstructed by all the other rows, which is pretty
exhaustive method, and it empirically doesn't work at all. And then
people developed or used [indiscernible] row pivoting which is a very
famous method in matrix algebra. So pick one extreme point in those
row normalized co-occurrence matrix as initial anchor and project every
other point rows down to the orthogonal complement of that vector and
choose the farthest point and repeat this process again and again until
we find K anchors. Of course these K anchors will never recover the
rest of the rows perfectly. And what we are doing is just
approximately find best K rows, and this is a greedy method, so it's
not even the best but something we could do in a manageable fashion.
So benefits of this anchor word algorithm is as you could imagine the
every inference process is deterministic. So there's no random
initialization goes, or there's no funky behavior at all. And so once
we construct this noisy co-occurrence matrix based on the real data,
and then we no longer play with these documents at all. These are the
only statistics that we need for the inference and we produce an anchor
word which is exclusively dedicated to each topic which might bring
some interpretability. So what's the problem then?
This always happens in machine learning. So the real data never
follows the model. So in reality, this co-occurrences with rare words
makes sparse row. So if there's a very rare words happens one or two
times through the document, the co-occurrence with that word is
extremely rare, which makes sparse row. But in terms of the matrix
algebra, those rows look like there's a strange point or eccentric
point of the co-occurrence space so QR with row pivoting algorithm
prefers to sell at those rows. So anchor words are selected as two
rare words. And the co-occurrence with those anchor words become noisy
statistics. And even words the co-occurrences between those anchor
words is usually diagonally dominant, which means so one anchor word
corresponds to one topic. If this becomes diagonally dominant, we
cannot capture any interaction between topics. This is a serious
problem. So as a result all those previous work, they usually manually
use manually crafted document frequency cut-off, which means they just
set some threshold like if in order for this word to be an anchor word
it must happen at least in five different documents, ten different
documents. Sometimes 100 different documents. And they measure the
held out likelihood again while in order to measure the held out
likelihood we need to finish the inference process entirely. And then
oh it doesn't look like it and so it changed the threshold and measure
how the likelihood again. It's extremely painful process. And even
with the document frequency cut-off and if the number of topics is very
small, they learn garbage topics. And cannot capture interaction
between topics at all because of this problem. And even if K is high,
the topic quality is poor. So this is, this covered recently caused
the original two papers, they all used just synthetic dataset which
follows the model pretty well.
But in reality, in the real data, oh, the topic quality is very poor.
And comparing to probabilistic inference like if sampling, the entire
inference quality is far inferior I'll show sample topics with the
original algorithm which is called greedy. This is New York Times
corpus, which is popular in this scene. And as you can see none of the
anchor words are, yeah, understandable. And so we had a toy experiment
in previous [indiscernible] paper which we just compressed this
co-occurrence matrix using either PCA or T stack estimate or
neighboring embedding and we realized it gives us much better anchor
word and even increasing the held-out likelihood. But still we cannot
explain well why these work, that explaining those is one of the
purposes of this new paper.
So I will show some visual example of those anchor words. So this is a
2-D or 3-D projection of small the LD corpus. This is illustrating the
word co-occurrence space and the anchor words corresponding to the
vertices of the convex hole, because as you might remember the goal
after getting anchor word, the goal is to learn all those coefficient
to express all these words by the convex combination of the anchor
words. And interestingly these anchor words corresponding to certain
topic of the anchor of the yelp. And we can do 3-D projection while it
looks messy. There's homeless and enchilada here.
So what we've done in the new paper coming for NIPS is we study
extensively what's the structure, mathematical structure of
co-occurrence matrix. So while I skipped entirely this has a lot of
probabilistic and statistic structure. You could see in the paper.
But what I'd like to articulate in this presentation is the geometric
structure of this C. So C must be low rank and at the same time doubly
nonnegative. Doubly nonnegative matrix is a category of matrix, which
is entry-wise nonnegative and positive semidefinite.
And by definition C must be joined stochastic, which means the sum of
all entries is equal to one, and the low rank is document is believed
to be generated based on small number of topics. The C need to satisfy
this four different structure at the same time on top of these
probabilistic and statistics structure. So obviously the C must
satisfy a lot of different conditions at the same time. And this is
proof very rough proof why C must be the positive semidefinite. I will
skip it.
So C needs to be this, this, this. And in order to make C satisfy all
those conditions, we perform alternating projection. So projecting C
down to the low rank, positive semidefinite space first, and then
projecting down to the cone of normalized matrices, then cone of the
nonnegative matrices, and repeat this process again and again.
So the first projection is easily achievable by the truncated
eigenvalue decomposition. So rather than doing the full eigen valid
decomposition which is clearly painful if the vocabulary size is high,
we just use, for example, by using power method, just get the first
biggest K eigenvalue and reconstruct C based on that eigenvalue, and
the projection orthogonal projection to the cone of matrix is given by
this. Briefly the intuition is the sum must be equal to 1 and the
measure, compute the difference between the current sum and the ideal
sum and get the average entry-wise average and do basically penalize or
reword based on that.
And the nonnegative matrix cone, it's pretty easy. So the algorithm
alternate these three projections again and again until conversions.
However, these alternating projection algorithm no longer guarantees
the convergence to the global optimum.
It's because whereas the first two cones normalized matrices and
nonnegative matrices are convex con, with the positive semidefinite
with low rank is no longer convex, whereas simple positive semidefinite
is a convex con.
However, AP instead enjoys local linear conversions. So there's a
mathematical proof in my paper, roughly speaking only about the
intuition. So the set of rank K matrix still forms not bad shape,
which is a smooth manifold. And the intersection with the convex cone
and smooth manifold is still smooth manifold in almost every rare
sense.
And so long as our estimator is not too far from the convergence point,
it is guaranteed to be converged. And this is one of the theories by
Adrienne Lewis 2009 paper.
So they rectified anchored algorithm that we proposed in the paper is
consistent of five different procedures. First construct the noisy but
unbiased estimator, word-word co-occurrence, and then rectify that by
using alternating projection. So that C will satisfy all the structure
of the ideal C. And then find the anchor words in the rectified
co-occurrence. Recover the word topic matrix by a probabilistic
inference with Bayes' rule and cover topic-topic matrix, A, based on
the block matrix diagonal decomposition you see in the earlier slides.
So it contains pretty simple procedures. And all of these procedures
are deterministic. And that's, of course, clear benefit of this
method. So I will show the result. So how the rectified co-occurrence
looks like. So this is a two-dimensional visualization of the
co-occurrence space.
This is from the original algorithm. By the way, the dataset is NIPS
dataset. And the goal is to find the five different anchors of this
dataset. So each dot corresponds to word and this is a word-word
co-occurrence space. And if I run the original greedy with row
pivoting it chooses five different X vertices like this.
And whereas if I rectify the space like this and then if I choose five
different anchors on the rectified space, it corresponds to these five
points. And if IMAP these five points into the original space, it
looks like this.
So as you could see, the coverage is much higher, and it can explain
the rest of the word as a convex combination better. So this is a
typical topic result. The dataset is NIPS and again I have five
different topics.
So the original algorithm Arora et al, if I run the code, based on his
algorithm and their algorithm, this is the result topic. So everything
looks like near run, layer, hidden, and basically each topic agrees
with the uni-DRAM distribution of the corpus. And in general the word
of the topic modeling, if the topic inference does not work well, topic
simply mimics the unique run distribution that's typical behavior. And
if I run the Gibb sampling it gives pretty good topic. One is about
the neuron cell topic or control theory reenforcement learning or
speech recognition neural network stuff or Gaussian approximation
stuff.
And if we run the code in our algorithm, it's pretty similar to the
probabilistic LDA even with this small number of topic. So if you
actually increase the number of topics drastically, like 200, then
still there's a chance to cover all these vertices eventually at the
end. So the topic inference quality becomes better, but if the number
of topics is not that large enough, there's a large amount of chances
that the original algorithm fails because of selecting two eccentric
words, rare words, which is now statistically stable. And this
illustrates the topic-topic interaction, which is the second top of the
inference. This is coming from the original Aurora et al method. So
what they did is while I haven't described, after finding the word
topping matrix, which is B, they multiply the pseudo inverse of the B
to the left-hand side and the right-hand side of the co-occurrence
matrices. Of course, right-hand side, the transpose of the pseudo
inverse matrix, and then it gives it this result. Of course, this is
wrong, because some entries are even negative and there's no way for
the probability becomes negative. And some entries even beyond one.
The sum is close to one. That's because it's algebraic proper to
satisfy that. And this is another method that we are using in our
paper by multiplying those diagonal sub matrix to the left-hand side
and the right-hand side. Of course, that is so simple method and
there's no way that the original author didn't try that method. But
actually if we try the original, that multiplying diagonal matrix
method to recover the topic-topic interaction without rectification, it
will look like this. So it's entirely diagonally dominant because of
the reasons that I explained before. Again, anchor words is likely to
be selected as very rare words, because of the co-occurrence between
anchor and anchor are extremely rare and statistically it's not a good
statistics at all. And it makes this diagonal matrices which cannot
capture the topic interaction at all, whereas this is the result from a
rectified anchored algorithm, it captures the topic interaction pretty
reasonably if you actually match one topic with the previous topic in
here that we learned. This failed to use our -- these values turn out
to be reasonable. And this plots draws the overequality at just one
big image. So we are not only testing our method in the document
collection, which are NIPS and New York Times. We can also do this
model for the movie and song data. In the movie data, the word
corresponds to each movie and document corresponds to the collection of
movies that each user observed.
And in the song data, the song, each song corresponds to word and the
play list in each row [indiscernible] plays frequently will be the
document. And then we currently have six different measure. The
recovery indicates how well those rows corresponding to anchor words
reconstruct the other rest of the rows. And the approximation error is
the Frobenius norm difference between the original co-occurrence
matrices and the decomposed vector rised matrices. So basically the
approximation error is the traditional measure to traditional metric to
measure how successful the matrix decompositions are.
And the domestic Nancy is how those topic topic matrices is diagonally
dominant. And specificity is how much each topic is specified from the
unigram distribution and dissimilarity how well each topic is separated
to each other and occurrence is how well each topic coincides with the
document. So while this graph may look a little bit messy, the thing
that we have to focus is AP and Gibbs and the original baseline method.
So if you see the original baseline method, for example, in
approximation, the approximation is pretty high but the AP, alternating
projection method or Gibbs sampling method, the errors are pretty low,
and those behavior agrees across different dataset, and also in the
recovery era, as you already seen in the word-word co-occurrence
figure, after the alternating projection, the recovery rate becomes far
better. And if you see another measure, the AP and Gibbs follows
pretty similar trajectory. It's comparable to each other, which means
we finally achieve the comparable result to the probabilistic
inference. If you see the original baseline method, they are all far
from those probabilistic inference or AP method.
>>: So what these right criterion to judge ->> Moontae Lee: Which topic is better.
>>: Yeah, which is better.
>> Moontae Lee: That's one of the questions which pops up always in
topping modeling. So this is unsupervised clustering. There's no
clear way to judge which is better. So people use to introduce all
these different metrics like that I suggested and also do the human
validation at the same time.
>>: So [indiscernible].
>> Moontae Lee: So basically you see this. It's not that hard. And
some people, some researchers designed the matrix -- some other metrics
based on this result, how well -- how much each word frequently occurs
across different topics like that. But those are usually subsumed in
the metrics that I illustrated in the plots.
>>: So in other tasks, I saw that people used perplexity.
>> Moontae Lee: Yes, perplexity is held-out likelihood measuring.
>>: Did you use?
>> Moontae Lee: We actually did that. While we didn't include it in
this paper. Because the reconstruction -- the recovery error
drastically goes down. The held out likelihood increased a lot.
So the conclusion -- so we studied the various mathematical natures of
the co-occurrence statistics and this might be exciting because as you,
as all of you know, the word embedding stuff, they're all based on the
word co-occurrence. While they are not coming from these topping
modeling assumptions, but of assuming there are some clusters of words
in the word, in the natural language, there might be exciting
mathematical structures which is desirable for certain embedding of
course that will be coming different based on which tasks we're
tackling. And we develop a principled way to rectify those noisy
co-occurrence rather than exploring the document cut off again and
again exhaustively. And based on this method we can learn the quality
topics even if K is very small you've already seen K is equal to 5
example. And another example which is in our paper is in the movie
data. K is equal to 15. And if you run the original anchor word
algorithm, I think Pulp Fiction appears across every 15 topics as a top
movie. And the second -- I forgot the name of that movie -- but those
two movies always top words across every different topic. While, what
we've learned has exciting cluster like, like Lord of the Ring cluster
and Star Wars cluster and Walt Disney cluster, and if you actually run
the Gibbs sampling method, they gave pretty comparable result. And we
quickly learned the topic interaction in stable and efficient manner
and as I said we achieved the comparable result. So these talks are
based on these two papers which I published last year and it's going to
be coming soon this year.
And while I haven't prepared the slide, we are doing several exciting
extensions in multiple different fashions. So as one of you might, one
of you might already realize. So the topping modeling contains two
different inferences. One is word topic inference, which is, of
course, the main inference. So how each topic is represented by the
distribution of words. That part is included in this algorithm. But
the secondary inference, which is what's the portion of topics for each
document, that part is entirely missing in this algorithm. So that is
currently ongoing experiment and interestingly all the original authors
in Princeton like Sanja [indiscernible] and [indiscernible] all the
authors in Cornell are collaborating with each other altogether for
those. And another exciting extension will be like anchor topic,
author topic inference. So one -- so rather than viewing all those
things as a hidden variable, let's say each author of the document,
sometimes document has a footprint of the authors. For example, all
these papers have a collection of author and another assumption which
adds more layer in the generative story is this author has, are
interested in blah, blah, blah topics. So author has a topic
distribution and based on that the observed words are decided.
Basically some authors are interested more in the topics. For me I'm
interested more in probabilistic method. And then based on that, the
word that I'm using frequently will be decided. For example, like NAB
estimation or Bayesian. Those are another ongoing extension and
entirely different field like privacy issue or so basically the
co-occurrence matrix C is large. If the vocabulary size is just
10,000, it's going to be 10,000 by 10,000 matrix. And save that in the
memory is painful. Usually the natural language vocabulary is 100,000.
So how to store and do the rectification step without explicitly
storing those all entries is exciting question. And even more I mean
how to store that efficiently and do the rectification without
violating the privacy. And those are all exciting extensions and the
future work. Yep. So again wrapping up the presentation. So this is
a method combined -- this is a new inference combining the
probabilistic method and the spectromic mode. So after forming the
co-occurrence matrix, which is second order moment matrix, we do, we
find the anchor words on that matrix and then based on those anchor
words all the inference process is deterministic and transparent and
the exciting part, I think, you can take to your home for the future
work. So it is pretty susceptible to the sample. So if the sample is
not enough, the estimation result is pretty bad and also the model
mismatch, there's no intrinsic ability to handle the model mismatch.
However, if those are solved, the result from the method of the moment,
it's easy to compute efficiently and if we plot that result into the
original probabilistic inference, for example, Gibbs sampling means a
burning process, we need -- no one knows how many to iteration we need
to run at the beginning to get a good result. But if we actually plug
this result as initial value to the Gibbs sampling, it shows amazing
results which even never appeared in hundred thousand of iteration
which is very exciting. So usually all these likelihood functions in
the likelihood based method they are never convex and there are a lot
of modalities inside of there. And of course we cannot find the good
initialization point caused parameter space is high dimensional. And
this is a really good way to give good initialization. So combining
those two methods seems highly promising. And other people's work
there are several other variations. Rather than doing the second
moment matrix, some people use third order tensor, which is
word-word-word co-occurrence and those tensor decomposition stuff is
also another way to do the topic inference while they need to assume
each topic is independent to each other but, yeah, that's another
direction in which is done by Anand Kumar. So this is the end of the
talk.
[applause]
>> Li Deng: Questions? Okay.
>> Moontae Lee: Pretty short. I finished in 30 minutes. Thanks.
Download