Latent Semantic Indexing and probabilistic Latent Semantic Indexing

Latent Semantic Indexing and
probabilistic Latent Semantic
Ingmar Schuster
Patrick Jähnichen
Institut für Informatik
Structure of text corpora
A corpus
consists of a set of
Every document consists of a set of
Whole corpus has a vocabulary of
Length of the corpus is
different words
The vector space model
Documents are represented by vectors
Each component in the vector represents a separate term
Vectors are frequency vectors, i.e. if a document has term ,
component in the vector is its frequency, 0 otherwise
Size of the vector is
We can use vector similarity measures to compare
documents, e.g. the cosine distance
often, tf-idf weights are used instead of indicator vectors
Aside: tf-idf measure
reflects the importance of a word for a document
commonly used weighting factor for terms in a document
Result: a weighted frequency vector
is the word frequency in a document, often a normalized
variant is used:
is the inverse document frequency, i.e. the ratio of the
number of documents in the corpus and the number of
documents also having this term
Multinomial distribution
generalisation of the binomial distribution
binomial distribution: probability distribution of the number of
“successes“ of n independent Bernoulli experiments
Bernoulli experiment: experiment with two complementary
outcomes, often termed “success“ and “failure“ with probability
p for success and 1-p for failure
also termed the Bernoulli distribution
in each of the n Bernoulli experiments, “success“ has the same
Multinomial distribution
categorical distribution is the analogon of the Bernoulli
distribution for more than two outcomes
an experiment results in exactly one of K possible outcomes
outcomes have probabilities
, where
a random variable showing the number of outcomes n
experiments with outcome , then the vector
follows a multinomial distribution with parameters n and
in natural language processing tasks, we often speak of
multinomial distributions when actually refering to categorical
Conditional probabilities
the probability of observation x under the condition that
observation y has occurred before is the conditional probability
of x given y
if x and y are independent then
and therefore
Bayes' theorem
Bayes' theorem states that
in words
Bag-of-words assumption
we neglect the order of words in a document
i.e. a document can be interpreted as a “bag“ full of words
for each wort, only its frequency is stored
information on type and frequency of words allows
conclusions about the structure of text
underlying assumption: de Finetti's theorem
assumption of exchangeability
exchangeable random variables follow a mixture
distribution, often an infinite one
Singular Value Decomposition
a method from linear algebra
is method of factorization of a matrix
the diagonal elements of
are the singular values of
these are specific properties of a matrix, comparable to
for every matrix, at least one singular value decomposition
Singular Value Decomposition
Singular Value Decomposition - Applications
Principal Component Analysis (PCA)
structuring data sets by
approximating many variables
by just a few linear combinations
(the principal components)
also called Karhunen-Loève transformation
image compression
image (a matrix of color values) is factorized
only singular values significantly higher than 0 are
reconstruction of the image
this is a lossy compression
Latent Semantic Indexing – LSI
Latent Semantic Analysis – LSA
LSA (Deerwester
a linear factorization of term-document-matrices
these are factorized into three matrices using singular value
all but the n highest singular values are set to 0
term-document matrix is reconstructed
the reconstructed matrix has a lower rank now (rank = n)
LSA – an example
LSA – term-document matrix
LSA – singular value decomposition
LSA – reconstructed matrix
there is no zero/one decision
we reduce the dimensionality to n dimensions (“semantic
this is a lossy method, some information is lost
no underlying statistical model, so how to justify this?
SVD assumes Gaussian noise on term frequencies, but
term frequencies follow a Poisson distribution
how to choose n?
problems with polysems
LSA – geometric interpretation
we reduced to n dimensions, these span the “semantic space“
we have word-dimension matrix U and document-dimension
matrix V
in U
angle between vectors
interpreted as semantic similarity
allows semantic clustering
in V
clustering of similar documents
probabilistic Latent Semantic
Indexing – pLSI
pLSI (Hofmann)
is NOT a method from linear algebra but linguistically motivated
assumes mixed distributions and a model of latent classes
based on the aspect model
each observation (term) is assigned to one latent variable
a joint probability of documents and terms is defined
based on two assumptions:
the bag-of-words assumption
a conditional independence of words and documents, they
are only coupled through the latent variable
Aside: latent variables
latent variables are theoretic constructs, they are defined in the
they are NOT directly measurable
but they can be determined based on measurable variables
(which are called observables)
example: IQ
cannot be measured directly
is determined by lots of test results (questions, answers are
IQ is latent variable, so the quality of a result is just as good
as the underlying model defining the latent variable → a lot
of criticism
pLSI – formal definition
joint probability over documents and terms:
now use Bayes' theorem:
pLSI – relation to LSI
define 3 matrices
joint probability model:
we see:
outer products of rows of U and V show conditional
K factors are mixture components as in the aspect model
mixture components replace singular values
pLSI – relation to LSI
approximation in LSI uses L2 / Frobenius norm
this expects a Gaussian noise on term frequencies
pLSI uses likelihood function to explicitly maximize the
predictive quality of the model
in particular: minimize the Kullback-Leibler distance
between true and approximated probability distribution
pLSI – geometric interpretation
we have K multinomial distributions that are specific to the K
semantic classes
those span a K-1 simplex
approximation of P(w|d) is given by a convex combination of
pLSI – results
approximating P gives a well defined probability distribution for
every word
factors have a clear probabilistic meaning
LSI does not use probabilities, even negative values possible
there is no apparent interpretation of the semantic space in LSI,
in pLSI the semantic space can be interpreted as a multinomial
distribution of word over semantic classes
as pLSI is a probabilistic model, we can use model selection to
find an optimal K (number of semantic classes)
