Latent Semantic Indexing and probabilistic Latent Semantic Indexing

advertisement
Latent Semantic Indexing and
probabilistic Latent Semantic
Indexing
Ingmar Schuster
Patrick Jähnichen
Institut für Informatik
This lecture covers
●
Semantic spaces
●
Bag of Word assumption
●
Singular value
decomposition
●
Latent Semantic Indexing
●
Latent variables
●
probabilistic Latent Semantic
Indexing
Foundations
Structure of text corpora
●
A corpus
consists of a set of
documents
●
Every document consists of a set of
●
Whole corpus has a vocabulary of
●
Length of the corpus is
words
different words
The vector space model
●
Documents are represented by vectors
●
●
●
●
●
Each component in the vector represents a separate term
Vectors are frequency vectors, i.e. if a document has term ,
component in the vector is its frequency, 0 otherwise
Size of the vector is
We can use vector similarity measures to compare
documents, e.g. the cosine distance
often, tf-idf weights are used instead of indicator vectors
Aside: tf-idf measure
●
reflects the importance of a word for a document
●
commonly used weighting factor for terms in a document
●
Result: a weighted frequency vector
●
●
is the word frequency in a document, often a normalized
variant is used:
is the inverse document frequency, i.e. the ratio of the
number of documents in the corpus and the number of
documents also having this term
Multinomial distribution
●
●
●
●
●
generalisation of the binomial distribution
binomial distribution: probability distribution of the number of
“successes“ of n independent Bernoulli experiments
Bernoulli experiment: experiment with two complementary
outcomes, often termed “success“ and “failure“ with probability
p for success and 1-p for failure
also termed the Bernoulli distribution
in each of the n Bernoulli experiments, “success“ has the same
probability
Multinomial distribution
●
categorical distribution is the analogon of the Bernoulli
distribution for more than two outcomes
●
an experiment results in exactly one of K possible outcomes
●
outcomes have probabilities
●
●
, where
a random variable showing the number of outcomes n
experiments with outcome , then the vector
follows a multinomial distribution with parameters n and
in natural language processing tasks, we often speak of
multinomial distributions when actually refering to categorical
distributions
Conditional probabilities
●
the probability of observation x under the condition that
observation y has occurred before is the conditional probability
of x given y
●
if x and y are independent then
●
and therefore
Bayes' theorem
●
Bayes' theorem states that
●
in words
Bag-of-words assumption
●
we neglect the order of words in a document
●
i.e. a document can be interpreted as a “bag“ full of words
●
for each wort, only its frequency is stored
●
Assumption:
●
●
information on type and frequency of words allows
conclusions about the structure of text
underlying assumption: de Finetti's theorem
–
assumption of exchangeability
–
exchangeable random variables follow a mixture
distribution, often an infinite one
Singular Value Decomposition
●
a method from linear algebra
●
is method of factorization of a matrix
●
formally:
●
the diagonal elements of
●
●
are the singular values of
these are specific properties of a matrix, comparable to
eigenvalues
for every matrix, at least one singular value decomposition
exists
Singular Value Decomposition
Singular Value Decomposition - Applications
●
Statistics
●
●
Principal Component Analysis (PCA)
–
structuring data sets by
approximating many variables
by just a few linear combinations
(the principal components)
–
also called Karhunen-Loève transformation
image compression
–
image (a matrix of color values) is factorized
–
only singular values significantly higher than 0 are
preserved
–
reconstruction of the image
–
this is a lossy compression
Latent Semantic Indexing – LSI
Latent Semantic Analysis – LSA
LSA (Deerwester et.al.)
●
●
a linear factorization of term-document-matrices
these are factorized into three matrices using singular value
decomposition
●
all but the n highest singular values are set to 0
●
term-document matrix is reconstructed
●
the reconstructed matrix has a lower rank now (rank = n)
LSA – an example
LSA – term-document matrix
LSA – singular value decomposition
LSA – reconstructed matrix
LSA
●
Benefits:
●
●
●
there is no zero/one decision
we reduce the dimensionality to n dimensions (“semantic
categories“)
but
●
this is a lossy method, some information is lost
●
no underlying statistical model, so how to justify this?
–
SVD assumes Gaussian noise on term frequencies, but
term frequencies follow a Poisson distribution
●
how to choose n?
●
problems with polysems
LSA – geometric interpretation
●
●
●
we reduced to n dimensions, these span the “semantic space“
we have word-dimension matrix U and document-dimension
matrix V
in U
●
●
●
angle between vectors
interpreted as semantic similarity
allows semantic clustering
in V
●
clustering of similar documents
probabilistic Latent Semantic
Indexing – pLSI
pLSI (Hofmann)
●
is NOT a method from linear algebra but linguistically motivated
●
assumes mixed distributions and a model of latent classes
●
based on the aspect model
●
●
●
each observation (term) is assigned to one latent variable
(class/dimension)
a joint probability of documents and terms is defined
based on two assumptions:
●
●
the bag-of-words assumption
a conditional independence of words and documents, they
are only coupled through the latent variable
Aside: latent variables
●
●
●
●
latent variables are theoretic constructs, they are defined in the
model
they are NOT directly measurable
but they can be determined based on measurable variables
(which are called observables)
example: IQ
●
●
●
cannot be measured directly
is determined by lots of test results (questions, answers are
observable)
IQ is latent variable, so the quality of a result is just as good
as the underlying model defining the latent variable → a lot
of criticism
pLSI – formal definition
●
joint probability over documents and terms:
●
now use Bayes' theorem:
pLSI – relation to LSI
●
define 3 matrices
●
joint probability model:
●
we see:
●
outer products of rows of U and V show conditional
independence
●
K factors are mixture components as in the aspect model
●
mixture components replace singular values
pLSI – relation to LSI
●
differences:
●
approximation in LSI uses L2 / Frobenius norm
●
this expects a Gaussian noise on term frequencies
●
pLSI uses likelihood function to explicitly maximize the
predictive quality of the model
–
in particular: minimize the Kullback-Leibler distance
between true and approximated probability distribution
pLSI – geometric interpretation
●
●
●
we have K multinomial distributions that are specific to the K
semantic classes
those span a K-1 simplex
approximation of P(w|d) is given by a convex combination of
P(w|z)
pLSI – results
●
approximating P gives a well defined probability distribution for
every word
●
factors have a clear probabilistic meaning
●
LSI does not use probabilities, even negative values possible
●
●
there is no apparent interpretation of the semantic space in LSI,
in pLSI the semantic space can be interpreted as a multinomial
distribution of word over semantic classes
as pLSI is a probabilistic model, we can use model selection to
find an optimal K (number of semantic classes)
This lecture covered
●
Semantic spaces
●
Bag of Word assumption
●
Singular value
decomposition
●
Latent Semantic Indexing
●
Latent variables
●
probabilistic Latent Semantic
Indexing
Pictures/References
●
●
●
Wikipedia, Singular Value Decomposition
Deerwester, Dumais, Furnas, Landauer, Harshman: Indexing by
Latent Semantic Analysis, Journal of the American Society for
Information Science, 1990
Hofmann: Probabilistic Latent Semantic Indexing, Proceedings
of the Twenty-Second SIGIR, 1999
Download