Latent Semantic Indexing and probabilistic Latent Semantic Indexing Ingmar Schuster Patrick Jähnichen Institut für Informatik This lecture covers ● Semantic spaces ● Bag of Word assumption ● Singular value decomposition ● Latent Semantic Indexing ● Latent variables ● probabilistic Latent Semantic Indexing Foundations Structure of text corpora ● A corpus consists of a set of documents ● Every document consists of a set of ● Whole corpus has a vocabulary of ● Length of the corpus is words different words The vector space model ● Documents are represented by vectors ● ● ● ● ● Each component in the vector represents a separate term Vectors are frequency vectors, i.e. if a document has term , component in the vector is its frequency, 0 otherwise Size of the vector is We can use vector similarity measures to compare documents, e.g. the cosine distance often, tf-idf weights are used instead of indicator vectors Aside: tf-idf measure ● reflects the importance of a word for a document ● commonly used weighting factor for terms in a document ● Result: a weighted frequency vector ● ● is the word frequency in a document, often a normalized variant is used: is the inverse document frequency, i.e. the ratio of the number of documents in the corpus and the number of documents also having this term Multinomial distribution ● ● ● ● ● generalisation of the binomial distribution binomial distribution: probability distribution of the number of “successes“ of n independent Bernoulli experiments Bernoulli experiment: experiment with two complementary outcomes, often termed “success“ and “failure“ with probability p for success and 1-p for failure also termed the Bernoulli distribution in each of the n Bernoulli experiments, “success“ has the same probability Multinomial distribution ● categorical distribution is the analogon of the Bernoulli distribution for more than two outcomes ● an experiment results in exactly one of K possible outcomes ● outcomes have probabilities ● ● , where a random variable showing the number of outcomes n experiments with outcome , then the vector follows a multinomial distribution with parameters n and in natural language processing tasks, we often speak of multinomial distributions when actually refering to categorical distributions Conditional probabilities ● the probability of observation x under the condition that observation y has occurred before is the conditional probability of x given y ● if x and y are independent then ● and therefore Bayes' theorem ● Bayes' theorem states that ● in words Bag-of-words assumption ● we neglect the order of words in a document ● i.e. a document can be interpreted as a “bag“ full of words ● for each wort, only its frequency is stored ● Assumption: ● ● information on type and frequency of words allows conclusions about the structure of text underlying assumption: de Finetti's theorem – assumption of exchangeability – exchangeable random variables follow a mixture distribution, often an infinite one Singular Value Decomposition ● a method from linear algebra ● is method of factorization of a matrix ● formally: ● the diagonal elements of ● ● are the singular values of these are specific properties of a matrix, comparable to eigenvalues for every matrix, at least one singular value decomposition exists Singular Value Decomposition Singular Value Decomposition - Applications ● Statistics ● ● Principal Component Analysis (PCA) – structuring data sets by approximating many variables by just a few linear combinations (the principal components) – also called Karhunen-Loève transformation image compression – image (a matrix of color values) is factorized – only singular values significantly higher than 0 are preserved – reconstruction of the image – this is a lossy compression Latent Semantic Indexing – LSI Latent Semantic Analysis – LSA LSA (Deerwester et.al.) ● ● a linear factorization of term-document-matrices these are factorized into three matrices using singular value decomposition ● all but the n highest singular values are set to 0 ● term-document matrix is reconstructed ● the reconstructed matrix has a lower rank now (rank = n) LSA – an example LSA – term-document matrix LSA – singular value decomposition LSA – reconstructed matrix LSA ● Benefits: ● ● ● there is no zero/one decision we reduce the dimensionality to n dimensions (“semantic categories“) but ● this is a lossy method, some information is lost ● no underlying statistical model, so how to justify this? – SVD assumes Gaussian noise on term frequencies, but term frequencies follow a Poisson distribution ● how to choose n? ● problems with polysems LSA – geometric interpretation ● ● ● we reduced to n dimensions, these span the “semantic space“ we have word-dimension matrix U and document-dimension matrix V in U ● ● ● angle between vectors interpreted as semantic similarity allows semantic clustering in V ● clustering of similar documents probabilistic Latent Semantic Indexing – pLSI pLSI (Hofmann) ● is NOT a method from linear algebra but linguistically motivated ● assumes mixed distributions and a model of latent classes ● based on the aspect model ● ● ● each observation (term) is assigned to one latent variable (class/dimension) a joint probability of documents and terms is defined based on two assumptions: ● ● the bag-of-words assumption a conditional independence of words and documents, they are only coupled through the latent variable Aside: latent variables ● ● ● ● latent variables are theoretic constructs, they are defined in the model they are NOT directly measurable but they can be determined based on measurable variables (which are called observables) example: IQ ● ● ● cannot be measured directly is determined by lots of test results (questions, answers are observable) IQ is latent variable, so the quality of a result is just as good as the underlying model defining the latent variable → a lot of criticism pLSI – formal definition ● joint probability over documents and terms: ● now use Bayes' theorem: pLSI – relation to LSI ● define 3 matrices ● joint probability model: ● we see: ● outer products of rows of U and V show conditional independence ● K factors are mixture components as in the aspect model ● mixture components replace singular values pLSI – relation to LSI ● differences: ● approximation in LSI uses L2 / Frobenius norm ● this expects a Gaussian noise on term frequencies ● pLSI uses likelihood function to explicitly maximize the predictive quality of the model – in particular: minimize the Kullback-Leibler distance between true and approximated probability distribution pLSI – geometric interpretation ● ● ● we have K multinomial distributions that are specific to the K semantic classes those span a K-1 simplex approximation of P(w|d) is given by a convex combination of P(w|z) pLSI – results ● approximating P gives a well defined probability distribution for every word ● factors have a clear probabilistic meaning ● LSI does not use probabilities, even negative values possible ● ● there is no apparent interpretation of the semantic space in LSI, in pLSI the semantic space can be interpreted as a multinomial distribution of word over semantic classes as pLSI is a probabilistic model, we can use model selection to find an optimal K (number of semantic classes) This lecture covered ● Semantic spaces ● Bag of Word assumption ● Singular value decomposition ● Latent Semantic Indexing ● Latent variables ● probabilistic Latent Semantic Indexing Pictures/References ● ● ● Wikipedia, Singular Value Decomposition Deerwester, Dumais, Furnas, Landauer, Harshman: Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 1990 Hofmann: Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second SIGIR, 1999