Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011 Class Outline Summary of last lecture Indexing Vector Space Models Matrix Decompositions Latent Semantic Analysis Mechanics Example Summary of previous class Principal Component Analysis Singular Value Decomposition Uses Mechanics Example swap rates Introduction How can we retrieve information using a search engine?. We can represent the query and the documents as vectors (vector space model) However to construct these vectors we should perform a preliminary document preparation. The documents are retrieved by finding the closest distance between the query and the document vector. Which is the most suitable distance to retrieve documents? Search engine Document File Preparation Manual Indexing Relationships and concepts between topics can be established It is expensive and time consuming It may not be reproduced if it is destroyed. The huge amount of information suggest a more automated system Document File Preparation Automatic indexing To buid an automatic index, we need to perform two steps: Document Analysis Decide what information or parts of the document should be indexed Token analysis Decide with words should be used in order to obtain the best representation of the semantic content of documents. Document Normalization After this preliminary analysis we need to perform another preprocessing of the data Remove stop words Function words: a, an, as, for, in, of, the… Other frequent words Stemming Group morphological variants Plurals Adverbs “ streets” -> “street” “fully” -> “full” The current algorithms can make some mistakes “police“, “policy” -> “polic” File Structures Once we have eliminated the stop words and apply the stemmer to the document we can construct: Document File We can extract the terms that should be used in the index and assign a number to each document. File Structures Dictionary We will construct a searchable dictionary of terms by arranging them alphabetically and indicating the frequency of each term in the collection Term Global Frequency banana 1 cranb 2 Hanna 2 hunger 1 manna 1 meat 1 potato 1 query 1 rye 2 sourdough 1 spiritual 1 wheat 2 File Structures Inverted List For each term we find the documents and its related position associated with Term (Doc, Position) banana (5,7) cranb (4,5); (6,4) Hanna (1,7); (8,2) hunger (9,4) manna (2,6) meat (7,6) potato (4,3) query (3,8) rye (3,3);(6,3) sourdough (5,5) spiritual (7,5) wheat (3,5);(6,6) Vector Space Model The vector space model can be used to represent terms and documents in a text collection The document collection of n documents can be represented with a matrix of m X n where the rows represent the terms and the columns the documents Once we construct the matrix, we can normalize it in order to have unitary vectors Vector Space Models Query Matching If we want to retrieve a document we should: Transform the query to a vector look for the most similar document vector to the query. One of the most common similarity methods is the cosine distance between vectors defined as: Where a is the document and q is the query vector Example: Using the book titles we want to retrieve books of “Child Proofing” Book titles Query 0 1 0 0 0 1 0 0 Cos 2=Cos 3=0.4082 Cos 5=Cos 6=0.500 With a threshold of 0.5, the 5th and the 6th would be retrieved. Term weighting In order to improve the retrieval, we can give to some terms more weight than others. Local Term Weights Global Term Weights Where 1 if X (r ) 0 if r 0 r 0 pij fi j f j ij Synonymy and Polysemy auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Polysemy Will have small cosine Will have large cosine but are related but not truly related Matrix Decomposition To produce a reduced –rank approximation of the document matrix, first we need to be able to identify the dependence between columns (documents) and rows (terms) QR Factorization SVD Decomposition QR Factorization The document matrix A can be decomposed as below: A QR Where Q is an mXm orthogonal matrix and R is an mX m upper triangular matrix This factorization can be used to determine the basis vectors for any matrix A This factorization can be used to describe the semantic content of the corresponding text collection Example A= 0.5774 0.5164 0.4961 0.1432 0.1252 0 0 0.5774 0.2582 0.2481 0.0716 0.0626 0 0 0 0 0 0.9393 0 0 0.6282 0.2864 0.2505 0 Q 0 0.5774 0.2582 0.2481 0.0716 0.0626 0 0 0 0 0 0.7071 0 0 0 0 0.9309 0.1252 0 0.7746 0.4961 0.1432 0.1252 0 0.7071 0 0 0 0 0 0 0.3802 0.3430 0 0 0.6860 0 0 0.1715 0.5962 0.3862 0 0.3802 0.5962 0.3430 0 0 0.3430 0 0 0 0.3802 0.5962 0.3430 0 0.1715 0.5962 Example 0 1 0 0 1 0.6667 0 0 0.7454 R0 0 0 0 0 0 0 0 0 0 0 0 0.6325 0.2582 .4082 0 0.4082 0.1155 0.3651 0 0.3651 0.7211 0.3508 0 0.3508 0 0.7596 0.6583 0.1013 0 0 0.7528 0.5756 0 0 0 0.4851 0 0 0 Query Matching We can rewrite the cosine distance using this decomposition cos j aTj q aj 2 q 2 (Q1rj )T q Q1rj 2 q 2 rjT (Q1T q) rj 2 q 2 Where rj refers to column j of the matrix R Singular Value Decomposition (SVD) This decomposition provides a reduced rank approximations in the column and row space of the document matrix This decomposition is defined as A UV T mm mn V is nn Where the columns U are orthogonal eigenvectors of AAT. The columns of V are orthogonal eigenvectors of ATA. Eigenvalues 1 … r of AAT are the square root of the eigenvalues of ATA. Latent Semantic Decomposition (LSA) It is the application of SVD in text mining. We decompose the document-term matrix A into three matrices A V The V matrix refers to terms and U matrix refers to documents U Latent Semantic Analysis Once we have decomposed the document matrix A we can reduce its rank We can account for synonymy and polysemy in the retrieval of documents Select the vectors associated with the higher value of in each matrix and reconstruct the matrix A Latent Semantic Analysis Query Matching The cosines between the vector q and the n document vectors can be represented as: cos j ( Ak e j )T q Ak e j q (U k kVk e j )T q U k kVk e j q where ej is the canonical vector of dimension n This formula can be simplified as cos j sTj (U kT q) sj where s j kVkT e j q , j 1,2,....,m Example Apply the LSA method to the following technical memo titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: m2: m3: m4: The generation of random, binary, ordered trees The intersection graph of paths in trees Graph minors IV: Widths of trees and well-quasi-ordering Graph minors: A survey Example First we construct the document matrix human interface computer user system response time EPS survey trees graph minors c1 1 1 1 0 0 0 0 0 0 0 0 0 c2 0 0 1 1 1 1 1 0 1 0 0 0 c3 0 1 0 1 1 0 0 1 0 0 0 0 c4 1 0 0 0 2 0 0 1 0 0 0 0 c5 0 0 0 1 0 1 1 0 0 0 0 0 m1 0 0 0 0 0 0 0 0 0 1 0 0 m2 0 0 0 0 0 0 0 0 0 1 1 0 m3 0 0 0 0 0 0 0 0 0 1 1 1 m4 0 0 0 0 0 0 0 0 1 0 1 1 Example The Resulting decomposition is the following {U} = 0.22 0.20 0.24 0.40 0.64 0.27 0.27 0.30 0.21 0.01 0.04 0.03 -0.11 -0.07 0.04 0.06 -0.17 0.11 0.11 -0.14 0.27 0.49 0.62 0.45 0.29 0.14 -0.16 -0.34 0.36 -0.43 -0.43 0.33 -0.18 0.23 0.22 0.14 -0.41 -0.55 -0.59 0.10 0.33 0.07 0.07 0.19 -0.03 0.03 0.00 -0.01 -0.11 0.28 -0.11 0.33 -0.16 0.08 0.08 0.11 -0.54 0.59 -0.07 -0.30 -0.34 0.50 -0.25 0.38 -0.21 -0.17 -0.17 0.27 0.08 -0.39 0.11 0.28 0.52 -0.07 -0.30 0.00 -0.17 0.28 0.28 0.03 -0.47 -0.29 0.16 0.34 -0.06 -0.01 0.06 0.00 0.03 -0.02 -0.02 -0.02 -0.04 0.25 -0.68 0.68 -0.41 -0.11 0.49 0.01 0.27 -0.05 -0.05 -0.17 -0.58 -0.23 0.23 0.18 Example {} = 3.34 2.54 2.35 1.64 1.50 1.31 0.85 0.56 0.36 Example {V} = 0.20 -0.06 0.11 -0.95 0.05 -0.08 0.18 -0.01 -0.06 0.61 0.17 -0.50 -0.03 -0.21 -0.26 -0.43 0.05 0.24 0.46 -0.13 0.21 0.04 0.38 0.72 -0.24 0.01 0.02 0.54 -0.23 0.57 0.27 -0.21 -0.37 0.26 -0.02 -0.08 0.28 0.11 -0.51 0.15 0.33 0.03 0.67 -0.06 -0.26 0.00 0.19 0.10 0.02 0.39 -0.30 -0.34 0.45 -0.62 0.01 0.44 0.19 0.02 0.35 -0.21 -0.15 -0.76 0.02 0.02 0.62 0.25 0.01 0.15 0.00 0.25 0.45 0.52 0.08 0.53 0.08 -0.03 -0.60 0.36 0.04 -0.07 -0.45 Example We will perform a 2 rank reconstruction: We select the first two vectors in each matrix and set the rest of the matrix to zero We reconstruct the document matrix Example m2 m3 m4 c1 c2 c3 c4 c5 m1 human 0.16 0.40 0.38 0.47 0.18 -0.05 -0.12 -0.16 -0.09 interface 0.14 0.37 0.33 0.40 0.16 -0.03 -0.07 -0.10 -0.04 computer 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12 user 0.26 0.84 0.61 0.70 0.39 0.03 0.08 0.12 0.19 system 0.45 1.23 1.05 1.27 0.56 -0.07 -0.15 -0.21 -0.05 response 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22 time 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22 EPS 0.22 0.55 0.51 0.63 0.24 -0.07 -0.14 -0.20 -0.11 survey 0.10 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42 trees -0.06 0.23 -0.14 -0.27 0.14 0.24 0.55 0.77 0.66 graph -0.06 0.34 -0.15 -0.30 0.20 0.31 0.69 0.98 0.85 minors -0.04 0.25 -0.10 -0.21 0.15 0.22 0.50 0.71 0.62 The word user seems to have presence in the documents where the word human appears