EDLSI with PSVD Updating April Kontostathis Ursinus College Erin Moulding, Raymond J. Spiteri

EDLSI with PSVD Updating April Kontostathis Ursinus College Erin Moulding, Raymond J. Spiteri University of Saskatchewan Outline • • • • • • Overview of Latent Semantic Indexing (LSI) Updating methods for PSVD Essential Dimensions of LSI (EDLSI) Description of our experiments Results Conclusions Vector Space Retrieval • Documents represented by vectors • Entries represent importance of term (word) ▫ ▫ ▫ ▫ Binary or raw frequencies May be weighted, locally or globally Normalized Common words (stop-words), infrequent words removed Vector Space Retrieval • Vectors combined into term-document matrix A • Query q also represented as vector, same rules • Scores computed by: w = qT A ▫ Entries of w are relevance of document to q • Same with multiple queries • Problems: ▫ Synonymy: use wrong synonym, miss documents ▫ Polysemy: multiple meanings, get wrong one Latent Semantic Indexing (LSI) • Approximate term-document matrix by rank-k partial singular value decomposition (PSVD): A = U Σ VT ≈ Ak = Uk Σk VkT • Uk, Vk orthonormal, Σ diagonal matrix of singular values • Closest approximation to A in 2-norm • Captures term relationship information • k chosen empirically, usually 100-300 Updating methods for PSVD • Computation of PSVD is expensive ▫ May be reused for many queries if collection is stable ▫ When collection changes, must add new information ▫ Recomputing PSVD very expensive • Methods of adding new information: ▫ Folding-in, updating, folding-up Folding-in • Project new documents D into k-dimensional space, then add to bottom of Vk: Dk = DT Uk Σk−1 [ A, D ] ≈ Uk Σk [ VkT, DkT ] • Fast and easy • Generally corrupts orthogonality of Uk, Vk • Not recommended if collection changes often Updating • Finds exact PSVD of [ Ak, D ] to roundoff error • Uses a smaller QR decomposition and PSVD calculations • Slower than folding-in, but much more accurate • Still faster than recomputing, and gives same result to roundoff error Folding-up • Hybrid method of folding-in and updating • New documents are folded-in until a threshold is reached, then updated with all new documents • Two threshold methods: ▫ Number of documents added reaches pre-selected percentage of current term-document matrix ▫ Error threshold based on the accumulated loss of orthogonality in Vk Essential Dimensions of LSI (EDLSI) • As k approaches the rank r of A, LSI approaches traditional vector-space retrieval ▫ But LSI outperforms vector-space for some collections, even for small k • Hypothesis: LSI captures term relationship information in first few dimensions, then continues to add dimensions to capture data from vector-space methods. Essential Dimensions of LSI (EDLSI) • Idea: use both vector-space retrieval and LSI with very small k • Score is a weighted sum of scores: w = x (qT Ak) + (1 – x) (qT A) • Optimal k with this method usually under 50 • Optimal x small, usually 0.2 or less • Outperforms LSI (both run-time and retrieval performance) Our experiments • Combining EDLSI with PSVD updating methods ▫ Each provides improvement over LSI alone, will combination provide further improvement? • Collections: ▫ Small SMART datasets, used often for LSI ▫ Two subsets of TREC AQUAINT, size 15000 and 30000 documents each (referred to as HARD-1 and HARD-2 respectively) Our experiments • Metrics for evaluation: ▫ Precision: number of retrieved relevant documents divided by total number of retrieved documents ▫ Recall: number of retrieved relevant documents divided by total relevant documents in dataset ▫ 11-point precision: average of precision at 11 standard recall levels (0%, 10%, … , 100%) ▫ Mean Average Precision (MAP): average of 11point precision over all queries Our experiments • Partition each dataset into initial set of 50% of documents • Add incrementally with 3% of documents • For each dataset, each method, determine optimal k for LSI, optimal k and x for EDLSI ▫ Optimal in terms of MAP ▫ If multiple runs give same MAP, smallest k and x value chosen Our experiments • EDLSI: tested k from 5 to 50 for small datasets and 5 to 100 for HARD-1, in increments of 5 • EDLSI: tested x from 0.1 to 0.9 by 0.1 • LSI: tested k from 25 to 200 for small datasets and 25 to 500 for HARD-1, in increments of 25 • For HARD-2, same parameters as HARD-1 used • Each dataset tested with recomputing, foldingin, updating, and both folding-up methods Results • EDLSI generally matched or outperformed LSI in term of MAP • EDLSI always outperformed LSI in terms of run time and memory requirements • EDLSI reaches optimal MAP at small k, then does not change much once passed • EDLSI MAP does not change much near optimal x value of 0.1 Results Results Results Conclusions • EDLSI in combination with PSVD updating techniques provides an improvement in MAP over both EDLSI alone and LSI with PSVD updating • EDLSI improves on LSI in terms of run time and memory considerations

EDLSI with PSVD Updating April Kontostathis Ursinus College Erin Moulding, Raymond J. Spiteri

Related documents

Products

Support

EDLSI with PSVD Updating April Kontostathis Ursinus College Erin Moulding, Raymond J. Spiteri

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib