Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22 Today Document Ranking term weights similarity measures vector space model probabilistic models Multi-dimensional spaces Clustering Marti A. Hearst SIMS 202, Fall 1997 Finding Out About Three phases: Asking of a question Construction of an answer Assessment of the answer Part of an iterative process Marti A. Hearst SIMS 202, Fall 1997 Ranking Algorithms Assign weights to the terms in the query. Assign weights to the terms in the documents. Compare the weighted query terms to the weighted document terms. Rank order the results. Marti A. Hearst SIMS 202, Fall 1997 Information need Collections Pre-process text input Parse Query Index Rank Vector Representation (revisited; see Salton article in Science) Documents and Queries are represented as vectors. Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position Di wd i1 , wd i 2 ,...,wd it Q wq1 , wq 2 ,...,wqt w 0 if a termis absent Marti A. Hearst SIMS 202, Fall 1997 Assigning Weights to Terms Raw term frequency tf x idf Automatically-derived thesaurus terms Marti A. Hearst SIMS 202, Fall 1997 Assigning Weights to Terms Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole Automatically derived thesaurus terms Marti A. Hearst SIMS 202, Fall 1997 Assigning Weights tf x idf measure: term frequency (tf) inverse document frequency (idf) Goal: assign a tf * idf weight to each term in each document Marti A. Hearst SIMS 202, Fall 1997 tf x idf wik tfik * log(N / nk ) Tk term k in document Di tf ik frequencyof termTk in document Di idfk inversedocumentfrequencyof termTk in C N totalnumber of documentsin thecollectionC nk the number of documentsin C thatcont ainTk idfk log(nk / N ) Marti A. Hearst SIMS 202, Fall 1997 tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive. wik tf ik log(N / nk ) 2 2 ( tf ) [log( N / n )] k 1 ik k t Marti A. Hearst SIMS 202, Fall 1997 Vector space similarity (use the weights to compare the documents) Now, thesimilarityof two documentsis : t sim( Di , D j ) wik w jk k 1 T hisis also called thecosine,or normalizedinner product. (Normalization wasdone when weightingthe terms.) Marti A. Hearst SIMS 202, Fall 1997 Vector Space Similarity Measure combine tf x idf into a similarity measure Di wd i1 , wd i 2 ,...,wd it Q wq1 , wq 2, ..., wqt w 0 if a termis absent t if term weights normalized: sim(Q, Di ) wqj wd ij j 1 otherwisenormalizein thesimilaritycomparison: t sim(Q, Di ) w j 1 t qj wd ij 2 ( w ) qj j 1 t 2 ( w ) d ij j 1 Marti A. Hearst SIMS 202, Fall 1997 To Think About How does this ranking algorithm behave? Make a set of hypothetical documents consisting of terms and their weights Create some hypothetical queries How are the documents ranked, depending on the weights of their terms and the queries’ terms? Marti A. Hearst SIMS 202, Fall 1997 Computing Similarity Scores D1 (0.8, 0.3) D2 (0.2, 0.7) 1.0 Q (0.4, 0.8) Q D2 cos1 0.74 0.8 0.6 0.4 0.2 cos 2 0.98 2 1 0.2 D1 0.4 0.6 0.8 1.0 Marti A. Hearst SIMS 202, Fall 1997 Computing a similarity score Say we havequery vector Q (0.4,0.8) Also, document D2 (0.2,0.7) Whatdoes theirsimilarit ycomparisonyield? sim(Q, D2 ) (0.4 * 0.2) (0.8 * 0.7) [(0.4) 2 (0.8) 2 ] * [(0.2) 2 (0.7) 2 ] 0.64 0.98 0.42 Marti A. Hearst SIMS 202, Fall 1997 Other Major Ranking Schemes Probabilistic Ranking Attempts to be more theoretically sound than the vector space (v.s.) model try to predict the probability of a document’s being relevant, given the query there are many many variations usually more complicated to compute than v.s. usually many approximations are required Usually can’t beat v.s. reliably using standard evaluation measures Marti A. Hearst SIMS 202, Fall 1997 Other Major Ranking Schemes Staged Logistic Regression A variation on probabilistic ranking Used successfully here at Berkeley in the Cheshire II system Marti A. Hearst SIMS 202, Fall 1997 Staged Logistic Regression Pick a set of X feature types sum of frequencies of all terms in query x1 sum of frequencies of all query terms in document x2 query length x3 document length x4 sum of idf’s for all terms in query x5 Determine weights, c, to indicate how important each feature type is (use training examples) To assign a score to the document: add up the feature weight times the term weight for each feature and each term in the query 5 score( D, Q) ci xi i 1 Marti A. Hearst SIMS 202, Fall 1997 Multi-Dimensional Space Documents exist in multi-dimensional space What does this mean? Consider a set of objects with features In what ways can they be grouped? different shapes different sizes different colors The features define an abstract space that the objects can reside in. Generalize this to terms in documents. There are more than three kinds of terms! Marti A. Hearst SIMS 202, Fall 1997 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti A. Hearst SIMS 202, Fall 1997 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti A. Hearst SIMS 202, Fall 1997 Pair-wise Document Similarity nova A 1 B 5 C D galaxy heat 3 1 h’wood film role 2 1 5 4 1 diet fur 2 How to compute document similarity? Marti A. Hearst SIMS 202, Fall 1997 Pair-wise Document Similarity (no normalization for simplicity) sim( A, B) (1 5) (2 3) 11 sim( A, C ) 0 D1 w11 , w12, ..., w1t D2 w21 , w22, ..., w2t sim( A, D) 0 sim( B, C ) 0 t sim( D1 , D2 ) w1i w2i sim( B, D) 0 sim(C , D) (2 4) (1 1) 9 i 1 nova A 1 B 5 C D galaxy heat 3 1 h’wood film role 2 1 5 4 1 diet fur 2 Marti A. Hearst SIMS 202, Fall 1997 Using Clustering Cluster entire collection Find cluster centroid that best matches the query This has been explored extensively it is expensive it doesn’t work well Marti A. Hearst SIMS 202, Fall 1997 Using Clustering Alternative (scatter/gather): cluster top-ranked documents show cluster summaries to user Seems useful experiments show relevant docs tend to end up in the same cluster users seem able to interpret and use the cluster summaries some of the time More computationally feasible Marti A. Hearst SIMS 202, Fall 1997 Clustering Advantages: See some main themes Disadvantage: Many ways documents could group together are hidden Marti A. Hearst SIMS 202, Fall 1997 Using Clustering Another alternative: cluster entire collection force results into a 2D space display graphically to give an overview looks neat but hasn’t been shown to be useful Kohonen feature maps can be used instead of clustering to produce display of documents in 2D regions Marti A. Hearst SIMS 202, Fall 1997 Clustering Multi-Dimensional Document Space (image from Wise et al 95) Marti A. Hearst SIMS 202, Fall 1997 Clustering Multi-Dimensional Document Space (image from Wise et al 95) Marti A. Hearst SIMS 202, Fall 1997 Concept “Landscapes” from Kohonen Feature Maps (X. Lin and H. Chen) Disease Pharmocology Anatomy Hospitals Legal Marti A. Hearst SIMS 202, Fall 1997 Graphical Depictions of Clusters Problems: Either too many concepts, or too coarse Only one concept per document Hard to view titles Browsing without search Marti A. Hearst SIMS 202, Fall 1997 Another Approach to Term Weighting: Latent Semantic Indexing Try to find words that are similar in meaning to other words by: computing document by term matrix a matrix is a two-dimensional vector processing the matrix to pull out the main themes Marti A. Hearst SIMS 202, Fall 1997 Document/Term Matrix T1 T2 ... Tt D1 d11 d12 ... d1t D2 d 21 d 22 ... d 2t . . . . . . . . Dn d n1 dij value of Tj in Di d n 2 ... d nt Marti A. Hearst SIMS 202, Fall 1997 Finding Similar Tokens T1 T2 ... Tt D1 d11 d12 ... d1t D2 d 21 d 22 ... d 2t . . . . . . . . Dn d n1 d n 2 ... d nt dij value of Tj in Di n sim(T j , Tk ) dij dik i 1 Two terms are considered similar if they co-occur often in many documents. Marti A. Hearst SIMS 202, Fall 1997 Document/Term Matrix This approach doesn’t work well Problems: Word contexts too large Polysemy Alternative Approaches Use Smaller Contexts Machine-Readable Dictionaries Local syntactic structure LSI (Latent Semantic Indexing) Find main themes within matrix Marti A. Hearst SIMS 202, Fall 1997