Text and Web Search Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Information Retrieval Typical IR systems Online library catalogs Online document management systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance Basic Measures for Text Retrieval Relevant Relevant & Retrieved Retrieved All Documents Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) | {Relevant} {Retrieved } | precision | {Retrieved } | Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved recall | {Relevant} {Retrieved } | | {Relevant} | Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model Problem - Motivation Given a database of documents, find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries information filtering Problem - Motivation Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT ...) additional features (‘data’ ADJACENT ‘retrieval’) keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents? Full-text scanning for single term: (naive: O(N*M)) ABRACADABRA text CAB pattern Full-text scanning for single term: (naive: O(N*M)) Knuth, Morris and Pratt (‘77) build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRA text CAB pattern Full-text scanning ABRACADABRA text CAB pattern CAB ... CAB CAB Full-text scanning for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) Boyer and Moore (‘77) preprocess pattern; start from right to left & skip! ABRACADABRA text CAB pattern Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI Text – Inverted Files Text – Inverted Files Q: space overhead? A: mainly, the postings lists Text – Inverted Files how to organize dictionary? stemming – Y/N? Keep only the root of each word ex. inverted, inversion invert insertions? Text – Inverted Files how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA trees, ... stemming – Y/N? insertions? Text – Inverted Files postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’ log(freq) freq ~ 1/rank / ln(1.78V) log(rank) Text – Inverted Files postings lists Cutting+Pedersen (keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92] geometric progression compression (Elias codes) [Zobel+] – down to 2% overhead! Conclusions: needs space overhead (2%-300%), but it is the fastest Vector Space Model and Clustering Keyword (free-text) queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors Vector Space Model and Clustering main idea: each document is a vector of size d: d is the number of different terms in the database document ...data... ‘indexing’ aaron data zoo d (= vocabulary size) Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally A vector is like an array of floating points Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse Document Vectors One location for each word. nova A 10 B 5 C D E F G 5 H I galaxy heat 5 3 10 h’wood film role diet fur 10 8 7 10 in text 5 A “Nova” occurs910 times “Galaxy” occurs 5 times in text A 10 “Heat” occurs 3 times in text A 9 7 means 0 occurrences.) 9 (Blank 6 10 7 2 5 10 10 8 1 3 Document Vectors One location for each word. nova A 10 B 5 C D E F G 5 H I galaxy heat 5 3 10 h’wood film role diet “Hollywood” occurs 78times7in text I 10 “Film” occurs 9 5 times 10 in 5text I “Diet” occurs 1 time in text I 10 “Fur” occurs 3 times in text I 9 6 7 10 7 2 5 fur 10 10 9 8 1 3 Document Vectors Document ids nova A 10 B 5 C D E F G 5 H I galaxy heat 5 3 10 h’wood 10 9 film 8 10 role diet 7 5 10 9 6 7 10 7 2 5 fur 10 10 9 8 1 3 We Can Plot the Vectors Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet Vector Space Model and Clustering Then, group nearby vectors together Q1: cluster search? Q2: cluster generation? Two significant contributions ranked output relevance feedback Vector Space Model and Clustering cluster search: visit the (k) closest superclusters; continue recursively MD TRs CS TRs Vector Space Model and Clustering ranked output: easy! MD TRs CS TRs Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] MD TRs CS TRs Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] How? MD TRs CS TRs Vector Space Model and Clustering How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones MD TRs CS TRs Cluster generation Problem: given N points in V dimensions, group them Cluster generation Problem: given N points in V dimensions, group them (typically a k-means or AGNES is used) Assigning Weights to Terms Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole Binary Weights Only the presence (1) or absence (0) of a term is included in the vector docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t1 1 1 0 1 1 1 0 0 0 0 1 t2 0 0 1 0 1 1 1 1 0 1 0 t3 1 0 1 0 1 0 0 0 1 1 1 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t1 2 1 0 3 1 3 0 0 0 0 4 t2 0 0 4 0 6 5 8 10 0 3 0 t3 3 0 7 0 3 0 0 0 1 5 1 Assigning Weights tf x idf measure: term frequency (tf) inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document tf x idf wik tfik * log( N / nk ) Tk term k tfik frequency of term Tk in document Di idf k inverse document frequency of term Tk in C N total number of documents in the collection C nk the number of documents in C that contain Tk idf k log N nk Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents 10000 log 0 10000 10000 log 0.301 5000 10000 log 2.698 20 10000 log 4 1 Similarity Measures for document vectors |QD| |QD| 2 |Q|| D| |QD| |QD| |QD| 1 Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient 1 |Q | | D | 2 |QD| min(| Q |, | D |) 2 Cosine Coefficient Overlap Coefficient tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive. wik tfik log( N / nk ) 2 2 ( tf ) [log( N / n )] k 1 ik k t Vector space similarity (use the weights to compare the documents) Now, the similarity of two documents is : t vi * v j sim ( Di , D j ) wik w jk || v i |||| v j || k 1 This is also called the cosine, or normalized inner product. Computing Similarity Scores D1 (0.8, 0.3) D2 (0.2, 0.7) 1.0 Q (0.4, 0.8) cos 1 0.74 Q D2 0.8 0.6 0.4 0.2 cos 2 0.98 2 1 0.2 D1 0.4 0.6 0.8 1.0 Vector Space with Term Weights and Cosine Matching Term B 1.0 0.8 0.6 D2 Q Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) 2 0 sim (Q, Di ) D1 1 0.2 0.4 0.6 Term A 0.8 1.0 t sim (Q, D 2) 0.4 0.2 Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) j 1 wq j wdij j 1 (wq j ) t 2 2 ( w ) j 1 dij t (0.4 0.2) (0.8 0.7) [(0.4) 2 (0.8) 2 ] [(0.2) 2 (0.7) 2 ] 0.64 0.98 0.42 .56 sim (Q, D1 ) 0.74 0.58 Text - Detailed outline Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI Information Filtering + LSI [Foltz+,’92] Goal: users specify interests (= keywords) system alerts them, on suitable newsdocuments Major contribution: LSI = Latent Semantic Indexing latent (‘hidden’) concepts Information Filtering + LSI Main idea map each document into some ‘concepts’ map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept Information Filtering + LSI Pictorially: term-document matrix (BEFORE) 'data' 'system' 'retrieval' 'lung' 'ear' TR1 1 1 1 TR2 1 1 1 TR3 1 1 TR4 1 1 Information Filtering + LSI Pictorially: concept-document matrix and... 'DBMS- 'medicalconcept' concept' TR1 1 TR2 1 TR3 1 TR4 1 Information Filtering + LSI ... and concept-term matrix data 'DBMS- 'medicalconcept' concept' 1 system 1 retrieval 1 lung 1 ear 1 Information Filtering + LSI Q: How to search, eg., for ‘system’? Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents data 'DBMS- 'medicalconcept' concept' 1 'DBMS- 'medicalconcept' concept' TR1 1 system 1 TR2 1 retrieval 1 lung 1 ear 1 TR3 1 TR4 1 Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents data 'DBMS- 'medicalconcept' concept' 1 'DBMS- 'medicalconcept' concept' TR1 1 system 1 TR2 1 retrieval 1 lung 1 ear 1 TR3 1 TR4 1 Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’) SVD LSI: find ‘concepts’ SVD - Definition A[n x m] = U[n x r] L [ r x r] (V[m x r])T A: n x m matrix (eg., n documents, m terms) U: n x r matrix (n documents, r concepts) L: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts) SVD - Example A = U L VT - example: retrieval inf. lung brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 SVD - Example A = U L VT - example: retrieval CS-concept inf. lung MD-concept brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 SVD - Example A = U L VT - example: doc-to-concept similarity matrix retrieval CS-concept inf. lung MD-concept brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 SVD - Example A = U L VT - example: retrieval inf. lung brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = ‘strength’ of CS-concept 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 SVD - Example A = U L VT - example: retrieval inf. lung brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 term-to-concept similarity matrix CS-concept x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 SVD - Example A = U L VT - example: retrieval inf. lung brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 term-to-concept similarity matrix CS-concept x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 SVD for LSI ‘documents’, ‘terms’ and ‘concepts’: U: document-to-concept similarity matrix V: term-to-concept sim. matrix L: its diagonal elements: ‘strength’ of each concept SVD for LSI Need to keep all the eigenvectors? NO, just keep the first k (concepts) Web Search What about web search? First you need to get all the documents of the web…. Crawlers. Then you have to index them (inverted files, etc) Find the web pages that are relevant to the query Report the pages with their links in a sorted order Main difference with IR: web pages have links may be possible to exploit the link structure for sorting the relevant documents… Kleinberg’s Algorithm (HITS) Main idea: In many cases, when you search the web using some terms, the most relevant pages may not contain this term (or contain the term only a few times) Harvard : www.harvard.edu Search Engines: yahoo, google, altavista Authorities and hubs Kleinberg’s algorithm Problem dfn: given the web and a query find the most ‘authoritative’ web pages for this query Step 0: find all pages containing the query terms (root set) Step 1: expand by one move forward and backward (base set) Kleinberg’s algorithm Step 1: expand by one move forward and backward Kleinberg’s algorithm on the resulting graph, give high score (= ‘authorities’) to nodes that many important nodes point to give high importance score (‘hubs’) to nodes that point to good ‘authorities’) hubs authorities Kleinberg’s algorithm observations recursive definition! each node (say, ‘i’-th node) has both an authoritativeness score ai and a hubness score hi Kleinberg’s algorithm Let E be the set of edges and A be the adjacency matrix: the (i,j) is 1 if the edge from i to j exists Let h and a be [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scores. Then: Kleinberg’s algorithm Then: k i l m ai = hk + hl + hm that is ai = Sum (hj) over all j that (j,i) edge exists or a = AT h Kleinberg’s algorithm i n p q symmetrically, for the ‘hubness’: hi = an + ap + aq that is hi = Sum (qj) over all j that (i,j) edge exists or h=Aa Kleinberg’s algorithm In conclusion, we want vectors h and a such that: h=Aa a = AT h Start from a and h to all 1. Then apply the following trick: h=Aa=A(ATh)=(AAT)h = ..=(AAT)2 h ..= (AAT)k h a = (ATA)ka Kleinberg’s algorithm In short, the solutions to h=Aa a = AT h are the left- and right- eigenvectors of the adjacency matrix A. Starting from random a’ and iterating, we’ll eventually converge (Q: to which of all the eigenvectors? why?) Kleinberg’s algorithm (Q: to which of all the eigenvectors? why?) A: to the ones of the strongest eigenvalue, because of property : (AT A ) k v’ ~ (constant) v1 So, we can find the a and h vectors and the page with the highest a values are reported! Kleinberg’s algorithm - results Eg., for the query ‘java’: 0.328 www.gamelan.com 0.251 java.sun.com 0.190 www.digitalfocus.com (“the java developer”) Kleinberg’s algorithm - discussion ‘authority’ score can be used to find ‘similar pages’ to page p closely related to ‘citation analysis’, social networs / ‘small world’ phenomena google/page-rank algorithm closely related: The Web is a directed graph of connected nodes imagine a particle randomly moving along the edges (*) compute its steady-state probabilities. That gives the PageRank of each pages (the importance of this page) (*) with occasional random jumps PageRank Definition Assume a page A and pages T1, T2, …, Tm that point to A. Let d is a damping factor. PR(A) the Pagerank of A. C(A) the outdegree of A. Then: PR(T1) PR(T 2) PR(Tm) PR( A) (1 d ) d ( ... ) C (T1) C (T 2) C (Tm) google/page-rank algorithm Compute the PR of each page~identical problem: given a Markov Chain, compute the steady state probabilities p1 ... p5 2 1 4 5 3 Computing PageRank Iterative procedure Also, … navigate the web by randomly follow links or with prob p jump to a random page. Let A the adjacency matrix (n x n), ci out-degree of page i Prob(Ai->Aj) = dn-1+(1-d)ci–1Aij A’[i,j] = Prob(Ai->Aj) google/page-rank algorithm Let A’ be the transition matrix (= adjacency matrix, row-normalized : sum of each row = 1) 2 1 1 3 1/2 1 5 p1 p2 p2 p3 1 4 1/2 p1 1/2 1/2 = p3 p4 p4 p5 p5 google/page-rank algorithm Ap=p A 2 1 1 3 1/2 1 5 1/2 1/2 1/2 = p p1 p1 p2 p2 p3 1 4 p = p3 p4 p4 p5 p5 google/page-rank algorithm Ap=p thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is row-normalized) Kleinberg/google - conclusions SVD helps in graph analysis: hub/authority scores: strongest left- and right- eigenvectors of the adjacency matrix random walk on a graph: steady state probabilities are given by the strongest eigenvector of the transition matrix References Brin, S. and L. Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf.