Web search engines Rooted in Information Retrieval (IR) systems •Prepare a keyword index for corpus •Respond to keyword queries with a ranked list of documents. ARCHIE •Earliest application of rudimentary IR systems to the Internet •Title search across sites serving files over FTP Boolean queries: Examples Simple queries involving relationships between terms and documents • Documents containing the word Java • Documents containing the word Java but not the word coffee Proximity queries • Documents containing the phrase Java beans • or the term API Documents where Java and island occur in the same sentence Mining the Web Chakrabarti and Ramakrishnan 2 Document preprocessing Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of • • • characters excluding spaces and punctuations. Token represented by a suitable integer, tid, typically 32 bits Optional: stemming/conflation of words Result: document (did) transformed into a sequence of integers (tid, pos) Mining the Web Chakrabarti and Ramakrishnan 3 Storing tokens Straight-forward implementation using a relational database • Example figure • Space scales to almost 10 times Accesses to table show common pattern • reduce the storage by mapping tids to a • lexicographically sorted buffer of (did, pos) tuples. Indexing = transposing document-term matrix Mining the Web Chakrabarti and Ramakrishnan 4 Two variants of the inverted index data structure, usually stored on disk. The simpler version in the middle does not store term offset information; the version to the right stores term offsets. The mapping from terms to documents and positions (written as “document/position”) may be implemented using a B-tree or a hash-table. Mining the Web Chakrabarti and Ramakrishnan 5 Storage For dynamic corpora • Berkeley DB2 storage manager • Can frequently add, modify and delete documents For static collections • Index compression techniques (to be discussed) Mining the Web Chakrabarti and Ramakrishnan 6 Stopwords Function words and connectives Appear in large number of documents and little use in pinpointing documents Indexing stopwords • Stopwords not indexed For reducing index space and improving performance • Replace stopwords with a placeholder (to remember the offset) Issues • Queries containing only stopwords ruled out • Polysemous words that are stopwords in one sense but not in others Mining the Web E.g.; can as a verb vs. can as a noun Chakrabarti and Ramakrishnan 7 Stemming Conflating words to help match a query term with a morphological variant in the corpus. Remove inflections that convey parts of speech, tense and number E.g.: university and universal both stem to universe. Techniques • morphological analysis (e.g., Porter's algorithm) • dictionary lookup (e.g., WordNet). Stemming may increase recall but at the price of precision • Abbreviations, polysemy and names coined in the technical and commercial sectors • E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad ! Mining the Web Chakrabarti and Ramakrishnan 8 Batch indexing and updates Incremental indexing • Time-consuming due to random disk IO • High level of disk block fragmentation Simple sort-merges. • To replace the indexed update of variablelength postings For a dynamic collection • single document-level change may need to update hundreds to thousands of records. • Solution : create an additional “stop-press” index. Mining the Web Chakrabarti and Ramakrishnan 9 Maintaining indices over dynamic collections. Mining the Web Chakrabarti and Ramakrishnan 10 Stop-press index Collection of document in flux • Model document modification as deletion followed by insertion • Documents in flux represented by a signed record (d,t,s) • “s” specifies if “d” has been deleted or inserted. Getting the final answer to a query • Main index returns a document set D0. • Stop-press index returns two document sets D+ : documents not yet indexed in D0 matching the query D- : documents matching the query removed from the collection since D0 was constructed. Stop-press index getting too large • Rebuild the main index signed (d, t, s) records are sorted in (t, d, s) order and mergepurged into the master (t, d) records • Stop-press index can be emptied out. Mining the Web Chakrabarti and Ramakrishnan 11 Relevance ranking Keyword queries • In natural language • Not precise, unlike SQL Boolean decision for response unacceptable • Solution Rate each document for how likely it is to satisfy the user's information need Sort in decreasing order of the score Present results in a ranked list. No algorithmic way of ensuring that the ranking strategy always favors the information need • Query: only a part of the user's information need Mining the Web Chakrabarti and Ramakrishnan 12 Responding to queries Set-valued response • Response set may be very large (E.g., by recent estimates, over 12 million Web pages contain the word java.) Demanding selective query from user Guessing user's information need and ranking responses Evaluating rankings Mining the Web Chakrabarti and Ramakrishnan 13 Evaluating procedure Given benchmark • Corpus of n documents D • A set of queries Q • For each query, q Q an exhaustive set of relevant documents D q D identified manually Query submitted system • Ranked list of documents • (d1 , d 2 ,, d n ) retrieved (r1, r2 , .., rn ) compute a 0/1 relevance list ri 1 iff d i D q Mining the Web ri 0 otherwise.Chakrabarti and Ramakrishnan 14 Recall and precision Recall at rank • Fraction of all relevant documents included in • . (d1 , d 2 ,, d n ) 1 . recall(k) | Dq | r 1i k i Precision at rank k 1 • Fraction of the top k responses that are • actually relevant. . precision( k) 1 r k Mining the Web 1i k i Chakrabarti and Ramakrishnan 15 Other measures Average precision • Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documents • . avg.precis ion 1 rk * precision (k ) | Dq | 1k |D| . • avg.precision =1 iff engine retrieves all relevant documents and ranks them ahead of any irrelevant document Interpolated precision • To combine precision values from multiple queries • Gives precision-vs.-recall curve for the benchmark. Mining the Web For each query, take the maximum precision obtained for the query for any recall greater than or equal to average them together for all queries Chakrabarti and Ramakrishnan 16 Precision-Recall tradeoff Interpolated precision cannot increase with recall • Interpolated precision at recall level 0 may be less than 1 At level k = 0 • Precision (by convention) = 1, Recall = 0 Inspecting more documents • Can increase recall • Precision may decrease we will start encountering more and more irrelevant documents Search engine with a good ranking function will generally show a negative relation between recall and precision. Mining • the Web Ramakrishnan 17 Higher the curve,Chakrabarti betterandthe engine Precision and interpolated precision plotted against recall for the given relevance vector. Missing rk are zeroes. Mining the Web Chakrabarti and Ramakrishnan 18 The vector space model Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token) Coordinate of document d in direction of term t determined by: • Term frequency TF(d,t) number of times term t occurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t) to scale down the coordinates of terms that occur Mining the Web in many documents Chakrabarti and Ramakrishnan 19 Term frequency n(d, t) . TF(d, t) n(d, t) TF(d, t) n(d, ) max (n(d, )) . Cornell SMART system uses a smoothed version TF (d , t ) 0 TF (d , t ) 1 log( 1 n(d , t )) Mining the Web n(d , t ) 0 otherwise Chakrabarti and Ramakrishnan 20 Inverse document frequency Given • D is the document collection and Dt is the set of documents containing t Formulae • mostly dampened functions of • SMART . Mining the Web D | Dt | 1 | D | IDF (t ) log( ) | Dt | Chakrabarti and Ramakrishnan 21 Vector space model Coordinate of document d in axis t • . dt TF (d , t ) IDF (t ) • Transformed to d in the TFIDF-space Query q • Interpreted as a document • Transformed to q in the same TFIDF-space as d Mining the Web Chakrabarti and Ramakrishnan 22 Measures of proximity Distance measure • Magnitude of the vector difference . | d q | • Document vectors must be normalized to unit length Else shorter documents dominate (since queries are short) Cosine similarity • cosine of the angle between Shorter Mining the Web d and q documents are penalized Chakrabarti and Ramakrishnan 23 Relevance feedback Users learning how to modify queries • Response list must have least some relevant documents • Relevance feedback `correcting' the ranks to the user's taste automates the query refinement process Rocchio's method • Folding-in user feedback q • To query vector • . Add a weighted sum of vectors for relevant documents D+ Subtract a weighted sum of the irrelevant documents D- q' q d - d D Mining the Web D- Chakrabarti and Ramakrishnan 24 Relevance feedback (contd.) Pseudo-relevance feedback • D+ and D- generated automatically E.g.: Cornell SMART system top 10 documents reported by the first round of query execution are included in D+ • typically set to 0; D- not used Not a commonly available feature • Web users want instant gratification • System complexity Executing the second round query slower and expensive for major search engines Mining the Web Chakrabarti and Ramakrishnan 25 Bayesian Inferencing Bayesian inference network for relevance ranking. A document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query. Mining the Web Chakrabarti and Ramakrishnan Manual specification of mappings between terms to approximate concepts. 26 Bayesian Inferencing (contd.) Four layers 1.Document layer 2.Representation layer 3.Query concept layer 4.Query Each node is associated with a random Boolean variable, reflecting belief Directed arcs signify that the belief of a node is a function of the belief of its immediate parents (and so on..) Mining the Web Chakrabarti and Ramakrishnan 27 Bayesian Inferencing systems 2 & 3 same for basic vector-space IR systems Verity's Search97 • Allows administrators and users to define hierarchies of concepts in files Estimation of relevance of a document d w.r.t. the query q • Set the belief of the corresponding node to 1 • Set all other document beliefs to 0 • Compute the belief of the query • Rank documents in decreasing order of belief that they induce in the query Chakrabarti and Ramakrishnan Mining the Web 28 Other issues Spamming • Adding popular query terms to a page unrelated to those terms • E.g.: Adding “Hawaii vacation rental” to a page about “Internet gambling” • Little setback due to hyperlink-based ranking Titles, headings, meta tags and anchor-text • TFIDF framework treats all terms the same • Meta search engines: Assign weight age to text occurring in tags, meta-tags • Using anchor-text on pages u which link to v Mining the Web Anchor-text on u offers valuable editorial judgment about v as well. Chakrabarti and Ramakrishnan 29 Other issues (contd..) Including phrases to rank complex queries • Operators to specify word inclusions and • exclusions With operators and phrases queries/documents can no longer be treated as ordinary points in vector space Dictionary of phrases • Could be cataloged manually • Could be derived from the corpus itself using • statistical techniques Two separate indices: one Mining the Web for single terms and another for phrases Chakrabarti and Ramakrishnan 30 Corpus derived phrase dictionary t t Two terms 1and 2 Null hypothesis = occurrences of 1and 2are independent To the extent the pair violates the null hypothesis, it is likely to be a phrase t t • Measuring violation with likelihood ratio of the hypothesis • Pick phrases that violate the null hypothesis with large confidence Contingency table built from statistics Mining the Web k00 k (t1 , t 2 ) k01 k (t1 , t 2 ) k10 k (t1 , t 2 ) k11 k (t1 , t 2 ) Chakrabarti and Ramakrishnan 31 Corpus derived phrase dictionary Hypotheses • Null hypothesis k 00 k 01 k10 k11 H ( p00 , p01, p10 , p11; k00 , k01, k10 , k11 ) p00 p01 p10 p11 • Alternative hypothesis H ( p1 , p2 ; k00 , k01, k10 , k11 ) ((1 p1 )(1 p2 )) k00 ((1 p1 ) p2 ) k01 ( p1 (1 p2 )) k10 ( p1 p2 ) k11 • Likelihood ratio max H ( p; k ) p 0 max H ( p; k ) p Mining the Web Chakrabarti and Ramakrishnan 32 Approximate string matching Non-uniformity of word spellings • dialects of English • transliteration from other languages Two ways to reduce this problem. 1. Aggressive conflation mechanism to 2. collapse variant spellings into the same token Decompose terms into a sequence of qgrams or sequences of q characters Mining the Web Chakrabarti and Ramakrishnan 33 Approximate string matching 1. Aggressive conflation mechanism to collapse variant spellings into the same token • • E.g.: Soundex : takes phonetics and pronunciation details into account used with great success in indexing and searching last names in census and telephone directory data. 2. Decompose terms into a sequence of q-grams or sequences of q characters • • Check for similarity in the q(2 q 4) grams Looking up the inverted index : a two-stage affair: • • • • Mining the Web Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms These terms are submitted to the regular index Used by Google for spelling correction Idea also adopted for eliminating near-duplicate pages Chakrabarti and Ramakrishnan 34 Meta-search systems • Take the search engine to the document • Forward queries to many geographically distributed repositories • Each has its own search service • Consolidate their responses. • Advantages • Perform non-trivial query rewriting • Suit a single user query to many search engines with different query syntax • Surprisingly small overlap between crawls • Consolidating responses • Function goes beyond just eliminating duplicates • Search services do not provide standard ranks which can be combined meaningfully Mining the Web Chakrabarti and Ramakrishnan 35 Similarity search • Cluster hypothesis • Documents similar to relevant documents are also likely to be relevant • Handling “find similar” queries • Replication or duplication of pages • Mirroring of sites Mining the Web Chakrabarti and Ramakrishnan 36 Document similarity • Jaccard coefficient of similarity between document d1 and d 2 • T(d) = set of tokens in document d | T (d ) T (d ) | r ' ( d , d ) •. | T (d ) T (d ) | • Symmetric, reflexive, not a metric • Forgives any number of occurrences and any 1 1 2 1 2 2 permutations of the terms. • 1 r ' (d1, d2 ) Mining the Web is a metric Chakrabarti and Ramakrishnan 37 Estimating Jaccard coefficient with random permutations 1. Generate a set of m random permutations 2. for each do 3. compute (d1 ) and (d2 ) 4. check if min T (d1 ) min T (d2 ) 5. end for 6. if equality was observed in k cases, estimate. r ' (d1 , d 2 ) k m Mining the Web Chakrabarti and Ramakrishnan 38 Fast similarity search with random permutations 1. for each random permutation do 2. 3. 4. 5. 6. create a filef for each document d do write out s min (T (d )), d tof end for sort f using key s--this results in contiguous blocks with fixed ds s containing all associated create a fileg f for each pair(d1 , d 2 ) within a run of having a given s do 7. 8. (d1 , d 2 ) 9. write out a document-pair record to g 10. end for 11. sort g on key(d1 , d 2 ) 12. end for (d1of , d2 ) 13. merge g for all in(d1 , d 2 ) order, counting the number entries Mining the Web Chakrabarti and Ramakrishnan 39 Eliminating near-duplicates via shingling • “Find-similar” algorithm reports all duplicate/nearduplicate pages • Eliminating duplicates • Maintain a checksum with every page in the corpus • Eliminating near-duplicates • Represent each document as a set T(d) of q-grams (shingles) • Find Jaccard similarity r (d1 , d2 ) between d1 and d 2 • Eliminate the pair from step 9 if it has similarity above a threshold Mining the Web Chakrabarti and Ramakrishnan 40 • Detecting locally similar sub-graphs of the Web Similarity search and duplicate elimination on the graph structure of the web • • To improve quality of hyperlink-assisted ranking Detecting mirrored sites • Approach 1 [Bottom-up Approach] 1. Start process with textual duplicate detection • • • 2. 3. • cleaned URLs are listed and sorted to find duplicates/nearduplicates each set of equivalent URLs is assigned a unique token ID each page is stripped of all text, and represented as a sequence of outlink IDs Continue using link sequence representation Until no further collapse of multiple URLs are possible Approach 2 [Bottom-up Approach] 1. 2. 3. Mining the Web identify single nodes which are near duplicates (using textshingling) extend single-node mirrors to two-node mirrors continue on to larger and larger graphs which are likely mirrors of Chakrabarti and Ramakrishnan 41 one another Detecting mirrored sites (contd.) • Approach 3 [Step before fetching all pages] • Uses regularity in URL strings to identify host-pairs which are mirrors • Preprocessing • Host are represented as sets of positional bigrams • Convert host and path to all lowercase characters • Let any punctuation or digit sequence be a token separator • Tokenize the URL into a sequence of tokens, (e.g., www6.infoseek.com gives www, infoseek, com) • Eliminate stop terms such as htm, html, txt, main, index, home, bin, cgi • Form positional bigrams from the token sequence • Two hosts are said to be mirrors if • A large fraction of paths are valid on both web sites • These common paths link to pages that are near-duplicates. Mining the Web Chakrabarti and Ramakrishnan 42