Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems March 23, 2016 Some content based on slides by Marti Hearst, Ray Larson Recall Where We Left Off We were discussing information retrieval ranking models The Boolean model captures some intuitions of what we want – AND, OR But it’s too restrictive, and has no real ranking between returned answers 2 Vector Model j dj Sim(q,dj) = cos() = [vec(dj) vec(q)] / |dj| * |q| = [ wij * wiq] / |dj| * |q| q i Since wij > 0 and wiq > 0, 0 ≤ sim(q,dj) ≤ 1 A document is retrieved even if it matches the query terms only partially 3 Weights in the Vector Model Sim(q,dj) = [ wij * wiq] / |dj| * |q| How do we compute the weights wij and wiq? A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a document quantification of inter-documents separation (dissimilarity) idf factor, the inverse document frequency wij = tf(i,j) * idf(i) 4 TF and IDF Factors Let: N be the total number of docs in the collection ni be the number of docs which contain ki freq(i,j) raw frequency of ki within dj A normalized tf factor is given by f(i,j) = freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document dj The idf factor is computed as idf(i) = log (N / ni) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki 5 Vector Model Example 1I k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 d1 d2 d3 d4 d5 d6 d7 k1 1 1 0 1 1 1 0 k2 0 0 1 0 1 1 1 k3 1 0 1 0 1 0 0 q 1 2 3 q dj 4 1 5 1 6 3 2 6 Vector Model Example III k2 k1 d7 d6 d2 d4 d5 d1 d3 k3 d1 d2 d3 d4 d5 d6 d7 k1 2 1 0 2 1 1 0 k2 0 0 1 0 2 2 5 k3 1 0 3 0 4 0 0 q 1 2 3 q dj 5 1 11 2 17 5 10 7 Vector Model, Summarized The best term-weighting schemes tf-idf weights: wij = f(i,j) * log(N/ni) For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N / ni) This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known ranking alternatives 8 Pros & Cons of Vector Model Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: assumes independence of index terms; not clear if this is a good or bad assumption 9 Comparison of Classic Models Boolean model does not provide for partial matches and is considered to be the weakest classic model Experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general Generally we use a variation of the vector model in most text search systems 10 Switching Our Sights to the Web Information retrieval is more heterogeneous in nature: No editor to control quality Deliberately misleading information (“web spam”) Great variety in types of information Phone books, catalogs, technical reports, news, slide shows, … Many languages; partial duplication; jargon Diverse user goals Very short queries ~2.35 words on average (Aug 2000; Google results) And much larger scale! 11 Handling Short Queries & Mixed-Quality Information Human processing Web directories: Yahoo, Open Directory, … Human-created answers: about.com, Search Wikia (Still not clear that automated question-answering works) Capitalism: “paid placement” Advertisers pay to be associated with certain keywords Clicks / page popularity: pages visited most often Link analysis: use link structure to determine credibility … combination of all? 12 Link Analysis for Starting Points: HITS (Kleinberg), PageRank (Google) Assumptions: Credible sources will mostly point to credible sources Names of hyperlinks suggest meaning Ranking is a function of the query terms and of the hyperlink structure An example of why this makes sense: The official Olympics site will be linked to by most highquality sites about sports, Olympics, etc. A spammer who adds “Olympics” to his/her web site probably won’t have many links to it Caveat: “Search engine optimization” 13 Google’s PageRank (Brin/Page 98) Mine structure of web graph independently of the query! Each web page is a node, each hyperlink is a directed edge Assumes a random walk (surf) through the web: Start at a random page At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1- d The PageRank of a page p is the fraction of steps the surfer spends at p in the limit 14 Link Counts Aren’t Everything… “A-Team” page Mr. T’s Hollywood page “Series to Recycle” page Team Sports Cheesy TV Shows page Yahoo Wikipedia Directory 15 PageRank Importance of page i is governed by pages linking to it 1 xi x j jBi N j Rank of page j Rank of page i Every page j that links to i Number of links out from page j 16 Computing PageRank (Simple version) Initialize so total rank sums to 1.0 Iterate until convergence (0) i x ( k 1) i x 1 n 1 (k ) xj jBi N j 17 Computing PageRank (Step 0) Initialize so total rank sums to 1.0 (0) i x 1 n 0.33 0.33 0.33 18 Computing PageRank (Step 1) Propagate weights across out-edges ( k 1) i x 1 (k ) xj jBi N j 0.33 0.17 0.33 0.17 19 Computing PageRank (Step 2) Compute weights based on in-edges xi (1) 1 ( 0) xj jBi N j 0.50 0.33 0.17 20 Computing PageRank (Convergence) ( k 1) i x 1 (k ) xj jBi N j 0.4 0.40 0.2 21 Naïve PageRank Algorithm Restated Let N(p) = number outgoing links from page p B(p) = number of back-links to page p 1 PageRank ( p) PageRank (b) bBi N (b) Each page b distributes its importance to all of the pages it points to (so we scale by N(b)) Page p’s importance is increased by the importance of its back set 22 In Linear Algebra Terms Create an m x m matrix M to capture links: M(i, j) = 1 / nj if page i is pointed to by page j and page j has nj outgoing links Initialize all PageRanks to 1, multiply by M repeatedly until all values converge: PageRank ( p1 ' ) PageRank ( p1 ) PageRank ( p ' ) PageRank ( p ) 2 2 M ... ... PageRank ( p ' ) PageRank ( p ) m m (Computes principal eigenvector via power iteration) 23 A Brief Example Google Amazon Yahoo g' y’ a’ = 0 0 0.5 0.5 0 0.5 * 1 0.5 0 g y a Running for multiple iterations: g y a = 1 1 1 1 , 0.5 , 0.75 1 1.5 1.25 ,… 1 0.01 1.99 Total rank sums to number of pages 24 Oops #1 – PageRank Sinks: Dead Ends Google Amazon g' y’ 0 0 0.5 = 0.5 0 0.5 * g y a’ 0.5 0 0 a Yahoo Running for multiple iterations: g y a = 1 0.5 0.25 1 , 1 , 0.5 1 0.5 0.25 ,… 0 0 0 25 Oops #2 – Hogging all the PageRank g' Google Amazon Yahoo y’ a’ 0 0 0.5 = 0.5 1 0.5 * g y 0.5 0 0 a Running for multiple iterations: g y a = 1 0.5 0.25 1 , 2 , 2.5 1 0.5 0.25 ,… 0 3 0 26 Improved PageRank Remove out-degree 0 nodes (or consider them to refer back to referrer) Add decay factor to deal with sinks PageRank(p) = d b B(p) (PageRank(b) / N(b)) + (1 – d) Intuition in the idea of the “random surfer”: Surfer occasionally stops following link sequence and jumps to new random page, with probability 1 - d 27 Stopping the Hog g' y’ a’ Google Amazon 0 0 0.5 = 0.8 0.5 1 0.5 * 0.5 0 0 g y a 0.2 + 0.2 0.2 Yahoo Running for multiple iterations: g y a = 0.6 1.8 , 0.6 0.44 2.12 , 0.44 0.38 2.25 , 0.38 0.35 2.30 0.35 … though does this seem right? 28 Summary of Link Analysis Use back-links as a means of adjusting the “worthiness” or “importance” of a page Use iterative process over matrix/vector values to reach a convergence point PageRank is query-independent and considered relatively stable But vulnerable to SEO 29 Can We Go Beyond? PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page A more general notion: label propagation Take a set of start nodes each with a different label Estimate, for every node, the distribution of arrivals from each label In essence, captures the relatedness or influence of nodes Used in YouTube video matching, schema matching, … 30 Overall Ranking Strategies in Web Search Engines Everybody has their own “secret sauce” that uses: Vector model (TF/IDF) Proximity of terms Where terms appear (title vs. body vs. link) Link analysis Info from directories Page popularity gorank.com “search engine optimization site” compares these factors Some alternative approaches: Some new engines (Vivisimo, Teoma, Clusty) try to do clustering A few engines (Dogpile, Mamma.com) try to do meta-search 31