19-Mar - SEAS - University of Pennsylvania

Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems March 23, 2016 Some content based on slides by Marti Hearst, Ray Larson Recall Where We Left Off  We were discussing information retrieval ranking models  The Boolean model captures some intuitions of what we want – AND, OR  But it’s too restrictive, and has no real ranking between returned answers 2 Vector Model j dj Sim(q,dj) = cos() = [vec(dj)  vec(q)] / |dj| * |q| = [ wij * wiq] / |dj| * |q|  q i  Since wij > 0 and wiq > 0, 0 ≤ sim(q,dj) ≤ 1  A document is retrieved even if it matches the query terms only partially 3 Weights in the Vector Model Sim(q,dj) = [ wij * wiq] / |dj| * |q|  How do we compute the weights wij and wiq?  A good weight must take into account two effects:  quantification of intra-document contents (similarity)  tf factor, the term frequency within a document  quantification of inter-documents separation (dissimilarity)  idf factor, the inverse document frequency wij = tf(i,j) * idf(i) 4 TF and IDF Factors  Let: N be the total number of docs in the collection ni be the number of docs which contain ki freq(i,j) raw frequency of ki within dj  A normalized tf factor is given by f(i,j) = freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document dj  The idf factor is computed as idf(i) = log (N / ni) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki 5 Vector Model Example 1I k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 d1 d2 d3 d4 d5 d6 d7 k1 1 1 0 1 1 1 0 k2 0 0 1 0 1 1 1 k3 1 0 1 0 1 0 0 q 1 2 3 q  dj 4 1 5 1 6 3 2 6 Vector Model Example III k2 k1 d7 d6 d2 d4 d5 d1 d3 k3 d1 d2 d3 d4 d5 d6 d7 k1 2 1 0 2 1 1 0 k2 0 0 1 0 2 2 5 k3 1 0 3 0 4 0 0 q 1 2 3 q  dj 5 1 11 2 17 5 10 7 Vector Model, Summarized  The best term-weighting schemes tf-idf weights: wij = f(i,j) * log(N/ni)  For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N / ni)  This model is very good in practice:  tf-idf works well with general collections  Simple and fast to compute  Vector model is usually as good as the known ranking alternatives 8 Pros & Cons of Vector Model Advantages:  term-weighting improves quality of the answer set  partial matching allows retrieval of docs that approximate the query conditions  cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages:  assumes independence of index terms; not clear if this is a good or bad assumption 9 Comparison of Classic Models  Boolean model does not provide for partial matches and is considered to be the weakest classic model  Experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general  Generally we use a variation of the vector model in most text search systems 10 Switching Our Sights to the Web  Information retrieval is more heterogeneous in nature:  No editor to control quality  Deliberately misleading information (“web spam”)  Great variety in types of information  Phone books, catalogs, technical reports, news, slide shows, …  Many languages; partial duplication; jargon  Diverse user goals  Very short queries  ~2.35 words on average (Aug 2000; Google results)  And much larger scale! 11 Handling Short Queries & Mixed-Quality Information  Human processing  Web directories: Yahoo, Open Directory, …  Human-created answers: about.com, Search Wikia  (Still not clear that automated question-answering works)  Capitalism: “paid placement”  Advertisers pay to be associated with certain keywords  Clicks / page popularity: pages visited most often  Link analysis: use link structure to determine credibility  … combination of all? 12 Link Analysis for Starting Points: HITS (Kleinberg), PageRank (Google)  Assumptions:  Credible sources will mostly point to credible sources  Names of hyperlinks suggest meaning  Ranking is a function of the query terms and of the hyperlink structure  An example of why this makes sense:  The official Olympics site will be linked to by most highquality sites about sports, Olympics, etc.  A spammer who adds “Olympics” to his/her web site probably won’t have many links to it  Caveat: “Search engine optimization” 13 Google’s PageRank (Brin/Page 98)  Mine structure of web graph independently of the query! Each web page is a node, each hyperlink is a directed edge  Assumes a random walk (surf) through the web:  Start at a random page  At each step, the surfer proceeds  to a randomly chosen web page with probability d  to a randomly chosen successor of the current page with probability 1- d  The PageRank of a page p is the fraction of steps the surfer spends at p in the limit 14 Link Counts Aren’t Everything… “A-Team” page Mr. T’s Hollywood page “Series to Recycle” page Team Sports Cheesy TV Shows page Yahoo Wikipedia Directory 15 PageRank Importance of page i is governed by pages linking to it 1 xi   x j jBi N j Rank of page j Rank of page i Every page j that links to i Number of links out from page j 16 Computing PageRank (Simple version) Initialize so total rank sums to 1.0 Iterate until convergence (0) i x ( k 1) i x 1  n 1 (k )   xj jBi N j 17 Computing PageRank (Step 0) Initialize so total rank sums to 1.0 (0) i x 1  n 0.33 0.33 0.33 18 Computing PageRank (Step 1) Propagate weights across out-edges ( k 1) i x 1 (k )   xj jBi N j 0.33 0.17 0.33 0.17 19 Computing PageRank (Step 2) Compute weights based on in-edges xi (1) 1 ( 0)   xj jBi N j 0.50 0.33 0.17 20 Computing PageRank (Convergence) ( k 1) i x 1 (k )   xj jBi N j 0.4 0.40 0.2 21 Naïve PageRank Algorithm Restated Let  N(p) = number outgoing links from page p  B(p) = number of back-links to page p 1 PageRank ( p)   PageRank (b) bBi N (b)  Each page b distributes its importance to all of the pages it points to (so we scale by N(b))  Page p’s importance is increased by the importance of its back set 22 In Linear Algebra Terms Create an m x m matrix M to capture links:  M(i, j) = 1 / nj if page i is pointed to by page j and page j has nj outgoing links  Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:  PageRank ( p1 ' )   PageRank ( p1 )   PageRank ( p ' )   PageRank ( p )  2 2     M ... ...         PageRank ( p ' ) PageRank ( p ) m  m     (Computes principal eigenvector via power iteration) 23 A Brief Example Google Amazon Yahoo g' y’ a’ = 0 0 0.5 0.5 0 0.5 * 1 0.5 0 g y a Running for multiple iterations: g y a = 1 1 1 1 , 0.5 , 0.75 1 1.5 1.25 ,… 1 0.01 1.99 Total rank sums to number of pages 24 Oops #1 – PageRank Sinks: Dead Ends Google Amazon g' y’ 0 0 0.5 = 0.5 0 0.5 * g y a’ 0.5 0 0 a Yahoo Running for multiple iterations: g y a = 1 0.5 0.25 1 , 1 , 0.5 1 0.5 0.25 ,… 0 0 0 25 Oops #2 – Hogging all the PageRank g' Google Amazon Yahoo y’ a’ 0 0 0.5 = 0.5 1 0.5 * g y 0.5 0 0 a Running for multiple iterations: g y a = 1 0.5 0.25 1 , 2 , 2.5 1 0.5 0.25 ,… 0 3 0 26 Improved PageRank  Remove out-degree 0 nodes (or consider them to refer back to referrer)  Add decay factor to deal with sinks  PageRank(p) = d b  B(p) (PageRank(b) / N(b)) + (1 – d)  Intuition in the idea of the “random surfer”:  Surfer occasionally stops following link sequence and jumps to new random page, with probability 1 - d 27 Stopping the Hog g' y’ a’ Google Amazon 0 0 0.5 = 0.8 0.5 1 0.5 * 0.5 0 0 g y a 0.2 + 0.2 0.2 Yahoo Running for multiple iterations: g y a = 0.6 1.8 , 0.6 0.44 2.12 , 0.44 0.38 2.25 , 0.38 0.35 2.30 0.35 … though does this seem right? 28 Summary of Link Analysis  Use back-links as a means of adjusting the “worthiness” or “importance” of a page  Use iterative process over matrix/vector values to reach a convergence point  PageRank is query-independent and considered relatively stable  But vulnerable to SEO 29 Can We Go Beyond?  PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page  A more general notion: label propagation  Take a set of start nodes each with a different label  Estimate, for every node, the distribution of arrivals from each label  In essence, captures the relatedness or influence of nodes  Used in YouTube video matching, schema matching, … 30 Overall Ranking Strategies in Web Search Engines  Everybody has their own “secret sauce” that uses:        Vector model (TF/IDF) Proximity of terms Where terms appear (title vs. body vs. link) Link analysis Info from directories Page popularity gorank.com “search engine optimization site” compares these factors  Some alternative approaches:  Some new engines (Vivisimo, Teoma, Clusty) try to do clustering  A few engines (Dogpile, Mamma.com) try to do meta-search 31

19-Mar - SEAS - University of Pennsylvania

Related documents

Products

Support

19-Mar - SEAS - University of Pennsylvania

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib