CIS 430 November 6, 2008 Emily Pitler 3 Named Entities 1 or 2 words Ambiguous meaning Ambiguous intent 4 5 Mei and Church, WSDM 2008 6 Beitzel et. al. SIGIR 2004 America Online, week in December 2003 Popular queries: ◦ 1.7 words Overall: ◦ 2.2 words 7 Lempel and Moran WWW2003 AltaVista, summer 2001 7,175,151 queries 2,657,410 distinct queries 1,792,104 queries occurred only once 63.7% Most popular query: 31,546 times 8 Saraiva et. al. SIGIR 2001 9 Lempel and Moran WWW2003 10 American Airlines? or Alcoholics Anonymous? 12 Clarity score ~ low ambiguity Cronen-Townsend et. al. SIGIR 2002 Compare a language model ◦ over the relevant documents for a query ◦ over all possible documents The more difference these are, the more clear the query is “programming perl” vs. “the” 13 Query Language Model P ( w | Q) P ( w | D) P( D | Q) DR Collection Language Model (unigram) P ( w | collection ) C (w) Dcollection C (allwords ) Dcollection 14 Relative entropy between the two distributions Cost in bits of coding using Q when true distribution is P H ( P( x)) P(i) log( P(i)) i DKL ( P Q) P(i) log( Q(i)) i ( P(i) log( P(i))) 15 P(i) DKL ( P Q) P(i) log( ) Q(i ) i 16 P( w | Q) Clarity score P( w | Q) log 2 Pcoll ( w) wV 17 18 Navigational ◦ greyhound bus ◦ compaq Informational ◦ San Francisco ◦ normocytic anemia Transactional ◦ britney spears lyrics ◦ download adobe reader Broder SIGIR 2002 19 The more webpages that point to you, the more important you are The more important webpages point to you, the more important you are These intuitions led to PageRank PageRank led to… Page et. al. 1998 22 washingtonpost.com Mtv.com cnn.com vh1.com Nytimes.com 23 Assume our surfer is on a page In the next time step she can: ◦ Choose a link on the current page uniformly at random ◦ Or ◦ Go somewhere else in the web uniformly at random After a long time, what is the probability she is on a given page? 24 Spread out their probability over outgoing links P(u ) P (v ) uBv deg( u ) Pages that point to v 25 26 Could also “get bored” with probability d and jump somewhere else completely d P(u ) P(v) (1 d ) N uBv deg( u ) 27 28 Google, obviously Given objects and links between them, measures importance Summarization (Erkan and Radev, 2004) ◦ Nodes = sentences, edges = thresholded cosine similarity Research (Mimno and McCallum, 2007) ◦ Nodes = people, edges = citations Facebook? 29 Words on the page Title Domain Anchor text—what other sites say when they link to that page 31 Title: Ani Nenkova - Home Domain: www.cis.upenn.edu 32 Ontology of webpages Over 4 million webpages are categorized Like WordNet for webpages Search engines use this Where is www.cis.upenn.edu? Computers ◦ Computer Science Academic Departments North America United States Pennsylvania 33 What OTHER webpages say about your webpage Very good descriptions of what’s on a page Link to: www.cis.upenn.edu/~nenkova “Ani Nenkova” is anchor text for that page 34 10,000 documents 10 of them are relevant What happens if you decide to return absolutely nothing? 99.9% accuracy 36 Standard metrics in Information Retrieval Precision: Of what you return, how many are relevant? | Relevant and Retrived | Precision | Retrieved | Recall: Of what is relevant, how many do you return? | Relevant and Retrived | Recall | Relevant | 37 Not always clear-cut binary classification: relevant vs. not relevant How do you measure recall over the whole web? How many of the 2.7 billion results will get looked at? Which ones actually need to be good? 38 Very relevant > Somewhat relevant > Not relevant Want most relevant documents to be ranked first DCG p rel1 i 2 p reli log 2 i NDCG = DCG / ideal ordering DCG Ranges from 0 to 1 39 Proposed ordering: 4 2 0 1 DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4) ◦ = 6.5 IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4) ◦ = 6.63 NDCG = 6.5/6.63 = .98 40 Documents—hundreds of words Queries—1 or 2, often ambiguous, words It would be much easier to compare documents and documents How can we turn a query into a document? Just find ONE relevant document, then use that to find more 42 New Query = Original Query +Terms from Relevant Docs - Terms from Irrelevant Docs Original query = “train” Relevant ◦ www.dog-obedience-training-review.com Irrelevant ◦ http://en.wikipedia.org/wiki/Caboose New query = train + .3*dog -.2*railroad 43 Explicit feedback ◦ Ask the user to mark relevant versus irrelevant ◦ Or, grade on a scale (like we saw for NDCG) Implicit feedback ◦ Users see list of top 10 results, click on a few ◦ Assume clicked on pages were relevant, rest weren’t Pseudo-relevance feedback ◦ Do search, assume top results are relevant, repeat 44 Have query logs for millions of users “hybrid car””toyota prius” is more likely than “hybrid car”-> “flights to LA” Find statistically significant pairs of queries (Jones et. al. WWW 2006) using: H1 : P(q2 | q1 ) P(q2 | q1 ) H 2 :P(q2 | q1 ) P(q2 | q1 ) L( H1 ) LLR 2 log L( H 2 ) 45 Make a bipartite graph of queries and URLs Cluster (Beeferman and Berger, KDD 2000) 46 Suggest queries in the same cluster 47 A lot of ambiguity is removed by knowing who the searcher is Lots of Fernando Pereira’s ◦ I (Emily Pitler) only know one of them Location matters ◦ “Thai restaurants” from me means “Thai restaurants Philadelphia, PA” 49 Mei and Church, WSDM 2008 H(URL|Q) = H(URL,Q)-H(Q) = 23.88-21.14=2.74 H(URL|Q,IP)= H(URL,Q,IP)-H(Q,IP)=27.17-26=1.17 50 51 Powerset trying to apply NLP to Wikipedia 52 Descriptive searches: “pictures of mountains” ◦ I don’t want a document with the words: ◦ {“picture”, “of”, “mountains”} Link farms: trying to game PageRank Spelling correction: a huge portion of queries are misspelled Ambiguity 53 Text normalization, documents as vectors, document similarity, log likelihood ratio, relative entropy, precision and recall, tf-idf, machine learning… Choosing relevant documents/content Snippets = short summaries 54