Information Retrieval CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze http://informationretrieval.org CSE 8337 Outline • Introduction Simple Text Processing Boolean Queries • Web Searching/Crawling • • • • • • Indexes Vector Space Model Matching Evaluation CSE 8337 Spring 2009 2 Web Searching TOC Web Overview Searching Ranking Crawling CSE 8337 Spring 2009 3 Web Overview Size >11.5 billion pages (2005) Grows at more than 1 million pages a day Google indexes over 3 billion documents Diverse types of data http://www.google.com/support/webse arch/bin/topic.py?topic=8996 CSE 8337 Spring 2009 4 Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data Profiles Registration information Cookies CSE 8337 Spring 2009 5 Zipf’s Law Applied to Web Distribution of frequency of occurrence of words in text. “Frequency of i-th most frequent word is 1/i q times that of the most frequent word” http://www.nslij-genetics.org/wli/zipf/ CSE 8337 Spring 2009 6 Heap’s Law Applied to Web Measures size of vocabulary in a text of size n : O (n b) b normally less than 1 CSE 8337 Spring 2009 7 Web search basics Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA User Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds) Miele, Inc -- Anything else is a compromise Web spider At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages Search Indexer The Web CSE 8337 Spring 2009 Indexes Ad indexes 8 How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) CSE 8337 Spring 2009 9 Users’ empirical evaluation of results Quality of pages varies widely Relevance is not enough Other desirable qualities (non IR!!) Precision vs. recall On the web, recall seldom matters What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with obscure queries Content: Trustworthy, diverse, non-duplicated, well maintained Web readability: display correctly & fast No annoyances: pop-ups, etc Recall matters when the number of matches is very small User perceptions may be unscientific, but are significant over a large aggregate CSE 8337 Spring 2009 10 Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided Mitigate user errors (auto spell check, search assist,…) Explicit: Search within results, more like this, refine ... Anticipative: related searches Deal with idiosyncrasies Web specific vocabulary Impact on stemming, spell-check, etc Web addresses typed in the search box … CSE 8337 Spring 2009 11 Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s SEOs (Search Engine Optimization) responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Pure word density cannot be trusted as an IR signal CSE 8337 Spring 2009 12 Term frequency tf The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. Raw term frequency is not what we want: A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency. CSE 8337 Spring 2009 13 Log-frequency weighting The log frequency weight of term t in d is wt,d 1 log10 tf t,d , 0, if tf t,d 0 otherwise 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: score The score is 0 if none of the query terms is present in tqd (1 log tf t ,d ) the document. CSE 8337 Spring 2009 14 Document frequency Rare terms are more informative than frequent terms Recall stop words Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric → We want a high weight for rare terms like arachnocentric. CSE 8337 Spring 2009 15 Document frequency, continued Consider a query term that is frequent in the collection (e.g., high, increase, line) For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms. We will use document frequency (df) to capture this in the score. df ( N) is the number of documents that contain the term CSE 8337 Spring 2009 16 idf weight dft is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t We define the idf (inverse document frequency) of t by idf t log 10 N/df t We use log N/dft instead of N/dft to “dampen” the effect of idf. Will turn out the base of the log is immaterial. CSE 8337 Spring 2009 17 idf example, suppose N= 1 million term calpurnia dft idft 1 6 animal 100 4 sunday 1,000 3 10,000 2 100,000 1 1,000,000 0 fly under the There is one idf value for each term t in a collection. CSE 8337 Spring 2009 18 Collection vs. Document frequency The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Example: Word Collection Document frequency frequency insurance 10440 3997 try 10422 8760 Which word is a better search term (and should get a higher weight)? CSE 8337 Spring 2009 19 tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. w t ,d (1 log tf t ,d ) log 10 N / dft Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf, tfidf, tf/idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection CSE 8337 Spring 2009 20 Search engine optimization (Spam) Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Operators Search Engine Optimizers for lobbies, companies Web masters Hosting services Forums E.g., Web master world (www.webmasterworld.com) Search engine specific tricks Discussions about academic papers CSE 8337 Spring 2009 21 Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate How do you identify a spider? Y SPAM Is this a Search Engine spider? Cloaking CSE 8337 Spring 2009 N Real Doc 22 More spam techniques Doorway pages Pages optimized for a single keyword that re-direct to the real target page Link spamming Mutual admiration societies, hidden links, awards – more on these later Domain flooding: numerous domains that point or re-direct to a target page Robots Fake query stream – rank checking programs CSE 8337 Spring 2009 23 The war against spam Quality signals - Prefer authoritative pages based on: Anti robot test Limits on meta-keywords Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) CSE 8337 Spring 2009 Spam recognition by machine learning Training set based on known spam Family friendly filters Policing of URL submissions Votes from authors (linkage signals) Votes from users (usage signals) Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection 24 More on spam Web search engines have policies on SEO practices they tolerate/block http://help.yahoo.com/help/us/ysearch/index.html http://www.google.com/intl/en/webmasters/ Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research http://airweb.cse.lehigh.edu CSE 8337 Spring 2009 25 Ranking Order documents based on relevance to query (similarity measure) Ranking has to be performed without accessing the text, just the index About ranking algorithms, all information is “top secret”, it is almost impossible to measure recall, as the number of relevant pages can be quite large for simple queries CSE 8337 Spring 2009 26 Ranking Some of the new ranking algorithms also use hyperlink information Important normal IR that point popularity difference between the Web and databases, the number of hyperlinks to a page provides a measure of its and quality. Links in common between pages often indicate a relationship between those pages. CSE 8337 Spring 2009 27 Ranking Three examples of ranking techniques based in link analysis: WebQuery HITS (Hub/Authority pages) PageRank CSE 8337 Spring 2009 28 WebQuery WebQuery takes a set of Web pages (for example, the answer to a query) and ranks them based on how connected each Web page is http://www.cgl.uwaterloo.ca/Projects/Vanish/webquer y-1.html CSE 8337 Spring 2009 29 HITS Kleinberg ranking scheme depends on the query and considers the set of pages S that point to or are pointed by pages in the answer Pages that have many links pointing to them in S are called authorities Pages that have many outgoing links are called hubs Better authority pages come from incoming edges from good hubs and better hub pages come from outgoing edges to good authorities CSE 8337 Spring 2009 30 Ranking H ( p) A(u) uSp u A( p) H (v ) vSv p CSE 8337 Spring 2009 31 PageRank Used in Google PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p1 to pn CSE 8337 Spring 2009 32 PageRank (cont’d) PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) PR(i): PageRank for a page i which points to target page p. Ni: number of links coming out of page I CSE 8337 Spring 2009 33 Conclusion Nowadays search engines use, basically, Boolean or Vector models and their variations Link Analysis Techniques seem to be the “next generation” of the search engines Indexes: Compression and distributed architecture are keys CSE 8337 Spring 2009 34 Crawlers Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional Crawler – visits entire Web (?) and replaces index Periodic Crawler – visits portions of the Web and updates subset of index Incremental Crawler – selectively searches the Web and incrementally modifies index Focused Crawler – visits pages related to a particular subject CSE 8337 Spring 2009 35 Crawling the Web The order in which the URLs are traversed is important Using a breadth first policy, we first look at all the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively Good ordering schemes can make a difference if crawling better pages first (PageRank) CSE 8337 Spring 2009 36 Crawling the Web Due to the fact that robots can overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages CSE 8337 Spring 2009 37 Focused Crawler Only visit links from a page if that page is determined to be relevant. Components: Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages based on crawler and distiller scores. Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score. CSE 8337 Spring 2009 38 Focused Crawler CSE 8337 Spring 2009 39 Basic crawler operation Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Fetch each URL on the queue and repeat CSE 8337 Spring 2009 40 Crawling picture URLs crawled and parsed Seed pages Unseen Web URLs frontier Web CSE 8337 Spring 2009 41 Simple picture – complications Web crawling isn’t feasible with one machine All of the above steps distributed Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Malicious pages Spam pages Spider traps Politeness – don’t hit a server too often CSE 8337 Spring 2009 42 What any crawler must do Be Polite: Respect implicit and explicit politeness considerations Only crawl allowed pages Respect robots.txt (more on this shortly) Be Robust: Be immune to spider traps and other malicious behavior from web servers CSE 8337 Spring 2009 43 What any crawler should do Be capable of distributed operation: designed to run on multiple distributed machines Be scalable: designed to increase the crawl rate by adding more machines Performance/efficiency: permit full use of available processing and network resources CSE 8337 Spring 2009 44 What any crawler should do Fetch pages of “higher quality” first Continuous operation: Continue fetching fresh copies of a previously fetched page Extensible: Adapt to new data formats, protocols CSE 8337 Spring 2009 45 Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling CSE 8337 Springthread 2009 46 URL frontier Can include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy CSE 8337 Spring 2009 47 Explicit and implicit politeness Explicit politeness: specifications from webmasters on what portions of site can be crawled robots.txt Implicit politeness: even with no specification, avoid hitting any site too often CSE 8337 Spring 2009 48 Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions CSE 8337 Spring 2009 49 Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: CSE 8337 Spring 2009 50 Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL Extract links from it to other docs (URLs) Check if URL has content already seen Which one? If not, add to indexes For each extracted URL E.g., only crawl .edu, obey robots.txt, etc. Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) CSE 8337 Spring 2009 51 Basic crawl architecture DNS WWW Doc FP’s robots filters URL set URL filter Dup URL elim Parse Fetch Content seen? URL Frontier CSE 8337 Spring 2009 52 DNS (Domain Name Server) A lookup service on the internet Given a URL, retrieve its IP address Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds) Common OS implementations of DNS lookup are blocking: only one outstanding request at a time Solutions DNS caching Batch DNS resolver – collects requests and sends them out together CSE 8337 Spring 2009 53 Parsing: URL normalization When a fetched document is parsed, some of the extracted links are relative URLs E.g., at http://en.wikipedia.org/wiki/Main_Page we have a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer During parsing, must normalize (expand) such relative URLs CSE 8337 Spring 2009 54 Content seen? Duplication is widespread on the web If the page just fetched is already in the index, do not further process it This is verified using document fingerprints or shingles CSE 8337 Spring 2009 55 Filters and robots.txt Filters – regular expressions for URL’s to be crawled/not Once a robots.txt file is fetched from a site, need not fetch it repeatedly Doing so burns bandwidth, hits web server Cache robots.txt files CSE 8337 Spring 2009 56 Duplicate URL elimination For a non-continuous (one-shot) crawl, test to see if an extracted+filtered URL has already been passed to the frontier For a continuous crawl – see details of frontier implementation CSE 8337 Spring 2009 57 Distributing the crawler Run multiple crawl threads, under different processes – potentially at different nodes Partition hosts being crawled into nodes Geographically distributed nodes Hash used for partition How do these nodes communicate? CSE 8337 Spring 2009 58 URL frontier: two main considerations Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose content changes often These goals may conflict each other. (E.g., simple priority queue fails – many links out of a page go to its own site, creating a burst of accesses to that site.) CSE 8337 Spring 2009 59 Politeness – challenges Even if we restrict only one thread to fetch from a host, can hit it repeatedly Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host CSE 8337 Spring 2009 60 URL frontier: Mercator scheme URLs Prioritizer K front queues Biased front queue selector Back queue router B back queues Single host on each Back queue selector CSE 8337 Spring 2009 Crawl thread requesting URL 61 Mercator URL frontier URLs flow in from the top into the frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO http://mercator.comm.nsdlib.org/ CSE 8337 Spring 2009 62 Front queues Prioritizer K 1 CSE 8337 Spring 2009 Biased front queue selector Back queue router 63 Front queues Prioritizer assigns to URL an integer priority between 1 and K Appends URL to corresponding queue Heuristics for assigning priority Refresh rate sampled from previous crawls Application-specific (e.g., “crawl news sites more often”) CSE 8337 Spring 2009 64 Biased front queue selector When a back queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URL This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Can be randomized CSE 8337 Spring 2009 65 Back queues Biased front queue selector Back queue router B 1 Back queue selector CSE 8337 Spring 2009 66 Back queue invariants Each back queue is kept non-empty while the crawl is in progress Each back queue only contains URLs from a single host Maintain a table from hosts to back queues Host name Back queue … 3 1 B CSE 8337 Spring 2009 67 Back queue heap One entry for each back queue The entry is the earliest time te at which the host corresponding to the back queue can be hit again This earliest time is determined from Last access to that host Any time buffer heuristic we choose CSE 8337 Spring 2009 68 Back queue processing A crawler thread seeking a URL to crawl: Extracts the root of the heap Fetches URL at head of corresponding back queue q (look up from table) Checks if queue q is now empty – if so, pulls a URL v from front queues If there’s already a back queue for v’s host, append v to q and pull another URL from front queues, repeat Else add v to q When q is non-empty, create heap entry for it CSE 8337 Spring 2009 69 Number of back queues B Keep all threads busy while respecting politeness Mercator recommendation: three times as many back queues as crawler threads CSE 8337 Spring 2009 70