Lecture 5: Search Engines 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Outline • Search engines: key tools for ecommerce – Buyers and sellers must find each other • • • • How do they work? How much do they index? How are hits ordered? Can the order be changed? 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engines • Tools for finding information on the Web – Problem: “hidden” databases, e.g. New York Times • Directory – A hand-constructed hierarchy of topics (e.g. Yahoo) • Search engine – A machine-constructed index (usually by keyword) • So many search engines, we now need search engines to find them. Searchenginecollosus.com 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Indexing • Arrangement of data (data structure) to permit fast searching • Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak • Sorting helps. Why? – Permits binary search. About log2n probes into list • log2(1 billion) ~ 30 – Permits interpolation search. About log2(log2n) probes • log2 log2(1 billion) ~ 5 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Inverted Files POS 1 10 20 30 36 FILE A file is a list of words by position – First entry is the word in position 1 (first word) – Entry 4562 is the word in position 4562 (4562nd word) – Last entry is the last word An inverted file is a list of positions by word! a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) 20-751 ECOMMERCE TECHNOLOGY INVERTED FILE FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Inverted Files for Multiple Documents LEXICON WORD jezebel OCCUR POS 1 POS 2 ... NDOCS PTR 20 jezer 3 jezerit 1 jeziah 1 jeziel 1 jezliah 1 jezoar 1 jezrahliah 1 jezreel DOCID 39 jezoar 20-751 ECOMMERCE TECHNOLOGY 34 44 56 6 3 4 1 215 5 118 2291 22 2087 3010 134 566 3 203 245 287 67 1 132 4 6 1 3 322 15 481 42 FALL 2003 3922 3981 5002 1951 2192 992 WORD INDEX ... 107 232 677 713 “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56 . . . 354 195 381 248 312 802 405 1897 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine Architecture • Spider – Crawls the web to find pages. Follows hyperlinks. Never stops • Indexer – Produces data structures for fast searching of all words in the pages • Retriever – Query interface – Database lookup to find hits • 2 billion documents • 4 TB RAM, many terabytes of disk – Ranking 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Crawlers (Spiders, Bots) • Retrieve web pages for indexing by search engines • Start with an initial page P0. Find URLs on P0 and add them to a queue • When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat • Can be specialized (e.g. only look for email addresses) • Issues – Which page to look at next? (Special subjects, recency) – Avoid overloading a site – How deep within a site to go (drill-down)? – How frequently to visit pages? 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Query Specification • Boolean – AND , OR, NOT, PHRASE “ ”, NEAR ~ – But keyword query is artificial • Question-answering (simulated) – “Who offers a master’s degree in ecommerce? • Date range • Relevance specification – In Altavista, can specify terms by importance (separate from query specification) • Content – multimedia, MP3, .PPT files • Stemming: eat, eats, eaten, eating, eater, (ate!) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS “Advanced” Query Specification • Multimedia, e.g. Google • Date range • Relevance specification – In Altavista, can specify terms by importance (separate from query specification) • Content – multimedia, MP3, .PPT files • Stemming • Language • Search depth (from site’s front page) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Ranking (Scoring) Hits • Hits must be presented in some order • What order? – Relevance, recency, popularity, reliability? • Some ranking methods – – – – Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document Link popularity (how many pages point to this one) • Can the user control? Can the page owner control? • Can you find out what order is used? • Spamdexing: influencing retrieval ranking by altering a web page. (Puts “spam” in the index) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Google’s PageRank Algorithm • Assumption: A link in page A to page B is a recommendation of page B by the author of A (we say B is successor of A) The “quality” of a page is related to the number of links that point to it (its in-degree) • Apply recursively: Quality of a page is related to – its in-degree, and to – the quality of pages linking to it PageRank Algorithm (Brinn & Page, 1998) SOURCE: GOOGLE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Definition of PageRank • Consider the following infinite random walk (surfing): – Initially the surfer is at a random page – At each step, the surfer proceeds • to a randomly chosen web page with probability d • to a randomly chosen successor of the current page with probability 1-d • The PageRank of a page p is the fraction of steps the surfer spends at p as the number of steps approaches infinity SOURCE: GOOGLE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS PageRank Formula d PageRank ( p) (1 d ) PageRank (q) / outdegree(q) n ( q , p )E where n is the total number of nodes in the graph • Google uses d 0.85 • PageRank is a probability distribution over web pages • The sum of all PageRanks of all Pages is 1 SOURCE: GOOGLE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS PageRank Example B A d d P PageRank of P is (1-d)*[(PageRank of A)/4 + (PageRank of B)/3)] + d/n PAGERANK CALCULATOR SOURCE: GOOGLE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Link Popularity • How many pages link to this page? – on the whole Web – in our database? • www.linkpopularity.com • Link popularity is used for ranking – Many measures – Number of links in – Weighted number of links in (by weight of referring page) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine Sizes (Sept. 2, 2003) BILLIONS OF PAGES ATW AV GG INK TMA SEARCHES/DAY (MILLIONS) 250 80 18 2900 per second! 20-751 ECOMMERCE TECHNOLOGY AllTheWeb Altavista Google Inktomi Teoma SOURCE: SEARCHENGINEWATCH.COM FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine Usage SHARE BY SEARCH SITE SHARE BY ENGINE SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engines Disjointness Four searches, 10 engines, total of 141 hits on March 6, 2002 SOURCE: SEARCHENGINESHOWDOWN 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine EKG Shows activity of the Lycos crawler at one sample site, calafia.com, by number of pages visited during each crawl 20-751 ECOMMERCE TECHNOLOGY SOURCE: SEARCHENGINEWATCH.COM FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS SOURCE: SEARCHENGINEWATCH.COM Search Engine EKG Comparison 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine Differences • Coverage (number of documents) • Spidering algorithms (visit SpiderCatcher) – Frequency, depth of visits • • • • Inexing policies Search interfaces Ranking One solution: use a metasearcher (search agent) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Metasearchers • All the engines operate differently. Different – – – – – – sizes query languages crawling algorithms storage policies (stop words, punctuation, fonts) freshness ranking • Submit the same query to many engines and collect the results • Metacrawler 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Clustering • Viewing large numbers of unstructured hits is not useful • Answer: cluster them • Vivisimo • Kartoo • iBoogie • SurfWax 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Spying • • • • • • • • • Peeking at queries as they are being submitted AllTheWeb Metaspy. Spies on Metacrawler AskJeeves Epicurious (recipes) StockCharts.com Yahoo buzz index Kanoodle IQSeek 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Time Spent Per Visitor (minutes) by Search Engine, Jan. 2003 Up 58% in ONE YEAR! AJ AOL AV ELNK GG ISP LS LY MSN NS OVR YH Ask Jeeves America Online Altavista EarthLink Google InfoSpace LookSmart Lycos Microsoft Netscape OVERTURE Yahoo SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Audience Reach by Search Site, Jan, 2003 AJ AOL AV ELNK GG ISP LS LY MSN NS OVR YH Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap Ask Jeeves America Online Altavista EarthLink Google InfoSpace LookSmart Lycos Microsoft Netscape OVERTURE Yahoo SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Robot Exclusion • You may not want certain pages indexed but still viewable by browsers. Can’t protect directory. • Some crawlers conform to the Robot Exclusion Protocol. Compliance is voluntary. One way to enforce: firewall • They look for file robots.txt at highest directory level in domain. If domain is www.ecom.cmu.edu, robots.txt goes in www.ecom.cmu.edu/robots.txt • A specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX"> 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Robots Exclusion Protocol • Format of robots.txt – Two fields. User-agent to specify a robot – Disallow to tell the agent what to ignore • To exclude all robots from a server: User-agent: * Disallow: / • To exclude one robot from two directories: User-agent: WebCrawler Disallow: /news/ Disallow: /tmp/ • View the robots.txt specification. 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Key Takeaways • • • • • • Engines are a critical Web resource Very sophisticated, high technology They don’t cover the Web completely Spamdexing is a problem New paradigms needed as Web grows What about images, music, video? – www.corbis.com, Google images 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Q&A 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS