Behind the Scenes at a Search Engine William Denton <wdenton@yorku.ca> Web Librarian, York University Libraries 20 March 2008 http://www.library.yorku.ca/binaries/frontiers/20080320-denton-search-engine.ppt To be covered The three basic parts of a web search Search engine optimization Advertising, and how to avoid it Library databases and the deep web Ask questions any time. Denton: Search Engines / 20 March 2008 / York 2 Google, Yahoo, Ask, Live, A9 www.google.com: australopithecus search.yahoo.com: australopithecus www.ask.com: australopithecus search.live.com: australopithecus a9.com: australopithecus Denton: Search Engines / 20 March 2008 / York 3 The computing power, bandwidth, and electricity use is mind-blowing David F. Carr, How Google Works Urs Hölze talk on Google’s Linux cluster (2002) Ginger Strand, Keyword: Evil (Harper’s, March 2008) Denton: Search Engines / 20 March 2008 / York 4 What happens when you search? You enter in some words You get good links And sometimes other good stuff Denton: Search Engines / 20 March 2008 / York 5 Three things 1. How does it know about everything? Denton: Search Engines / 20 March 2008 / York 6 Three things 2. How does it decide what’s relevant? Denton: Search Engines / 20 March 2008 / York 7 Three things 3. How does it serve you the results? Denton: Search Engines / 20 March 2008 / York 8 1. How does it know about everything? Crawlers are continually Different search engines moving around the crawl different web, looking for numbers of pages, but whatever they can find. they all do in the billions. Denton: Search Engines / 20 March 2008 / York 9 A visit from a Googlebot 66.249.73.229 - [09/Mar/2008:04:15:47 -0400] "GET /ccm/jsp/homepage.jsp HTTP/1.1" 200 13670 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html) - -" Denton: Search Engines / 20 March 2008 / York 10 robots.txt Polite web crawlers first check the robots.txt file on a web site and obey its rules. Wikipedia’s robots.txt York’s robots.txt www.robotstxt.org Denton: Search Engines / 20 March 2008 / York 11 Crawl, download, scan, repeat Crawlers crawl and crawl and download all the pages (and Word files and PDFs, and spreadsheets) they come across. They scan the page for links and add them to their list of pages to crawl, ad infinitum. Denton: Search Engines / 20 March 2008 / York 12 2. How does it decide what’s relevant? What’s on the page Inbound and outbound links (PageRank) Frequency of updates The kind of site it is Clickthroughs and usage analysis People tweaking the rules Secret other stuff Denton: Search Engines / 20 March 2008 / York 13 Term frequency How many times a word appears in a document Denton: Search Engines / 20 March 2008 / York 14 Inverse document frequency Number of documents / number of documents containing the term (Actually the logarithm of this.) Denton: Search Engines / 20 March 2008 / York 15 TF-IDF TF-IDF of a keyword in a page = TF * IDF Denton: Search Engines / 20 March 2008 / York 16 Example 100 web pages. Keyword: naillie #1 has 8 mentions. TF = 8. #2, 17, 19, 76 have 4 mentions. TF = 4. 20 pages have 1 mention. TF = 1. IDF = log2 (100 / 25) = 2 TF-IDF of naillie in #1 = 8 * 2 = 16 High! TF-IDF of naillie in #2, 17, 19, 76 =4*2 =8 Not so high TF-IDF of naillie in 20 others =1*2 =2 Small TF-IDF of naillie in all the rest =0*2 =0 Irrelevant Denton: Search Engines / 20 March 2008 / York 17 HTML helps a lot <title>The title of the page</title> <h1>The most important heading</h1> <h2>Lesser headings</h2> <a href=“http://www.yorku.ca/”>Hyperlinks</a> Text at the top of the page Denton: Search Engines / 20 March 2008 / York 18 Semantic markup Wrong: say how it should look <b><font size=“26”> <i>Upon the Distinction Between the Ashes of the Various Tobaccos</i> </font></b> Right: say what it is (then apply a look) <h1>Upon the Distinction Between the Ashes of the Various Tobaccos</h1> Denton: Search Engines / 20 March 2008 / York 19 PageRank! Algorithm designed by Larry Page and Sergey Brin when they were at Stanford One of the things Google uses in deciding how important a web page is US Patent 6,285,999 The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin, Page, 1998) Denton: Search Engines / 20 March 2008 / York 20 Denton: Search Engines / 20 March 2008 / York 21 http://en.wikipedia.org/wiki/PageRank Wikipedia has a good explanation of PageRank Denton: Search Engines / 20 March 2008 / York 22 Other weightings Frequency of updates The kind of site it is: blog? wiki? institutional repository? Who runs the site? Search engine companies tweak the rules They don’t give away their secrets, but lots of people try to reverse engineer the algorithms Denton: Search Engines / 20 March 2008 / York 23 Google’s explanation “Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines dozens of aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.” http://www.google.com/technology/ Denton: Search Engines / 20 March 2008 / York 24 Live’s explanation “Live Search website ranking is completely automated. The Live Search ranking algorithm analyzes factors such as web page content, the number and quality of websites that link to your pages, and the relevance of your website’s content to keywords. The algorithm is complex and is never human-mediated. You can't pay to boost your website’s relevance ranking; however, we do offer advertising options for website owners.” http://help.live.com/help.aspx?mkt=en-us&project=wl_webmasters Denton: Search Engines / 20 March 2008 / York 25 What doesn’t it look at? Crawlers don’t see inside fancy things like Flash plug-ins, so if your home page is a Flash intro, the search engines may go no further But they can infer what’s in an image Most search engines ignore <meta> tags and other metadata Denton: Search Engines / 20 March 2008 / York 26 3. How does it serve you the results? You enter terms, it looks them up in a reverse index, then it formats the results on the way out. Denton: Search Engines / 20 March 2008 / York 27 Reverse index naillie 1, 2, 17, 19, 76, etc. partridge 1, 2, 35, 76, 8, 65 Not showing weights and rankings etc. Denton: Search Engines / 20 March 2008 / York 28 Formatting on the way out Google: toronto shoes Look at the cached copy; get the title, keywords in context, etc. Denton: Search Engines / 20 March 2008 / York 29 Search engine optimization SEO is bringing in people by designing your web site so that search engines will list you high on results pages when people search for certain keywords—without buying ads. Denton: Search Engines / 20 March 2008 / York 30 White hat SEO Improving your organic search results by making your site understandable by search engine crawlers, and by getting people to link to you. Denton: Search Engines / 20 March 2008 / York 31 White hat SEO Have good content Semantic markup again: <title>Good Titles Matter</title> <h1>So do headings</h1> Use text Host on a reliable, trustworthy site Denton: Search Engines / 20 March 2008 / York 32 Black hat SEO Pulling tricks to make search engines think your site is more popular than it is or to mislead them about the content. Denton: Search Engines / 20 March 2008 / York 33 SEO browser extensions Google Toolbar Search Firefox Add-ons for “pagerank” and “seo”, but mind the privacy issues on what’s being reported where about the pages you view Denton: Search Engines / 20 March 2008 / York 34 Advertising Search engines sell ads. You can pay to get your web page listed at the top of the results page for desired keywords. Search engines are advertising companies. Denton: Search Engines / 20 March 2008 / York 35 Google: AdWords and AdSense AdWords is Google’s program for selling ads on its site in results pages. Try their Keyword Tool. AdSense puts little boxes of context-relevant ads on web pages of people who want to make a little money. (Or a lot.) Denton: Search Engines / 20 March 2008 / York 36 Yahoo Video explaining how “sponsored search” works See how much it would cost to buy some keywords Denton: Search Engines / 20 March 2008 / York 37 Avoiding advertising The Firefox extensions Adblock and Customize Google will make your web browsing ad-free and much more pleasant. Denton: Search Engines / 20 March 2008 / York 38 Denton: Search Engines / 20 March 2008 / York 39 Denton: Search Engines / 20 March 2008 / York 40 Invisible web or deep web Search engines miss a lot of web content: It’s behind a login It’s dynamically generated It’s embedded in Flash or Java applets Denton: Search Engines / 20 March 2008 / York 41 2/3 of the web goes unseen He, Patel, Zhang, Chang, Accessing the Deep Web: A Survey (Communications of the ACM, 50: 5, May 2007) Denton: Search Engines / 20 March 2008 / York 42 Library databases Library databases (JSTOR, PsycInfo, Scholars Portal) are part of the deep web. They have enormous amounts of information that’s hidden from the public … except when it’s not, as through Google Scholar. Denton: Search Engines / 20 March 2008 / York 43 Library databases Differences in rankings, algorithms Full use of metadata Usually sorted by date PageRank is based on citation analysis, but these databases don’t use citation analysis to rank relevant papers Scholars Portal: search Natural Sciences for australopithecus Denton: Search Engines / 20 March 2008 / York 44 Final note on privacy Everything you do at a search engine is logged. They track you by IP address and with cookies and logins. Assume all that information is stored forever. See http://blog.searchenginewatch.com/blog/060206-150030 Denton: Search Engines / 20 March 2008 / York 45 Further reading: online searchenginewatch.com John Battelle’s Searchblog Online: Exploring Technology and Resources for Information Professionals Wikipedia (usually quite strong on technical articles) The library’s Computer Science Research Guide Denton: Search Engines / 20 March 2008 / York 46 Further reading: books Battelle, John. 2006. The Search: The Inside Story of How Google and Its Rivals Changed Everything (Portfolio) Berry, Michael W., and Murray Browne. 2005. Understanding Search Engines: Mathematical Modelling and Text Retrieval (SIAM) Levene, Mark. 2006. An Introduction to Search Engines and Web Navigation (Addison-Wesley) Denton: Search Engines / 20 March 2008 / York 47 Denton: Search Engines / 20 March 2008 / York 48