CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2008 Outline for Today • Why is Google so fast and reliable – Use of commodity servers connected together – Replication • Learning from data Google’s Infrastructure • Commodity servers in data centers – $1000 per server – 450,000 of them! • Linux • What is the problem with this approach? Replication • Make copies of everything – Data centers, index, document repository, … • • • • • Even if one copy fails, you have others Where should you place each copy? How many copies should there be? What are problems of having multiple copies? What are the advantages of having multiple copies? Three New Software Systems at Google • Google File System – Does automatic replication • Map/Reduce – Makes it easy to write software that runs on many computers at the same time – Ex: Count the number of occurrences of each word in a collection of Web pages – Ex: Find the list of pages pointing to each Web page • Global Work Queue – Easy to run software that needs to process a lot of data Learning from Data • Misspellings of queries --- suggest alternate spellings for queries • Triangle area cooking class Duke course page on vegetarian cuisine – – – – Find clusters, name them automatically Works when there is a lot of data AdSense ads Google News Outline for Today • Why are we interested in Google? • What we will cover in this class • Logistics