CPS 49S Google: The Computer Science Within and its Impact on Society

advertisement
CPS 49S
Google: The Computer Science
Within and its Impact on Society
Shivnath Babu
Spring 2008
Outline for Today
• Why is Google so fast and reliable
– Use of commodity servers connected together
– Replication
• Learning from data
Google’s Infrastructure
• Commodity servers in data centers
– $1000 per server
– 450,000 of them!
• Linux
• What is the problem with this approach?
Replication
• Make copies of everything
– Data centers, index, document repository, …
•
•
•
•
•
Even if one copy fails, you have others
Where should you place each copy?
How many copies should there be?
What are problems of having multiple copies?
What are the advantages of having multiple
copies?
Three New Software Systems at
Google
• Google File System
– Does automatic replication
• Map/Reduce
– Makes it easy to write software that runs on many
computers at the same time
– Ex: Count the number of occurrences of each word in a
collection of Web pages
– Ex: Find the list of pages pointing to each Web page
• Global Work Queue
– Easy to run software that needs to process a lot of data
Learning from Data
• Misspellings of queries --- suggest alternate
spellings for queries
• Triangle area cooking class  Duke course
page on vegetarian cuisine
–
–
–
–
Find clusters, name them automatically
Works when there is a lot of data
AdSense ads
Google News
Outline for Today
• Why are we interested in Google?
• What we will cover in this class
• Logistics
Download