CMSC 100 Google Deconstructed Professor Marie desJardins Tuesday, September 4, 2012

advertisement
CMSC 100
Google Deconstructed
Professor Marie desJardins
Tuesday, September 4, 2012
CMSC 100 -- Google
1
Tue 9/4/12
Everyone Knows What
Google Is…

Right?

So how does it work?
2
CMSC 100 -- Google
Tue 9/4/12
What’s So Great About
Google?

Abstraction (of course!)

Crawl/index/search model

PageRank algorithm and complex ranking model

Scalability and robustness
3
CMSC 100 -- Google
Tue 9/4/12
Let’s Try It…

Google demo/activity:

Find the market capitalization of Google

Find a zero-hit “word”

Let’s add it to the course website

Now does Google notice it?

How long do you think it will it take for Google to
“learn” it?

Find a satellite picture of this building

Show the PageRank of several different pages
4
CMSC 100 -- Google
Tue 9/4/12
PreHistory of the Web:
Computation
•
First, there were numbers (early counting systems)
•
Then there were “computers” (abaci, Babbage,
Difference Engine, Jacquard loom)
•
Then there was the theory of computation (Turing)
•
Then there were computers (ENIAC, PCs, Macs;
explosion of computing power in the latter half of the
20th century)
5
CMSC 100 -- Google
Tue 9/4/12
History of the Web:
Medieval Times









6
CMSC 100 -- Google
Internet (1960s – government sites and a few universities)
Talk/email and Usenet newsgroups used by a limited population
in the ‘80s
1988: First computer worm (created by Robert Morris)
1989: World Wide Web invented by Tim Berners-Lee
1992: Netscape browser released
1993: World Wide Web Worm (not a virus: an early search
engine); 300K documents
1995: eBay goes live
1996: Larry Page invents PageRank
1997: Google.com domain registered; 2M-100M documents;
20M queries in AltaVista
Tue 9/4/12
History of the Web: The
Renaissance

Notable quotes:










“It is foreseeable that by the year 2000, a comprehensive index of the Web will
contain over a billion documents”
“It is likely that search engines will handle hundreds of millions of queries per
day by the year 2000.” [Brin & Page, “The Anatomy of a Search Engine”]
1998: Google incorporated with initial investment of $1M
1999: blogger.com goes live; blogs explode in popularity
2000: GoogleAds introduced
2003: MySpace.com
2004: Facebook
2004: Google IPO, with a market capitalization of $23B
2004: Mass-market VOIP (Voice Over Internet Protocol)
2006: Google buys YouTube
7
CMSC 100 -- Google
Tue 9/4/12
History of the Web: The
Modern Era




2007: Google has a 53.6% market share (Yahoo has 20%)
2008: Facebook overtakes myspace as the #1 social networking site
Today:
 Google has a market capitalization of $221.5B, over 30K employees, and a
67%US search market share
 Google is #73 on the Fortune 500
 The planet has 6.9B people, of whom 2.3B have Internet access
Tomorrow:
 The Semantic Web will enable more meaningful information retrieval
 Service-based computing and intelligent agent technology will let you perform
all sorts of tasks automatically (travel planning, etc.)
 Mobile phones will become the primary Internet platform
 Continuing increased global access to the Web will lead to increased democracy,
freedom of information, and protection of human rights
8
CMSC 100 -- Google
Sources: [investor.google.com, googlesystem.blogspot.com, internetworldstats.com]
Tue 9/4/12
Google: Crawling the Web

Googlebot: Start with a few “well connected” pages, follow the links.
Lather, rinse, repeat.

Deepbot: Crawl the entire web once a month

Freshbot: Visit frequently updated sites more often

As of 2012, Google indexes about 35B documents.

Bing indexes 18B, and Yahoo indexes 3-4B [worldwidewebsize.com]

As of August 2012, Google had identified 30T (that’s thirty trillion!)
unique URLs (but not all of them are interesting enough to index)

Google claims to handle more than 3 billion queries a day
9
CMSC 100 -- Google
Tue 9/4/12
Google Queries and Tools

Demo:

“Advanced Google Operators” –
http://www.google.com/intl/gn/help/operators.html

“Google Cheat Sheet” –
http://www.googleguide.com/advanced_operators_reference.html

Important concepts: stemming, stop words

Parsing

Term reordering for efficiency

More apps:
10
CMSC 100 -- Google

Google products on “Even More” link

Google Maps and Earth

Google Mail, Calendar, Documents, Plus....

...and even a freakin’ self-driving car!! [check it out...]
Tue 9/4/12
Ranking the Web


Googlefight: When queries duke it out
Google uses keyword similarity but includes other criteria in
“scoring” documents, most notably PageRank




PageRank assigns a “score” to a web page based on how many other
pages point to it
A page’s PageRank depends on the PageRanks of the “referring”
pages, so it is a recursive definition!
Other criteria in Google’s ranking scheme include “extra credit” for
keywords that appear earlier in the body of the document, in the title
tag, in H1 HTML headers, and in anchor text (on links to the page
being ranked)
“Search engine optimization” (legitimate) vs. “Google bombing”
(not allowed)
11
CMSC 100 -- Google
Tue 9/4/12
Indexing the Web

Googlewhack: Query that returns exactly one page (can you find
one?)

Efficient data structures are needed!

How do you find a phone number in the yellow pages, or a word in the
dictionary?

Examples of successively more efficient data structures: Unordered
list, ordered list, ordered binary tree, hash table

Indexes: sorted by keyword, with “pointer” to pages containing
that keyword

Document server: store the cached versions of the actual pages
12
CMSC 100 -- Google
Tue 9/4/12
Storing the Web


How much storage is needed for (say) 10B documents?
Server farms





13
As of 2012, Google had an estimated 1.8M (yes, 1.8 million) servers in
locations around the globe
Issues: Energy consumption, heat generation/dissipation, environmental
impact (server farms are typically located near, and dissipate their heat
into, water sources)
Google claims they have been carbon neutral since 2007 (through energy
savings, green energy sources, and carbon offsets [google.com/green]
Design philosophy: Use many cheap, redundant machines, plus
failure detection and handling → robust, scalable, affordable
Cloud computing (21st-century computation paradigm: use a very
loosely connected network of many servers to manage enormous
quantities of data and huge numbers of queries)
CMSC 100 -- Google
Tue 9/4/12
Privacy and Security

Is it good that Google provides ready access to “all” information?











14

Do you want all of your publicly available data to be readily available to any
random inquirer?
Do you want all of your data to be permanently available via caching, even if
you’ve taken it off your facebook page?
Do you want “leaked” (legally or not, but against your wishes) private
information to become readily, permanently available public data?
If you create a story/book/painting/drawing/idea, do you want other people to
be able to get it and use it as much as they want, for any purpose, for free?
Do you have any music you didn’t pay for on your ipod?
Do you have movies you didn’t buy?
Google Earth: security and privacy issues
China/censorship – filtering search results
Digitizing books – copyright issues
Cookies/user data
Tabulating and selling search queries
Copyright and privacy law are way behind the technology curve!
CMSC 100 -- Google
Tue 9/4/12
Google’s Business Model

Why is it free?



Advertising: $10.6B ad revenue in 2006.
How Google Ads work:
 AdWords: “pay-per-click” for “sponsored
links” using Google’s technology to identify
relevant ads
 AdSense to put sponsored advertising on your
own
Big issue: Click fraud
15
CMSC 100 -- Google
Tue 9/4/12
What Doesn’t Google Do
Well?

What doesn’t Google do well?



Non-digitized documents (handwritten/scanned
data); non-text documents (audio, images, video,
multimedia (e.g., PowerPoint presentations)
Databases and dynamically generated content
Semantic knowledge (beyond keywords and
PageRank)
16
CMSC 100 -- Google
Tue 9/4/12
Download