2008/3/26
Yuh-Jzer Joung
莊 裕 澤
Dept. of Information Management
National Taiwan University
Mar. 2008
Google 1
Google: A Brief History
Began as a research project by Larry Page (then a Ph.D. student) at Stanford in
January 1996.
Co-founded by Larry Page and Sergey Brin while they were still students at
Stanford University on Sep. 7, 1998
Located in Mountain View, California.
The name "Google" originated from a misspelling of “ googol ” , meaning
1000 … 000 (100 zeros).
“ Google ” : “ to use the Google search engine to obtain information on the Internet.
”
Def. from Merriam Webster Collegiate Dictionary and the Oxford English
Dictionary, 2006:
Fortune Magazine's #1 Best Place To Work
2008/3/26
Yuh-Jzer Joung
Google 2
Google’s Mission
To organize the world ’ s information and make it universally accessible and useful.
2008/3/26
Yuh-Jzer Joung
Google ’ s Data center in Oregon
3
Financial Performance
IPO on August 19, 2004
sold for $85 a share, but went public for $100.34.
Added to Nasdaq 100 on Dec. 19, 2005.
Added to S&P 500 on Mar.31, 2006.
The largest American company (by market capitalization) that is not part of the Dow Jones Industrial Average (as of Oct 31, 2007).
4 2008/3/26
Yuh-Jzer Joung
Impressive Financial Success
Years to $2 billion in annual net revenue after generating $80-
$100 million...
Source: Goldman Sachs Global Investment Research, Nov. 2004.
2008/3/26
Yuh-Jzer Joung
Google 5
Profile and Key Statistics
Revenue: US$16.593B [2007]
↑
56%
Net income: US$4.203B [2007]
↑
25%
Employees: 16,805 (Dec. 31, 2007)
Market Cap: US$145.47B [Feb. 27, 2008]
See updated data .
2008/3/26
Yuh-Jzer Joung
Google 6
Competitors: Yahoo
Revenue: US$6.7B [2007], $6.4B [2006]
Net income: $730M [2007], $751M [2006]
Employees: 13,600 (October 18, 2007)
Market Cap: US$39.36B [Feb. 27, 2008]
See updated data .
Recent News: Microsoft offers $44.6B bid for Yahoo!
[2008-0201]
2008/3/26
Yuh-Jzer Joung
Google 7
Competitors: Microsoft
Revenue: US$51.12B [2007]
Net income: $14.06B [2007]
Employees: 79,000 (2007)
Market Cap: US$264.13B [Feb. 27, 2008]
See updated data .
Comparison Summary, updated.
2008/3/26
Yuh-Jzer Joung
Google 8
The Core Business: Search Engine
Huge Data Size
By 1998, Google had an index of about 60M pages
Handles 250 M queries per day (as of Feb 2003)
Search Engine Size Wars
−
As of 2004
Google: 8.1B
MSN: 5.0 B
Yahoo: 4.2 B (estimate)
−
Yahoo claims 19.2B in 2005
19.2B web documents
1.6 B images
over 50M and video files
−
Google estimated about 24B in 2005
−
Estimated Web pages in WWW
29.7 B (as of Feb. 2007)
−
Now top secret for each search engine company
2008/3/26
Yuh-Jzer Joung
Google 9
Simple is the Beauty
A simple and clean interface in Google ’ s search engine
copied by several other search engines
2008/3/26
Yuh-Jzer Joung
Google 10
Key to the Success
The interface is clear and simple.
Pages load instantly.
Placement in search results is never sold to anyone.
Advertising on the site must offer relevant content and not be a distraction.
11 2008/3/26
Yuh-Jzer Joung
Some Fun: Google Doodles
Holiday Logos
Leap Year - February 29, 2008
Fan Logos
Official Logo
Chinese New Year
February 7, 2008
Happy New Year &
25 Years of TCP/IP
January 1, 2008
Google 2008/3/26
Yuh-Jzer Joung
12
Directories vs. Search Engines
Directories
Selected sites organized (typically manually) hierarchically into categories
−
Yahoo! Directory
Changed to crawler-based listings for its main results in Oct. 2002.
Originally, the actual web crawling and storage/retrieval of data was powered by
Inktomi (bought by Yahoo in 2002), and later by Google until 2004.
Although Yahoo also bought Overture
Services, Inc., (which owned the AlltheWeb and AltaVista search engines), it kept using
Google until 2004.
−
Open Directory Project
2008/3/26
Yuh-Jzer Joung
Google 13
Directories vs. Search Engines (contd.)
Search Engines
Search over the contents of the pages themselves
Organized in response to a query by relevance rankings or other scores
14 2008/3/26
Yuh-Jzer Joung
2008/3/26
Yuh-Jzer Joung
How Search Engine Operates
1. crawl the web
2. Store the documents
DocIds
4. User Query
6. Return Results
Search engine servers
5. Check Index
Server
Indexing
DB
3. Parsing, analysis, and indexing
Lots of technologies involved in crawling the Web, analyzing and indexing web pages, ranking them, and make the system scalable!
Google 15
Basic Indexing Concept
Doc 1
Information Technology I:
Information Systems and
Applications, EMBA,
Instructor Yuh-Jzer Joung , …
Doc 2
Yuh-Jzer Joung , Professor,
Department of Information
Management, National
Taiwan University …
Doc 3
Information Retrieval, Basic
Index Concept, Distributed
Inverted Index, …
…
Term
Information
Joung
3
2
No. of
Doc.
2
2
Total
Freq.
1
2
2
3
DocID
1
1
1
1
1
Freq
2
6
1
Position
1,4
…
There are more information needs to be recorded in the indexing files!
2008/3/26
Yuh-Jzer Joung
Google 16
2008/3/26
Yuh-Jzer Joung
Google’s Architecture
Sorted barrels = inverted index
Pagerank computed from link structure; combined with IR rank
IR rank depends on TF, type of “ hit ” , hit proximity, etc.
Billion documents
Hundred million queries a day
17
Page Rank
PageRank: Ranking Web Pages using link structure of the web
A trademark of Google
Developed at Stanford University by Larry Page and Sergey Brin
−
Stanford owns a patent of this
2008/3/26
Yuh-Jzer Joung
D
E
B
G
A
F
H
C
Google 18
Page Rank (contd.)
PR ( A )
=
( 1
− d )
+ d (
PR ( T
1
)
L ( T
1
)
+
PR ( T
2
)
+
L ( T
2
)
PR ( T
3
)
+
...)
L ( T
3
)
¾ d : damping factor, normally this is set to 0.85.
¾ (1d ) can also be viewed the probability of a user hitting page A directly, and so d is the probability for hitting A from other pages.
¾ T
1
,
…
, T n
: pages pointing to page A
¾ PR ( X ) : PageRank of page X .
¾ L ( T i
): the number of links going out of page T i
.
D
E
G
A
B
F
H
C
2008/3/26
Yuh-Jzer Joung
Google 19
Search Engine Optimization
A new business to help websites raise their rankings on
Google and on other search engines.
“ on page ” factors
− like body copy, title tags, H1 heading tags and image alt attributes
“ Off Page Optimization ” factors
− anchor text and PageRank.
2008/3/26
Yuh-Jzer Joung
Google 20
Search Market Share
As of July 2006, US Market
5.6 billion searches in this month
2008/3/26
Yuh-Jzer Joung
Google 21
Search Market Share (contd.)
As of August 2007, US Market
7.8 billion searches in this month
2008/3/26
Yuh-Jzer Joung
Google 22
Trends: Number of Searches
2008/3/26
Yuh-Jzer Joung
Google 23
Zeitgeist: Search patterns, trends, and surprises
2007 Year-End Zeitgeist
Zeitgeist Archive
Fastest Rising (global)
(2007) iphone badoo facebook dailymotion webkinz youtube ebuddy second life hi5 club penguin
2008/3/26
Yuh-Jzer Joung
Google 24
Google Platform
“ Google ” Circa 1997 ( google.stanford.edu)
2008/3/26
Yuh-Jzer Joung
Google 25
First Production Server (circa 1999)
2008/3/26
Yuh-Jzer Joung
Google 26
Google Data Center (Circa 2000)
2008/3/26
Yuh-Jzer Joung
Google 27
Data Centers Now
With major centers in Mountain View, CA; Virginia; Atlanta,
Georgia; Dublin, Ireland; and new facilities constructed in Dalles,
Oregon and Saint-Ghislain, Belgium.
Estimated over 450,000 servers.
Plan to open one in Asia
Taiwan? Malaysia? Japan? Korea? India? (as of Feb. 2008)
Google Spent $2.4B on Data Centers in 2007.
−
$1.9B in 2006
Google typically invests $600M in each new data center.
Google Dalles Data Centre
28 2008/3/26
Yuh-Jzer Joung
2008/3/26
Google 29
Google AdWords
Launched in 2000, now Google ’ s main source of revenue
Clients choose keywords for which their ad would like to appear when someone uses the keywords to search in Google, and bid for the price
2008/3/26
Yuh-Jzer Joung
Google 30
Google AdWords (contd.)
CPC (Cost-Per-Click) advertisements
The ordering of paid listings (displayed as “ Sponsored Links ” ) depends on
− other advertisers' bids
A variation of the Vickrey auction .
The highest bidder wins, but the price paid is the second-highest bid
a minimum bid of $0.01 per click (as of 2008)
−
"quality score" of all ads shown for a given search
The quality score is determined by Google based on the historical clickthrough rates and the relevance of an advertiser's ad text and keywords
CPM (cost-per-thousand impressions)
Site targeted advertisements
Advertisers select the sites where they'd like their ad to appear, and set a CPM bid that applies for all those sites.
CPM ads are again ranked for display according partly to their CPM bid and with keyword-targeted CPC ads.
a minimum bid of $0.25 per thousand impressions
2008/3/26
Yuh-Jzer Joung
Google 31
Google AdSense
For web publishers to make more revenue from advertising on their site while maintaining editorial quality.
Google provides HTML ad code to place on the web pages on which you want to display AdWords ads, and then take care of the rest.
32 2008/3/26
Yuh-Jzer Joung
Google AdSense
AdSense for content
automatically crawls the content of your pages and delivers ads that are relevant to your audience and your site content
AdSense for search
allows website publishers to provide Google web and site search to their visitors.
AdSense for mobile content
2008/3/26
Yuh-Jzer Joung
Google 33
You Can Make Money Without
Doing Evil
Not allow ads to be displayed on the results pages unless they're relevant to the results page on which they're shown.
Advertising can be effective without being flashy.
Advertising on Google is always clearly identified as a
“ Sponsored Link.
”
No one can buy better PageRank. Never manipulate rankings to put partners higher in the search results.
2008/3/26
Yuh-Jzer Joung
Google 34
Online Ad War
Google Attracting 25% of all US Online Ad Revenue [2006]
2008/3/26
Yuh-Jzer Joung
Google 35
Online Ad War (contd.)
2008/3/26
Yuh-Jzer Joung
Google 36
2008/3/26
Google 37
Google Earth: Overview
Developed by Keyhole, Inc. (acquired by Google in 2004), and then renamed to Google Earth in 2005.
Free with limited functionality
Google Earth Plus ($20 per year)
Google Earth Pro ($400 per year)
Was 2D, but goes to 3D in Nov. 2006 for some scenes.
In August 2007, Hamburg became the first city entirely shown in 3D.
Added a Sky tool for viewing stars and astronomical images in Aug. 2007.
2008/3/26
Yuh-Jzer Joung
Google 38
Some Spec
Resolutions
Baseline resolutions
−
Global: Generally 15 m
−
U.S.: < 15 m, some states are in 1m
Typical high resolutions
−
U.S.: 1 m, 0.6 m, 0.3 m, 0.15 m (extremely rare; e.g. Cambridge and Google Campus, or Glendale)
−
Europe : 0.3 m, 0.15 m (e.g. Berlin, Z ü rich, Hamburg)
Image ages
Most of the international urban image dates are from 2004 and have not been updated.
But cities in US are kelp current
2008/3/26
Yuh-Jzer Joung
Google 39
Google Earth: Demo
2008/3/26
Yuh-Jzer Joung
Google 40
Explore sky in Google Earth
View the Sky in your place.
2008/3/26
Yuh-Jzer Joung
Google 41
National Security and Privacy Issues
Some places are obscured through pixelization in Google
Earth and Google Maps dues to security concerns.
Neverthesless, high-resolution photos and aerial surveys of the property are readily available on the Internet elsewhere.
Individual's right to privacy is less addressed as opposed to the state's right to secrecy.
2008/3/26
Yuh-Jzer Joung
Google 42
Competitor: Windows Live Local
Windows Live Local Gets "Virtual Earth" 3-D Cities
2008/3/26
Yuh-Jzer Joung
Google 43
2008/3/26
Google 44
Google Map: Overview
Announced in Feb 8 2005, offer street maps, a route planner, and an urban business locator for numerous countries around the world.
Provides high-resolution satellite images for most urban areas in
Canada and the United States as well as
parts of New Zealand, Australia, Egypt, France, Germany, Hong
Kong, Iran, Iceland, Italy, Ireland, Iraq, Japan, Taiwan , the Bahamas,
Bermuda, Kuwait, Mexico, the Netherlands, the United Kingdom, and many other countries .
All the images shown in Google Maps' satellite mode are at least a year old (wiki).
Google Maps for Mobile lunched in late 2006.
Released Street View on May 25, 2007, Google, a new feature of
Google Maps for 360 ° panoramic street-level views of various U.S. cities.
2008/3/26
Yuh-Jzer Joung
Google 45
Resolution Can be this High!
2008/3/26
Yuh-Jzer Joung
Google deep in Chad, Africa
46
Resolution Can be this High (contd.)!
2008/3/26
Yuh-Jzer Joung
Google people at a well deep in Chad, Africa
47
Google Map: Implementation
Uses AJAX (Asynchronous JavaScript and XML) technique.
Locations are drawn dynamically by positioning a red pin on top of the map images
GIS (Geographic Information System) data are provided by
Tele Atlas and NAVTEQ .
High-resolution satellite imagery are largely provided by
DigitalGlobe and its QuickBird satellite, with some imagery also from government sources.
Provide API for developers to integrate Google Maps into their web sites.
2008/3/26
Yuh-Jzer Joung
Google 48
Maps Front-end Design
Metaphor: infinite map under a small window
Accomplish with DHTML by splitting map into tiles
As user pans, rotate tiles to make map seem infinite
Pad
Map Tiles
View width
Google 2008/3/26
Yuh-Jzer Joung
49
Using Google Map APIs
Putting map on page requires only two lines of JavaScript: var map = new GMap(document.getElementById("map")); map.centerAndZoom(new GPoint(-122.141944, 37.441944), 4);
2008/3/26
Yuh-Jzer Joung
Google 50
Google Map: Demo
2008/3/26
Yuh-Jzer Joung
Some video for the use of location marker .
Google 51
The Potential of Google Map
Overlay search results
and much more …
+ + =
2008/3/26
Yuh-Jzer Joung
Google 52
Google Moon
In honor of the 36th anniversary of the Apollo 11 moon landing on
July 20, 1969.
2008/3/26
Yuh-Jzer Joung
Google 53
Google Mars
Lunched in Mar . 2006.
2008/3/26
Yuh-Jzer Joung
Google 54
Competitor: Yahoo! Map
Yahoo! released their own Maps API
2008/3/26
Yuh-Jzer Joung
Google 55
2008/3/26
Google 56
Google Docs & Spreadsheets
A free, Web-based word processor, spreadsheet, and presentation application, lunched in Oct. 11, 2006.
Allow users to create and edit documents online while collaborating in real-time with other users.
2008/3/26
Yuh-Jzer Joung
Google 57
Google Docs & Spreadsheets: Growth
Rate
A free, Web-based word processor, spreadsheet, and presentation application, lunched in Oct. 11, 2006.
2008/3/26
Yuh-Jzer Joung
Google 58
Software As A Service (SaaS)
A software application delivery model where a software vendor develops a web-native software application and hosts and operates (either independently or through a third-party) the application for use by its customers over the Internet.
Gartner Says 25 Percent of New Business Software Will Be
Delivered As Software As A Service by 2011.
The trend: Application Service Provider (ASP)
Î
On-Demand
Î
SaaS
2008/3/26
Yuh-Jzer Joung
Google 59
2008/3/26
Google 60
StartPage
A self constructed, and often customized, webpage that uses small boxes or modules that can house weblinks, photo feeds, breaking news, search engines, bookmarking links, …
Also known as “ Web Desktop ” , “ Start-page service ” , “ Personalized
Portal ” , …
2008/3/26
Yuh-Jzer Joung
Google 61
iGoogle: Background
Launched in May 2005 as “ Google Personalized Homepage ” , but renamed to current one on April 30, 2007.
Customizable AJAX-based start page for accessing the Web.
Supports the use of specially developed “ gadgets ” to display content on a user's page.
The gadgets interact with the user and utilize the Google
Gadgets API.
Competitors:
Netvibes , Pageflakes , My Yahoo , Microsoft's Live.com
.
2008/3/26
Yuh-Jzer Joung
Google 62
iGoogle: Demo
2008/3/26
Yuh-Jzer Joung
Google 63
2008/3/26
Yuh-Jzer Joung
Google Desktop
First released in Oct. 2004.
Desktop search software.
Can index several types of data
− email, web browsing history, office documents, instant messenger transcripts from AOL, Google,
MSN, Skype, Tencent QQ, and several multimedia file types.
−
Additional file types can be indexed through the use of plug-ins.
Provide a number of sidebars
Email, Scratch Pad, Photos, News, Weather,
Google Talk, ...
Google 64
Google Gadgets
Windows sidebars for Vista
2008/3/26
Yuh-Jzer Joung
Google 65
Start-Page Service: The Potential
Recall the battle of portals?
2008/3/26
Yuh-Jzer Joung
Google 66
2008/3/26
Google 67
Cloud Computing: Function comes from “the Cloud”
A new jargon for describing the use of shared computing resources.
Related concepts
Cluster Computing
Grid Computing
Utility Computing/on-demand computing
2008/3/26
Yuh-Jzer Joung
Google 68
Who Are in the Arena
2008/3/26
Yuh-Jzer Joung
Google Source: BusinessWeek 69
MapReduce: Simplified Data Processing on Large Clusters
[Google 2004]
MapReduce: a programming model to
Process lots of data (TB)
Parallelize across hundreds/thousands of CPUs
Make this easy
MapReduce: an associated implementation that provides
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
2008/3/26
Yuh-Jzer Joung
Google 70
MapReduce Programming model
Two functions inspired by functional programming.
map (in_key, in_value) -> list(out_key, intermediate_value)
−
Processes input key/value pair
−
Produces set of intermediate pairs
(key
1
,value
1
) (key
2
,value
2
) … f f f f …
2008/3/26
Yuh-Jzer Joung
Google 71
Programming model (contd.)
reduce (out_key, list(intermediate_value)) -> list(out_value)
−
Combines all intermediate values for a particular key
−
Produces a set of merged output values (usually just one) inter_value
1 inter_value
2
… f f f f f f return initial value
2008/3/26
Yuh-Jzer Joung
Google 72
Putting It Together image how to use this to search documents in a corpus!
2008/3/26
Yuh-Jzer Joung
Source: Google Lab map() functions run in parallel reduce() functions may also run in parallel, but they must wait for map()
Google 73
Parallel Execution
2008/3/26
Yuh-Jzer Joung
Source: Google Lab
74
Typical Cluster in MapReduce
100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
Limited bisection bandwidth
Storage is on local IDE disks
GFS: distributed file system manages data
Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines
Implementation is a C++ library linked into user programs
2008/3/26
Yuh-Jzer Joung
Google 75
Fault Tolerance
Handled via re-execution
Master-Slave Model
On worker failure:
−
Detect failure via periodic heartbeats
−
Re-execute completed and in-progress map tasks
−
Re-execute in progress reduce tasks
−
Task completion committed through master
Master failure:
−
Could handle, but don't yet (master failure unlikely)
2008/3/26
Yuh-Jzer Joung
Google 76
Locality Optimization
Master scheduling policy:
map() task inputs are divided into 64 MB blocks (Google File
System block size)
−
Asks GFS for locations of replicas of input file blocks
−
Try to have map() tasks on same machine as physical file data, or at least same rack.
Effect: Thousands of machines read input at local disk speed
−
Without this, rack switches limit read rate
2008/3/26
Yuh-Jzer Joung
Google 77
Refinement: Redundant Execution
Slow workers significantly lengthen completion time
Solution: Near end of phase, spawn backup copies of tasks, use results of first copy to finish
Effect: Dramatically shortens job completion time
2008/3/26
Yuh-Jzer Joung
Google 78
Refinement: Local Merge
“ Combiner ” functions can run on same machine as a mapper
Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth
2008/3/26
Yuh-Jzer Joung
Google 79
Applications:
Find content in thousands of thousands of documents.
Inverted index construction
web link-graph reversal
web access log stats
term-vector per host
document clustering
…
2008/3/26
Yuh-Jzer Joung
Google 80
Hadoop
Essentially an open source version of Google ’ s MapReduce and its file system GFS.
Yahoo! uses Hadoop extensively in its Web Search and
Advertising businesses.
Yahoo! Search Webmap uses Hadoop to run on a more than 10,000 core Linux cluster [as of 2008-0219]
−
Some statistics of Webmap:
Number of links between pages in the index: roughly 1 trillion links
Size of output: over 300 TB, compressed!
Number of cores used to run a single Map-Reduce job: over 10,000
Raw disk used in the production cluster: over 5 Petabytes
2008/3/26
Yuh-Jzer Joung
Google 81
Hadoop (contd.)
IBM and Google announced a major initiative to use Hadoop to support university courses in Distributed Programming. [20071008]
A cluster of processors of more than 1,600 processors running Hadoop
2008/3/26
Yuh-Jzer Joung
Google 82
Google Labs
Google's technology playground.
Favorite ideas that aren't quite ready for prime time.
2008/3/26
Yuh-Jzer Joung
Google 83
Where are the Innovations Come from
Ideas come from everywhere!
All Google engineers are encouraged to spend 20% of their work time (one day per week) on projects that interest them.
−
Half of the new product launches originated from the 20% time.
And, more importantly, a pleasant environment to work.
You are brilliant, we are hiring!
2008/3/26
Yuh-Jzer Joung
Google 84
Google Philosophy
Focus on the user and all else will follow.
It's best to do one thing really, really well.
Fast is better than slow.
Democracy on the web works.
You don't need to be at your desk to need an answer.
You can make money without doing evil.
There's always more information out there.
The need for information crosses all borders.
You can be serious without a suit.
Great just isn't good enough.
2008/3/26
Yuh-Jzer Joung
Google 85
Discussions
Question 1: In late 2006, Google bought online video site
YouTube for US$1.65 billion in stock. Discuss the strategic implication of this merge.
2008/3/26
Yuh-Jzer Joung
Google 86
Discussions
Question 2: Present some interesting projects/products from
Google Labs.
2008/3/26
Yuh-Jzer Joung
Google 87