Six Laws of Computing

advertisement
Scaleable Computing
Jim Gray
Microsoft Corporation
Gray@Microsoft.com
™
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
1987: 256 tps Benchmark



14 M$ computer (Tandem)
A dozen people
False floor, 2 rooms of machines
Admin expert
Hardware experts
A 32 node processor array
Simulate 25,600 clients
Network expert
Manager
Performance
expert
DB expert
A 40 GB disk array (80 drives)
Auditor
OS expert
1988: DB2 + CICS Mainframe
65 tps




IBM 4391
Simulated network of 800 clients
2m$ computer
Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized
CPU
16 GB disk farm
4 x 8 x .5GB
1997: 10 years later
1 Person and 1 box = 1250 tps




1 Breadbox ~ 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 1,000x less
25 micro dollars per transaction
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
disk arrays
What Happened?

Moore’s law:
Things get 4x better every 3 years
(applies to computers, storage, and networks)

New Economics: Commodity
class
price/mips software
$/mips k$/year
mainframe
10,000
100
minicomputer
100
10
microcomputer
10
1

time
GUI: Human - computer tradeoff
optimize for people, not computers
What Happens Next




Last 10 years:
1000x improvement
Next 10 years:
????
1985 1995 2005
Today:
text and image servers are free
25 m$/hit => advertising pays for them
Future:
video, audio, … servers are free
“You ain’t seen nothing yet!”
Kinds Of
Information Processing
Point-to-point
Immediate
Timeshifted
Broadcast
Conversation
Money
Lecture
Concert
Network
Mail
Book
Newspaper
Database
It’s ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis and automatic processing are being added
Low rent min $/byte
Shrinks time now or later
Shrinks space here or there
Automate processing knowbots
Immediate OR time-delayed
Why Put Everything
In Cyberspace?
Point-to-point
OR
broadcast
Network
Locate
Process
Analyze
Summarize
Database
Magnetic Storage
Cheaper Than Paper

File cabinet:
cabinet (four drawer) 250$
paper (24,000 sheets) 250$
space (2x3 @ 10$/ft2) 180$
total
700$
3¢/sheet


Disk:
Image:
disk (4 GB =)
ASCII: 2 mil pages
800$
0.04¢/sheet
(80x cheaper)
200,000 pages
0.4¢/sheet

Store everything on disk
(8x cheaper)
Billions Of Clients



Every device will be “intelligent”
Doors, rooms, cars…
Computing will be ubiquitous
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Future Super Server:
4T Machine

Array of 1,000 4B machines
1
bps processors
 1 BB DRAM
 10 BB disks
 1 Bbps comm lines
 1 TB tape robot


A few megabucks
Challenge:
 Manageability
 Programmability
CPU
50 GB Disc
5 GB RAM
Cyber Brick
a 4B machine
 Security
 Availability
 Scaleability
 Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
The Hardware Is In Place…
And then a miracle occurs
?



SNAP: scaleable network
and platforms
Commodity-distributed
OS built on:
 Commodity platforms
 Commodity network
interconnect
Enables parallel applications
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
Scaleable Servers
BOTH SMP And Cluster
SMP super
server
Departmental
server
Personal
system
Grow up with SMP; 4xP6
is now standard
Grow out with cluster
Cluster has inexpensive parts
Cluster
of PCs
SMPs Have Advantages




Single system image
easier to manage, easier
to program threads in
shared memory, disk, Net
4x SMP is commodity
SMP super
Software capable of 16x server
Problems:
Departmental
>4 not commodity
server
 Scale-down problem
(starter systems expensive)
Personal
 There is a BIGGEST one
system

Building the Largest Node

There is a biggest node (size grows over time)
Today, with NT, it is probably 1TB

We are building it (with help from DEC and SPIN2)







1 TB GeoSpatial SQL Server database
(1.4 TB of disks = 320 drives).
30K BTU, 8 KVA, 1.5 metric tons.
1-TB home page
Will put it on the Web as a demo app.
10 meter image of the ENTIRE PLANET.
www.SQL.1TB.com
2 meter image of interesting parts (2% of land)
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
One pixel per meter = 500 TB uncompressed.
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
TM

Better resolution in US (courtesy of USGS).
1-TB SQL Server DB
Satellite and aerial
photos
Support
files
What’s TeraByte?

1 Terabyte:
1,000,000,000 business letters 150 miles of book shelf
100,000,000 book pages
15 miles of book shelf
50,000,000 FAX images
7 miles of book shelf
10,000,000 TV pictures (mpeg)
10 days of video
4,000 LandSat images
16 earth images (100m)
100,000,000 web page
10 copies of the web HTML

Library of Congress (in ASCII) is 25 TB
1980: $200 million of disc
$5 million of tape silo
1997: $200 k$ of magnetic disc
$30 k$ nearline tape
Terror Byte !
10,000 discs
10,000 tapes
48 discs
20 tapes
Tpc-C Web-Based Benchmarks






Order
Invoice
Query to server via Web
page interface
Web server translates to DB
SQL does DB work
Net:
 easy
to implement
 performance is GREAT!
HTTP

Client is a Web browser
(7,500 of them!)
Submits
IIS
= Web
ODBC

SQL
TPC-C Shows How Far SMPs have come

Performance is amazing:




Peak Performance: 30,390 tpmC @ $305/tpmC (Oracle/DEC)
Best Price/Perf: 6,712 tpmC @ $65/tpmC (MS SQL/DEC/Intel)
graphs show UNIX high price & diseconomy of scaleup
tpm C & Price Pe rform ance
(only "best" data shown for each vendor)
DB2
400
Informix
MS SQL Server
350
Oracle
300
Sybase
250
$/tpmC

2,000 users is the min!
30,000 users on a 4x12 alpha cluster (Oracle)
200
150
100
50
0
0
5000
10000
tpmC
15000
20000
TPC C SMP Performance
• SMPs do offer speedup
but 4x P6 is better than some 18x MIPSco
tpm C vs CPS
SUN Scaleability
20,000
20,000
18,000
SUN Scaleability
16,000
15,000
SQL Server
14,000
tpmC
tpmC
12,000
10,000
10,000
8,000
6,000
5,000
4,000
2,000
0
0
0
5
10
CPUs
15
20
0
5
10
cpus
15
20
The TPC-C Revolution
Shows How Far
NT and SQL Server have Come
tpmC and $/tpmC
MS
SQL Server: Economy of Scale & Low Price
$250
DB2
Informix
Microsoft
Oracle
Sybase
$200
Better

Economy of scale on Windows NT
Recent Microsoft SQL Server benchmarks
are Web-based
Price $/TPM-C

$150
$100
$50
$0
0
1000
2000
3000
4000
5000
Performance tpmC
6000
7000
8000
What Happens To Prices?

No expensive UNIX front end
(20$/tpmC)
No expensive TP monitor software (10$/tpmC)

=> 65$/tpmC

164
188
TPC Price/tpmC
100
93
90
Informix on SNI
Oracle on DEC Unix
Oracle on Compaq/NT
Sybase on Compaq/NT
Microsoft on Compaq with Visigenics
Microsoft on HP with Visagenics
Microsoft on Intergraph with IIS
Microsoft on Compaq with IIS
80
70
66
64 66
60
50
40
54
45
44
35
44
38
44
40
39 39
35
30
27
30
20
42
40
38
41 39
31
22
18
19 21
16
8
10
3
0
30
processor
disk
software
net
Grow UP and OUT
1 Terabyte DB
SMP super
server
Departmental
server
Personal
system
Cluster:
•a collection of nodes
•as easy to program
and manage as a
single node
1 billion
transactions
per day
Clusters Have Advantages


Clients and servers made from the same stuff
Inexpensive:


Fault tolerance:


Spare modules mask failures
Modular growth


Built with commodity components
Grow by adding small modules
Unlimited growth:
no biggest one
Windows NT clusters

Key goals:





Easy: to install, manage, program
Reliable: better than a single node
Scaleable: added parts add power
Microsoft & 60 vendors
defining NT clusters



Almost all big hardware and
software vendors involved
No special hardware needed 
but it may help
Enables



Commodity fault-tolerance
Commodity parallelism
(data mining, virtual reality…)
Also great for workgroups!
Initial: two-node failover





Beta testing since December96
SAP, Microsoft, Oracle giving
demos.
File, print, Internet, mail, DB, other
services
Easy to manage
Each node can be 4x (or more) SMP
Next (NT5) “Wolfpack” is modest
size cluster


About 16 nodes (so 64 to 128 CPUs)
No hard limit, algorithms designed
to go further
™
SQL Server Failover
Using “Wolfpack” Windows NT Clusters


Each server “owns” half the database
When one fails…


The other server takes over the shared disks
Recovers the database and serves it
Private
disks
Private
disks
Shared SCSI disk strings
B
A
Clients
How Much Is 1 Billion
Transactions Per Day?
1 Btpd = 11,574 tps
(transactions per second)
Millions of transactions per day
~ 700,000 tpm
1,000.
(transactions/minute)



400 M customers
250,000 ATMs worldwide
7 billion transactions / year
(card+cheque) in 1994
0.1
NYSE
Visa ~20 M tpd
1.
BofA

185 million calls
(peak day worldwide)
AT&T

10.
Visa
AT&T
Mtpd

100.
1 Btpd

Billion Transactions per Day
Project






Building a 20-node Windows NT
Cluster (with help from Intel)
> 800 disks
All commodity parts
Using SQL Server &
DTC distributed transactions
Each node has 1/20 th of the DB
Each node does 1/20 th of the
work
15% of the transactions are
“distributed”
Parallelism
The OTHER aspect of clusters

Clusters of machines
allow two kinds
of parallelism



Many little jobs: online
transaction processing
 TPC-A, B, C…
A few big jobs: data
search and analysis
 TPC-D, DSS, OLAP
Both give
automatic parallelism
Kinds of Parallel Execution
Pipeline
Any
Sequential
Program
Partition
outputs split N ways
inputs merge M ways
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Data Rivers
Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma = Exchange operator in Volcano.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
The Parallel Law
Of Computing
Grosch's Law:
1 MIPS
1$
2x $ is 4x performance
1,000 MIPS
32 $
.03$/MIPS
2x $ is
2x performance
Parallel Law:
Needs:
Linear speedup and linear scale-up
Not always possible
1,000 MIPS
1,000 $
1 MIPS
1$
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
The BIG Picture
Components and transactions




Software modules are objects
Object Request Broker (a.k.a., Transaction
Processing Monitor) connects objects
(clients to servers)
Standard interfaces allow software plug-ins
Transaction ties execution of a “job” into an
atomic unit: all-or-nothing, durable, isolated
Object Request Broker
ActiveX and COM




COM is Microsoft model, engine inside OLE ALL
Microsoft software is based on COM (ActiveX)
CORBA + OpenDoc is equivalent
Heated debate over which is best
Both share same key goals:








Encapsulation: hide implementation
Polymorphism: generic operations
key to GUI and reuse
Versioning: allow upgrades
Transparency: local/remote
Security: invocation can be remote
Shrink-wrap: minimal inheritance
Automation: easy
COM now managed by the Open Group
Linking And Embedding
Objects are data modules;
transactions are execution modules

Link: pointer to object
somewhere else

Think URL in Internet

Embed: bytes
are here

Objects may be active;
can callback to subscribers
Objects Meet Databases
The basis for universal
data servers, access, & integration






object-oriented (COM oriented)
programming interface to data
Breaks DBMS into components
Anything can be a data source
Optimization/navigation “on top
of” other data sources
A way to componentized a DBMS
Makes an RDBMS and O-R
DBMS (assumes optimizer
understands objects)
DBMS
engine
Database
Spreadsheet
Photos
Mail
Map
Document
The Pattern:
Three Tier Computing
Presentation

Clients do presentation, gather input

Clients do some workflow (Xscript)

Clients send high-level requests to
ORB (Object Request Broker)

ORB dispatches workflows and
business objects -- proxies for client, Business
Objects
orchestrate flows & queues

Server-side workflow scripts call on
distributed business objects to
execute task
workflow
Database
49
The Three
Tiers
Web Client
HTML
VB Java
plug-ins
VBscritpt
JavaScrpt
Middleware
VB or Java
Script Engine
Object
server
Pool
VB or Java
Virt Machine
Internet
HTTP+
DCOM
ORB
ORB
TP Monitor
Web Server...
Object & Data
server.
DCOM (oleDB, ODBC,...)
IBM
Legacy
Gateways
50
Why Did Everyone Go To
Three-Tier?

Manageability





Business rules must be with data
Middleware operations tools
Performance (scaleability)


workflow
Server resources are precious
ORB dispatches requests to server pools
Technology & Physics

Presentation
Put UI processing near user
Put shared data processing near shared
data
Business
Objects
Database
51
What Middleware Does
ORB, TP Monitor, Workflow Mgr, Web Server




Registers transaction programs
workflow and business objects (DLLs)
Pre-allocates server pools
Provides server execution environment
Dynamically checks authority
(request-level security)




Does parameter binding
Dispatches requests to servers
 parameter binding
 load balancing
Provides Queues
Operator interface
53
Server Side Objects

Easy Server-Side Execution
A Server
Give simple execution
environment
Object gets
Network




start
invoke
shutdown
Everything else is
automatic
Drag & Drop Business
Objects
Queue
Connections
Context
Security
Thread Pool
Service logic
Synchronization
Shared Data
54
Configuration

Management

Receiver
A new programming paradigm






Develop object on the desktop
Better yet: download them from the Net
Script work flows as method invocations
All on desktop
Then, move work flows and objects to server(s)
Gives
desktop
development
three-tier deployment
Software Cyberbricks
Transactions Coordinate
Components (ACID)

Transaction properties





Atomic: all or nothing
Consistent: old and new values
Isolated: automatic locking or versioning
Durable: once committed, effects survive
Transactions are built into modern OSs

MVS/TM Tandem TMF, VMS DEC-DTM, NT-DTC
Transactions & Objects




Application requests transaction
identifier (XID)
XID flows with method invocations
Object Managers join (enlist)
in transaction
Distributed Transaction Manager
coordinates commit/abort
Distributed Transactions
Enable Huge Throughput




Each node capable of 7 KtmpC (7,000 active users!)
Can add nodes to cluster (to support 100,000 users)
Transactions coordinate nodes
ORB / TP monitor spreads work among nodes
Distributed Transactions
Enable Huge DBs


Distributed database technology
spreads data among nodes
Transaction processing technology
manages nodes
Thesis: Scaleable Servers

Scaleable Servers Built from Cyberbricks


Servers should be able to


Allow new applications
Scale up, out, down
Key software technologies




Clusters (ties the hardware together)
Parallelism: (uses the independent cpus, stores, wires
Objects (software CyberBricks)
Transactions: masks errors.
Computer Industry Laws
(Rules of thumb)











Metcalf’s law
Moore’s first law
Bell’s computer classes (7 price tiers)
Bell’s platform evolution
Bell’s platform economics
Bill’s law
Software economics
Grove’s law
Moore’s second law
Is info-demand infinite?
The death of Grosch’s law
Metcalf’s Law
Network Utility = Users2

How many connections can it
make?





1 user: no utility
100,000 users: a few contacts
1 million users: many on Net
1 billion users: everyone on Net
That is why the Internet is so “hot”

Exponential benefit
Moore’s First Law

1GB
XXX doubles every 18 months
128MB
60% increase per year
8MB





Micro processor speeds
1MB
128KB
Chip density
8KB
Magnetic disk density
1970
Communications bandwidth bits: 1K 4K
WAN bandwidth approaching LANs
1980

The past does not matter
10x here, 10x there, soon you’re talking REAL
change
PC costs decline faster than any other
platform


1990
2000
16K 64K 256K 1M 4M 16M 64M 256M
Exponential growth:


1 chip memory size
( 2 MB to 32 MB)
Volume and learning curves
PCs will be the building bricks of all future
Bumps In The Moore’s
Law Road
$/MB of DRAM
1000000

DRAM:



1988: United States
anti-dumping
rules
1993-1995: ?price flat
Magnetic disk:


1965-1989: 10x/decade
1989-1996: 4x/3year!
100X/decade
10000
100
1
1970
1980
1990
2000
$/MB of DISK
10,000
100
1
.01
1970
1980
1990
2000
Gordon Bell’s 1975 VAX Planning
Model... He Didn’t Believe
It!
(t-1972)
System Price = 5 x 3 x .04 x memory size/ 1.26



5x: Memory is
20% of cost
3x: DEC markup
.04x: $ per byte
He didn’t believe:
the projection
$500 machine
He couldn’t
comprehend
the implications
K$
100,000.K$
10,000.K$
1,000.K$
100.K$
10.K$
1.K$
0.1K$
0.01K$
1960
16 KB
1970
1980
64 KB
256 KB
1990
1 MB
2000
8 MB
Gordon Bell’s Processing
Memories, And Comm 100
Years
1.E+18
1.E+15
1.E+12
1.E+09
1.E+06
1.E+03
1.E+00
1947
1967
Processing
1987
2007
2027
Sec. Mem.
Pri. Mem
POTS(bps)
2047
Backbone
Gordon Bell’s Seven Price
Tiers
10$:
100$:
1,000$:
10,000$:
100,000$:
1,000,000$:
10,000,000$:
wrist watch computers
pocket/ palm computers
portable computers
•
personal
computers (desktop)
departmental computers (closet)
site computers (glass house)
regional computers (glass castle)
Super server: costs more than $100,000
“Mainframe”: costs more than $1 million
Must be an array of processors, disks, tapes, comm ports
Bell’s Evolution Of
Computer Classes
Technology enables two evolutionary paths:
1. constant performance, decreasing cost
2. constant price, increasing performance
Log price
Mainframes (central)
Minis (dep’t.)
WSs
PCs (personals)
Time
??
1.26 = 2x/3 yrs -- 10x/decade; 1/1.26 = .8
1.6 = 4x/3 yrs --100x/decade; 1/1.6 = .62
Gordon Bell’s
Platform Economics


Traditional computers: custom or semi-custom,
high-tech and high-touch
New computers: high-tech and no-touch
100000
10000
Price (K$)
Volume (K)
Application
price
1000
100
10
1
0.1
0.01
Mainframe
WS
Computer type
Browser
Software
Economics
Microsoft: $9 billion
An engineer costs
Profit
R&D
about
24%
16%
$150,000/year
 R&D gets [5%…15%]
SG&A
Tax
34%
13%
of budget
 Need [$3 million…
Product
and Service
$1 million] revenue
13%
per engineer
Intel: $16 billion
IBM: $72 billion
Oracle: $3 billion

Profit
22%
R&D
8%
SG&A
11%
Tax
12%
P&S
47%
Profit
Tax 6%
5%
R&D
8%
Profit
15%
Tax
7%
SG&A
22%
P&S
59%
P&S
26%
R&D
9%
SG&A
43%
Software Economics: Bill’s
Law
Fixed_Cost
Price =
+ Marginal _Cost
Units



Bill Joy’s law (Sun):
don’t write software for less than 100,000 platforms
@$10 million engineering expense, $1,000 price
Bill Gate’s law:
don’t write software for less than 1,000,000 platforms
@$10 engineering expense, $100 price
Examples:
UNIX
versus Windows NT: $3,500 versus $500
Oracle versus SQL-Server: $100,000 versus $6,000
No spreadsheet or presentation pack on UNIX/VMS/...

Commoditization of base software and hardware
Grove’s Law
The New Computer Industry



Horizontal
integration
is new structure
Each layer picks
best from lower
layer
Desktop (C/S)
market
1991:
50%
1995: 75%
Function
Operation
Integration
Applications
Middleware
Baseware
Systems
Silicon & Oxide
Example
AT&T
EDS
SAP
Oracle
Microsoft
Compaq
Intel & Seagate



The cost of fab lines
doubles every generation
(three years)
Money limit hard to imagine:
 $10-billion line
 $20-billion line
 $40-billion line
Physical limit
 Quantum effects at 0.25
micron now 0.05 micron
seems hard 12 years, three
generations
 Lithograph: need Xray
below 0.13 micron
$million/ Fab Line
Moore’s Second
Law
$10,000
$1,000
$100
$10
$1
1960
1970
1980
Year
1990
2000
Constant Dollars Versus
Constant Work

Constant work:

One SuperServer can do
all the world’s computations

Constant dollars:
The world spends 10% on
information processing
 Computers are moving from
5% penetration to 50%
 $300 billion to $3 trillion
 We have the patent
on the byte and algorithm

Crossing The Chasm
New
market
Product finds
customers
No product
no customers
Hard
Old
market
Boring
competitive
slow growth
Old
technology
Hard
Customers
find product
New
technology
Download