Scaleable Computing Jim Gray Microsoft Corporation

advertisement
Scaleable Computing
Jim Gray
Microsoft Corporation
Gray@Microsoft.com
™
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
1987: 256 tps Benchmark



14 M$ computer (Tandem)
A dozen people
False floor, 2 rooms of machines
Admin expert
Hardware experts
A 32 node processor array
Simulate 25,600 clients
Network expert
Manager
Performance
expert
DB expert
A 40 GB disk array (80 drives)
Auditor
OS expert
1988: DB2 + CICS Mainframe
65 tps




IBM 4391
Simulated network of 800 clients
2m$ computer
Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized
CPU
16 GB disk farm
4 x 8 x .5GB
1997: 10 years later
1 Person and 1 box = 1250 tps




1 Breadbox ~ 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 1,000x less
25 micro dollars per transaction
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
disk arrays
What Happened?

Moore’s law:
Things get 4x better every 3 years
(applies to computers, storage, and networks)

New Economics: Commodity
class
price/mips software
$/mips k$/year
mainframe
10,000
100
minicomputer
100
10
microcomputer
10
1

time
GUI: Human - computer tradeoff
optimize for people, not computers
What Happens Next




Last 10 years:
1000x improvement
Next 10 years:
????
1985 1995 2005
Today:
text and image servers are free
25 m$/hit => advertising pays for them
Future:
video, audio, … servers are free
“You ain’t seen nothing yet!”
Kinds Of
Information Processing
Point-to-point
Immediate
Timeshifted
Broadcast
Conversation
Money
Lecture
Concert
Network
Mail
Book
Newspaper
Database
It’s ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis and automatic processing are being added
Low rent min $/byte
Shrinks time now or later
Shrinks space here or there
Automate processing knowbots
Immediate OR time-delayed
Why Put Everything
In Cyberspace?
Point-to-point
OR
broadcast
Network
Locate
Process
Analyze
Summarize
Database
Magnetic Storage
Cheaper Than Paper

File cabinet:
cabinet (four drawer) 250$
paper (24,000 sheets) 250$
space (2x3 @ 10$/ft2) 180$
total
700$
3¢/sheet


Disk:
Image:
disk (4 GB =)
ASCII: 2 mil pages
800$
0.04¢/sheet
(80x cheaper)
200,000 pages
0.4¢/sheet

Store everything on disk
(8x cheaper)
Billions Of Clients



Every device will be “intelligent”
Doors, rooms, cars…
Computing will be ubiquitous
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Future Super Server:
4T Machine

Array of 1,000 4B machines
1
bps processors
 1 BB DRAM
 10 BB disks
 1 Bbps comm lines
 1 TB tape robot


A few megabucks
Challenge:
 Manageability
 Programmability
CPU
50 GB Disc
5 GB RAM
Cyber Brick
a 4B machine
 Security
 Availability
 Scaleability
 Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
Performance = Storage Accesses
not Instructions Executed
 In
the “old days” we counted instructions and
IO’s
 Now we count memory references
 Processors wait most of the time
Where the time goes:
clock ticks used by AlphaSort Components
Disc Wait
Disc Wait
Sort
Sort
OS
Memory Wait
B-Cache
Data Miss
70 MIPS
“real” apps have worse Icache
misses so run at 60 MIPS
if well tuned, 20 MIPS if not
I-Cache
Miss
D-Cache
Miss
Storage Latency:
How Far Away is the Data?
Clock Ticks
10
9
Andromeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head 1 min
The Hardware Is In Place…
And then a miracle occurs
?



SNAP: scaleable network
and platforms
Commodity-distributed
OS built on:
 Commodity platforms
 Commodity network
interconnect
Enables parallel applications
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
Scaleable Servers
BOTH SMP And Cluster
SMP super
server
Departmental
server
Personal
system
Grow up with SMP; 4xP6
is now standard
Grow out with cluster
Cluster has inexpensive parts
Cluster
of PCs
SMPs Have Advantages




Single system image
easier to manage, easier
to program threads in
shared memory, disk, Net
4x SMP is commodity
SMP super
Software capable of 16x server
Problems:
Departmental
>4 not commodity
server
 Scale-down problem
(starter systems expensive)
Personal
 There is a BIGGEST one
system

Tpc-C Web-Based Benchmarks






Order
Invoice
Query to server via Web
page interface
Web server translates to DB
SQL does DB work
Net:
 easy
to implement
 performance is GREAT!
HTTP

Client is a Web browser
(7,500 of them!)
Submits
IIS
= Web
ODBC

SQL
TPC-C Shows How Far SMPs have come
Performance is amazing:





2,000 users is the min!
30,000 users on a 4x12 alpha cluster (Oracle)
Peak Performance: 30,390 tpmC @ $305/tpmC (Oracle/DEC)
Best Price/Perf: 8,040 tpmC @ $54/tpmC (MS SQL/Compaq)
graphs show UNIX high price & diseconomy of scaleup
tpmC & Price Performance
(only "best" data shown for each vendor)
DB2
400
Informix
MS SQL Server
350
Oracle
300
Sybase
250
$/tpmC

200
150
100
50
0
0
5000
10000
tpmC
15000
20000
TPC C SMP Performance
• SMPs do offer speedup
but 4x P6 is better than some 18x MIPSco
tpm C vs CPS
SUN Scaleability
20,000
20,000
18,000
SUN Scaleability
16,000
15,000
SQL Server
14,000
tpmC
tpmC
12,000
10,000
10,000
8,000
6,000
5,000
4,000
2,000
0
0
0
5
10
CPUs
15
20
0
5
10
cpus
15
20
What Happens To Prices?

No expensive UNIX front end
(20$/tpmC)
No expensive TP monitor software (10$/tpmC)

=> 65$/tpmC

164
188
TPC Price/tpmC
100
93
90
Informix on SNI
Oracle on DEC Unix
Oracle on Compaq/NT
Sybase on Compaq/NT
Microsoft on Compaq with Visigenics
Microsoft on HP with Visagenics
Microsoft on Intergraph with IIS
Microsoft on Compaq with IIS
80
70
66
64 66
60
50
40
54
45
44
38
35
44
40
39 39
35
30
27
30
20
44
42
40
38
41 39
31
22
18
19 21
16
8
10
3
0
27
processor
disk
software
net
What’s TeraByte?

1 Terabyte:
1,000,000,000 business letters 150 miles of book shelf
100,000,000 book pages
15 miles of book shelf
50,000,000 FAX images
7 miles of book shelf
10,000,000 TV pictures (mpeg)
10 days of video
4,000 LandSat images
16 earth images (100m)
100,000,000 web page
10 copies of the web HTML

Library of Congress (in ASCII) is 25 TB
1980: $200 million of disc
$5 million of tape silo
1997: 200 k$ of magnetic disc
30 k$ nearline tape
Terror Byte !
10,000 discs
10,000 tapes
48 discs
20 tapes
Building the Largest NT Node

Build a 1 TB SQL Server database



Demo it on the Internet


Show off NT and SQL Server Scaleability
Stress test the product
WWW accessible by anyone
So data must be




1 TB
Unencumbered
Interesting to everyone everywhere
AND not offensive to anyone anywhere
The Plan


DEC Alpha +
324 StorageWorks
Drives (1.4 TB)




30K BTU,
8 KW,
1.5 metric tons.
SQL 7.0
USGS data
(1 meter)
Russian Space
data (2 meter)
SPIN-2
DEC 4100
4 x 400 Mhz
Alpha Processors
4GB DRAM
Microsoft
BackOffice
Image Data Sources
300 GB
Src: USGS
& UCSB
UCSB
missing
some
DOQs
DOQ
Spin-2
500 GB
WorldWide
LoB App
New Data
Coming
DOQ coverage of the US


1 Meter images of many places
Problems:
 most of data not yet published



interesting places missing
(LA, Portland, SD, Anchorage,…)
Loaded published 130 GB.
CRDA for unpublished 3 TB
SPIN-2
Coverage

The rest of the world
The US Government can’t help, but....
The Russian Space Agency is eager to cooperate.
2 Meter Geo Rectified imagery of anywhere

More data coming, Earth has ~ 500 TeraMeters2





=> ~30 Tera Bytes of Land at 2x2 Meter
=> we need 3% of the land (Urban World = the red stuff)
Demo Interface
Grow UP and OUT
1 Terabyte DB
SMP super
server
Departmental
server
Personal
system
Cluster:
•a collection of nodes
•as easy to program
and manage as a
single node
1 billion
transactions
per day
Clusters Have Advantages


Clients and servers made from the same stuff
Inexpensive:


Fault tolerance:


Spare modules mask failures
Modular growth


Built with commodity components
Grow by adding small modules
Unlimited growth:
no biggest one
Billion Transactions per Day Project







Built a 45-node Windows NT Cluster
(with help from Intel & Compaq)
> 900 disks
All off-the-shelf parts
Using SQL Server &
DTC distributed transactions
DebitCredit Transaction
Each node has 1/20 th of the DB
Each node does 1/20 th of the work
15% of the transactions are “distributed”
How Much Is 1 Billion
Transactions Per Day?
1 Btpd = 11,574 tps
(transactions per second)
Millions of transactions per day
~ 700,000 tpm
1,000.
(transactions/minute)



400 M customers
250,000 ATMs worldwide
7 billion transactions / year
(card+cheque) in 1994
0.1
NYSE
Visa ~20 M tpd
1.
BofA

185 million calls
(peak day worldwide)
AT&T

10.
Visa
AT&T
Mtpd

100.
1 Btpd

Billion Transactions Per Day Hardware



45 nodes (Compaq Proliant)
Clustered with 100 Mbps Switched Ethernet
140 cpu, 13 GB, 3 TB.
Type
Workflow
MTS
SQL Server
Distributed
Transaction
Coordinator
TOTAL
nodes
CPUs
DRAM
ctlrs
disks
20
Compaq
Proliant
2500
20
Compaq
Proliant
5000
5
Compaq
Proliant
5000
45
20x
20x
20x
20x
RAID
space
20x
2
128
1
1
2 GB
20x
20x
20x
20x
4
512
4
20x
36x4.2GB
7x9.1GB
130 GB
5x
5x
5x
5x
5x
4
256
1
3
8 GB
140
13 GB
105
895
3 TB
1.2 B tpd







1 B tpd ran for 24 hrs.
Sized for 30 days
Linear growth
5 micro-dollars per
transaction
Out-of-the-box software
Off-the-shelf hardware
AMAZING!
Parallelism
The OTHER aspect of clusters

Clusters of machines
allow two kinds
of parallelism



Many little jobs: online
transaction processing
 TPC-A, B, C…
A few big jobs: data
search and analysis
 TPC-D, DSS, OLAP
Both give
automatic parallelism
Kinds of Parallel Execution
Pipeline
Any
Sequential
Program
Partition
outputs split N ways
inputs merge M ways
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Data Rivers
Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma = Exchange operator in Volcano.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Clusters (Plumbing)

Single system image




Fault Tolerance


naming
protection/security
management/load balance
Wolfpack
Hot Pluggable hardware & Software
Windows NT clusters

Key goals:





Easy: to install, manage, program
Reliable: better than a single node
Scaleable: added parts add power
Microsoft & 60 vendors
defining NT clusters



Almost all big hardware and
software vendors involved
No special hardware needed 
but it may help
Enables



Commodity fault-tolerance
Commodity parallelism
(data mining, virtual reality…)
Also great for workgroups!
Initial: two-node failover





Beta testing since December96
SAP, Microsoft, Oracle giving
demos.
File, print, Internet, mail, DB, other
services
Easy to manage
Each node can be 4x (or more) SMP
Next (NT5) “Wolfpack” is modest
size cluster


About 16 nodes (so 64 to 128 CPUs)
No hard limit, algorithms designed
to go further
So, What’s New?




When slices cost 50k$, you buy 10 or 20.
When slices cost 5k$ you buy 100 or 200.
Manageability, programmability, usability
become key issues (total cost of
ownership).
PCs are MUCH easier to use and program
MPP
Vicious Cycle
No Customers!
New
New
MPP & App
NewOS
New
New
MPP & App
NewOS
New
New
MPP & App
NewOS
Apps
CP/Commodity
Virtuous Cycle:
Standards allow progress
and investment protection
New
New
MPP & App
NewOS
Standard
platform
Customers
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
The BIG Picture
Components and transactions




Software modules are objects
Object Request Broker (a.k.a., Transaction
Processing Monitor) connects objects
(clients to servers)
Standard interfaces allow software plug-ins
Transaction ties execution of a “job” into an
atomic unit: all-or-nothing, durable, isolated
Object Request Broker
Objects Meet Databases
The basis for universal
data servers, access, & integration






object-oriented (COM oriented)
programming interface to data
Breaks DBMS into components
Anything can be a data source
Optimization/navigation “on top
of” other data sources
A way to componentized a DBMS
Makes an RDBMS and O-R
DBMS (assumes optimizer
understands objects)
DBMS
engine
Database
Spreadsheet
Photos
Mail
Map
Document
A new programming paradigm






Develop object on the desktop
Better yet: download them from the Net
Script work flows as method invocations
All on desktop
Then, move work flows and objects to server(s)
Gives
desktop
development
three-tier deployment
Software Cyberbricks
Transactions & Objects




Application requests transaction
identifier (XID)
XID flows with method invocations
Object Managers join (enlist)
in transaction
Distributed Transaction Manager
coordinates commit/abort
Thesis: Scaleable Servers

Scaleable Servers Built from Cyberbricks


Servers should be able to


Allow new applications
Scale up, out, down
Key software technologies




Clusters (ties the hardware together)
Parallelism: (uses the independent cpus, stores, wires
Objects (software CyberBricks)
Transactions: masks errors.
Download