Scaleable Servers Jim Gray Microsoft

advertisement
Scaleable Servers
Jim Gray
Microsoft
Gray@Microsoft.com
http://www.research.Microsoft.com/~Gray
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
1987: 256 tps Benchmark



14 M$ computer (Tandem)
A dozen people
False floor, 2 rooms of machines
Admin expert
Hardware experts
A 32 node processor array
Simulate 25,600 clients
Network expert
Manager
Performance
expert
DB expert
A 40 GB disk array (80 drives)
Auditor
OS expert
1988: DB2 + CICS Mainframe
65 tps




IBM 4391
Simulated network of 800 clients
2m$ computer
Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized
CPU
16 GB disk farm
4 x 8 x .5GB
1997: 10 years later
1 Person and 1 box = 1250 tps




1 Breadbox ~ 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 1,000x less
25 micro dollars per transaction
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
disk arrays
What Happened?

Moore’s law:
Things get 4x better every 3 years
(applies to computers, storage, and networks)

New Economics: Commodity
class
price/mips software
$/mips k$/year
mainframe
10,000
100
minicomputer
100
10
microcomputer
10
1

time
GUI: Human - computer tradeoff
optimize for people, not computers
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Future Super Server:
4T Machine

Array of 1,000 4B machines
1
bps processors
 1 BB DRAM
 10 BB disks
 1 Bbps comm lines
 1 TB tape robot


A few megabucks
Challenge:
 Manageability
 Programmability
CPU
50 GB Disc
5 GB RAM
Cyber Brick
a 4B machine
 Security
 Availability
 Scaleability
 Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
The Hardware Is In Place…
And then a miracle occurs
?



SNAP: scaleable network
and platforms
Commodity-distributed
OS built on:
 Commodity platforms
 Commodity network
interconnect
Enables parallel applications
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
Scaleable Servers
BOTH SMP And Cluster
SMP super
server
Departmental
server
Personal
system
Grow up with SMP; 4xP6
is now standard
Grow out with cluster
Cluster has inexpensive parts
Cluster
of PCs
SMPs Have Advantages




Single system image
easier to manage, easier
to program threads in
shared memory, disk, Net
4x SMP is commodity
SMP super
Software capable of 16x server
Problems:
Departmental
>4 not commodity
server
 Scale-down problem
(starter systems expensive)
Personal
 There is a BIGGEST one
system

Tpc-C Web-Based Benchmarks






Order
Invoice
Query to server via Web
page interface
Web server translates to DB
SQL does DB work
Net:
 easy
to implement
 performance is GREAT!
HTTP

Client is a Web browser
(9,200 of them!)
Submits
IIS
= Web
ODBC

SQL
TPC-C Shows How Far SMPs have come
Performance is amazing:





2,000 users is the min!
30,000 users on a 4x12 alpha cluster (Oracle)
Peak Performance: 30,390 tpmC @ $305/tpmC (Oracle/DEC)
Best Price/Perf: 7,693 tpmC @ $43/tpmC (MS SQL/Dell)
graphs show UNIX high price & diseconomy of scaleup
tpmC & Price Performance
(only "best" data shown for each vendor)
DB2
400
Informix
MS SQL Server
350
Oracle
300
Sybase
250
$/tpmC

200
150
100
50
0
0
5000
10000
tpmC
15000
20000
TPC C SMP Performance
• SMPs do offer speedup
but 4x P6 is better than some 18x MIPSco
tpm C vs CPS
SUN Scaleability
20,000
20,000
18,000
SUN Scaleability
16,000
15,000
SQL Server
14,000
tpmC
tpmC
12,000
10,000
10,000
8,000
6,000
5,000
4,000
2,000
0
0
0
5
10
CPUs
15
20
0
5
10
cpus
15
20
What Happens To Prices?

No expensive UNIX front end
(20$/tpmC)
No expensive TP monitor software (10$/tpmC)

=> 65$/tpmC

164
188
TPC Price/tpmC
100
93
90
Informix on SNI
Oracle on DEC Unix
Oracle on Compaq/NT
Sybase on Compaq/NT
Microsoft on Compaq with Visigenics
Microsoft on HP with Visagenics
Microsoft on Intergraph with IIS
Microsoft on Compaq with IIS
80
70
66
64 66
60
50
40
54
45
44
38
35
44
40
39 39
35
30
27
30
20
44
42
40
38
41 39
31
22
18
19 21
16
8
10
3
0
18
processor
disk
software
net
Building the Largest NT Node

Build a 1 TB SQL Server database



Demo it on the Internet


Show off NT and SQL Server Scaleability
Stress test the product
WWW accessible by anyone
So data must be




1 TB
Unencumbered
Interesting to everyone everywhere
AND not offensive to anyone anywhere
What’s TeraByte?

1 Terabyte:
1,000,000,000 business letters 150 miles of book shelf
100,000,000 book pages
15 miles of book shelf
50,000,000 FAX images
7 miles of book shelf
10,000,000 TV pictures (mpeg)
10 days of video
4,000 LandSat images
16 earth images (100m)
100,000,000 web page
10 copies of the web HTML

Library of Congress (in ASCII) is 25 TB
1980: $200 million of disc
$5 million of tape silo
1997: 200 k$ of magnetic disc
30 k$ nearline tape
Terror Byte !
10,000 discs
10,000 tapes
48 discs
20 tapes
The Plan


DEC Alpha +
324 StorageWorks
Drives (1.4 TB)




30K BTU,
8 KW,
1.5 metric tons.
SQL 7.0
USGS data
(1 meter)
Russian Space
data (2 meter)
DEC 4100
4 x 400 Mhz
Alpha Processors
4GB DRAM
SPIN-2
Microsoft
BackOffice
Image Data Sources
300 GB
Src: USGS
& UCSB
UCSB
missing
some
DOQs
DOQ
Spin-2
500 GB
WorldWide
LoB App
New Data
Coming
DOQ coverage of the US


1 Meter images of many places
Problems:
 most of data not yet published



interesting places missing
(LA, Portland, SD, Anchorage,…)
Loaded published 130 GB.
CRDA for unpublished 3 TB
SPIN-2
Coverage

The rest of the world
The US Government can’t help, but....
The Russian Space Agency is eager to cooperate.
2 Meter Geo Rectified imagery of anywhere

More data coming, Earth has ~ 500 TeraMeters2





=> ~30 Tera Bytes of Land at 2x2 Meter
=> we need 3% of the land (Urban World = the red stuff)
Demo Interface
Grow UP and OUT
1 Terabyte DB
SMP super
server
Departmental
server
Personal
system
Cluster:
•a collection of nodes
•as easy to program
and manage as a
single node
1 billion
transactions
per day
Clusters Have Advantages


Clients and servers made from the same stuff
Inexpensive:


Fault tolerance:


Spare modules mask failures
Modular growth


Built with commodity components
Grow by adding small modules
Unlimited growth:
no biggest one
Billion Transactions per Day Project







Built a 45-node Windows NT Cluster
(with help from Intel & Compaq)
> 900 disks
All off-the-shelf parts
Using SQL Server &
DTC distributed transactions
DebitCredit Transaction
Each node has 1/20 th of the DB
Each node does 1/20 th of the work
15% of the transactions are “distributed”
How Much Is 1 Billion
Transactions Per Day?
1 Btpd = 11,574 tps
(transactions per second)
Millions of transactions per day
~ 700,000 tpm
1,000.
(transactions/minute)



400 M customers
250,000 ATMs worldwide
7 billion transactions / year
(card+cheque) in 1994
0.1
NYSE
Visa ~20 M tpd
1.
BofA

185 million calls
(peak day worldwide)
AT&T

10.
Visa
AT&T
Mtpd

100.
1 Btpd

Billion Transactions Per Day Hardware



45 nodes (Compaq Proliant)
Clustered with 100 Mbps Switched Ethernet
140 cpu, 13 GB, 3 TB.
Type
Workflow
MTS
SQL Server
Distributed
Transaction
Coordinator
TOTAL
nodes
CPUs
DRAM
ctlrs
disks
20
Compaq
Proliant
2500
20
Compaq
Proliant
5000
5
Compaq
Proliant
5000
45
20x
20x
20x
20x
RAID
space
20x
2
128
1
1
2 GB
20x
20x
20x
20x
4
512
4
20x
36x4.2GB
7x9.1GB
130 GB
5x
5x
5x
5x
5x
4
256
1
3
8 GB
140
13 GB
105
895
3 TB
1.2 B tpd







1 B tpd ran for 24 hrs.
Sized for 30 days
Linear growth
5 micro-dollars per
transaction
Out-of-the-box software
Off-the-shelf hardware
AMAZING!
Other Stunts

100 M Web Hits/day on one server


Email server (exchange)





(=1,300 hits/sec, Web Mark HTML server)
50 GB database (up from 16GB, limit now 16TB)
50 k POP3 users (1.5 M msg/day)
64-bit addressing SQL Server
SAP Failover
Theme:
conventional
stuff is easy
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
Parallelism
The OTHER aspect of clusters

Clusters of machines
allow two kinds
of parallelism



Many little jobs: online
transaction processing
 TPC-A, B, C…
A few big jobs: data
search and analysis
 TPC-D, DSS, OLAP
Both give
automatic parallelism
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Data Rivers
Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma = Exchange operator in Volcano.
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Clusters (Plumbing)

Single system image




Fault Tolerance


naming
protection/security
management/load balance
Wolfpack
Hot Pluggable hardware & Software
Windows NT clusters

Key goals:





Easy: to install, manage, program
Reliable: better than a single node
Scaleable: added parts add power
Microsoft & 60 vendors
defining NT clusters



Almost all big hardware and
software vendors involved
No special hardware needed 
but it may help
Enables



Commodity fault-tolerance
Commodity parallelism
(data mining, virtual reality…)
Also great for workgroups!
Initial: two-node failover





Beta testing since December96
SAP, Microsoft, Oracle giving
demos.
File, print, Internet, mail, DB, other
services
Easy to manage
Each node can be 4x (or more) SMP
Next (NT5) “Wolfpack” is modest
size cluster


About 16 nodes (so 64 to 128 CPUs)
No hard limit, algorithms designed
to go further
So, What’s New?




When slices cost 50k$, you buy 10 or 20.
When slices cost 5k$ you buy 100 or 200.
Manageability, programmability, usability
become key issues (total cost of
ownership).
PCs are MUCH easier to use and program
MPP
Vicious Cycle
No Customers!
New
New
MPP & App
NewOS
New
New
MPP & App
NewOS
New
New
MPP & App
NewOS
Apps
CP/Commodity
Virtuous Cycle:
Standards allow progress
and investment protection
New
New
MPP & App
NewOS
Standard
platform
Customers
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
Objects Meet Databases
The basis for universal
data servers, access, & integration






object-oriented (COM oriented)
programming interface to data
Breaks DBMS into components
Anything can be a data source
Optimization/navigation “on top
of” other data sources
A way to componentized a DBMS
Makes an RDBMS and O-R
DBMS (assumes optimizer
understands objects)
DBMS
engine
Database
Spreadsheet
Photos
Mail
Map
Document
Download