BARC Report 5/30/96 - Microsoft Research

advertisement
Scaleable WindowsNT?
• Jim Gray
Microsoft Research
Gray@Microsoft.com
http://research.Microsoft.com/~Gray
1
Outline
• What is Scalability?
• Why does Microsoft care about ScaleUp
• Current ScaleUp Status?
• NT5 & SQL7 & Exchange
2
Scale Up and Scale Out
Grow Up with SMP
4xP6 is now standard
SMP
Super Server
Grow Out with Cluster
Cluster has inexpensive parts
Departmental
Server
Personal
System
Cluster
of PCs
Billions Of Clients
• Every device will be “intelligent”
• Doors, rooms, cars…
• Computing will be ubiquitous
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Outline
• What is Scalability
• Why does Microsoft care about ScaleUp
• Current ScaleUp Status?
• NT5 & SQL7 & Exchange
7
Scalability
1 billion
transactions
100 million
web hits
• Scale
up: to large SMP nodes
• Scale out: to clusters of SMP nodes
4 terabytes
of data
1.8 million
mail messages
8
“Commercial” NT Clusters
• 16-node Tandem Cluster
» 64 cpus
» 2 TB of disk
» Decision support
• 45-node Compaq Cluster
» 140 cpus
» 14 GB DRAM
» 4 TB RAID disk
» OLTP (Debit Credit)
• 1 B tpd (14 k tps)
9
Tandem Oracle/NT
• 27,383 tpmC
• 71.50 $/tpmC
• 4 x 6 cpus
• 384 disks
=2.7 TB
10
24 cpu, 384 disks (=2.7TB)
11
Billion Transactions per Day Project
•
•
•
•
•
•
•
Built a 45-node Windows NT Cluster
(with help from Intel & Compaq)
> 900 disks
All off-the-shelf parts
Using SQL Server &
DTC distributed transactions
DebitCredit Transaction
Each node has 1/20 th of the DB
Each node does 1/20 th of the work
15% of the transactions are
“distributed”
Billion Transactions Per Day Hardware
• 45 nodes (Compaq Proliant)
• Clustered with 100 Mbps Switched Ethernet
• 140 cpu, 13 GB, 3 TB.
Type
Workflow
MTS
SQL Server
Distributed
Transaction
Coordinator
TOTAL
nodes
CPUs
DRAM
ctlrs
disks
20
Compaq
Proliant
2500
20
Compaq
Proliant
5000
5
Compaq
Proliant
5000
45
20x
20x
20x
20x
RAID
space
20x
2
128
1
1
2 GB
20x
20x
20x
20x
4
512
4
20x
36x4.2GB
7x9.1GB
130 GB
5x
5x
5x
5x
5x
4
256
1
3
8 GB
140
13 GB
105
895
3 TB
13
How Much Is 1 Billion Tpd?
Mtpd
Millions of Transactions Per Day
1,000.
900.
800.
100.
700.
600.
500.
10.
400.
300.
1.
200.
100.
0.
0.1
• 1 billion tpd = 11,574 tps
~ 700,000 tpm (transactions/minute)
• ATT
» 185 million calls per peak day (worldwide)
• Visa ~20 million tpd
»
»
»
1 Btpd
Visa
ATT
BofA
NYSE
400 million customers
250K ATMs worldwide
7 billion transactions
(card+cheque) in 1994
• New York Stock Exchange
» 600,000 tpd
• Bank of America
•
»
»
20 million tpd checks cleared
(more than any other bank)
1.4 million tpd ATM transactions
Worldwide Airlines Reservations:
250 Mtpd
14
Infinite, Ubiquitous Scaling
Redefining the rules
Per Sec
Per Min
Per Day
10K TPC
166
10,000
14,400,000
1 BTPD
11,574
694,444
1,000,000,000
1.4 BTPD 16,204
972,222
1,400,000,000
IIS
MTS
All Shipping
Products!
COM / ActiveX
SQL
SQL
SQL
SQL
SQL
SQL
15
Microsoft.com: ~150x4 nodes
Building 11
Staging Servers
(7)
Ave CFG:4xP6,
Internal WWW
Ave CFG:4xP5,
512 RAM,
30 GB HD
FTP Servers
Ave CFG:4xP5,
512 RAM,
Download 30 GB HD
Replication
SQLNet
Feeder LAN
Router
Live SQL Servers
MOSWest
Admin LAN
Live SQL Server
www.microsoft.com
(4)
register.microsoft.com
(2) Ave CFG:4xP6,
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:12
Ave CFG:4xP6,
512 RAM,
50 GB HD
www.microsoft.com
(4)
premium.microsoft.com
(2)
home.microsoft.com
(3)
FDDI Ring
(MIS2)
cdm.microsoft.com
(1)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:7
Ave CFG:4xP6,
256 RAM,
30 GB HD
Ave Cost:$25K
FY98 Fcst:2
Router
Router
msid.msn.com
(1)
premium.microsoft.com
(1)
FDDI Ring
(MIS3)
www.microsoft.com
premium.microsoft.com
(3)
(1)
Ave CFG:4xP6,
Ave CFG:4xP6,
512 RAM,
30 GB HD
512 RAM,
50 GB HD
FTP
Download Server
(1)
HTTP
Download Servers
(2)
SQL SERVERS
(2)
Ave CFG:4xP6,
512 RAM,
160 GB HD
msid.msn.com
(1)
Switched
Ethernet
search.microsoft.com
(2)
Router
Internet
Secondary
Gigaswitch
support.microsoft.com
search.microsoft.com
(1)
(3)
Router
2
Ethernet
(100 Mb/Sec Each)
support.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
13
DS3
(45 Mb/Sec Each)
Ave CFG:4xP5,
512 RAM,
30 GB HD
register.microsoft.com
(2)
register.microsoft.com
(1)
(100Mb/Sec Each)
Router
FTP.microsoft.com
(3)
msid.msn.com
(1)
2
OC3
Primary
Gigaswitch
Router
Ave CFG:4xP5,
256 RAM,
20 GB HD
register.msn.com
(2)
search.microsoft.com
(1)
Japan Data Center
Internet
Router
home.microsoft.com
(2)
Switched
Ethernet
Router
Router
www.microsoft.com
(3)
FTP
Download Server
(1)
activex.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave CFG:4xP5,
256 RAM,
12 GB HD
SQL SERVERS
(2)
Ave CFG:4xP6,
512 RAM,
160 GB HD
Router
Ave CFG:4xP6
512 RAM
28 GB HD
FDDI Ring
(MIS1)
512 RAM,
30 GB HD
msid.msn.com
(1)
search.microsoft.com
(3)
home.microsoft.com
(4)
Ave CFG:4xP6,
1 GB RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:2
msid.msn.com
(1)
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave CFG:4xP6,
512 RAM,
30 GB HD
www.microsoft.com premium.microsoft.com
(1)
Ave CFG:4xP6,
Ave CFG:4xP6,(3)
512 RAM,
50 GB HD
SQL Consolidators
DMZ Staging Servers
Router
SQL Reporting
Ave CFG:4xP6,
512 RAM,
160 GB HD
European Data Center
IDC Staging Servers
MOSWest
www.microsoft.com
(5)
Internet
FDDI Ring
(MIS4)
home.microsoft.com
(5)
16
NCSA Super Cluster
http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
• National Center for Supercomputing Applications
•
•
•
•
•
University of Illinois @ Urbana
512 Pentium II cpus, 2,096 disks, SAN
Compaq + HP +Myricom + WindowsNT
A Super Computer for 3M$
Classic Fortran/MPI programming
DCOM programming model
17
TPC C Improved Fast
(250%/year!)
$1,000
$/tpmC vs time
100,000
40% hardware,
100% software,
100% PC Technology
tpmC vs time
10,000
250 %/year
improvement!
tpmC
$100
250 %/year
improvement!
1,000
1.5
2.755676
$10
Jan-93
Jun-94
Oct-95
Date
Mar-97
Jul-98
100
Jan-93
Jun-94
Oct-95
Date
Mar-97
Jul-98
18
tpmC
Windows NT Versus UNIX
35,000
30,000
25,000
20,000
15,000
10,000
5,000
0
Jan-95
tpmC vs Time
NT
Unix
h
Jan-96
Jan-97
19
Economy Of Scale
Transactions/k$ By Vendor
25.0
20.0
tpmC/k$
Microsoft/NT
15.0
Oracle/Unix
Sybase/Unix
10.0
Informix/Unix
DB2/Unix
5.0
0.0
0
10,000
20,000
tpmC
30,000
40,000
20
Microsoft TerraServer:
Scaleup to Big Databases
•
•
Build a 1 TB SQL Server database
Data must be
•
Loaded
•
•
On the web (world’s largest atlas)
Sell images with commerce server.
» 1 TB
» Unencumbered
» Interesting to everyone everywhere
» And not offensive to anyone anywhere
» 1.5 M place names from Encarta World Atlas
» 3 M Sq Km from USGS (1 meter resolution)
» 1 M Sq Km from Russian Space agency (2 m)
21
Microsoft TerraServer Background
• Earth is 500 Tera-meters square • Someday
•
•
•
•
•
•
» USA is 10 tm2
100 TM2 land in 70ºN to 70ºS
We have pictures of 6% of it
» 3 tsm from USGS
» 2 tsm from Russian Space Agency
Compress 5:1 (JPEG) to 1.5 TB.
Slice into 10 KB chunks
Store chunks in DB
Navigate with
» Encarta™ Atlas
» multi-spectral image
» of everywhere
» once a day / hour
1.8x1.2 km2
tile
10x15 km2
thumbnail
20x30 km2 browse image
40x60 km2 jump image
• globe
• gazetteer
» StreetsPlus™
in the USA
22
Demo
• navigate by coverage map to White House
• Download image
• buy imagery from USGS
• navigate by name to Venice
• buy SPIN2 image & Kodak photo
• Pop out to Expedia street map of Venice
• Mention that DB will double in next 18
months (2x USGS, 2X SPIN2)
23
The
Microsoft TerraServer Hardware
• Compaq AlphaServer 8400
• 8x400Mhz Alpha cpus
• 10 GB DRAM
• 324 9.2 GB StorageWorks Disks
» 3 TB raw, 2.4 TB of RAID5
• STK 9710 tape robot (4 TB)
• WindowsNT 4 EE, SQL Server 7.0
24
Software
Web Client
Image
Server
Active Server Pages
Internet
Information
Server 4.0
Java
Viewer
browser
MTS
Terra-Server
Stored Procedures
HTML
The Internet
Internet Info
Server 4.0
SQL Server 7
Microsoft Automap
ActiveX Server
TerraServer DB
Automap Server
TerraServer Web Site
Internet Information
Server 4.0
Microsoft
Site Server EE
Image Delivery SQL Server
Application
7
25
Image Provider Site(s)
Image Delivery and Load
Incremental load of 4 more TB in next 18 months
DLT
Tape
DLT
Tape
“tar”
NT
DoJob
\Drop’N’
LoadMgr
DB
Wait 4
Load
Backup
LoadMgr
LoadMgr
ESA
Alpha
Server
4100
100mbit
EtherSwitch
60
4.3 GB
Drives
Alpha
Server
4100
ImgCutter
\Drop’N’
\Images
...
10: ImgCutter
20: Partition
30: ThumbImg
40: BrowseImg
45: JumpImg
50: TileImg
55: Meta Data
60: Tile Meta
70: Img Meta
80: Update Place
Enterprise Storage Array
STK
DLT
Tape
Library
108
9.1 GB
Drives
108
9.1 GB
Drives
108
9.1 GB
Drives
Alpha
Server
8400
26
TerraServer:
A Real “World” Example
71
Hits
Queries
Images
PageViews
Total Average Peak
728.45m 10.26m 29.27m
565.09m
7.96m 17.76m
212.02m
2.99m 9.23m
376.29m
5.30m 9.20m
• Largest DB on the Web
• 1.3TB
• 99.95% uptime since July 1
• No downtime, period, in August
• 70% of downtime for SQL software upgrades
27
NT Clusters (Wolfpack)
• Scale DOWN to PDA: WindowsCE
• Scale UP an SMP: TerraServer
• Scale OUT with a cluster of machines
• Single-system image
» Naming
» Protection/security
» Management/load balance
• Fault tolerance
»“Wolfpack”
• Hot pluggable hardware & software
28
Symmetric Virtual Server
Failover Example
Browser
Server
Server 11
Server 2
Web
site
Web
site
Web
site
Database
Database
Database
Web site files
Web site files
Database files
Database files
29
Windows NT 5
(scalability features)
• Better SMP support
• Clusters:
» 16x packs (fault tolerant clusters)
» 100x mobs: arrays for manageability
» SAN/VIA support
• 64 bit addressing for data
» Apps like SQL, Oracle, will use it for data
» 64 bit API to NT comes later (in lab now).
• Remote management (scripting and DCOM)
• Active Directory
• Veritas volume manager
• Many 3rd party HSMs
• Batch support
30
Microsoft SQL Server 7.0
• Fixes the famous performance bugs
» dynamic record locking
» online backup, quick recovery….
• 64 bit addressing buffer pool
• SMP parallelism and better SMP support
• Built in OLAP (cubes and MOLAP)
• Scale down to Win9x
• Improved management interfaces
• Data transform services (for warehouses)
31
Outline
• What is Scalability
• Why does Microsoft care about ScaleUp
• Current ScaleUp Status?
• NT5 & SQL7
32
end
Other slides would be interesting,
but...
33
Interesting “other slides”
No time for them but...
• How much information is there?
• IO bandwidth in the Intel world
• Intelligent disks
• SAN/VIA
• NT Cluster Sort
34
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
Some Tera-Byte Databases
• The Web: 1 TB of HTML
• TerraServer 1 TB of images
• Several other 1 TB (file) servers
• Hotmail: 7 TB of email
• Sloan Digital Sky Survey:
40 TB raw, 2 TB cooked
• EOS/DIS (picture of planet each week)
» 15 PB by 2007
• Federal Clearing house: images of checks
» 15 PB by 2006 (7 year history)
• Nuclear Stockpile Stewardship Program
» 10 Exabytes (???!!)
35
Kilo
A letter
A novel
Info Capture
• You can record
Mega
Giga
A
Movie
•
Tera
Library of
Congress (text)
•
Peta
LoC (image)
everything you see
or hear or read.
What would you do
with it?
How would you
organize & analyze it?
Exa
Video
All Disks Audio
Read or write:
Zetta
All Tapes See: http://www.lesk.com/mlesk/ksg97/ksg.html
Yotta
8 PB per lifetime (10GBph)
30 TB (10KBps)
8 GB (words)
36
Michael Lesk’s Points
www.lesk.com/mlesk/ksg97/ksg.html
• Soon everything can be recorded and kept
• Most data will never be seen by humans
• Precious Resource: Human attention
Auto-Summarization
Auto-Search
will be a key enabling technology.
37
PAP (peak advertised Performance) vs
RAP (real application performance)
• Goal: RAP = PAP / 2 (the half-power point)
System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
Data
10-15 MBps
7.2 MB/s
File System
Buffers
SCSI
Disk
133 MBps
7.2 MB/s
PCI
38
PAP vs RAP
• Reads are easy, writes are hard
• Async write can match WCE.
422 MBps
142 MBps
SCSI
Application
Data
Disks
40 MBps
File System
10-15 MBps
31 MBps
9 MBps
•
133 MBps
72 MBps
PCI
SCSI
39
Bottleneck Analysis
• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI
(not measured, we had only one PCI bus available, 2nd one was “internal”)
~ 120 MBps Unbuffered read
~ 80 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Adapter
~30 MBps
Adapter
PCI
~70 MBps
Memory
Read/Write
~150 MBps
Adapter
PCI
Adapter
40
Year 2002 Disks
• Big disk
(10 $/GB)
» 3”
» 100 GB
» 150 kaps (k accesses per second)
» 20 MBps sequential
• Small disk (20 $/GB)
» 3”
» 4 GB
» 100 kaps
» 10 MBps sequential
• Both running Windows NT™ 7.0?
(see below for why)
41
How Do They Talk to Each Other?
• Each node has an OS
• Each node has local resources: A federation.
• Each node does not completely trust the others.
• Nodes use RPC to talk to each other
RMI?
Applications
?
RPC
streams
datagrams
• Huge leverage in high-level interfaces.
• Same old distributed system story.
VIAL/VIPL
?
RPC
streams
datagrams
Applications
» CORBA? DCOM? IIOP?
» One or all of the above.
h
Wire(s)
42
SAN:
Standard
Interconnect
Gbps Ethernet: 110 MBps
• LAN faster than
PCI 32: 70 MBps
UW Scsi: 40 MBps
•
•
•
memory bus?
1 GBps links in lab.
300$ port cost soon
Port is computer
FW scsi: 20 MBps
scsi: 5 MBps
43
PennySort
• Hardware
» 266 Mhz Intel PPro
» 64 MB SDRAM (10ns)
» Dual Fujitsu DMA 3.2GB EIDE
• Software
» NT workstation 4.3
» NT 5 sort
• Performance
PennySort Machine (1107$ )
Disk
25%
Cabinet +
Assembly
7%
Memory
8%
Other
22%
board
13%
Network,
Video, floppy
9%
Software
6%
cpu
32%
» sort 15 M 100-byte records
» Disk to disk
» elapsed time 820 sec
(~1.5 GB)
• cpu time = 404 sec
44
Cluster Sort
•Multiple Data Sources
A
AAA
BBB
CCC
Conceptual Model
•Multiple Data Destinations
AAA
AAA
AAA
•Multiple nodes
AAA
AAA
AAA
•Disks -> Sockets -> Disk -> Disk
B
C
AAA
BBB
CCC
CCC
CCC
CCC
AAA
BBB
CCC
BBB
BBB
BBB
BBB
BBB
BBB
CCC
CCC
CCC
45
Cluster Install & Execute
•If this is to be used by others,
it must be:
•Easy to install
•Easy to execute
• Installations of distributed systems take
time and can be tedious. (AM2, GluGuard)
• Parallel Remote execution is
non-trivial. (GLUnix, LSF)
How do we keep this “simple” and “built-in” to
NTClusterSort ?
46
Remote Install
•Add Registry entry to each remote node.
RegConnectRegistry()
RegCreateKeyEx()
47
Cluster Execution
•Setup :
MULTI_QI struct
COSERVERINFO struct
•CoCreateInstanceEx()
MULT_QI
COSERVERINFO
HANDLE
HANDLE
HANDLE
•Retrieve remote object handle
from MULTI_QI struct
Sort()
•Invoke methods as usual
Sort()
Sort()
48
Download