Jim Gray
Microsoft Research http://www.research.Microsoft.com/~Gray
SMP
Super Server
Departmental
Server
Personal
System
Grow Up with SMP
4xP6 is now standard
Grow Out with Cluster
Cluster has inexpensive parts
Cluster of PCs
Teradata 500 nodes (50k$/slice)
Tandem,VMScluster 150 nodes (100k$/slice)
Intel, 9,000 nodes @ 55M$ ( 6k$/slice)
IBM: 512 nodes @ 100m$ (200k$/slice)
PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB servers
KEY TECHNOLOGY HERE IS THE APPS.
– Apps distribute data
– Apps distribute execution
When slices cost 50k$, you buy 10 or 20.
When slices cost 5k$ you buy 100 or 200.
Manageability, programmability, usability become key issues (total cost of ownership).
PCs are MUCH easier to use and program
PCs create virtuous cycle
Vicious Cycle
No Customers!
New
MPP &
NewOS
New
App
New
MPP &
NewOS
New
App
New
MPP &
NewOS
New
App
New
MPP &
NewOS
New
App
Virtuous Cycle:
Standards allow progress and investment protection
Standard
OS & Hardware
Apps
Customers
A consortium of 60 HW & SW vendors
(everybody who is anybody)
A set of APIs for clustering and fault tolerance
An enhancement to NT™ Server (in beta test )
Key concepts
– System: a particular node
– Cluster: a collection of systems working together
– resource: a hardware or software module
– resource dependency: one resource needs another
– resource group: fails over as a unit: dependencies do not cross group boundaries
Cluster Management Tools
Database
Manager
Cluster Api DLL
RPC
Global Update
Manager
Cluster Service
Event Processor
Node
Manager
Open
Online
IsAlive
LooksAlive
Offline
Close
App
Resource
DLL
Failover Mgr
Communication
Manager
Resource Monitors
Physical
Resource
DLL
Logical
Resource
DLL
App
Resource
DLL
Resource
Management
Interface
Non Aware
App
Cluster Aware
App
Other Nodes
Clients and Servers made from the same stuff.
– Inexpensive: Built with commodity components
Fault tolerance:
– Spare modules mask failures
Modular growth
– grow by adding small modules
Parallel data search
– use multiple processors and disks
Yes, if you don’t have it you fail
– parallel MPPs vs Tandem, Teradata, VAXcluster.
NUMA & Cluster:
– some things are farther away.
– Must program in parallel to
• utilize multiple cpus, disks, wires
OS, DBMS, TPmonitor, Web Server, ORB give transparency: load balance data and programs.
Administrator, Programmer, User
– do not want to know about program & data location
Redundant disk or path: configure around it.
Non-redundant software: restart.
Non-redundant hardware: migrate software to surviving nodes.
Fault detection: 1 ms to 10 sec.
Failover .1 sec to 1 min.
This is standard in Tandem, Teradata,
VMScluster
Cluster lowers support costs by
– masking failures (instant repair via spare modules)
– allowing online maintenance and upgrades.
Commodity parts are much cheaper
– 10$/MIPS vs 10,000$/MIPS
– 1k$/OS vs 30K$/month/OS
Moden OSs are easier to install, configure, manage
– GUI
– Self-tuning
– Online and task-based help
– Built in wizards