FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”)

advertisement
FT NT: A Tutorial on
Microsoft Cluster Server™
©1996, 1997 Microsoft Corp.
(formerly “Wolfpack”)
Joe Barrera
Jim Gray
Microsoft Research
{joebar, gray} @ microsoft.com
http://research.microsoft.com/barc
1
Outline






©1996, 1997 Microsoft Corp.
Why FT and Why Clusters
Cluster Abstractions
Cluster Architecture
Cluster Implementation
Application Support
Q&A
2
DEPENDABILITY: The 3 ITIES
 RELIABILITY /
INTEGRITY:
Does the right thing.
(also large MTTF)
 AVAILABILITY:
(also small
Does it now.Integrity
MTTR
)
MTTF+MTTR
Security
Reliability
System Availability:
If 90% of terminals up & 99% of DB up?
(=>89% of transactions are serviced on time).
Holistic
©1996, 1997 Microsoft Corp.
Availability
vs. Reductionist view
3
Case Study - Japan
"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).
Vendor
4 2%
Tele Comm
lines
12 %
2 5%
Application
Software
11.2
%
Environment
9.3%
Operations
Vendor (hardware and software)
Application software
Communications lines
Operations
Environment
5
9
1.5
2
2
10
Months
Months
Years
Years
Years
Weeks
1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES
To Get 10 Year MTTF, Must Attack All These Areas
©1996, 1997 Microsoft Corp.
4
Case Studies - Tandem Trends
MTTF improved
Shift from Hardware & Maintenance to from 50% to 10%
to
Software (62%) & Operations (15%)
NOTE: Systematic under-reporting of
Environment
Operations errors
Application Software
©1996, 1997 Microsoft Corp.
5
Summary of FT Studies
 Current
Situation: ~4-year MTTF =>
Fault Tolerance Works.
 Hardware is GREAT (maintenance and MTTF).
 Software masks most hardware faults.
 Many
hidden software outages in operations:
 New
Software.
 Utilities.
 Must make all software ONLINE.
 Software seems to define a 30-year MTTF ceiling.
Reasonable
©1996, 1997 Microsoft Corp.
Goal: 100-year MTTF.
class 4 today => class 6 tomorrow.
6
Fault Tolerance vs Disaster Tolerance

Fault-Tolerance: mask local faults
 RAID
disks
 Uninterruptible Power Supplies
 Cluster Failover

Disaster Tolerance: masks site failures
 Protects
against fire, flood, sabotage,..
 Redundant system and service at remote
site.
©1996, 1997 Microsoft Corp.
7
The Microsoft “Vision”:
Plug & Play Dependability




©1996, 1997 Microsoft Corp.
Transactions for reliability
Clusters: for availability
Security
All built into the OS
Integrity
Security
Integrity /
Reliability
Availability
8
Cluster Goals

Manageability
 Manage
nodes as a single system
 Perform server maintenance without affecting users
 Mask faults, so repair is non-disruptive

Availability
 Restart
failed applications & servers
• un-availability ~ MTTR / MTBF , so quick repair.
 Detect/warn

administrators of failures
Scalability
 Add
nodes for incremental
• processing
• storage
• bandwidth
©1996, 1997 Microsoft Corp.
9
Fault Model

Failures are independent
So, single fault tolerance is a big win



Hardware fails fast (blue-screen)
Software fails-fast (or goes to sleep)
Software often repaired by reboot:
 Heisenbugs

Operations tasks: major source of outage
 Utility
operations
 Software upgrades
©1996, 1997 Microsoft Corp.
10
Cluster: Servers Combined to
Improve Availability & Scalability

Cluster: A group of independent systems working
together as a single system.
Clients see scalable & FT services (single system image).


Node: A server in a cluster. May be an SMP server.
Interconnect: Communications link used for intra-
cluster status info such as “heartbeats”. Can be Ethernet.
Client PCs
Printers
Server A
©1996, 1997 Microsoft Corp.
Server B
Disk array A
Interconnect
Disk array B
11
Microsoft Cluster Server™

2-node availability Summer 97 (20,000 Beta Testers now)
 Commoditize
fault-tolerance (high availability)
 Commodity hardware (no special hardware)
 Easy to set up and manage
 Lots of applications work out of the box.

16-node scalability later (next year?)
©1996, 1997 Microsoft Corp.
12
Failover Example
Browser
Server 1
Server 2
Web
site
Web
site
Database
Database
©1996, 1997 Microsoft Corp.
Web site files
Database files
13
MS Press Failover Demo




Client/Server
Software failure
Admin shutdown
Server failure
©1996, 1997 Microsoft Corp.
Resource States
- Pending
- Partial
- Failed
!
- Offline
14
Demo Configuration
Server “Alice”
Server “Betty”
SMP Pentium® Pro Processors
Windows NT Server with Wolfpack
Microsoft Internet Information Server
Microsoft SQL Server
SMP Pentium® Pro Processors
Windows NT Server with Wolfpack
Microsoft Internet Information Server
Microsoft SQL Server
Interconnect
standard Ethernet
Local
Disks
SCSI Disk Cabinet
Shared
Disks
Administrator
Windows NT Workstation
Cluster Admin
SQL Enterprise Mgr
©1996, 1997 Microsoft Corp.
Local
Disks
Windows NT Server Cluster
Client
Windows NT Workstation
Internet Explorer
MS Press OLTP app
Demo Administration
Server “Alice”
Server “Betty”
Runs SQL Trace
Runs Globe
Run SQL Trace
Local
Disks
SCSI Disk Cabinet
Shared
Disks
Cluster Admin Console
Windows GUI
Shows cluster resource status
Replicates status to all servers
Define apps & related resources
Define resource dependencies
Orchestrates recovery order
©1996, 1997 Microsoft Corp.
Local
Disks
Windows NT Server Cluster
SQL Enterprise Mgr
Windows GUI
Shows server status
Manages many servers
Start, stop manage DBs
Client
Generic Stateless Application
Rotating Globe





Mplay32 is generic app.
Registered with MSCS
MSCS restarts it on failure
Move/restart ~ 2 seconds
Fail-over if
4
failures
(= process exits)
 in 3 minutes
 settable default
©1996, 1997 Microsoft Corp.
17
Demo Moving or Failing Over
An Application
X
X
AVI
Application
Local
SCSI Disk Cabinet
Disks
©1996, 1997 Microsoft Corp.
Shared
Disks
Alice Fails or
Operator
Requests move
AVI
Application Local
Disks
Windows NT Server Cluster
Generic Stateful Application
NotePad



Notepad saves state on shared disk
Failure before save => lost changes
Failover or move (disk & state move)
©1996, 1997 Microsoft Corp.
19
Demo Step 1: Alice Delivering Service
SQL Activity
SQL
SQL
ODBC
ODBC
Local
Disks
No SQL Activity
IIS
SCSI Disk Cabinet
Shared
Disks
Windows NT Server Cluster
IP
©1996, 1997 Microsoft Corp.
HTTP
IIS
Local
Disks
2: Request Move to Betty
No SQL Activity
SQL
SQL
ODBC
ODBC
Local
Disks
SQL Activity
IIS
©1996, 1997 Microsoft Corp.
SCSI Disk Cabinet
Shared
Disks
Windows NT Server Cluster
IP
IIS
IP
HTTP
Local
Disks
3: Betty Delivering Service
No SQL Activity
SQL Activity
SQL
ODBC
Local
Disks
SQL
ODBC
.
IIS
©1996, 1997 Microsoft Corp.
SCSI Disk Cabinet
Shared
Disks
Windows NT Server Cluster
IIS
IP
Local
Disks
4: Power Fail Betty, Alice Takeover
SQL
SQL
ODBC
Local
Disks
SQL Activity
ODBC
No SQL Activity
IIS
IP
©1996, 1997 Microsoft Corp.
SCSI Disk Cabinet
Shared
Disks
Windows NT Server Cluster
IIS
IP
Local
Disks
5: Alice Delivering Service
SQL Activity
No SQL Activity
Local
Disks
ODBC
SQL
IIS
Local
Disks
SCSI Disk Cabinet
Shared
Disks
Windows NT Server Cluster
IP
©1996, 1997 Microsoft Corp.
HTTP
6: Reboot Betty, now can takeover
SQL Activity
SQL
SQL
ODBC
ODBC
Local
Disks
No SQL Activity
IIS
SCSI Disk Cabinet
Shared
Disks
Windows NT Server ClusterIIS
IP
©1996, 1997 Microsoft Corp.
HTTP
Local
Disks
Outline






©1996, 1997 Microsoft Corp.
Why FT and Why Clusters
Cluster Abstractions
Cluster Architecture
Cluster Implementation
Application Support
Q&A
26
Cluster and NT Abstractions
Cluster
Group
Resource
Cluster Abstractions
NT Abstractions
Domain
©1996, 1997 Microsoft Corp.
Node
Service
27
Basic NT Abstractions
Domain



e.g., file service, print service, database server
can depend on other services (startup ordering)
can be started, stopped, paused, failed
Node: a single (tightly-coupled) NT system




Service
Service: program or device managed by a node


Node
hosts services; belongs to a domain
services on node always remain co-located
unit of service co-location; involved in naming services
Domain: a collection of nodes

cooperation for authentication, administration, naming
©1996, 1997 Microsoft Corp.
28
Cluster Abstractions
Cluster



e.g., file service, print service, database server
can depend on other resources (startup ordering)
can be online, offline, paused, failed
Resource Group: a collection of related resources



Resource
Resource: program or device managed by a cluster


Resource
Group
hosts resources; belongs to a cluster
unit of co-location; involved in naming resources
Cluster: a collection of nodes, resources, and groups

©1996, 1997 Microsoft Corp.
cooperation for authentication, administration, naming
29
Resources
Cluster
Group
Resource
Resources have...
 Type: what it does (file, DB, print, web…)
 An operational state (online/offline/failed)
 Current and possible nodes
 Containing Resource Group
 Dependencies on other resources
 Restart parameters (in case of resource failure)
©1996, 1997 Microsoft Corp.
30
Resource Types

Built-in types









Generic Application
Generic Service
Internet Information Server
(IIS) Virtual Root
Network Name
TCP/IP Address
Physical Disk
FT Disk (Software RAID)
Print Spooler
File Share
©1996, 1997 Microsoft Corp.

Added by others






Microsoft SQL Server,
Message Queues,
Exchange Mail Server,
Oracle,
SAP R/3
Your application?
(use developer kit wizard).
31
©1996, 1997 Microsoft Corp.
Physical Disk
32
©1996, 1997 Microsoft Corp.
TCP/IP Address
33
©1996, 1997 Microsoft Corp.
Network Name
34
©1996, 1997 Microsoft Corp.
File Share
35
©1996, 1997 Microsoft Corp.
IIS (WWW/FTP) Server
36
©1996, 1997 Microsoft Corp.
Print Spooler
37
Resource States

Resources states:
I’m
 Offline: exists, not offering service
Online
Online!
Off-line!
 Online: offering service
 Failed: not able to offer service

Online
Pending
Resource failure may cause:
 local
Go
Failed
Go
Online!
I’m
here!
Offline
restart
 other resources to go offline
 resource group to move
 (all subject to group and resource parameters)

Offline
Pending
I’m
Off-line!
Resource failure detected by:
 Polling
failure
 Node failure
©1996, 1997 Microsoft Corp.
38
Resource Dependencies


Similar to NT Service Dependencies
Orderly startup & shutdown



A resource is brought online after any
resources it depends on are online.
A Resource is taken offline before any
resources it depends on
File Share
Interdependent resources




Form dependency trees
move among nodes together
failover together
as per resource group
©1996, 1997 Microsoft Corp.
IIS Virtual
Root
Network Name
IP Address
Resource DLL
39
©1996, 1997 Microsoft Corp.
Dependencies Tab
40
NT Registry

Stores all configuration information
 Software
 Hardware





Hierarchical (name, value) map
Has a open, documented interface
Is secure
Is visible across the net (RPC interface)
Typical Entry:
\Software\Microsoft\MSSQLServer\MSSQLServer\
DefaultLogin = “GUEST”
DefaultDomain = “REDMOND”
©1996, 1997 Microsoft Corp.
41
Cluster Registry


Separate from local NT Registry
Replicated at each node


Algorithms explained later
Maintains configuration information:
 Cluster
members
 Cluster resources
 Resource and group parameters (e.g. restart)


Stable storage
Refreshed from “master” copy when node joins
cluster
©1996, 1997 Microsoft Corp.
42
Other Resource Properties






Name
Restart policy (restart N times, failover…)
Startup parameters
Private configuration info (resource type specific)
 Per-node as well, if necessary
Poll Intervals (LooksAlive, IsAlive, Timeout)
These properties are all kept in Cluster Registry
©1996, 1997 Microsoft Corp.
43
©1996, 1997 Microsoft Corp.
General Resource Tab
44
©1996, 1997 Microsoft Corp.
Advanced Resource Tab
45
Resource Groups
Cluster
Group

Every resource belongs to a

resource group.
Resource groups move
(failover) as a unit


Dependencies NEVER cross
groups. (Dependency trees
contained within groups.)
Group may contain forest of
dependency trees
©1996, 1997 Microsoft Corp.
Resource
Payroll Group
Web Server
IP Address
Drive E:
SQL
Server
Drive F:
46
Moving a Resource Group
©1996, 1997 Microsoft Corp.
47
Group Properties

CurrentState: Online, Partially Online, Offline

Members: resources that belong to group

members determine which nodes can host group.

Preferred Owners: ordered list of host nodes

FailoverThreshold: How many faults cause failover

FailoverPeriod: Time window for failover threshold

FailbackWindowsStart: When can failback happen?

FailbackWindowEnd: When can failback happen?

Everything (except CurrentState) is stored in registry
©1996, 1997 Microsoft Corp.
48
Failover and Failback

Failover parameters



Failback to preferred node


timeout on LooksAlive, IsAlive
# local restarts in failure window
after this, offline.
(during failback window)
Do resource failures affect group?
©1996, 1997 Microsoft Corp.
Node \\Betty
Node \\Alice
Failover
Cluster
Cluster
Failback
Service
Service
IPaddr
name
49
Cluster Concepts
Clusters
Cluster
©1996, 1997 Microsoft Corp.
Group
Resource
Group
Resource
Group
Resource
Group
Resource
50
Cluster Properties

Defined Members: nodes that can join the cluster

Active Members: nodes currently joined to cluster

Resource Groups: groups in a cluster

Quorum Resource:
 Stores
 Used
copy of cluster registry.
to form quorum.

Network: Which network used for communication

All properties kept in Cluster Registry
©1996, 1997 Microsoft Corp.
51
Cluster API Functions
(operations on nodes & groups)



Find and communicate with Cluster
Query/Set Cluster properties
Enumerate Cluster objects




Nodes
Groups
Resources and Resource Types
Cluster Event Notifications



©1996, 1997 Microsoft Corp.
Node state and property changes
Group state and property changes
Resource state and property changes
52
©1996, 1997 Microsoft Corp.
Cluster Management
53
Demo






Server startup and shutdown
Installing applications
Changing status
Failing over
Transferring ownership of groups or resources
Deleting Groups and Resources
©1996, 1997 Microsoft Corp.
54
Outline






©1996, 1997 Microsoft Corp.
Why FT and Why Clusters
Cluster Abstractions
Cluster Architecture
Cluster Implementation
Application Support
Q&A
55
Architecture



©1996, 1997 Microsoft Corp.
Top tier provides
cluster abstractions
Middle tier provides
distributed operations
Bottom tier is
NT and drivers
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Quorum
Membership
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
56
Membership and Regroup

Membership:
for orderly addition
and removal from
{ active nodes }
Failover Manager
 Used

Regroup:
 Used
for failure detection
(via heartbeat messages)
 Forceful eviction from
{ active nodes }
©1996, 1997 Microsoft Corp.
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
57
Membership


Defined cluster = all nodes
Active cluster:



Subset of defined cluster
Includes Quorum Resource
Stable (no regroup in progress)
©1996, 1997 Microsoft Corp.
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
58
Quorum Resource


Usually (but not necessarily) a SCSI disk
Requirements:
 Arbitrates
for a resource by supporting the
challenge/defense protocol
 Capable of storing cluster registry and logs

Configuration Change Logs
 Tracks
changes to configuration database when
any defined member missing (not active)
 Prevents configuration partitions in time
©1996, 1997 Microsoft Corp.
59
Challenge/Defense Protocol

SCSI-2 has reserve/release verbs
 Semaphore



©1996, 1997 Microsoft Corp.
on disk controller
Owner gets lease on semaphore
Renews lease once every 3 seconds
To preempt ownership:
 Challenger
clears semaphore (SCSI bus reset)
 Waits 10 seconds
• 3 seconds for renewal + 2 seconds bus settle time
• x2 to give owner two chances to renew
 If
still clear, then former owner loses lease
 Challenger issues reserve to acquire semaphore
60
Challenge/Defense Protocol:
Successful Defense
Defender Node
Reserve
0
©1996, 1997 Microsoft Corp.
1
Reserve
2
3
4
Reserve
Reserve
5
6
7
Bus Reset
8
9
10
11
Reserve
12
13
14
15
16
Reservation
detected
Challenger Node
61
Challenge/Defense Protocol:
Successful Challenge
Defender Node
Reserve
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Reserve
Bus Reset
Challenger Node
©1996, 1997 Microsoft Corp.
No
reservation
detected
62
Regroup




Invariant:
All members agree on { members }
Regroup re-computes { members }
Each node sends heartbeat message
to a peer (default is one per second)
Regroup if two lost heartbeat
messages



Uses a 5-round protocol to agree.



suspicion that sender is dead
failure detection in bounded time
Checks communication among nodes.
Suspected missing node may survive.
Upper levels (global update, etc.)
informed of regroup event.
©1996, 1997 Microsoft Corp.
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
63
Membership State Machine
Initialize
Sleeping
Start Cluster
Search Fails
Member
Search
Found
Online
Member
©1996, 1997 Microsoft Corp.
Minority or
no Quorum
Joining
Search or
Reserve Fails
Acquire (reserve)
Quorum
Disk
Regroup
Non-Minority
and Quorum
Join
Succeeds
Quorum
Disk Search
Lost
Heartbeat
Online
Forming
Synchronize
Succeeds
64
Joining a Cluster


When a node starts up, it mounts and configures
only local, non-cluster devices
Starts Cluster Service which
 looks
in local (stale) registry for members
 Asks each member in turn to sponsor new node’s
membership. (Stop when sponsor found.)

Sponsor (any active member)
 Sponsor
authenticates applicant
 Broadcasts applicant to cluster members
 Sponsor sends updated registry to applicant
 Applicant becomes a cluster member
©1996, 1997 Microsoft Corp.
65
Forming a Cluster
(when Joining fails)



Use registry to find quorum resource
Attach to (arbitrate for) quorum resource
Update cluster registry from quorum resource
 e.g.



if we were down when it was in use
Form new one-node cluster
Bring other cluster resources online
Let others join your cluster
©1996, 1997 Microsoft Corp.
66
Leaving A Cluster (Gracefully)

Pause:
 Move
all groups off this member.
 Change to paused state (remains a cluster member)

Offline:
 Move
all groups off this member.
 Sends ClusterExit message all cluster members
• Prevents regroup
• Prevents stalls during departure transitions
 Close
Cluster connections
(now not an active cluster member)
 Cluster service stops on node

Evict: remove node from defined member list
©1996, 1997 Microsoft Corp.
67
Leaving a Cluster (Node Failure)


Node (or communication) failure triggers Regroup
If after regroup:



Non-Minority rule:



Minority group OR no quorum device:
• group does NOT survive
Non-minority group AND quorum device:
• group DOES survive
Number of new members >= 1/2 old active cluster
Prevents minority from seizing quorum device at the expense of a
larger potentially surviving cluster
Quorum guarantees correctness
 Prevents “split-brain”
• e.g. with newly forming cluster containing a single node
©1996, 1997 Microsoft Corp.
68
Global Update





Propagates updates to all
nodes in cluster
Used to maintain replicated
cluster registry
Updates are atomic and
totally ordered
Tolerates all benign failures.
Depends on membership



all are up
all can communicate
R. Carr, Tandem Systems Review. V1.2
1985, sketches regroup and global update
protocol.
©1996, 1997 Microsoft Corp.
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
69
Global Update Algorithm

Cluster has locker node that regulates
updates.




in seniority order (e.g. locker first)
this includes the updating node
Failure of all updated nodes:



L
Send Update to locker node
Update other (active) nodes


Oldest active node in cluster
Update never happened
Updated nodes will roll back on recovery
S
Survival of any updated nodes:


New locker is oldest and so has update if any do.
New locker restarts update
©1996, 1997 Microsoft Corp.
70
Cluster Registry


Separate from local NT Registry
Maintains cluster configuration



members, resources, restart
parameters, etc.
Stable storage
Replicated at each member


Global Update protocol
NT Registry keeps local copy
©1996, 1997 Microsoft Corp.
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
71
Cluster Registry Bootstrapping

Membership uses
Cluster Registry for list
of nodes
 …Circular

dependency
Solution:
 Membership
uses stale
local cluster registry
 Refresh after joining or
forming cluster
 Master is either
• quorum device, or
• active members
©1996, 1997 Microsoft Corp.
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
72
Resource Monitor

Polls resources:
 IsAlive

and LooksAlive
Detects failures
 polling
failure
 failure event from resource

Higher levels tell it
 Online,
 Restart
©1996, 1997 Microsoft Corp.
Offline
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
73
Failover Manager

Assigns groups to nodes
based on
 Failover
parameters
 Possible nodes for each
resource in group
 Preferred nodes for
resource group
©1996, 1997 Microsoft Corp.
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
74
©1996, 1997 Microsoft Corp.
Failover
(Resource Goes Offline)
Resource Manager
Detects resource error.
Notify Failover Manager.
Failover Manager checks:
Failover Window and
Failover Threshold
Attempt to
restart resource.
Wait for
Failback Window
No
Has the
Resource
Retry limit
been exceeded?
Are Failover
conditions
within
Constraints?
No
Yes
Leave Group in
partially Online
state.
Yes
Switch resource
(and Dependants)
Offline.
Can another
owner be found?
(Arbitration)
Yes
No
Notify Failover Manager
on the new system to
bring resource Online.
75
Pushing a Group
(Resource Failure)
Resource Monitor
notifies Resource Manager
of resource failure.
Resource Manager
enumerates all objects in the
Dependency Tree of the failed
resource.
Resource Manager notifies
Failover Manager that the
Dependency Tree is Offline
and needs to fail over.
Resource Manager takes
each depending resource
Offline.
Leave Group in
partially Online
state.
©1996, 1997 Microsoft Corp.
No
Any
resource has
“Affect the Group”
True
Failover Manager performs
Arbitration to locate a new
owner for the group.
Yes
Failover Manager on the
new owner node brings the
resources Online.
76
©1996, 1997 Microsoft Corp.
Pulling a Group
(Node Failure)
Cluster Service
notifies Failover Manager
of node failure.
Failover Manager
determines which groups
were owned by the failed
node.
Resource Manager notifies
Failover Manager that the
node is Offline
and the groups it owned
need to fail over.
Failover Manager performs
Arbitration to locate a new
owner for the groups.
Failover Manager on the
new owner(s) bring the
resources Online
in dependency order.
77
Failback to Preferred Owner Node



Group may have a Preferred Owner
Preferred Owner comes back online
Will only occur during the Failback Window
(time slot, e.g. at night)
Preferred owner
comes back Online.
Is the time within
the Failback Window?
©1996, 1997 Microsoft Corp.
Resource Manager takes
each resource on the
current owner Offline.
Resource Manager notifies
Failover Manager that the
Group is Offline
and needs to fail over to the
Preferred Owner.
Failover Manager performs
Arbitration to locate the
Preferred Owner of
the group.
Failover Manager on the
Preferred Owner brings
the resources Online.
78
Outline






©1996, 1997 Microsoft Corp.
Why FT and Why Clusters
Cluster Abstractions
Cluster Architecture
Cluster Implementation
Application Support
Q&A
79
Process Structure

Cluster Service






Resource
Monitor
Private
calls
Resource Monitor Cluster



Failover Manager
Cluster Registry
Global Update
Quorum
Membership
A Node
Resource Monitor Service
Resource DLLs
Resources


Services
Applications
©1996, 1997 Microsoft Corp.
Resource
Monitor
DLL
Private
calls
Resource
80
Resource Control

Commands







A Node
CreateResource()
OnlineResource()
OfflineResource()
TerminateResource()
CloseResource()
ShutdownProcess()
Resource
Monitor
And resource events
©1996, 1997 Microsoft Corp.
Private
calls
Cluster
Service
Resource
Monitor
DLL
Private
calls
Resource
81
Resource DLLs
I’m
Online
Online!

Calls to Resource DLL



Online
Pending
Off-line!
Failed
I’m
here!
Go
Open: get handle
Online: start offering service
Offline: stop offering service
Go
Online!
Offline
Offline
Pending
I’m
Off-line!
• as a standby or
• pair-is offline




LooksAlive: Quick check
IsAlive: Thorough check
Terminate: Forceful Offline
Close: release handle
©1996, 1997 Microsoft Corp.
Resource
Monitor
DLL
Std
calls
Private
calls
Resource
82
Cluster Communications



©1996, 1997 Microsoft Corp.
Most communication via DCOM /RPC
UDP used for membership heartbeat messages
Standard (e.g. Ethernet) interconnects
Management
apps
DCOM
Cluster
Service
DCOM
DCOM / RPC: admin
UDP: Heartbeat
Cluster
Service
DCOM / RPC
Resource
Monitors
DCOM / RPC
Resource
Monitors
Resource
Monitors
Resource
Monitors
83
Outline






©1996, 1997 Microsoft Corp.
Why FT and Why Clusters
Cluster Abstractions
Cluster Architecture
Cluster Implementation
Application Support
Q&A
84
Application Support




Virtual Servers
Generic Resource DLLs
Resource DLL VC++ Wizard
Cluster API
©1996, 1997 Microsoft Corp.
85
Virtual Servers

Problem:


A Virtual Server simulates an NT Node






Client and Server Applications
do not want node name to change
when server app moves to another node.
Resource Group (name, disks, databases,…)
NetName and IP address
(node: \\a keeps name and IP address as is moves)
Virtual Registry (registry “moves” (is replicated))
Virtual Service Control
Virtual RPC service
Challenges:


Limit app to virtual server’s devices and services.
Client reconnect on failover (easy if connectionless
-- eg web-clients)
©1996, 1997 Microsoft Corp.
Virtual
Server
\\a:1.2.3.4
Virtual
Server
\\a: 1.2.3.4
86
Virtual Servers (before failover)


Nodes \\Y and \\Z
support virtual servers
\\A and \\B
Things that need to fail
over transparently
 Client
connection
 Server dependencies
 Service names
 Binding to local
resources
 Binding to local servers
©1996, 1997 Microsoft Corp.
SAP
\\Y
\\Z
SAP
SQL
SQL
S:\
\\A
T:\
“SAP on A”
\\B
“SAP on B”
87
Virtual Servers (just after failover)


\\Y resources and groups
(i.e. Virtual Server \\A)
moved to \\Z
A resources bind to each other
and to local resources (e.g.,
local file system)






E.g. time must remain monotonic
after failover
©1996, 1997 Microsoft Corp.
\\Z
SAP
SAP
SQL
SQL
S:\
T:\
\\A
Registry
Physical resource
Security domain
Time
Transactions used to make DB
state consistent.
To “work”, local resources on
\\Y and \\Z have to be similar

\\Y
“SAP on A”
\\B
“SAP on B”
88
Address Failover and
Client Reconnection

Name and Address rebind
to new node


\\Y
Details later
Clients reconnect




Failure not transparent
Must log on again
Client context lost
(encourages connectionless)
Applications could maintain
context
©1996, 1997 Microsoft Corp.
\\Z
SAP
SAP
SQL
SQL
S:\
T:\
\\A
“SAP on A”
\\B
“SAP on B”
89
Mapping Local References to
Group-Relative References

Send client requests to
correct server





\\A\SAP refers to \\.\SQL
\\B\SAP refers to \\.\SQL
Must remap references:


\\Y
\\A\SAP to \\.\SQL$A
\\B\SAP to \\.\SQL$B
\\Z
SAP
SAP
SQL
SQL
S:\
T:\
\\A
\\B
Also handles namespace
collision
Done via


modifying server apps, or
“SAP on A”
DLLs to transparently rename
©1996, 1997 Microsoft Corp.
“SAP on B”
90
Naming and Binding and Failover



Services rely on the NT node name and - or IP address
to advertise Shares, Printers, and Services.
 Applications register names to advertise services
 Example: \\Alice\SQL (i.e. <node><service>)
 Example: 128.2.2.2:80 (=http://www.foo.com/)
Binding
 Clients bind to an address (e.g. name->IP address)
Thus the node name and IP address must failover along
with the services (preserve client bindings)
©1996, 1997 Microsoft Corp.
91
Client to Cluster Communications
IP address mobility based on MAC rebinding



IP rebinds to failover MAC addr
Transparent to client or server
Low-level ARP (address
resolution protocol) rebinds IP
add to new MAC addr.
Client
Alice <-> 200.110.12.4
Virtual Alice <-> 200.110.12.5
Betty <-> 200.110.12.6
Virtual Betty <-> 200.110.12.7
Alice <-> 200.110.120.4
Virtual Alice <-> 200.110.120.5
©1996, 1997 Microsoft Corp.


Cluster Clients
 Must use IP (TCP, UDP, NBT,... )
 Must Reconnect or Retry after failure
Cluster Servers
 All cluster nodes must be on same LAN
segment
WAN
Router:
Betty <-> 200.110.120.6
Virtual Betty <-> 200.110.120.7
200.110.120.4 ->AliceMAC
200.110.120.5 ->AliceMAC
200.110.120.6 ->BettyMAC
200.110.120.7 ->BettyMAC
Local Network
92
Time

Time must increase monotonically



Time is maintained within failover resolution



Otherwise applications get confused
e.g. make/nmake/build
Not hard, since failover on order of seconds
Time is a resource, so one node owns time resource
Other nodes periodically correct drift from owner’s time
©1996, 1997 Microsoft Corp.
93
Application Local
NT Registry Checkpointing


Resources can request that local NT registry subtrees be replicated
Changes written out to quorum device
 Uses

registry change notification interface
Changes read and applied on fail-over
\\A on \\X
\\A on \\B
registry
registry
©1996, 1997 Microsoft Corp.
registry
Quorum
Device
94
©1996, 1997 Microsoft Corp.
Registry Replication
95
Application Support




Virtual Servers
Generic Resource DLLs
Resource DLL VC++ Wizard
Cluster API
©1996, 1997 Microsoft Corp.
96
Generic Resource DLLs

Generic Application DLL
 Simplest:
just starts, stops application, and
makes sure process is alive

Generic Service DLL
 Translates
DLL calls into equivalent NT
Server calls
• Online => Service Start
• Offline => Service Stop
• Looks/IsAlive => Service Status
©1996, 1997 Microsoft Corp.
Resource
Monitor
DLL Private
Std
calls
calls
Resource
97
©1996, 1997 Microsoft Corp.
Generic Application
98
©1996, 1997 Microsoft Corp.
Generic Service
99
Application Support




©1996, 1997 Microsoft Corp.
Virtual Servers
Generic Resource DLLs
Resource DLL VC++ Wizard
Cluster API
100
Resource DLL VC++ Wizard





Asks for resource type name
Asks for optional service to control
Asks for other parameters (and associated types)
Generates DLL source code
Source can be modified as necessary
 E.g.
©1996, 1997 Microsoft Corp.
additional checks for Looks/IsAlive
101
Creating a New Workspace
©1996, 1997 Microsoft Corp.
102
Specifying Resource Type Name
©1996, 1997 Microsoft Corp.
103
Specifying Resource Parameters
©1996, 1997 Microsoft Corp.
104
Automatic Code Generation
©1996, 1997 Microsoft Corp.
105
©1996, 1997 Microsoft Corp.
Customizing The Code
106
Application Support




©1996, 1997 Microsoft Corp.
Virtual Servers
Generic Resource DLLs
Resource DLL VC++ Wizard
Cluster API
107
Cluster API

Allows resources to:
 Examine
dependencies
 Manage per-resource data
 Change parameters (e.g. failover)
 Listen for cluster events
 etc.



©1996, 1997 Microsoft Corp.
Specs & API became public Sept 1996
On all MSDN Level 3
On web site:
 http://www.microsoft.com/clustering.htm
108
Cluster API Documentation
©1996, 1997 Microsoft Corp.
109
Outline






©1996, 1997 Microsoft Corp.
Why FT and Why Clusters
Cluster Abstractions
Cluster Architecture
Cluster Implementation
Application Support
Q&A
110
Research Topics?









Even easier to manage
Transparent failover
Instant failover
Geographic distribution (disaster tolerance)
Server pools (load-balanced pool of processes)
Process pair (active/backup process)
10,000 nodes?
Better algorithms
Shared memory or shared disk among nodes
a
©1996, 1997 Microsoft Corp.
truly bad idea?
111
References
Microsoft NT site: http://www.microsoft.com/ntserver/
BARC site (e.g. these slides):http://research.microsoft.com/~joebar/wolfpack
Inside Windows NT,
H. Custer, Microsoft Pr, ISBN: 155615481
Tandem Global Update Protocol,
R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol.
VAXclusters: a Closely Coupled Distributed System,
Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V 4.2 1986. A (the) shared disk
cluster.
In Search of Clusters : The Coming Battle in Lowly Parallel Computing,
Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing
Transaction Processing Concepts and Techniques,
Gray, J., Reuter A., Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages,
transaction techniques.
©1996, 1997 Microsoft Corp.
112
Download