Lampson Sturgis Fault Model

advertisement
Lampson Sturgis Fault Model
Jim Gray
Microsoft, Gray @ Microsoft.com
Andreas Reuter
International University, Andreas.Reuter@i-u.de
Mon
Tue
Wed
Thur
Fri
9:00
Overview
TP mons
Log
Files &Buffers
B-tree
11:00
Faults
Lock Theory
ResMgr
COM+
Access Paths
1:30
Tolerance
Lock Techniq
CICS & Inet
Corba
Groupware
3:30
T Models
Queues
Adv TM
Replication
Benchmark
7:00
Party
Workflow
Cyberbrick
Party
Gray& Reuter FT
3: 1
Rationale
Fault Tolerance Needs a Fault Model
What do you tolerate?
Fault tolerance needs a fault model.
Model needs to be simple enough to understand.
With a model,
can design hardware/software to tolerate the faults.
can make statements about the system behavior.
Gray& Reuter FT
3: 2
Byzantine Fault Model
Some modules are fault free (during the period of interest).
Other modules may fail (in the worst way).
Make statements about of the fault-free module behavior
Synchronous
All operations happen within a time limit.
Asynchronous:
No time limit on anything,
No lost messages.
Timed: (used here)
Notion of timeout and retry
Key result: N modules can tolerate N/3 faults.
Gray& Reuter FT
3: 3
Lampson Sturgis Model
Processes:
Correct:
Fault:
Message:
Correct:
Fault:
Execute a program at a finite rate.
Reset to null state and "stop" for a finite time.
Eventually arrives and is correct.
Lost, duplicated, or corrupted.
Storage:
Correct:
Read(x) returns the most recent value of x.
Write(x, v) sets the value of x to v.
Fault:
All pages reset to null.
A page resets to null.
Read or Write operate on the wrong page.
Other faults (called disasters)
not dealt with.
Assumption: Disasters are rare.
Gray& Reuter FT
3: 4
Byzantine vs. Lampson-Sturgis Fault Models
Connections unclear.
Byzantine focuses on bounded-time bounded-faults
(real-time systems)
asynchronous (mostly) or
synchronous (real time)
Lampson/Sturgis focuses on long-term behavior
no time or fault limits
time and timeout heavily used to detect faults
Gray& Reuter FT
3: 5
Roadmap of What's Coming
• Lampson-Sturgis Fault Model
• Building highly available
processes,
messages,
storage
from faulty components.
• Process pairs give quick repair
• Kinds of process pairs:
– Checkpoint / Restart based on storage
– Checkpoint / Restart based on messages
– Restart based on transactions (easy to
program).
Gray& Reuter FT
3: 6
Model of Storage and its Faults
System has several stores (discs).
Each has a set of pages.
Stores fail independently.
a store
status
store_write(store, address, value)
status
a page
store_read (store, address, value)
value
probability write has no effect: 1 in a million
mean time to a page fail, a few days
mean time to disc fail is a few years
wild read/write modeled as a page fail.
Gray& Reuter FT
3: 7
Storage Decay (the demon)
/* There is one store_decay process for each store in the system
#define
mttvf
7E5
/* mean time (sec) to a page fail, a few days
#define
mttsf
1E8
/* mean time(sec) to disc fail is a few years
void store_decay(astore store)
/*
{ Ulong
addr;
/* the random places that will decay
Ulong page_fail = time() + mttvf*randf();/* timeto next page decay
Ulong store_fail = time() + mttsf*randf(); /* timeto next store decay
while (TRUE)
/* repeat this loop forever
wait(min(page_fail,store_fail) - time());/* wait for next
if (time() >= page_fail)
/* if the event is a page decay
{
*/
*/
*/
*/
*/
*/
*/
*/
event*/
*/
{ addr = randf()*MAXSTORE; /* pick a random address
*/
store.page[addr].status = FALSE; /* set it invalid
*/
page_fail = time() - log(randf())*mttvf; /* pick next fault time*/
};
/* negative exp distributed, mean mttvf
*/
if (time() >= store_fail)
};
};
/*
if the event is a storage fault
Page
Decay
*/
{ store.status = FALSE; /* mark the store as broken
*/
for (addr = 0; addr < MAXSTORE; addr++) /*invalidate all pages */
store.page[addr].status = FALSE; /*
*/
store_fail = time() + log(randf())*mttsf; /* pick next fault time*/
};
/* negative exp distributed, mean mttsf */
/* end of endless while loop
*/
/*
*/
Store
Failure
Simulates (specifies) system behavior.
Gray& Reuter FT
3: 8
Reliable Write: Write all members of a N-plex set.
#define nplex
2
/* code works for n>2, but do duplex
*/
Boolean reliable_write(Ulong group, address addr, avalue value) /* */
{ Ulong
Boolean
i;
status = FALSE;
/* index on elements of store group
*/
/* true if any write worked
*/
/* each group uses Nplex stores
*/
for (i = 0; i < nplex; i++ ) /*write each store in the group
{ status = status ||
/* status indicates if any write worked
store_write(stores[group*nplex+i],addr,value); /*
}
*/
*/
*/
/* loop to write all stores of group */
return status;
/* return indicates if ANY write worked*/
};
/*
*/
Reliable Write
Gray& Reuter FT
3: 9
Reliable Read: read all members of N-plex set
Problems:
All fail: Disaster
Ambiguity: (N-different answers)
Take majority
Take "newest"
Ulong version(avalue); /* returns version of a value
*/
/* read an n-plex group to find the most recent version of a page
*/
Boolean reliable_read(Ulong group, address addr, avalue value) /*
*/
{ Ulong I = 0;
/* index on store group
*/
Boolean gotone
= FALSE;
/* flag says had a good read
*/
Boolean bad
= FALSE;
/* bad says group needs repair
*/
avalue
next;
/* next value that is read
*/
Boolean status;
/* read ok
*/
for (i = 0; i < nplex; i++ )
/* for each page in the nplex set */
{ status = store_read(stores[group*nplex+i],addr,next); /*read value */
if (! status ) bad = TRUE;
/* if status bad, ignore value
*/
else
/* have a good read
*/
if (! gotone)
/* if it is first good value
*/
{copy(value,next,VSIZE); gotone = TRUE;}/* make it best value */
else if ( version(next) != version(value)) /*if new val,compare */
{ bad = TRUE;
/* if different, repair needed
*/
if (version(next) > version(value)) /* if new is best version */
copy(value, next, VSIZE); /* copy it to best value
*/
};
}; /* end of read all copies
*/
if (! gotone) return FALSE;
/* disaster, no good pages
*/
if (bad) reliable_write(group,addr,value); /* repair any bad pages */
return TRUE;
/* success
*/
Reliable read
on bad read
rewrite with best value
Gray& Reuter FT
3: 10
Background Store Repair Process
/* repair the broken pages in an n-plex group.
*/
/* Group is in 0,...,(MAXSTORE/nplex)-1
*/
void store_repair(Ulong group)
/*
*/
{ int
i;
/* next address to be repaired
*/
avalue value;
/* buffer holds value to be read */
while (TRUE)
/* do forever
*/
{for (i = 0; i <MAXSTORE; i++) /* for each page in the store
*/
{ wait(1);
/* wait a second
*/
reliable_read(group,i,value); /* a reliable read repairs page*/
}; };};
/*
if they do not match */
Needed to minimize chances of N-failures.
Repair is important.
Data
Scrubber
Reliable read
on bad read
rewrite with best value
Gray& Reuter FT
3: 11
Optimistic Reads
Most implementations do optimistic reads:
read only one value.
Boolean optimistic_read(Ulong group,address addr,avalue value) /*
*/
{if (group >= MAXSTORES/nplex) return FALSE; /* return false if bad addr*/
if (store_read(stores[nplex*group],addr,value)) /* read one value
return TRUE;
else
*/
/* and if that is ok return it as the true value
*/
/* if reading one value returned bad then
*/
return (reliable_read(group,addr,value)); /* n-plex read & repair. */
};
/*
*/
This is dangerous (especially without repair).
Gray& Reuter FT
3: 12
Storage Fault Summary
• Simple fault model.
• Allows discussion/specification of fault tolerance.
• Uncovers some problems in many implementations:
•
Ambiguous reads
•
Repair process.
•
Optimistic reads.
Gray& Reuter FT
3: 13
Process Fault Model
• Process executes a program and has state.
• Program causes state change plus: send/get message.
• Process fails by stopping (for a while) and then
resetting its data and message state.
Sender Process
Data
Program
Gray& Reuter FT
a
new
message
Queue of
Input Messages to the process
status next value
Receiver Process
Data
Program
3: 14
Process Fault Model: The Break/Fix loop
#define
MAXPROCESS
MANY
typedef Ulong processid;
/* the system will have many processes
*/
/* process id is an integer index into array
*/
typedef struct {char program[MANY/2];char data[MANY/2]} state;/* program + data
*/
struct
{ state
initial;
/* process initial state
*/
state
current;
/* value of the process state
*/
amessagep messages;
/* queue of messages waiting for process
*/
}
/*
*/
process [MAXPROCESS];
/* Process Decay : execute a process and occasionally inject faults into it
*/
#define
mttpf
1E7
/* mean time to process failure Å4 months
*/
#define
mttpr
1E4
/* mean time to repair is 3 hours
*/
/*
*/
void process_execution(processid pid)
{ Ulong
proc_fail;
/* time of next process fault
*/
Ulong
proc_repair;
/* time to repair process
*/
amessagep
msg, next;
/* pointers to process messages
*/
/* global execution loop
*/
while (TRUE)
{ proc_fail = time() - log(randf())*mttpf; /* the time of next fail
*/
proc_repair = -log(randf())*mttpr;
/* delay in next process repair
*/
while (time() < proc_fail)
/*
{ execute(process[pid].current);}; /* execute for about 4 months
(void) wait(proc_repair);
/* wait about 3 hrs for repair
copy(process[pid].current,process[pid].initial,MANY); /* reset
while (message_get(msg,status) {};
};
};
Gray& Reuter FT
*/
Execute
(work)
*/
4
Months
(break)
*/
Fail!!!
(fix)
*/
/* read and discard all msgs in queue
/* bottom of work, break, fix loop
Repair */
3 hrs
*/
3: 15
Checkpoint/Restart Process (Storage based)
/* A checkpoint-restart process server generating unique sequence numbers
*/
checkpoint_restart_process()
/*
*/
{ Ulong
disc = 0;
/* a reliable storage
group with state */
Ulong address[2] = {0,1};
/* page address of two states on disc
*/
Ulong old;
/* index of the disc with the old state */
struct { Ulong
ticketno;
/* process reads its state from disc.
*/
char
filler[VSIZE];
/* newest state has max ticket
number */
} value [2];
/* current state kept in value[0]
*/
struct msg{
/* buffer to hold input message
*/
processid him;
/* contains requesting process id
*/
char
filler[VSIZE];
/* reply (ticket num) sent to process
*/
} msg;
/*
*/
At Researt
/* Restart logic: recover ticket number from persistent storage
*/
for (old = 0; old<=1, old++)
/* read the two states from
*/
Getdisc
Ticket Number
{ if (!reliable_read(disc,address[old],value[old] )) /*if read fails
*/
From
Disk
panic(); };
/* then failfast
*/
if (value[1].ticketno < value[0].ticketno) old = 1;
/* pick max seq no
*/
else { old = 0; copy(value[0], value[1],VSIZE);};/*which is old val
*/
/* Processing logic: generate next number, checkpoint, and reply
*/
while (TRUE)
/* do forever
*/
{ while (! get_msg(&msg)) {};
/* get next request for a ticketGet
number
*/
request
value[0].ticketno = value[0].ticketno + 1;
/* increment ticket num
*/
bump ticket
#
if ( ! reliable_write(disc,address[old],value[0])) panic(); /* checkpoint
*/
old = (old + 1) % 2;
/* use other disc for state next time */
Save to*/disk
message_send(msg.him, value[0]);
/* send the ticket number to client
};
};
/* endless loop to get messages.
Send to*/
client
Gray& Reuter FT
3: 16
Process Pairs (message-based checkpoints)
Give
a ticket
Give
Meme
A Ticket
Ticket
Numbers
Ticket
number
I'm Alive
Messages
Ticket #
Client Processes
Primary
Server Process
Next Ticket Number
State Checkpoint
Messages
Backup
Server Process
Next Ticket Number
Problem
Detect failure
Continuation:
Startup
Gray& Reuter FT
Solutions
I'm Alive msg
timeout
No "real" solution.
Checkpoint Messages
backup waits for primary
3: 17
Process Pairs (message-based checkpoints)
Restart
Wait a second
Backup Loop
Broadcast: "Im Primary"Reply to last request
Primary Loop any
input?
+
requests
Read it
Compute new state.
Send new state to backup.
reply
replies
am I
- default primary?
+
+
Wait a second
new state
any
in last second?
input?
+
Read it
newer
state?
+
Set my state to new state
Send state to backup.
Im alive
• Primary in tight loop sending "I'm alive" or state change
messages to backup
• Backup thinks primary dead if no messages in previous second.
Gray& Reuter FT
3: 18
What We Have Done So Far
Converted "faulty" processes to reliable ones.
Tolerate hardware and some software faults
Can repair in seconds or milli-seconds.
Unlike checkpoint restart: No process creation/setup time
No client reconnect time.
Operating systems are beginning to provide process pairs.
Stateless process pairs can use transactional servers to
Store their state
Cleanup the mess at takeover.
Like storage-based checkpoint/restart
except process setup/connection is instant.
Gray& Reuter FT
3: 19
Persistent process pairs
persistent_process()
{
wait_to_be_primary();
while (TRUE)
{ begin_work();
read request();
doit();
reply();
commit_work();
};
};
Gray& Reuter FT
/* prototypical persistent process
*/
/* wait to be told you are primary
*/
/* when primary, do forever
*/
/* start transaction or subtransaction */
/* read a request
*/
/* perform the desired function
*/
/* reply
*/
/* finish transaction or subtransaction*/
/* did a step, now get next request */
/*
*/
3: 20
Persistent Process Pairs
The ticket server redone as a transactional server.
/* A transactional persistent server process generating unique tickets
*/
perstistent_ticket_server()
/* current state kept in sql database
*/
{ int ticketno;
/* next ticket # ( from DB)
*/
struct msg{
/* buffer to hold input message
*/
processid him;
/* contains requesting process id
*/
char
filler[VSIZE]; /* reply (ticket num) sent to that addr
*/
} msg;
/*
*/
/* Restart logic: recover ticket number from persistent storage
*/
Wait
to
be
wait_to_be_primary();
/* wait to be told you are primary
*/
/* Processing logic: generate next number, checkpoint, and reply
Primary */
while (TRUE)
/* do forever
*/
{ begin_work();
/* begin a transaction
*/
Begin Trans
&
while (! get_msg(&msg));
/* get next request for a ticket
*/
Get request
exec sql update ticket
/* increment the next ticket number
*/
set ticketno = ticketno + 1;
/*
*/
exec sql select max(ticketno) /* fetch current ticket number
*/
bump ticket
#
into :ticketno
/* into program local variable
*/
in Database
from ticket;
/* from SQL database
*/
commit_work();
/* commit transaction
Commit*/and
message_send(msg.him, value); /* send the ticket number to client
*/
Send
to
client
};
};
/* endless loop to get messages.
*/
Gray& Reuter FT
3: 21
Messages: Fault Model
Each process has a queue of incoming messages.
Messages can be
corrupted: checksum detects it
duplicated: sequence number detects it.
delayed arbitrarily long (ack + retransmit).
can be lost (ack + retransmit+seq number).
Techniques here give messages fail-fast semantics.
Gray& Reuter FT
3: 22
Message Verbs: SEND
/*send a message to a process: returns true if the process exists
Boolean
loop:
message_send(processid
him, avalue value) /*
*/
*/
{ amessagep
it;
/* pointer to message created by this call*/
amessagep
queue;
/* pointer to process message queue
*/
if (him > MAXPROCESS) return FALSE;
/* test for valid process
*/
it = malloc(sizeof(amessage));
/* allocate space to hold message
*/
Build&
it->status = TRUE; it->next = NULL;
/* and fill in the fields
*/
Queue
copy(it->value,value,VSIZE);
/* copy msg data to message body
*/
Msg
queue = process[him].messages;
/* look at process message queue
*/
if (queue == NULL) process[him].messages = it; /* if the empty then
*/
else
/* place this message at queue head
*/
{while (queue->next != NULL) queue = queue->next; /* else place
*/
queue->next = it;}
/* the message at queue end .
*/
Corrupt
if (randf() < pmf) it->status = FALSE; /* sometimes message corrupted
*/
Msg
Duplicate
if (randf() < pmd) goto loop;
/* sometimes the message duplicated
*/
return TRUE;
/*
*/
Msg
};
/*
*/
Gray& Reuter FT
3: 23
Message Verbs: GET
/* get the next input message of this process: returns true if a message
Boolean
message_get(avalue * valuep, Boolean * msg_status)/*
{ processid me = MyPID();
/*
amessagep
it;
/*
it = process[me].messages;
/*
if (it == NULL) return FALSE;
/*
process[me].messages = it->next;/*
*msg_status = it->status;
/*
copy(valuep,it->value,VSIZE);
/*
free(it);
/*
return TRUE;
/*
};
/*
Gray& Reuter FT
caller’s process number
pointer to input message
find caller’s message input queue
return false if queue is empty
take first message off the queue
record its status
value = it->value
deallocate its space
return status to caller
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
3: 24
Sessions Make Messages FailFast
Process
in 6
acknowledged 3
out 3
in 7
acknowledged 3
out 3
in 7
acknowledged 3
out 3
•
•
•
•
Session
7 ••••••••••
7
•••ack 77••••
Ack
Process
7 out
6 acknowledged
3 in
Process
7 out
6 acknowledged
3 in
7 out
7 acknowledged
3 in
CRC makes corrupt look like lost message
Sequence numbers detect duplicates => lost message
So, only failure is lost message
Timeout/retransmit masks lost messages. => Only failure is delay.
Gray& Reuter FT
3: 25
Sessions Plus Process Pairs
Give Highly Available Messages
Process Pair
Process
in 6
acknowledged 3
out 3
in 7
acknowledged 3
out 3
in 7
acknowledged 3
out 3
Session
7 •••••••
••••••••
7
ack 7
7 •••••••••••••••
•••ack 7•••••••
ack 7•••••••
•••
checkpoint
7 out
6 ack
3 in
7 out
6 acknowledged
3 in
send
7 out
6 acked
3 in
7 out
6 acked
3 in
7
7 out
6 acked
3 in
ack 7
7 out
7 acked
3 in
Checkpoint messages and sequence numbers to backup
Backup resumes session if primary fails.
Backup
broadcasts new identity at takeover (see book for code) 3: 26
Gray& Reuter
FT
Highly Available Message Verbs
Output Message Session
Application Programs
reliable_send_msg()
reliable_get_msg()
Acknowledged Input
Messages
The
Listener
Process
Input Message Session
Hide under reliable get/send msg
–
–
–
–
–
Sequence number,
ack retransmit logic
checkpoint
process pair takeover
resend of most recent reply.
Uses a Listener process (thread) to do all this async work
Gray& Reuter FT
3: 27
Summary
Went from faulty storage, processes, messages
to fault tolerant versions of each.
Simple fault model explains many techniques used
(and mis-used) in FT systems.
Gray& Reuter FT
3: 28
Download