Lampson Sturgis Fault Model Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de Mon Tue Wed Thur Fri 9:00 Overview TP mons Log Files &Buffers B-tree 11:00 Faults Lock Theory ResMgr COM+ Access Paths 1:30 Tolerance Lock Techniq CICS & Inet Corba Groupware 3:30 T Models Queues Adv TM Replication Benchmark 7:00 Party Workflow Cyberbrick Party Gray& Reuter FT 3: 1 Rationale Fault Tolerance Needs a Fault Model What do you tolerate? Fault tolerance needs a fault model. Model needs to be simple enough to understand. With a model, can design hardware/software to tolerate the faults. can make statements about the system behavior. Gray& Reuter FT 3: 2 Byzantine Fault Model Some modules are fault free (during the period of interest). Other modules may fail (in the worst way). Make statements about of the fault-free module behavior Synchronous All operations happen within a time limit. Asynchronous: No time limit on anything, No lost messages. Timed: (used here) Notion of timeout and retry Key result: N modules can tolerate N/3 faults. Gray& Reuter FT 3: 3 Lampson Sturgis Model Processes: Correct: Fault: Message: Correct: Fault: Execute a program at a finite rate. Reset to null state and "stop" for a finite time. Eventually arrives and is correct. Lost, duplicated, or corrupted. Storage: Correct: Read(x) returns the most recent value of x. Write(x, v) sets the value of x to v. Fault: All pages reset to null. A page resets to null. Read or Write operate on the wrong page. Other faults (called disasters) not dealt with. Assumption: Disasters are rare. Gray& Reuter FT 3: 4 Byzantine vs. Lampson-Sturgis Fault Models Connections unclear. Byzantine focuses on bounded-time bounded-faults (real-time systems) asynchronous (mostly) or synchronous (real time) Lampson/Sturgis focuses on long-term behavior no time or fault limits time and timeout heavily used to detect faults Gray& Reuter FT 3: 5 Roadmap of What's Coming • Lampson-Sturgis Fault Model • Building highly available processes, messages, storage from faulty components. • Process pairs give quick repair • Kinds of process pairs: – Checkpoint / Restart based on storage – Checkpoint / Restart based on messages – Restart based on transactions (easy to program). Gray& Reuter FT 3: 6 Model of Storage and its Faults System has several stores (discs). Each has a set of pages. Stores fail independently. a store status store_write(store, address, value) status a page store_read (store, address, value) value probability write has no effect: 1 in a million mean time to a page fail, a few days mean time to disc fail is a few years wild read/write modeled as a page fail. Gray& Reuter FT 3: 7 Storage Decay (the demon) /* There is one store_decay process for each store in the system #define mttvf 7E5 /* mean time (sec) to a page fail, a few days #define mttsf 1E8 /* mean time(sec) to disc fail is a few years void store_decay(astore store) /* { Ulong addr; /* the random places that will decay Ulong page_fail = time() + mttvf*randf();/* timeto next page decay Ulong store_fail = time() + mttsf*randf(); /* timeto next store decay while (TRUE) /* repeat this loop forever wait(min(page_fail,store_fail) - time());/* wait for next if (time() >= page_fail) /* if the event is a page decay { */ */ */ */ */ */ */ */ event*/ */ { addr = randf()*MAXSTORE; /* pick a random address */ store.page[addr].status = FALSE; /* set it invalid */ page_fail = time() - log(randf())*mttvf; /* pick next fault time*/ }; /* negative exp distributed, mean mttvf */ if (time() >= store_fail) }; }; /* if the event is a storage fault Page Decay */ { store.status = FALSE; /* mark the store as broken */ for (addr = 0; addr < MAXSTORE; addr++) /*invalidate all pages */ store.page[addr].status = FALSE; /* */ store_fail = time() + log(randf())*mttsf; /* pick next fault time*/ }; /* negative exp distributed, mean mttsf */ /* end of endless while loop */ /* */ Store Failure Simulates (specifies) system behavior. Gray& Reuter FT 3: 8 Reliable Write: Write all members of a N-plex set. #define nplex 2 /* code works for n>2, but do duplex */ Boolean reliable_write(Ulong group, address addr, avalue value) /* */ { Ulong Boolean i; status = FALSE; /* index on elements of store group */ /* true if any write worked */ /* each group uses Nplex stores */ for (i = 0; i < nplex; i++ ) /*write each store in the group { status = status || /* status indicates if any write worked store_write(stores[group*nplex+i],addr,value); /* } */ */ */ /* loop to write all stores of group */ return status; /* return indicates if ANY write worked*/ }; /* */ Reliable Write Gray& Reuter FT 3: 9 Reliable Read: read all members of N-plex set Problems: All fail: Disaster Ambiguity: (N-different answers) Take majority Take "newest" Ulong version(avalue); /* returns version of a value */ /* read an n-plex group to find the most recent version of a page */ Boolean reliable_read(Ulong group, address addr, avalue value) /* */ { Ulong I = 0; /* index on store group */ Boolean gotone = FALSE; /* flag says had a good read */ Boolean bad = FALSE; /* bad says group needs repair */ avalue next; /* next value that is read */ Boolean status; /* read ok */ for (i = 0; i < nplex; i++ ) /* for each page in the nplex set */ { status = store_read(stores[group*nplex+i],addr,next); /*read value */ if (! status ) bad = TRUE; /* if status bad, ignore value */ else /* have a good read */ if (! gotone) /* if it is first good value */ {copy(value,next,VSIZE); gotone = TRUE;}/* make it best value */ else if ( version(next) != version(value)) /*if new val,compare */ { bad = TRUE; /* if different, repair needed */ if (version(next) > version(value)) /* if new is best version */ copy(value, next, VSIZE); /* copy it to best value */ }; }; /* end of read all copies */ if (! gotone) return FALSE; /* disaster, no good pages */ if (bad) reliable_write(group,addr,value); /* repair any bad pages */ return TRUE; /* success */ Reliable read on bad read rewrite with best value Gray& Reuter FT 3: 10 Background Store Repair Process /* repair the broken pages in an n-plex group. */ /* Group is in 0,...,(MAXSTORE/nplex)-1 */ void store_repair(Ulong group) /* */ { int i; /* next address to be repaired */ avalue value; /* buffer holds value to be read */ while (TRUE) /* do forever */ {for (i = 0; i <MAXSTORE; i++) /* for each page in the store */ { wait(1); /* wait a second */ reliable_read(group,i,value); /* a reliable read repairs page*/ }; };}; /* if they do not match */ Needed to minimize chances of N-failures. Repair is important. Data Scrubber Reliable read on bad read rewrite with best value Gray& Reuter FT 3: 11 Optimistic Reads Most implementations do optimistic reads: read only one value. Boolean optimistic_read(Ulong group,address addr,avalue value) /* */ {if (group >= MAXSTORES/nplex) return FALSE; /* return false if bad addr*/ if (store_read(stores[nplex*group],addr,value)) /* read one value return TRUE; else */ /* and if that is ok return it as the true value */ /* if reading one value returned bad then */ return (reliable_read(group,addr,value)); /* n-plex read & repair. */ }; /* */ This is dangerous (especially without repair). Gray& Reuter FT 3: 12 Storage Fault Summary • Simple fault model. • Allows discussion/specification of fault tolerance. • Uncovers some problems in many implementations: • Ambiguous reads • Repair process. • Optimistic reads. Gray& Reuter FT 3: 13 Process Fault Model • Process executes a program and has state. • Program causes state change plus: send/get message. • Process fails by stopping (for a while) and then resetting its data and message state. Sender Process Data Program Gray& Reuter FT a new message Queue of Input Messages to the process status next value Receiver Process Data Program 3: 14 Process Fault Model: The Break/Fix loop #define MAXPROCESS MANY typedef Ulong processid; /* the system will have many processes */ /* process id is an integer index into array */ typedef struct {char program[MANY/2];char data[MANY/2]} state;/* program + data */ struct { state initial; /* process initial state */ state current; /* value of the process state */ amessagep messages; /* queue of messages waiting for process */ } /* */ process [MAXPROCESS]; /* Process Decay : execute a process and occasionally inject faults into it */ #define mttpf 1E7 /* mean time to process failure Å4 months */ #define mttpr 1E4 /* mean time to repair is 3 hours */ /* */ void process_execution(processid pid) { Ulong proc_fail; /* time of next process fault */ Ulong proc_repair; /* time to repair process */ amessagep msg, next; /* pointers to process messages */ /* global execution loop */ while (TRUE) { proc_fail = time() - log(randf())*mttpf; /* the time of next fail */ proc_repair = -log(randf())*mttpr; /* delay in next process repair */ while (time() < proc_fail) /* { execute(process[pid].current);}; /* execute for about 4 months (void) wait(proc_repair); /* wait about 3 hrs for repair copy(process[pid].current,process[pid].initial,MANY); /* reset while (message_get(msg,status) {}; }; }; Gray& Reuter FT */ Execute (work) */ 4 Months (break) */ Fail!!! (fix) */ /* read and discard all msgs in queue /* bottom of work, break, fix loop Repair */ 3 hrs */ 3: 15 Checkpoint/Restart Process (Storage based) /* A checkpoint-restart process server generating unique sequence numbers */ checkpoint_restart_process() /* */ { Ulong disc = 0; /* a reliable storage group with state */ Ulong address[2] = {0,1}; /* page address of two states on disc */ Ulong old; /* index of the disc with the old state */ struct { Ulong ticketno; /* process reads its state from disc. */ char filler[VSIZE]; /* newest state has max ticket number */ } value [2]; /* current state kept in value[0] */ struct msg{ /* buffer to hold input message */ processid him; /* contains requesting process id */ char filler[VSIZE]; /* reply (ticket num) sent to process */ } msg; /* */ At Researt /* Restart logic: recover ticket number from persistent storage */ for (old = 0; old<=1, old++) /* read the two states from */ Getdisc Ticket Number { if (!reliable_read(disc,address[old],value[old] )) /*if read fails */ From Disk panic(); }; /* then failfast */ if (value[1].ticketno < value[0].ticketno) old = 1; /* pick max seq no */ else { old = 0; copy(value[0], value[1],VSIZE);};/*which is old val */ /* Processing logic: generate next number, checkpoint, and reply */ while (TRUE) /* do forever */ { while (! get_msg(&msg)) {}; /* get next request for a ticketGet number */ request value[0].ticketno = value[0].ticketno + 1; /* increment ticket num */ bump ticket # if ( ! reliable_write(disc,address[old],value[0])) panic(); /* checkpoint */ old = (old + 1) % 2; /* use other disc for state next time */ Save to*/disk message_send(msg.him, value[0]); /* send the ticket number to client }; }; /* endless loop to get messages. Send to*/ client Gray& Reuter FT 3: 16 Process Pairs (message-based checkpoints) Give a ticket Give Meme A Ticket Ticket Numbers Ticket number I'm Alive Messages Ticket # Client Processes Primary Server Process Next Ticket Number State Checkpoint Messages Backup Server Process Next Ticket Number Problem Detect failure Continuation: Startup Gray& Reuter FT Solutions I'm Alive msg timeout No "real" solution. Checkpoint Messages backup waits for primary 3: 17 Process Pairs (message-based checkpoints) Restart Wait a second Backup Loop Broadcast: "Im Primary"Reply to last request Primary Loop any input? + requests Read it Compute new state. Send new state to backup. reply replies am I - default primary? + + Wait a second new state any in last second? input? + Read it newer state? + Set my state to new state Send state to backup. Im alive • Primary in tight loop sending "I'm alive" or state change messages to backup • Backup thinks primary dead if no messages in previous second. Gray& Reuter FT 3: 18 What We Have Done So Far Converted "faulty" processes to reliable ones. Tolerate hardware and some software faults Can repair in seconds or milli-seconds. Unlike checkpoint restart: No process creation/setup time No client reconnect time. Operating systems are beginning to provide process pairs. Stateless process pairs can use transactional servers to Store their state Cleanup the mess at takeover. Like storage-based checkpoint/restart except process setup/connection is instant. Gray& Reuter FT 3: 19 Persistent process pairs persistent_process() { wait_to_be_primary(); while (TRUE) { begin_work(); read request(); doit(); reply(); commit_work(); }; }; Gray& Reuter FT /* prototypical persistent process */ /* wait to be told you are primary */ /* when primary, do forever */ /* start transaction or subtransaction */ /* read a request */ /* perform the desired function */ /* reply */ /* finish transaction or subtransaction*/ /* did a step, now get next request */ /* */ 3: 20 Persistent Process Pairs The ticket server redone as a transactional server. /* A transactional persistent server process generating unique tickets */ perstistent_ticket_server() /* current state kept in sql database */ { int ticketno; /* next ticket # ( from DB) */ struct msg{ /* buffer to hold input message */ processid him; /* contains requesting process id */ char filler[VSIZE]; /* reply (ticket num) sent to that addr */ } msg; /* */ /* Restart logic: recover ticket number from persistent storage */ Wait to be wait_to_be_primary(); /* wait to be told you are primary */ /* Processing logic: generate next number, checkpoint, and reply Primary */ while (TRUE) /* do forever */ { begin_work(); /* begin a transaction */ Begin Trans & while (! get_msg(&msg)); /* get next request for a ticket */ Get request exec sql update ticket /* increment the next ticket number */ set ticketno = ticketno + 1; /* */ exec sql select max(ticketno) /* fetch current ticket number */ bump ticket # into :ticketno /* into program local variable */ in Database from ticket; /* from SQL database */ commit_work(); /* commit transaction Commit*/and message_send(msg.him, value); /* send the ticket number to client */ Send to client }; }; /* endless loop to get messages. */ Gray& Reuter FT 3: 21 Messages: Fault Model Each process has a queue of incoming messages. Messages can be corrupted: checksum detects it duplicated: sequence number detects it. delayed arbitrarily long (ack + retransmit). can be lost (ack + retransmit+seq number). Techniques here give messages fail-fast semantics. Gray& Reuter FT 3: 22 Message Verbs: SEND /*send a message to a process: returns true if the process exists Boolean loop: message_send(processid him, avalue value) /* */ */ { amessagep it; /* pointer to message created by this call*/ amessagep queue; /* pointer to process message queue */ if (him > MAXPROCESS) return FALSE; /* test for valid process */ it = malloc(sizeof(amessage)); /* allocate space to hold message */ Build& it->status = TRUE; it->next = NULL; /* and fill in the fields */ Queue copy(it->value,value,VSIZE); /* copy msg data to message body */ Msg queue = process[him].messages; /* look at process message queue */ if (queue == NULL) process[him].messages = it; /* if the empty then */ else /* place this message at queue head */ {while (queue->next != NULL) queue = queue->next; /* else place */ queue->next = it;} /* the message at queue end . */ Corrupt if (randf() < pmf) it->status = FALSE; /* sometimes message corrupted */ Msg Duplicate if (randf() < pmd) goto loop; /* sometimes the message duplicated */ return TRUE; /* */ Msg }; /* */ Gray& Reuter FT 3: 23 Message Verbs: GET /* get the next input message of this process: returns true if a message Boolean message_get(avalue * valuep, Boolean * msg_status)/* { processid me = MyPID(); /* amessagep it; /* it = process[me].messages; /* if (it == NULL) return FALSE; /* process[me].messages = it->next;/* *msg_status = it->status; /* copy(valuep,it->value,VSIZE); /* free(it); /* return TRUE; /* }; /* Gray& Reuter FT caller’s process number pointer to input message find caller’s message input queue return false if queue is empty take first message off the queue record its status value = it->value deallocate its space return status to caller */ */ */ */ */ */ */ */ */ */ */ */ 3: 24 Sessions Make Messages FailFast Process in 6 acknowledged 3 out 3 in 7 acknowledged 3 out 3 in 7 acknowledged 3 out 3 • • • • Session 7 •••••••••• 7 •••ack 77•••• Ack Process 7 out 6 acknowledged 3 in Process 7 out 6 acknowledged 3 in 7 out 7 acknowledged 3 in CRC makes corrupt look like lost message Sequence numbers detect duplicates => lost message So, only failure is lost message Timeout/retransmit masks lost messages. => Only failure is delay. Gray& Reuter FT 3: 25 Sessions Plus Process Pairs Give Highly Available Messages Process Pair Process in 6 acknowledged 3 out 3 in 7 acknowledged 3 out 3 in 7 acknowledged 3 out 3 Session 7 ••••••• •••••••• 7 ack 7 7 ••••••••••••••• •••ack 7••••••• ack 7••••••• ••• checkpoint 7 out 6 ack 3 in 7 out 6 acknowledged 3 in send 7 out 6 acked 3 in 7 out 6 acked 3 in 7 7 out 6 acked 3 in ack 7 7 out 7 acked 3 in Checkpoint messages and sequence numbers to backup Backup resumes session if primary fails. Backup broadcasts new identity at takeover (see book for code) 3: 26 Gray& Reuter FT Highly Available Message Verbs Output Message Session Application Programs reliable_send_msg() reliable_get_msg() Acknowledged Input Messages The Listener Process Input Message Session Hide under reliable get/send msg – – – – – Sequence number, ack retransmit logic checkpoint process pair takeover resend of most recent reply. Uses a Listener process (thread) to do all this async work Gray& Reuter FT 3: 27 Summary Went from faulty storage, processes, messages to fault tolerant versions of each. Simple fault model explains many techniques used (and mis-used) in FT systems. Gray& Reuter FT 3: 28