CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 23: Nov. 14 Topics for this lecture • Adding clocks to distributed systems – Also, to non-distributed one – We won’t look at real-time operating systems and scheduling, but this was an active area until recently – Recent machines are so fast that importance of topic is now reduced… • Using time in broadcast protocols • Comparison of CASD with gossip Many systems need realtime • Air traffic control talks about “when” planes will be at various locations • Doctors talk about how patients responded after drug was given, or change therapy after some amount of time • Process control software runs factor floors by coordinating what machine tools do, and when Many systems need realtime • Video and multi-media systems need isochronous communication protocols that coordinate video, voice, and other data sources • Telecommunications systems must guarantee real-time response despite failures, for example when switching telephone calls Real time in real systems • These real systems combine the need for logical consistency and coordination (for control) with the need for physical coordination and timely behavior • Issue becomes one of integrating real-time tools and support with logical tools and support such as we have already considered • Not real-time or logical time, but want both for different aspects of a single application! Clock Synchronization and Synchronous Systems • Up to now, we restricted attention to logical notions of time: “a happens before b” • Today look at adding real clocks to such a system • Two views of real-time: – Supporting clocks that programs can consult as needed – Making direct use of real-time inside protocols Clock Synchronization • Topic was extremely active during early 1980’s • Best known algorithms include the one in OSF/1 UNIX (based on one by Marzullo at Xerox), the “optimally accurate” algorithm of Srikanth and Toueg, and the “probabilistic” algorithm of Cristian • Introduction of Global Positioning System is eliminating the need for this form of synchronization Clock synchronization protocols • Would like to think of network as having a single “clock” that all processes can consult • But need to implement this using many “local” clocks that can: – Initially be set incorrectly – Drift over the course of time • Clock synchronization tries to keep clocks close to true real-time and minimize their tendency to drift Precision and Accuracy • Accuracy measures local clocks relative to an external source of accurate real-time. Accurate clocks are close to real-time • Precision measures local clocks relative to each other. Precise clocks are close to each other • Skew is the numerical limit on the maximum dis-tance that correct clocks can drift apart. E.g. could say the “maximum skew is 1sec” for some system Thought question • Suppose that you were an air-traffic controller • Assume that clock synchronization is by a protocol that runs on your local area network. Each control center runs its own clock sync. • Would you prefer an algorithm that optimizes for the best possible accuracy, or for the best possible precision? ... and the answer • Air traffic controller basically says to planes “during time t0 to t1 fly from a to b” • Planes presumably have accurate clocks .... so ATC needs accurate clocks, too. • Precision is like internal consistency. Accuracy is like external consistency (dynamic uniformity) How clock synchronization used to work • Periodically, all processors would run a clock sync protocol, for example by broadcasting the reading from their clocks • Each receives a set of values from the others (sets may differ due to faults!) • Algorithm would pick a synchronized value from the set; analysis used to prove properties of clocks Thinks known about clock sync. • Accuracy of clock synchronization is limited by end-to-end latencies for messages passing through the network • Example: you tell me that it is 10:00am, exactly. • I am unsure how long the message took to pass through the network, hence I conclude that the time is in the range [10:00, 10:00+] Better way to express this • Suppose that I know something about the mean message propagation latency • The quality of clock synchronization is at best proportional to the possible variation in latency relative to this mean • A-posteriori method of Verissimo and Rodriguez uses this to obtain optimal precision (but is less accurate than the algorithm of Srikanth, Toueg) Global Positioning System • Satellite system launched by military in early 1990’s, became public and inexpensive • Can think of satellites as broadcasting the time • Small radio receiver picks up signals from three satellites and triangulates to determine position • Same computation also yields extremely accurate clock (accurate to a few milliseconds) Clock synchronization with GPS • Put two GPS receivers (or more) on a network • Periodically, receivers broadcast the “true time” • Other machines only need to adjust their clocks to overcome propagation delays for clock sync messages through the network! • Well matched to the a-posteriori clock synchronization approach Basic idea • GPS receiver broadcasts “the time is now 10:00” on a broadcast network (ethernet) • Receivers note the time when they receive the message: [10:01, 9:58, ....] and reply with values • GPS receiver subtracts the median value • Differences [1, -2, ...] now give the drift of the clock of the destination relative to the median clock A-posteriori method, adjustment stage • Now we distribute these drift values back to the processors, which compensate for the rate of drift over the time period to the next synchronization • Can show that this yields clocks that are optimal both in accuracy and precision • A processor with a broken clock has a good chance of discovering it during synchronization Using real-time • One option is to use a real-time operating system, clock synchronization algorithm, and to design protocols that exploit time • Example: MARS system uses pairs of redundant processors to perform actions fault-tolerantly and meet deadlines. Has been applied in process control systems. (Another example: Delta-4) Real-time broadcast protocols • Can also implement broadcast protocols that make direct use of temporal information • Examples: – Broadcast that is delivered at same time by all correct processes (plus or minus the clock skew) – Distributed shared memory that is updated within a known maximum delay – Group of processes that can perform periodic actions A real-time broadcast p0 p1 p2 p3 p4 p5 t+a t t+b * * * * * Message is sent at time t by p0. Later both p0 and p1 fail. But message is still delivered atomically, after a bounded delay, and within a bounded interval of time (at non-faulty processes) A real-time distributed shared memory p0 p1 p2 t set x=3 t+a t+b x=3 p3 p4 p5 At time t p0 updates a variable in a distributed shared memory. All correct processes observe the new value after a bounded delay, and within a bounded interval of time. Periodic process group: Marzullo p0 p1 p2 p3 p4 p5 Periodically, all members of a group take some action. Idea is to accomplish this with minimal communication The CASD protocols • Also known as the “ -T” protocols • Developed by Cristian and others at IBM, was intended for use in the (ultimately, failed) FAA project • Goal is to implement a timed atomic broadcast tolerant of Byzantine failures Basic idea of the CASD protocols • Assumes use of clock synchronization • Sender timestamps message • Recipients forward the message using a flooding technique (each echos the message to others) • Wait until all correct processors have a copy, then deliver in unison (up to limits of the clock skew) CASD picture p0 p1 p2 p3 p4 p5 t+a t t+b * * * * * p0, p1 fail. Messages are lost when echoed by p2, p3 Idea of CASD • Assume known limits on number of processes that fail during protocol, number of messages lost • Using these and the temporal assumptions, deduce worst-case scenario • Now now that if we wait long enough, all (or no) correct process will have the message • Then schedule delivery using original time plus a delay computed from the worst-case assumptions The problems with CASD • In the usual case, nothing goes wrong, hence the delay can be very conservative • Even if things do go wrong, is it right to assume that if a message needs between 0 and ms to make one hope, it needs [0,n* ] to make n hops? • How realistic is it to bound the number of failures expected during a run? CASD in a more typical run p0 p1 p2 p3 p4 p5 t t+a t+b * * * * * * ... leading developers to employ more aggressive parameter settings p0 p1 p2 p3 p4 p5 t t+a t+b * * * * * * CASD with over-aggressive paramter settings starts to “malfunction” p0 p1 p2 p3 p4 p5 t t+a t+b * * * * all processes look “incorrect” (red) from time to time CASD “mile high” • When run “slowly” protocol is like a real-time version of abcast • When run “quickly” protocol starts to give probabilistic behavior: – If I am correct (and there is no way to know!) then I am guaranteed the properties of the protocol, but if not, I may deliver the wrong messages How to repair CASD in this case? • Gopal and Toueg developed an extension, but it slows the basic CASD protocol down, so it wouldn’t be useful in the case where we want speed and also real-time guarantees • Can argue that the best we can hope to do is to superimpose a process group mechanism over CASD (Verissimo and Almeida are looking at this). Why worry? • CASD can be used to implement a distributed shared memory (“deltacommon storage”) • But when this is done, the memory consistency properties will be those of the CASD protocol itself • If CASD protocol delivers different sets of messages to different processes, memory will become inconsistent Why worry? • In fact, we have seen that CASD can do just this, if the parameters are set aggressively • Moreover, the problem is not detectable either by “technically faulty” processes or “correct” ones • Thus, DSM can become inconsistent and we lack any obvious way to get it back into a consistent state Using CASD in real environments • Would probably need to set the parameters close to the range where CASD can malfunction, but rarely • Hence would need to add a selfstabilization algorithm to restore consistent state of memory after it becomes inconsistent • Problem has not been treated in papers on CASD • pbcast protocol does this Using CASD in real environments • Once we build the CASD mechanism how would we use it? – Could implement a shared memory – Or could use it to implement a real-time state machine replication scheme for processes • US air traffic project adopted latter approach – But stumbled on many complexities… Using CASD in real environments • Pipelined computation • Transformed computation Issues? • Could be quite slow if we use conservative parameter settings • But with aggressive settings, either process could be deemed “faulty” by the protocol – If so, it might become inconsistent • Protocol guarantees don’t apply – No obvious mechanism to reconcile states within the pair • Utility of the method thus unclear Similar to MARS • Research system done in Austria by Hermann Kopetz – Basic idea is that everything happens twice – Receiver can suppress duplicates but is guaranteed of at least one copy of each message – Used to overcome faults without loss of real-time guarantees • MARS is used in the BMW but gets close to a hardware f.tol. scheme Many more issues…. • What if a process starts to lag? • What if applications aren’t strictly deterministic? • How should such a system be managed? • How can a process be restarted? – If not, the system eventually shuts down! • How to measure the timing behavior of components, including the network FAA experience? • It became too hard to work all of this out • Then they tried a transactional approach, also had limited success • Finally, they gave up! – $6B was lost… – A major fiasco, ATC is still a mess Totem approach • Start with extended virtual synchrony model • Analysis used to prove real-time delivery properties • Enables them to guarantee delivery within about 100-200ms on a standard broadcast LAN • Contrast with our 85us latency for Horus! Tradeoffs between consistency, time • Notice that as we push CASD to run faster we lose consistency • Contrast with our virtual synchrony protocols: they run as fast as they can (often, much faster than CASD when it is not malfunctioning) but don’t guarantee real-time delivery A puzzle • Suppose that experiments show that 99.99% of Horus or Ensemble messages are delivered in 85us +/10us for some known maximum load • Also have a theory that shows that 100% of Totem messages are delivered in about 150ms for reasonable assumptions • And have the CASD protocols which work well with around 250ms for similar LAN’s A puzzle • Question: is there really a difference between these forms of guarantees? • We saw that CASD is ultimately probabilistic. Since Totem makes assumptions, it is also, ultimately, probabilistic • But the experimentally observed behavior of Horus is also probabilistic • ... so why isn’t Horus a “real-time” system? What does real-time mean? • To the real-time community? – A system that provably achieves its deadlines under stated assumptions – Often achieved using delays! • To the pragmatic community? – The system is fast enough to accomplish our goals – Experimentally, it never seems to lag behind or screw up Some real-time issues • Scheduling – Given goals, how should tasks be scheduled? – Periodic, a-periodic and completely adhoc tasks • What should we do if a system misses its goals? • How can we make components highly predictable in terms of their real-time performance profile? Real-time today • Slow transition – Older, special purpose operating systems and components, carefully hand-crafted for predictability – Newer systems are simply so fast (and can be dedicated to task) that what used to be hard is now easy – In effect, we no longer need to worry about real-time, in many cases, because our goals are so easily satisfied!