CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 13: Oct. 5 Consistency • How can we relate models of consistency to cost and availability? • Is it possible to reconcile transactional replication with virtual synchrony replication? Consistency • Various models – Multiple copies of some object but behavior mimics a single nonfaulty object – ACID: 1-copy SR plus durability – FLP style of consensus – Dynamic uniformity versus static model Basic “design points” • Does the model guarantee anything relative to “last words” of a process that fails? – Yes for transactions: ACID – No, in virtual synchrony • Can do better using “flush” primitive • And can mimic transactional replication if we require that primary partition is also a quorum of some statically specified set of processes Are actions asynchronous? • No in case of transactions – We can do things locally – But at commit time, we need to synchronize – And most transactional replication schemes are heavily synchronous • Yes for virtual synchrony – But only with cbcast or fbcast Mixing models • Virtual synchrony is like “weak transactional serializability” – In fact, connection can be made precise – We use a model called linearizability by Wing and Herlihy • Much recent work on database replication mixes models… Real systems have varied needs • Must match choice of properties to needs of the application • Find that multiple models are hard to avoid – We want the stronger models for database applications – But where data won’t persist, the cheaper models suffice… Digression • Need to strengthen our intuition • Can we find examples of real systems that might need group communication or data replication? – Ideally, systems that can’t be built in any other way – Use this to think about properties required for application correctness Joint Battlespace Infosphere ROBUST INFRASTRUCTURE Distributed Trading System Pricing DB’s Trader Clients Historical Data Market Data Feeds Current Pricing Availability for historical data Load balancing and consistent message delivery for price distribution Parallel execution for analytics Analytics Tokyo, London, Zurich, ... Long-Haul WAN Spooler Distributed Service Node One phone number per person Telephony Data/Digitized Voice Path x86/UNIX 1. RISC/UNIX 2. Telephone Trunk Lines RISC/UNIX 3. Ethernet = Isis Replicated files for digitized voice store Redundancy for database availability Load balancing for call handling & routing Dumb Switch Calls, Changes, Adds, Deletes Shop Floor Process Control Example WorkStream Server VAX HP Operator Client Recipe Management Server Data Collection Server HP HP Operator Client PC Operator Client HP PC Operator Client HP Station Controller Enet HP Station Controller Factory equipment The List Goes On • Air traffic control system • Medical decision support in a hospital • Providing real-time data in support of banking or major risk-management strategies in finance • Real-time system for balancing power production and consumption in the power grid • Telephone system for providing services in setting with mobile users and complex requirements Challenge faced by developers • We have multiple notions of consistency now: – Transactional, with persistent data – Process groups with dynamic uniformity – Process groups without dynamic uniformity – Primary partition notion of progress – Non-primary partitions with merge • How can we make the right choices for a given situation? One sizes fits all? • One possibility is that we’ll simply need multiple options – User would somehow specify their requirements – Given this information, system would configure protocols appropriately • Alternative is to decide to standardize on one scheme – Likely to be a strong, more costly option “Understanding” CATOCS • Paper by Cheriton, Skeen in 1993 • They argue that end-to-end approach dictates – Simplicity in the GCS – Properties enforced near end-points • Paper is full of mistakes but the point is well taken – People don’t want to pay for properties they don’t actually require! French air traffic control • They wanted to use replication and group communication in a system for high availability controller consoles • Issues they faced – How strong is the consistency need? – Where should we use groups? – Where should we use transactions Air traffic control • Much use of computer technologies – Flight management system (controls airplane) – Flaps, engine controls (critical subsystems) – Navigational systems – TCAS (collision avoidance system) – Air traffic control system on ground • In-flight, approach, international “hand-off” – Airport ground system (runways, gates, etc) Air traffic control • Much use of computer technologies – Flight management system (controls airplane) – Flaps, engine controls (critical subsystems) – Navigational systems – TCAS (collision avoidance system) – Air traffic control system on ground • In-flight, approach, international “hand-off” – Airport ground system (runways, gates, etc) ATC system components Onboard Radar X.500 Directory Controllers Air Traffic Database (flight plans, etc) Possible uses of groups • To replicate data in console clusters • For administration of console clusters • For administration of the “whole system” • For radar communication from radar to the consoles • To inform consoles when flight plan database is updated • To replicate the database itself ATC system components Onboard Radar X.500 Directory Controllers Air Traffic Database (flight plans, etc) French air traffic control • Some conclusions – They use transactions for the flight plan database • In fact would love to ways to replicate this “geographically” • But the topic remains research – They use one process group for each set of 3-5 control consoles – They use unreliable hardware multicast to distribute radar inputs • Different groups treated in different ways French air traffic control • Different consistency in different uses • In some cases, forced changes to the application itself – E.g. different consoles may not have identical radar images • Choices always favored – Simplicity – Avoiding technology performance and scaling limits Air traffic control example • Controller interacts with service: “where can I safely route flight TWA 857?” • Service responds: “sector 17.8.09 is available” ... what forms of consistency are needed in order to make this a safe action to perform? Observations that can help • Real systems are client-server structured • Early work on process group computing tended to forget this! – Isis system said “RPC can be harmful” but then took the next step and said “so we won’t think in client-server terms”. This was a mistake! – Beware systems that provide a single API system-wide A multi-tier API • Separate concerns: – Client system wants a simple interface, RPC to servers and a reliable stream connection back, wants to think of the whole system as a single server – Server wants to implement a WAN abstraction out of multiple component servers – Server itself wants replication and loadbalancing for fault-tolerance • Need security and management API throughout Sample but typical issue • It is very appealing to say – This server poses a problem – So I’ll roll it out… – … and replace it with a high availability group server • Often, in practice, the existing code on the client side precludes such upgrades! Separate concerns • Consistency goals for client are different from goals within lower levels of many systems • At client level, main issue is dynamic uniformity: does a disconnected client continue to act on basis of information provided before the partitioning? Do we care? • In ATC example, the answer is yes, so we need dynamic uniformity guarantees WAN architecture • Mental model for this level is a network whose component nodes are servers • Each server initiates updates to data it “owns” and distributes this data to the other servers • May also have globally owned data but this is an uncommon case! • For global data need dynamic uniformity but for locally owned data, weaker solution sufficies Consistency approach in partitionable network • Free to update your local data • When partition ends, state merges by propogation of local updates to remote sites, which had safe but stale view of other sites’ local data. (Treated formally by Malki, Dolev, Strong; Keidar, others) • Global updates may be done using dynamically uniform protocols, but will obviously be delayed during partitioning events Within a server • At this level, see a server replicated on multiple nodes for – Fault-tolerance (availability or recoverability) – Load-balancing – Replication to improve response time • Goal is primary component progress and no need for dynamic uniformity Worst case for a replicated server? • If application wants recoverability, server replication may be costly and counterproductive • Many real database systems actually sacrifice transactional guarantees to fudge this case: – Primary/backup approach with log sent from primary to backup periodically – Failure can cause some transactions to “vanish” until primary recovers and lost log records are discovered Observations? • Complex systems may exhibit multiple needs, superimposed • Needs are easiest to understand when approached in terms of architectural structure • Literature of distributed consistency is often confusing because different goals are blurred in papers Example of a blurry goal • Essential point of the famous FLP result: can’t guarantee liveness in a system that also provides an external consistency propertly such as dynamic unformity or database atomicity • Can evade in settings with accurate failure detectors... but real systems can make mistakes • But often, we didn’t actually want this form of consistency! Example of a blurry goal (cont) • Moreover, FLP result may require a very “clever” adversary strategy. • Friedman and Vaysburd have a proof that an adversary that cannot predict the future is arbitrarily unlikely to prevent consensus! • On the other hand, it is easy to force a system to wait if it wants external consistency. Think about 2PC and 3PC. This is a more serious issue. Much of the theory is misunderstood! • Theory tends to make sweeping claims: “The impossibility of group membership in asynchronous systems” • These claims are, strictly speaking, correct • But they may not be relevant in specific practical settings because, often, the practical situation needs much weaker guarantees! When do we need FLP style consistency? • Few real systems need such strong forms of consistency • Yet relatively little is understood about the full spectrum of weaker consistency options • Interplay between consistency of the fault-tolerance solution and other properties like security or realtime further clouds the picture Why should this bother us? • Main problem is that we can’t just pick a single level of consistency that will make all users happy • Database consistency model is extremely slow: even database vendors don’t respect the model (primary/backup “window of vulnerability” is accepted because 2PC is too costly) • Dynamic uniformity costs a factor of 1001000 compared to non-uniform protocols ... but non-uniform protocols are too weak! • Usually, non-uniform protocols are adequate – They capture “all the things that a system can detect about itself” – They make sense when partitioning can’t occur, as on a cluster of computers working as a server • But they don’t solve our ATC example and are too weak for a wide-area database replicated over many servers Optimal Transactions • Best known dynamic uniformity solution is actually not the static scheme we’ve examined • This optimal approach – Was first developed by Lamport in his Paxos paper, but the paper was very hard to follow – Later, Keidar, Chockler and Dolev showed that a version of 3-phase commit gives optimal progress; the scheme was very similar to Paxos – But performance is still “poor” Long-term prospects? • Systems in which you pay for what you use • API’s specialized to particular models of computation in which API can make a choice even if same choice wouldn’t work for other API’s • Tremendous performance variation depending on the nature of the tradeoffs accepted • Big difference depending on whether we care about actions by that partitioned-off controller Theory side • Beginning to see a solid and well founded theory of consistency that deals with nonuniform case – For example, Lynch has an IOA model of vsync. • Remains hard to express the guarantees of such a model in the usual temporal logic style – Cristian and Fetzer did some nice work on this • Challenge is that we want to say things “about” the execution, but we often don’t know “yet” if the process we are talking about will turn out to be a member of the system or partitioned away! Other directions? • In subsequent lectures will look at probabilistic guarantees – These are fairly basic to guarantees of real-time behavior – They can be integrated into more traditional replication and process group methods, but not easily • Self-stabilization is another option, we won’t consider it here (Dijkstra) Self-stabilization • Idea is that the system, when pushed out of a consistent state, settles back into one after the environment stops “pushing” • Example: after a failure, if no more failures occur, some system guarantee is restored • Concern is that this may not bound the degree of deviation from correct behavior while the system is being perturbed. Curious problems? • Is reliability/consistency fundamentally “unstable” – Many systems tend to thrash if reliability technology is scaled up enough (except if goals are probabilistic) – Example: reliable 1-n communication is harder and harder to scale as n gets larger. (But probabilistically reliable 1-n communication may do better as we scale) – Underlying theory completely unknown: research topic ... can we develop a “90% reliable” protocol? Is there a window of stable behavior? Consistency is ubiquitous • We often talk about “the” behavior of “a” system, not the “joint” behavior of its components • Implication is that our specifications implicitly assume that there is a mapping from “the system” to the execution model • Confusion over consistency thus has fundamental implications. One of the hardest problems we face in distributed computing today! Moving on… • But enough about replication • Now start to think about higher level system issues • Next week: briefly look at what is known about how and why systems fail • Then look at a variety of structuring options and trends…