CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 26: Nov. 28 Scalability • We’ve heard about many technologies • Looked at implementations and features and reliability issues – Illustrate with real systems • But what about scalability? Dimensions of Scalability • We often say that we want systems that “scale” • But what does scalability mean? • As with reliability & security, the term “scalability” is very much in the eye of the beholder Scalability • Marketing perspective: – Does a technology potentially reach large numbers of consumers? – Can the vendor sell it without deploying human resources in proportion to the size of the market? – Is the market accessible with finite resources? • Not our “worry” here in CS514, but in the real world you’ll need to keep this point of view in mind! Scalability • As a reliability question: – Suppose a system experiences some rate of disruptions r – How does r change as a function of the size of the system? • If r rises when the system gets larger we would say that the system scales poorly • Need to ask what “disruption” means, and what “size” means… Scalability • As a management question – Suppose it takes some amount of effort to set up the system – How does this effort rise for a larger configuration? – Can lead to surprising discoveries • E.g. the 2-machine demo is easy, but setup for 100 machines is extremely hard to define Scalability • As a question about throughput – Suppose the system can do t operations each second – Now I make the system larger • Does t increase as a function of system size? Decrease? • Is the behavior of the system stable, or unstable? Scalability • As a question about dependency on configuration – Many technologies need to know something about the network setup or properties – The larger the system, the less we know! – This can make a technology fragile, hard to configure, and hence poorly scalable Scalability • As a question about costs – Most systems have a basic cost • E.g. 2pc “costs” 3N messages – And many have a background overhead • E.g. gossip involves sending one message per round, receiving (on avg) one per round, and doing some retransmission work (rarely) • Can ask how these costs change as we make our system larger, or make the network noisier, etc Scalability • As a question about environments – Small systems are well-behaved – But large ones are more like the Internet • Packet loss rates and congestion can be problems • Performance gets bursty and erratic • More heterogeneity of connections and of machines on which applications run – The larger the environment, the nastier it may be! Scalability • As a pro-active question – How can we design for scalability? – We know a lot about technologies now – Are certain styles of system more scalable than others? Approaches • Many ways to evaluate systems: – – – – Experiments on the real system Emulation environments Simulation Theoretical (“analytic”) • But we need to know what we want to evaluate Dangers • “Lies, damn lies, and statistics” – It is much to easy to pick some random property of a system, graph it as a function of something, and declare success – We need sophistication in designing our evaluation or we’ll miss the point • Example: message overhead of gossip – Technically, O(n) – Does any process or link see this cost? • Perhaps not, if protocol is designed carefully Technologies we’ve seen • TCP/IP and O/S message-passing architectures like U-Net • RPC and client-server architectures • Transactions and nested transactions • Virtual synchrony and replication • Other forms of multicast • Object oriented architectures • Cluster management facilities How well do they scale? • Consider a web server • Life consists of responding to HTTP requests – RPC over TCP • Where are the costs? – TCP: lots of channels, raises costs • For example, “SO_LINGER” issue • Also uncoordinated congestion issue – Disk I/O bottleneck… Web server • Usual solution is to split server over many machines, spray requests • Jim Gray calls this a cloned architecture • Works best for unchanging data (so-called soft-state replication) Web server • Failure detection: a scaling limit – With very large server, failure detection gets hard • If a node crashes, how should we detect this? • Need for “witnesses”? A “jury”? • Notification architecture… • Modes of failure may grow with scale Web server • Sample scaling issue – Suppose failure detection is “slow” – With more nodes, we have • More failures • More opportunities for anomalous behavior that may disrupt response – Load balancing system won’t know about failures for a while… so it might send work to faulty nodes, creating a perception of down time! Porcupine • This system was a scalable email server built at U. Wash • It uses a partitioning scheme to spread load and a simple faulttolerance scheme • Papers show near-linear scalability in cluster size Porcupine • Does Porcupine scale? – Papers says “yes” – But in their work we can see a reconfiguration disruption when nodes fail or recover • With larger scale, frequency of such events will rise • And the cost is linear in system size – Very likely that on large clusters this overhead would become dominent! Transactions • Do transactional systems scale? – A complicated question – Scale in: • Number of users & patterns of use • Number of participating servers • Use or non-use of replication – Different mixes give different results Transactions • Argus didn’t scale all that well – This was the MIT system we discussed, with transactions in the language itself – High costs of 2PC and concurrency control motivated a whole series of work-arounds • Some worked well, like “lazy commit” • But others seemed awkward, like “top-level transactions” – But how typical is Argus? Transactions • Transactional replicated data – Not very scalable – Costly quorum reads and writes • Read > 1 nodes, write many too • And 2PC gets costly – Jim Gray wrote a paper on this • Concludes that naïve replication will certainly not scale Transactions • Yet transactions can scale in some dimensions – Many very, very large data server architectures have been proposed – Partitioned database architectures offer a way to scale on clusters – And queries against a database can scale if we clone the database Transactions • Gray talks about RAPS, RACS and Farms – Reliable Array of Partitioned/Cloned Servers – Farm is a set of such clusters • Little technology exists in support of this model – But the model is an attractive one – It forces database people to think in terms of scalability from the outset, by structuring their database with scalability as a primary goal! Replicated Data • Common to view reliable multicast as a tool for replicating data • Here, scalability is tied to data model – Virtual synchrony works best for small groups – Pbcast has a weaker replication model but scales much better for large ones • If our goals were stability of a video stream or some other application we might opt for other design points Air Force JBI • Goal of anywhere, anytime access to replicated data • Model selected assumes – Publish-subscribe communication – Objects represented using XML – XML ontology basis of a filtering scheme • Basically, think of objects as “forms” • Subscriber can fill in some values for fields where specific values are desired • Data is delivered where an object matches JBI has many subproblems • Name-space issue (the XML ontology used for filtering) • Online data delivery • Tracking of real-time state – Probably, need an in-core database at the user • Long-term storage – Some form of database that can be queried – Apparently automatic and self-organized Air Force JBI • Does this model scale? – Need to ask what properties are needed for applications • Real-time? Rapid response? Stability? • Guarantees of reliability? – Also need to ask what desired model of use would be • Specific applications? Or arbitrary ones? • Steady patterns of data need? Or dynamic? – And questions arise about the network Publish-subscribe • Does publish-subscribe scale? – In practice, publish-subscribe does poorly with large networks • Fragile, easily disrupted • Over-sensitive to network configuration • Behavior similar to that of SRM (high overheads in large networks) – But IBM Gryphon router does better Use Gryphon in JBI? • Raises a new problem – Not all data is being sent at all times – In fact most data lives in a “repository” • Must the JBI be capable of solving application-specific database layout problems in a general, automatic way? • Some data is sent in real-time, some sent in slow-motion. • Some data is only sent when requested – Gryphon only deals with the online part JBI Scalability • Points to the dilemma seen in many scalability settings – Systems are complex – Use patterns are complex – Very hard to pose the question in a way that can be answered convincingly How do people evaluate such questions? • Build one and evaluate it – But implementation could limit performance and might conclude the wrong thing • Evaluate component technology options (existing products) – But these were tuned for other uses and perhaps the JBI won’t match the underlying assumptions • Design a simulation – But only as good as your simulation… Two Side-by-Side Case Studies • American Advanced Automation System – Intended as replacement for air traffic control system – Needed because Pres. Reagan fired many controllers in 1981 – But project was a fiasco, lost $6B • French Phidias System – Similar goals, slightly less ambitious – But rolled out, on time and on budget, in 1999 Background • Air traffic control systems are using 1970’s technology • Extremely costly to maintain and impossible to upgrade • Meanwhile, load on controllers is rising steadily • Can’t easily reduce load Politics • Government wanted to upgrade the whole thing, solve a nagging problem • Controllers demanded various simplifications and powerful new tools • Everyone assumed that what you use at home can be adapted to the demands of an air traffic control center Technology • IBM bid the project, proposed to use its own workstations • These aren’t super reliable, so they proposed to adapt a new approach to “fault-tolerance” • Idea is to plan for failure – Detect failures when they occur – Automatically switch to backups Core Technical Issue? • Problem revolves around high availability • Waiting for restart not seen as an option: goal is 10sec downtime in 10 years • So IBM proposed a replication scheme much like the “load balancing” approach • IBM had primary and backup simply do the same work, keeping them in the same state Technology radar find tracks Identify flight Lookup record Plan actions Human action Conceptual flow of system IBM’s fault-tolerant process pair concept radar radar find find tracks tracks Identify Lookup Identify Lookup flight record flight record Plan Plan actions actions Human Human action action Why is this Hard? • The system has many “real-time” constraints on it – Actions need to occur promptly – Even if something fails, we want the human controller to continue to see updates • IBM’s technology – Based on a research paper by Flaviu Cristian – Each “process” replaced by a pair, but elements behave identically (duplicated actions are detected and automatically discarded) – But had never been used except for proof of concept purposes, on a small scale in the laboratory Politics • IBM’s proposal sounded good… – … and they were the second lowest bidder – … and they had the most aggressive schedule • So the FAA selected them over alternatives • IBM took on the whole thing all at once Disaster Strikes • Immediate confusion: all parts of the system seemed interdependent – To design part A I need to know how part B, also being designed, will work • Controllers didn’t like early proposals and insisted on major changes to design • Fault-tolerance idea was one of the reasons IBM was picked, but made the system so complex that it went on the back burner Summary of Simplifications • Focus on some core components • Postpone worry about faulttolerance until later • Try and build a simple version that can be fleshed out later … but the simplification wasn’t enough. Too many players kept intruding with requirements Crash and Burn • The technical guys saw it coming – Probably as early as one year into the effort – But they kept it secret (“bad news diode”) – Anyhow, management wasn’t listening (“they’ve heard it all before – whining engineers!”) • The fault-tolerance scheme didn’t work – Many technical issues unresolved • The FAA kept out of the technical issues – But a mixture of changing specifications and serious technical issues were at the root of the problems What came out? • In the USA, nothing. • The entire system was useless – the technology was of an all-or-nothing style and nothing was ready to deploy • British later rolled out a very limited version of a similar technology, late, with many bugs, but it does work… Contrast with French • They took a very incremental approach – Early design sought to cut back as much as possible – If it isn’t “mandatory” don’t do it yet – Focus was on console cluster architecture and fault-tolerance • They insisted on using off-the-shelf technology Contrast with French • Managers intervened in technology choices – For example, the vendor wanted to do a home-brew fault-tolerance technology – French insisted on a specific existing technology and refused to bid out the work until vendors accepted – A critical “good call” as it worked out Virtual Synchrony Model • Was supported by a commercially available software technology • Already proven in the field • Could solve the problem IBM was looking at… – But the French decided this was too ambitious – Limited themselves to “replicating” state of the consoles in a controller position – This is a simpler problem – basically, keeps the internal representations of the display consistent Learning by Doing • To gain experience with technology – They tested, and tested, and tested – Designed simple prototypes and played with them – Discovered that large cluster would perform poorly – But found a “sweet spot” and worked within it – This forced project to cut back on some goals • They easily could handle clusters of 3-5 machines • But abandoned hope of treating entire 150 machine network as a single big cluster – “scalability” concerns Testing • 9/10th of time and expense on any system is in – Testing – Debugging – Integration • Many projects overlook this • French planned conservatively Software Bugs • • • • Figure 1/10 lines in new code But as many as 1/250 lines in old code Bugs show up under stress Trick is to run a system in an unstressed mode • French identified “stress points” and designed to steer far from them • Their design also assumed that components would fail and automated the restart All of this worked! • Take-aways from French project? – Break big projects into little pieces – Do the critical core first, separately, and focus exclusively on it – Test, test, test – Don’t build anything you can possibly buy – Management was technically sophisticated enough to make some critical “calls” Scalability Conclusions? • Does XYZ scale? – Not a well-posed question! • Best to see scalability as a process, not a specific issue • Can learn a great deal from the French – Divide and conquer (limit scale) – Test to understand limits (and avoid them by miles)