CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Perspectives on Computing Systems and Networks • CS314: Hardware and architecture • CS414: Operating Systems with a focus on singleprocessor and multi-processor systems • CS513: A course on security for operating systems and networks • CS514: Emphasis on “middleware”: networks, distributed computing, technologies for building reliable applications over the middleware • CS614: A survey of current research frontiers in the operating systems and middleware space • CS444, CS476, CS644, CS676: networks, routers, theory of network protocols, not offered recently Styles of Course • CS514 tries to be practical in emphasis: – We look at the tools used in real products and real systems – The focus is on technology one could build / buy – But not specific products • CS614 emphasis is on research opportunities – We try to understand the state of the art – Idea is to find good research topics • Both have projects, but – CS514 builds on popular middleware components – CS614 tries to break new ground Recent Trends • Massive network rollout • Larger and larger numbers of small devices, web-compatible cell phones • Object orientation and components emerge as prevailing structural option • Widespread use of transactions for reliability and atomicity • XML: The web-ization of everything • Java/Jini, .NET: code can run on anything • Client-server yielding to scalable replication Understanding Trends • Basically two options – Study the fundamentals – Then apply to specific tools • Or – Study specific tools – Extract fundamental insights from examples Understanding Trends • Basically two options – Study the fundamentals – Then apply to specific tools • Or – Study specific tools – Extract fundamental insights from examples Ken’s bias • I work on reliable, secure distributed computing – Air traffic control systems – Stock exchanges – Next generation electric power grid • To me, the question is: How can we build systems that do what we need them to do, reliably, accurately, and in a secure manner? Butler Lampson’s Insight • Why computer scientists didn’t invent the web – CS researchers would have wanted it to “work” – The web doesn’t really work – But it doesn’t really need to! • Gives some reason to suspect that Ken’s bias isn’t widely shared! World Wide Web • A seductive pass-time, but increasingly seen as a serious business model • Idea would be to put information you need at your fingertips to enable better, more informed, more intelligent actions • The Web can also replace paper entirely: a world-wide tool for sharing knowledge Relying on the Web: Banking • Companies and individuals will need to rely on the Web for this model to work: – Broker will rely upon up-to-the minute stock quotes and investment data and advice – Back office will trade stocks based on what the broker currently wants – Criminals will try and violate security/privacy to steal funds or manipulate trades Relying on the Web: Medicine • Web-style interface in a hospital • Doctor relies on accuracy of patient status records to make treatment decisions • Nurse relies on accuracy of drug dosage and frequency data to administer treatment • Hospital legally obligated to provide for security and privacy of the data Relying on the Web: Publisher • More and more publications will go electronic in coming years (so will movies, MTV videos, classical music, etc) • Publisher’s edge: quality of authors, quality of material. Will “sell” information • But for this to work, need reliable ways to charge for access and to limit access to authorized individuals! Air Traffic Control on the Web • Web interface could easily show planes, natural for controller interactions • But clearly need to know that trajectory and flight data is current and consistent • Also need help with routing options • Continuous availability is vital. Security and privacy also needed New Air Traffic Control System: AAS • Started by FAA in 1989 to replace existing ATC system • Current system has video display of radar for controllers to use • Database has information about each flight • Telephones to talk to the planes ATC systems divide country up More details on ATC • Each sector has a control center • Centers may have few or many (50) controllers • Data comes from a radar system that broadcasts updates every 10 seconds • Database keeps other flight data • Controllers each “own” smaller subsectors Current System has Problems! • Overloaded computers that often crash • Getting slow as volume of air traffic rises • Inconsistent displays a problem: phantom planes, missing planes, stale information • Some major outages recently (Newark down for 1/2 hour, LA down for 1 hour in 1995). One near-miss associated with LA outage Concept of New System • Replace video terminals with workstations • Build a highly available real-time system guaranteeing no more than 3 seconds downtime per year • Offer much better user interface to ATC controllers, with intelligent course recommendations and warnings about future course changes that will be needed ATC Architecture NETWORK INFRASTRUCTURE DATABASE Technologies Used • Base on standard, off-the-shelf workstations (easier to maintain, upgrade, manage) • IBM proposed software for faulttolerance and consistent system implementation • Fancy graphical user interface much like the Web, pop-up menus for control decisions, etc. Project Was a Fiasco!! • IBM unable to implement a faulttolerant software architecture! Problem was much harder than they expected. • Even a non-distributed interface turned out to be very hard, major delays, scaled back goals • Resulting system is unsatisfactory even before delivery Free Flight • Many think this is the next step in aviation • Planes use GPS receivers to track own location accurately • Combine radar and a shared database to see each other • Each pilot makes own routing decisions • ATC controllers only act in emergencies Free Flight (cont) • Now each plane is like an ATC workstation • Each pilot must make decisions consistent with those of other pilots • ... but if FAA’s project failed in 1994, why should free flight succeed in 2010? • Something is wrong with the distributed systems infrastructure! Other critical applications • • • • • Banking, stock markets, stock brokers Heath care, hospital automation Control of power plants, electric grid Telecommunications infrastructure Electronic commerce and electronic cash on the Web (very important emerging area) • Corporate “information” base: a company’s memory of decisions, technologies, strategy • Military command, control, intelligence systems We depend on distributed systems! • If these critical systems don’t work – – – – When we need them Correctly Fast enough Securely and privately • ... then revenue, health and safety, and national security may be at risk! Signs of a Crisis in Computing • Highly visible fiascos: ATC project, Denver lug-gage handling system, London Stock Exchange. • Hackers pose an increasingly serious threat: dis-rupted telephone services, breakins to critical computing systems • Vendors offering little in the way of reliability (security situation is better) Critical Needs of Critical Applications • Security: Can tell who is doing what and can use this to enforce authorization • Privacy: Intruders can’t see data or user id’s • Availability: System is continuously “up” • Recoverability: Can restart failed components • Consistency: Actions of system at different locations are consistent with each other. Web Brownouts 2 1 cafe.org cornell.edu ... sf.cafe.org The network name service is structured like an inverted tree. cs.cornell.edu ... 3 4 6 9 Web brower’s system only needs to contact local name and web services. Local Web Proxy (cached documents) 5 Cornell Web Proxy (cached documents) 7 Cornell Web Server 8 • Domain name service (DNS) can overload (1-3) • Server or proxies can overload, crash (4-9) • Communication lines can overload or break • DNS or proxy can return “stale” data Infrastructure Needs to Change • To avoid brownouts need to make more use of replicated (cached) data • DNS replication: caching of host addresses • Web proxies: replicate copies of documents • Creates a new challenge: – Coherence: guarantee that a cached copy of an object is up to date What this course is really about • Distributed computing is rapidly transforming the way we work, live, the way that companies do business. • Increasingly, distributed computing systems are the only ones you can buy. • The challenge: build distributed systems which can be relied upon in critical settings What’s the Story Today? • Few distributed systems or Web applications consider reliability issues • The ones that do worry about reliability are often naive about what they are getting into, leading to highly visible failures • But we do have technical answers to many of the basic problems and some exciting initial options Goals for this course? • Understand the basic technologies from which distributed systems are constructed • Maintain a degree of emphasis on reliability issues throughout: how reliable are the standard technologies? Can they be used reliably despite their limitations? • Look at advanced technologies in context of real systems built in standard ways Trends are changing • More and more pressure on industry – When the network is down, your company won’t make money – Clients want tools they can rely on • This is creating pressure on vendors who offer middleware • Result is a new emphasis on scalability and reliability • We want reliability, as long as we can have performance and scalability too. Technologies we will cover • RPC and client-server computing; Streams • Internet technologies (email, news, msg. bus) and trends (the “next generation Internet”) • DCE, Corba, COM: Object-Oriented and Component Environments • Web technologies (HTTP, XML), how the popular scalable architectures work • Process group computing and scalability issues • Transactions and reliability • Just a Taste of Security • System Management, Clusters, Realtime Course Overview: 24 lectures • Intro + Basic technologies: 4 lectures • Web and Internet: 2 lectures • Reliability technologies – – – – – – Distributed “group” solutions: 6 lectures Security options: 2 lectures Real-time issues: 2 lectures Transactional systems: 2 lectures Management: 3 lectures Other topics: 3 lectures Project • CS514 has – Homeworks, from time to time – A reasonably ambitious software project (can be used to satisfy your MEng project requirement) • Projects can be done in groups • Usually involve tackling reliability or scalability with some popular technology • This semester, hoping to use two Javaoriented b2b technologies – HP’s eSpeak – BEA Systems “WebLogic” • You’ll teach yourself how to use them Major Themes? • Modularity (also known as objectorientation). Better structured systems are more reliable. • Performance. Technologies need to be fast to be perceived as working well • Exploiting group structures. These are common in reliable distributed systems • Rigor. We want to know why a technique works: ad-hoc solutions often break under stress Scalability • Suddenly the hot issue for industry • Basically, customers expect solutions that – Can be developed on a small scale – Continue to work during prime-time – Scalability and stability: can be considered from many dimensions • Today, most of the most popular solutions scale poorly! The Prevailing Mindset • Many developers believe that reliable systems are clumsy, overengineered, slow • Image: a “robust bridge”. Sounds like some sort of ugly, heavy eyesore • The Web and the Net are about elegant, light-weight, fast systems: “antithesis” of robust ones • Reliability is also at odds with using standard components and packages Insights From Course? • Reliability techniques are often very elegant • Complexity is a challenge; modularity used to control these costs • Can achieve high performance in reliable distributed systems ... but they sometimes are hard to combine with standard technologies Lightweight but Resilient Bridges, Secure Computing Enclaves Lightweight but Resilient Bridges, Secure Computing Enclaves • A good way to imagine the technology we seek • Our job is to build those enclaves • Trick is to use the technical tools the right way! • In CS514, we won’t study the security aspects of the problem in more than a shallow manner Recommended Reading • Textbook: read the Introduction • While surfing the Web, think about outages • Keep a count over half an hour of surfing the net: how often did you have problems? What sorts of problems? • Find the University of Michigan Web pages on internet availability. What does this data tell you?