CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 2: August 29 Overview of Lecture • Fundamentals: terminology and components of a reliable distributed computing system • Communication technologies and their properties • Basic communication services • Internet protocols • End-to-end argument Some terminology • A program is the code you type in • A process is what you get when you run it • A message is used to communicate between processes. Arbitrary size. • A packet is a fragment of a message that might travel on the wire. Variable size but limited, usually to 1400 bytes or less. • A protocol is an algorithm by which processes cooperate to do something using message exchanges. More terminology • A network is the infrastructure that links the computers, workstations, terminals, servers, etc. – It consists of routers – They are connected by communication links • A network application is one that fetches needed data from servers over the network • A distributed system is a more complex application designed to run on a network. Such a system has multiple processes that cooperate to do something. A network is like a “mostly reliable” post office Why isn’t it totally reliable? • Links can corrupt messages – Rare in the high quality ones on the Internet “backbone” – More common with wireless connections, cable modems, ADSL • Routers can get overloaded – When this happens they drop messages – As we’ll see, this is very common • But protocols that retransmit lost packets can increase reliability How do distributed systems differ from network applications? • Distributed systems may have many components but are often designed to mimic a single, non-distributed process running at a single place. • “State” is spread around in a distributed system • Networked application is free-standing and centered around the user or computer where it runs. (E.g. “web browser.) Distributed system is spread out, decentralized. (E.g. “air traffic control system”) What about the Web? • Browser is independent: fetches data you request when you ask for it. • Web servers don’t keep track of who is using them. Each request is self-contained and treated independently of all others. – Cookies don’t count: they sit on your machine – And the database of account info doesn’t count either… this is “ancient” history, nothing recent • ... So the web has two network applications that talk to each other – The browser on your machine – The web server it happens to connect with… which has a database “behind” it You and the Web Cookie identifies this user, encodes past preferences HTTP request Web browser with stashed cookies Database Web servers are kept current by the database but usually don’t talk to it when your request comes in You and the Web Web servers immediately forget the interaction Reply updates cookie You and the Web Web servers have no memory of the interaction Purchase is a “transaction” on the database Examples of Distributed Systems • Air traffic control system with workstations for the controllers • Banking/brokerage trading system that coord-inates trading (risk management) at multiple locations • Factory floor control system that monitors devices and replans work as they go on/offline This course is about reliability • We want to build distributed systems that can be relied upon to do the correct thing and to provide services according to the user’s expectations • Not all systems need reliability – If a web site doesn’t respond, you just try again later – If you end up with two wheels of brie, well, throw a party! • Reliability is a growing requirement in “critical” settings but these remain a small percentage of the overall market for networked computers Reliability is a broad term • Fault-Tolerance: remains correct despite failures • High or continuous availability: resumes service after failures, doesn’t wait for repairs • Performance: provides desired responsiveness • Recoverability: can restart failed components • Consistency: coordinates actions by multiple components, so they mimic a single one • Security: authenticates access to data, services • Privacy: protects identity, locations of users “Failure” also has many meanings • Halting failures: component simply stops • Fail-stop: halting failures with notifications • Omission failures: failure to send/recv. message • Network failures: network link breaks • Network partition: network fragments into two or more disjoint subnetworks • Timing failures: action early/late; clock fails, etc. • Byzantine failures: arbitrary malicious behavior Examples of failures • My PC suddenly freezes up while running a text processing program. No damage is done. This is a halting failure • A network file server tells its clients that it is about to shut down, then goes offline. This is a failstop failure. (The notification can be trusted) • An intruder hacks the network and replaces some parts with fakes. This is a Byzantine failure. More terminology • A real-world network is what we work on. It has computers, links that can fail, and some problems synchronizing time. But this is hard to model in a formal way. • An asynchronous distributed system is a theoretical model of a network with no notion of time • A synchronous distributed system, in contrast, has perfect clocks and bounds all all events, like message passing. Model we’ll use? • Our focus is on real-world networks, halting failures, and extremely practical techniques • The closest model is the asynchronous one; we use it to reason about protocols – Most often, employ asynchronous model to illustrate techniques we can actually implement in real-world settings – And usually employ the synchronous model to obtain impossibility results – Question: why not prove impossibility results in an asynchronous model, or use the synchronous one to illustrate techniques that we might really use? ISO protocol layers: Oft-cited Standard Application The program using a communication connection Presentation Software to encode data into messages, and decode on reception Logic associated with guaranteeing end-to-end reliability and Session flow control, if desired Software for fragmenting big messages into small packets Transport Routing functionality, limited to small packets Network Data-link The protocol that represents packets on the wire • ISO is tied to a TCP-style of connection • Match with modern protocols is poor •We are mostly at “layer 4” – session Internet protocol suite • Can be understood in terms of ISO • Defines “addressing” standard, basic network layer (IP packets, limited to 1400 bytes), and session protocols (TCP, UDP, UDP-multicast) • Includes standard “domain name service” that maps host names to IP addresses • DNS itself is tree-structured and caches data Major internet protocols • • • • • • TCP, UDP, FTP, Telnet Email: Simple Mail Transfer Protocol (SMTP) News: Network News Transfer Protocol (NNTP) DNS: Domain name service protocol NIS: Network information service (a.k.a. “YP”) LDAP: Protocol for talking to the management information database (MIB) on a computer • NFS: Network file system protocol for UNIX • X11: X-server display protocol • Web: HyperText Transfer Protocol (HTTP), and SSL (one of the widely used security protocols) Typical hardware options • Ethernet: 10Mbit CSMA technology, limited to 1400 byte packets. Uses single coax cable. • FDDI: twisted pair, self-repairing if cable breaks • Bridged Ethernet: common in big LAN’s, ring with multiple ethernet segments • Fast Ethernet: 100Mbit version of ethernet • ATM: switching technology for fiber optic paths. Can run at 155Mbits/second or more. Very reliable, but mostly used in telephone systems. Implications for reliability? • Protocol designers have problems predicting the properties of localarea networks • Latencies and throughput may vary widely even in a single installation • Hardware properties differ widely; often, must assume the leastcommon-denominator • Packet loss a minor problem in hardware itself Technology trends Note tremendous growth in WAN speeds 700 600 CPU MIPS Memory MB LAN Mbits WAN Mbits O/S overhead 500 400 300 200 100 0 19851990 19901995 19952000 20002005 Source: Scientific American, Sept. 1995 Typical latencies (milliseconds) Disk I/O 1000 100 Ethernet RPC 10 1 ATM roundtrip 0.1 WAN roundtrip 5 -2 0 0 0 0 0 0 2 1 9 9 5 -2 0 0 5 9 9 -1 0 9 9 1 9 8 5 -1 9 9 0 0.01 1 WAN, disk latencies are fairly constant due to physical limitations Note dramatic drop in LAN latencies over ATM O/S latency: the most expensive overhead on LAN communication! 40 35 30 25 20 15 10 5 0 O/S overhead as percentage 19851990 19952000 Broad observations • A discontinuity is currently occuring in WAN communication speeds! • Other performance curves are all similar • Disks have “maxed out” and hence are looking slower and slower • Memory of remote computers looks “closer and closer” • O/S imposed communication latencies has risen in relative terms over past decade! Implications? • The revolution in WAN communication we are now seeing is not surprising and will continue • Look for a shift from disk storage towards more use of access to remote objects “over the network” • O/S overhead is already by far the main obstacle to low latency and this problem will seem worse and worse unless O/S communication architectures evolve in major ways. More Implications • Look for full motion video to the workstation by around 2005 or 2010 • Low LAN latencies: an unexploited “niche” • One puzzle: what to do with extremely high data throughput but relatively high WAN latencies • O/S architecture and whole concept of O/S must change to better exploit the “pool of memory” of a cluster of machines; otherwise, disk latencies will loom higher and higher Reliability and performance • Some think that more reliable means “slower” – Indeed, it usually costs time to overcome failure – For example, if a packet is lost probably need to resend it, and may need to solicit the retransmission • But for many applications, performance is a big part of the application itself: too slow means “not reliable” for these! • Reliable systems thus must look for highest possible performance • ... but unlike unreliable systems, they can’t cut corners in ways that make them flakey but faster Back to the internet: IP layer • Addresses have a machine address and a “port” number. The port selects the application when a packet arrives. (If none is bound to that port, packet is dropped) • Fixed maximum size of 1400 bytes • Each machine can have multiple addresses • Special “broadcast” address delivered to all • “Class D” addresses used for multicast groups • Running out of addresses, so tricks with addresses are increasingly common IP gotcha’s • IP messages are not authenticated in any way. The sender can lie about who it is, and can send to any host or port on the network • A system can lie about its own machine address • IP messages are not reliable. They can be (and are) dropped at many stages • IP routing: basically static these days IP multicast sends to: “123.45.87.51” both blue machines accept IP address “123.45.87.51” • Key insight: IP address is just an abstraction. • Any machine can potentially “spoof” any other! UDP protocol • Lives above IP and looks much like IP • Permits larger packets (fragments them into a burst of IP packets; if any is lost UDP packet will be dropped on receive side) • Also can run in a multicast mode • Most applications use UDP; IP layer is typically “reserved” for kernel-level applications UDP loss rates can be very high! • Hunt experimented with this (book reproduces some of his data) • UDP is normally very reliable • If sender overruns receiver, even briefly, 100% of data may be lost! • Easy to provoke this problem even with source, dest on same machine!!! • O/S makes no effort to detect or avoid loss! TCP protocol • Implemented over IP, considered “reliable” • Supports a byte-stream model, like a pipe. • Implemented using sliding-window protocol • Many variations on protocol to optimize performance, window size, reduce header size, etc. We’ll focus on the “most basic” TCP protocol here and won’t look at the optimizations TCP sliding window sender provides data window has k “segments” mi+k mi+k-1 .... receiver replies with acks and nacks. sender resends missing data mi - IP packets carry segments - mi+k-2 - mi+k-3 ... mi receiver consumes data Observations • Trick is to have a big enough window so that data flow is continuous. This permits sender, receiver to “match” their data rates • Must retransmit as soon as possible but not so soon that duplicates get through (undesirable overhead) • Channel “breaks” after many retries, i.e. after a crash or if the network gets “lossy” for a while TCP failures: not “failstop” • Many applications treat a broken TCP channel as a sign of remote crash • But in fact, a TCP channel can easily break under network overload or due to other transient conditions! • Applications would then believe destination to have failed when it is actually still running. A basic insight • In fact, there is no way to know if a message has reached its destination unless it is explicitly acknowledged • And if it is, the sender of the ack. has no way to know if it was received! • Distributed systems are always slightly out of sync! This will be a very big issue for us later Most distributed systems use UDP or TCP • Applications typically lack permission to use IP (otherwise could “break” TCP protocol) • UDP multicast is hard to use because it isn’t always available and information on hardware layout of a LAN is not available in any standard format or from any standard service • Under heavy load, both UDP and TCP begin to misbehave (high loss rates, broken channels, etc) End-to-End argument • Suppose an IP packet will take n hops to its destination, and can be lost with probability p on each hop • Now, say that we want to transfer a file of k records that each fit in one IP (or UDP) packet • Should we use a retransmission protocol running “end-to-end” or n TCP protocols in a chain? End-to-End argument source dest Probability of successful transit: (1-p)n, Expected packets lost: k-k*(1-p)n Saltzer et. al. analysis • If p is very small, the overhead of the n TCP protocols (excess bytes sent, lost CPU time, etc) will be higher than if we just send the whole file, then retransmit the missing records • Generalization: low-level transport systems should focus on speed, not reliability; the application layer should worry about “properties” needed by the application Example • Suppose that 2% of file will be resent. Wouldn’t want to impose a 10% overhead on the net and slow the transfer down by 7% for this purpose! • But the end-to-end argument would not apply if: – p or n is large, hence (1-p)n approaches 0 – cost of recovery when a problem occurs is very high – reliability property is hard for users to implement • Justify complex mechanisms against these points. For next time • Read chapters 1-3 • If you were in charge of extending the communication layer, what would you change or add? • If computers can set their own addresses and fill out packets in any way they like, how can a system like UNIX possibly support user-id’s over a network? Or are user-id’s a big fake?