CS542: Topics in Distributed Systems Diganta Goswami Distributed System • A collection of independent computers that appears to its users as a single coherent system. – Autonomous computers – Many components – connected by a network – sharing resources. Distributed System • A System of networked components that communicate and coordinate their actions only by passing messages – concurrent execution of programs – no global clock – components fail independently of one another Another definition • You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done. – inter-dependencies – shared state – independent failure of components A working definition for us A distributed system is a collection of entities, each of which is autonomous, programmable, asynchronous and failure-prone, and which communicate through an unreliable communication medium using message passing. • Entity=a process on a device (PC, PDA) • Communication Medium=Wired or wireless network • Our interest in distributed systems involves – design and implementation, maintenance, algorithmics “Important” Distributed Systems Issues • No global clock: no single global notion of the correct time (asynchrony) • Unpredictable failures of components: lack of response may be due to either failure of a network component, network path being down, or a computer crash (failure-prone, unreliable) • Highly variable bandwidth: from 16Kbps (slow modems or Google Balloon) to Gbps (Internet2) to Tbps (in between DCs of same big company) • Possibly large and variable latency: few ms to several seconds • Large numbers of hosts: 2 to several million There are a range of interesting problems for Distributed System designers • • • Real distributed systems – Cloud Computing, Peer to peer systems, Hadoop, distributed file systems, sensor networks, graph processing, … • Classical Problems – Failure detection, Asynchrony, Snapshots, Multicast, Consensus, Mutual Exclusion, Election, … • Concurrency – RPCs, Concurrency Control, Replication Control, … • Security – Byzantine Faults, … • • Others… Typical Distributed Systems Design Goals • Common Goals: – Heterogeneity – can the system handle a large variety of types of PCs and devices? – Robustness – is the system resilient to host crashes and failures, and to the network dropping messages? – Availability – are data+services always there for clients? – Transparency – can the system hide its internal workings from the users? – Concurrency – can the server handle multiple clients simultaneously? – Efficiency – is the service fast enough? Does it utilize 100% of all resources? – Scalability – can it handle 100 million nodes without degrading service? (nodes=clients and/or servers) – Security – can the system withstand hacker attacks? – Openness – is the system extensible? Challenges and Goals of Distributed Systems • • • • • • • Heterogeneity Openness Security Scalability Failure handling Concurrency Transparency Challenges • Heterogeneity (variety and difference) of underlying network infrastructure, • Internet consists of many different sorts of network – their differences are masked by the fact that all of the computers attached to them use the Internet Protocols for communication. – e.g. a computer attached to an Ethernet has an implementation of the Internet Protocols over the Ethernet, whereas a computer on a different sort of network will need an implementation of the Internet Protocols for that network. Heterogeneity • Computer hardware and software – e.g., operating systems, compare UNIX socket and Winsock calls • Programming languages : in particular, data representations Some approaches: Middleware • A S/W layer that provides a programming abstraction as well as masking the heterogeneity of the underlying networks, H/W, O/S and programming languages. – Middleware (e.g., CORBA): transparency of network, hard- and software and programming language heterogeneity. JAVA RMI • In addition to solving the problems of heterogeneity, middleware provides a uniform computational model for use by the programmers of servers and distributed applications. Positioning Middleware • General structure of a distributed system as middleware. 1-22 Openness • Characteristic that determine whether the system can be extended and re-implemented in various ways. – Determined primarily by the degree to which new resource sharing services can be added and be made available for use by a variety of client programs. – Cannot be achieved unless the specification and documentation of the key s/w interfaces are made available to s/w developers (i.e. key interfaces are published) Openness • Designers of the Internet protocols introduced a series of documents called RFCs – Specifications of the Internet communication protocols – Specifications for applications run over them » e.g., email, telnet, file transfer, etc. (by the mid 80’s) • RFCs are not the only means --- e.g. CORBA is published through a series of documents, including a complete specification of the interfaces of its services (www.omg.org) Openness • Offering services according to standard rules that describe the syntax and semantics of those services – e.g., Network protocol rules (RFCs) • Services specified through interfaces – Interface definition languages (IDLs) • specifies names and available functions as well as parameters, return values, exceptions etc. Security • Distributed systems must protect the shared information and resources • The openness of DS makes them vulnerable to security threats • Providing security is a significant challenge for DS Security. Privacy / Confidentiality: protection against disclosure to unauthorized individuals Integrity: protection against alteration or corruption Availability: protection against interference with the means to access the resources Scalability • Scalable system—system that can handle additional number of users/resources without suffering noticeable loss of performance • Three metrics of a scalable system – No of user/resources – Distance between the farthest nodes in the system (network radius) – Number of organizations exerting control over the pieces of the system Challenges in designing scalable DS • Controlling the cost of physical resources: – As the demand for a resource grows, it should be possible to extend the system, at reasonable cost, to meet it. » e.g. it must be possible to add server computers to avoid the performance bottleneck that would arise if a single file server had to handle all file access request when the freq. of file access request grows in an intranet with the increase in users and computers. www.amazon.com is more than one computer Challenges in designing scalable DS • Controlling the performance loss: – Management of a set of data whose size is proportional to the number of users or resources in the system » e.g. the Domain Name System holds the table with the correspondence between domain names of computers and their Internet address » Hierarchic structures scale better than linar structures. Scaling Techniques 1.5 An example of dividing the DNS name space into zones. Challenges in designing scalable DS • Preventing s/w resources running out: – Numbers used as Internet address --- 32 bits was used in the late 70’s but may run out soon. – Change from 32 bits to 128 bits? – Difficult to predict the demand. – Over-compensating for future growth may be worse than adapting to a change when we are forced to - large Internet address occupy extra space in messages and in computer storage. Failure Handling • Failure in a DS is partial – Some components fail while others continue to function – This makes handling of failures difficult. Techniques for dealing with failures • Detecting failures – may be impossible – remote site crash or delay in message transmission? – Some can be. – Ex. - Checksums can be used to detect corrupted data Techniques for dealing with failures • Masking failure – Some can be hidden or made less severe – Retransmission – when messages fail to arrive Techniques for dealing with failures • Tolerating failures – Would not be practical to detect and hide all of the failures. Can be designed to tolerate some of those – e.g. timeouts when waiting for a web resource – clients give up after a predetermined number of attempts and take other actions & inform the user. Failure Handling • Recovery from failures – Rollback – Undo/Redo in transactions • Redundancy – Makes the system more available through replication of resources/data – Redundant routes in the network – Replication of name tables in multiple domain name servers Concurrency • In a distributed system it is possible that multiple machines/processes/users may try to access shared data/resource concurrently – Can potentially lead to incorrect results and/or – Deadlocks • The operations must be synchronized/serialized so that the end result is correct Transparency • Concealing the heterogeneous and distributed nature of the system so that it appears to the user like one system – Making the user believe that there is only a single, undivided system i.e., to hide the notion of distribution completely • What are the challenges of transparency? Transparency Categories • Access transparency - access local and remote resources using identical operations – e.g., users of UNIX NFS can use the same commands and parameters for file system operations regardless of whether the accessed files are on a local or remote disk. Transparency categories • Location Transparency: Access without knowledge of location of a resource – e.g., URLs, email addresses (hostname, IP addresses, etc. not required --- the part of the URL that identifies a web server domain name refers to a computer name in a domain, rather than to an Internet address) Transparency Categories • Concurrency transparency: Allow several processes to operate concurrently using shared resources in a consistent fashion w/o interference between them. – That is, users and programmers are unaware that components request services concurrently. • Replication transparency – Use replicated resource as if there was just one instance. » Increase reliability and performance w/o knowledge of the replicas by users or application programmers. Failure transparency • Enables the concealment of faults, allowing users and application programs to complete their task despite failures of h/w or s/w components. • Retransmit of email messages – eventually delivered even when servers or communication links fail – it may even take several days. Failure transparency • Failure transparency depends on concurrency and replication transparency. • Replication can be employed to achieve failure transparency • Message transmission governed by TCP is a mechanism for providing failure transparency Mobility Transparency • Mobility transparency: allow resources to move around w/o affecting the operation of users or programs • e.g., 700 phone number – but URLs are not, because someone’s personal web page cannot move to their new place of work in a different domain – all of the links in other pages will still point to the original page! • Transparency Categories • Performance transparency: adaptation of the system to varying load situations without the user noticing it. • Scaling transparency: allow system and applications to expand without need to change structure or application algorithms Degree of transparency • There are systems in which attempting to blindly hide all distribution aspects from users is not always a good idea – Requesting your electronic newspaper in your mailbox before 7 am local time – while you are at the other end of the world living in a different time zone – (Your morning paper will not be the morning paper you are used to) Degree of transparency • There is trade-off between a high degree of transparency and the performance of a system – Masking transient server failure by retransmitting the request may slow down the system – If it is necessary to guarantee that several replicas need to be consistent all the time, a single update may take a long time – something that cannot be hidden from the user.