Distributed Systems -Rise of Distributed Systems: increased power, network connectivity increasing, easy to connect hardware together -Distributed System: Collection of independent computers that appears to its users as a single coherent system. Distributed System = Distributed hardware + distributed control + distributed data -Why? Big data growing, apps are data-intensive, indv. computers have limited resources, chip multiprocessors now available,cpu speeds grew not mem speeds. -Requirements: programming models & concurrency,communication,synchronization,consistency & replication,fault tolerance,security,virtualization -Organized as middleware(b/t apps and OS). Extends over multiple machines and users can interact in a consistent way -Goals: transparency, scaling(hiding communication latency->important for interactive apps, asynch comm & distribution->spread info/processing to 1+ location & replication->copy info to ++ availability and decrease centralized load) -Architectures:Client invoke indv servers,peer-to-peer systems,a service by multi servers,web proxy server,web applets,thin clients and compute servers -DS EX: cloud computing(on demand, dynamic allocation of resources, abstraction of resource,self-managed,billed for what you use,standard interfaces) and Xaas -IaaS:user gets access to virtualised hardware, manage OS, middleware, runtime, data EX: amazon EC2 -PaaS:integrated devlopment environment(app design,testing,deployment,hosting), develop apps on top, responsible for managing data EX Google App Engine -SaaS:top layer consumed by end user, app software provided EX gmail, google docs, facebook -HuaaS:extraction of info from crowds of ppl, arbitrary EX youtube vids Networking -many diff agreements(protocols)needed at various levels for two OS to communicate. -Internet is network of networks connecting millions of devices(hostsend systems,links-fiber to satellite,routers and switches).Collection of protocols providing communication services to distributed applications. -Can be define recursively as: two or more nodes connected by a link OR two or more connected by a node. Internet Protocol Stack -application: protocols designed to meet communication requirements of specific applications,defines interface to a service(FTP,HTTP) -transportation: process-to-process data transfer (TCP,UDP) -network: routing of datagrams from source to destination(IP,OSPF,BGP) -link: data transfer between neighboring elements (PPP,Ethernet) -physical: transmission of bits on a link (electronic signals on cable, light signals on fibre) **ISO/OSI model same as above, but presentation and session are inbetween app & transport** -presentation: allow applications to interpret meaning of data (encrypt,compression) -session: synchronization, check pointing, recovery of data exchange -Why layering?allows for identification,relationship of complex system’s pieces. Each layer… gets service from one below->performs specific task->provides service to one above. Modularization eases maintenance and updating. -IP address:32 bit unique identifier for host, router interface. Interface connection between host/router and physical link (routers usually have mult interfaces,host has 1, IP associated w/ each interface) -IP Header: version(IPv4), header length(usually 5), size(bytes, header+data),flags(3 bits),time to live(# of hops/links packet is routed, decremented by router), Protocol(type of transport; 1=ICMP,6=TCP,17=UDP), Header Checksum(updates when packet header is modified by a node),source addr/destination addr -Datagram:every datagram contains destination’s address. If connected to destination network, forward to the host in LAN(if network # of dest == my net #). if not directly connected,forward to hosts default router. Each router maintains a forwarding table(maps network number rather than host addr into next hop or interface number -if directly connected-) -Address:unique byte-string that identifies a node -Routing:process of forwarding messages to the destination node based on its address Types… Unicast(node-specific);Broadcast(all nodes on network);Multicast(some subset of nodes) -Address Translation in LAN: maps IP addresses into physical addresses of the destination host or the next hop router -Address Resolution Protocol(ARP):host caches table of IP to physical address bindings(table entries discarded if not refreshed)>broadcast request if IP not in table->target machine sends physical address to sender & updates add entry of source in its table. Network Layer -End to End Protocols: underlying best-effort network(drop&reorder messages,deliver duplicate copies, limit packets,delivers messages arbitrarily)common end-to-end services(guarantee delivery,deliver in same order and at most one copy, supports large messages, supports synchronization, allows receiver to flow control,supports mult application processes) -Transport Layer: provides logical communication b/t app processes on different hosts. Runs in end systems: sender side -> breaks app messages into segments, passes to network layer. Receiver side -> reassembles segments into messages, passes to app layer. *more than 1 protocol available to apps, Internet: TCP & UDP* -Transmission Control Protocol(TCP): connection oriented, reliable transport, flow control, does not provide timing, minimum throughput guarantees, security. Segment Format: 4-tuple(SrcPort, SrcIPAddr, DsrPort,DstIPAddr), sliding window + flow control(acknowledgement, SequenceNum, AdvertisedWindow), flags(SYN, FIN, RESET,PUSH, URG,ACK), checksum. Client -> connection initiator Server -> contacted by client Three way Handshake: 1. Client host sends TCP SYN segment to server, 2. Server receives SYN, replies with SYNACK segment, 3. Client receives SYNACK, replies with ACK segment,may contain data. Closing a connection:clientSocket.close(); 1. client end system sends TCP FIN control segment to server 2. server receives FIN, replies with ACK. Closes connection, sends FIN. 3. client receives FIN, replies with ACK. 4. server, receives ACK. Connection closed. -User Datagram Protocol(UDP): unreliable data transfer, does not provide connection setup, reliability, flow control, congestion control, timing, throughput guarantee or security. Simple Demultiplexor(UDP)unreliable and unordered datagram service. No flow control or error control, endpoints identified by ports, header format, optional checksum.(pseudoheader + UDP header + data) Interprocess Communication -Characteristics:messages sent to internet addr & local port pairs, port has one receiver but many senders, processes may use multiple ports. VALIDITY,INTERITY,ORDERING -Sockets:inbetween app and transport layers. a door between application process and end-to-end-transport protocol (UCP or TCP).Application and middleware layers use servivec provided by the network and transport layers thru the socket API( interface, gate, door between a process and transport layer). A socket must be bound to a local port Programming: Client must contact server(server process must first be running, server must have created socket (door) that welcomes client’s contact), Client contacts server by(creating client-local TCP socket, specifying IP address, port number of server process, When client creates socket: client TCP establishes connection to server TCP), When contacted by client, server TCP creates new socket for server process to communicate with client(allows server to talk with multiple clients & source port numbers used to distinguish clients) -Process: program running within a host. within same host, two processes communicate using inter-process communication. processes in different hosts communicate by exchanging messages using transport layer. Client process: process that initiates communication. Server process: process that waits to be contacted -Adressing Processes: to receive messages, process must have identifier. host device has unique 32- bit IP address. identifier includes both IP address and port number associated with the process -Broadcast: sends a single message from one process to all processes or hosts(Used for ARP in a LAN, Hard and expensive in WAN) -Multicast: sends a single message from one process to members of a group of processes (hosts). USES: Fault tolerance based on replicated services, Discovery in spontaneous networking, Performance from replicated data, Propagation of event notifications in a distributed environment. IP Multicast: multicast address ->identify a group Internet Group Membership Protocol.(Processes register a group with local router using IGMP). Router updates its multicast routing table-> Processes send message to a group-> Router forward multicast messages. Multicast Routing Problem: Goal: find a tree (or trees) connecting routers having local multicast group members. tree: not all paths between routers used; source-based: different tree from each sender to recivers; shared-tree: same tree used by all group members Indirect Communication -Indirect communication: communication through intermediary with no direct coupling between the sender and the receiver.(Space uncoupling: the sender does not know or need to know the identity of the receiver; Time uncoupling: the sender and receiver can have independent lifetimes).Its often used in distributed systems. Main disadvantage is performance overhead introduced by added level of indirection & more difficult to manage -Group communication: offers a service where a message is sent to a group and then this message is delivered to all members of the group, Sender is not aware of the identities of the receivers. Represents abstraction over multicast communication adding significant extra value in terms of managing group membership, detecting failures and providing reliability and ordering guarantees. Application: Reliable dissemination of information, Support for collaborative applications, a range of fault-tolerance strategies and system monitoring and management. -The programming model:a group with associated group membership where processes may join or leave the group. Processes can send a message to this group and have it propagated to all members of the group with guarantees in terms of reliability and ordering. The essential feature is that a process issues only one multicast operation to send a message to each of a group of processes. Issues: Reliability and ordering in multicast(Integrity, validity, and agreement). Group communication services offer ordered multicast(FIF0, casual ordering, total ordering) -Group Membership Management: Provides an interface for group membership changes, failure detection, notifying members of changes, & performing group address expansion. Issues: most effective in smallscale and static systems and does not operate as well in larger-scale environments or environments with a high degree of volatility -Publish-subscribe systems: distributed event-based systems. Publisher publishes structured events to an event service->Subscribers express interest through subscriptions which can be arbitrarily patterned over the structured events. Ensures that events are delivered efficiently to all subscribers that have filters defined that match the event. Subscription filter model(channel, topic, content and type based)EX: financial info systems Issues: Centralized versus distributed implementations. -Message Queues: point-to-point service using the concept of a message queue as an indirection. EX: Enterprise App Integration. extensively used as the basis for commercial transaction processing systems. -Share memory approaches: Distributed shared memory (DSM) is an abstraction used for sharing data between computers that do not share physical memory. Processes access DSM by reads and updates to ordinary memory within their address space. It is as though the processes access a single shared memory, but in fact the physical memory is distributed. primarily a tool for parallel applications or for any distributed application or group of applications in which individual shared data items can be accessed directly. general less appropriate in client-server systems. -Tuple Space Communication: processes communicate indirectly by placing tuples in a tuple space, from which other processes can read or remove them. Tuples consist of a sequence of one or more typed data fields such as <"fred", 1958>, <"sid", 1964> and <4, 9.8, "Yes">. OPERATIONS: write, read take. Immutable