DISTRIBUTED SYSYTEM ASIGNMENT Assignment Matthew Musee Muthusi Rongo University Distributed Systems Jane Juma 21/5/2021 1 DISTRIBUTED SYSYTEM ASIGNMENT 2 Assignment i. Introduction to fault tolerance Fault tolerance or being able to handle any type of fault in itself is a motivation for distributed systems. This is one of the most widely studied topics in the area of Distributed Systems. In distributed environment of thousands of machines, it is evident that almost always, some will fail. Due to this very obvious fact, failures have become the norm rather than an exception. A poorly designed Distributed System is counter-intuitive and worse than a nondistributed system (Ganesan et al., 2017). A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. The current electrical problem in the machine room is not the culprit-it just highlights a situation that has been getting progressively worse. In Fault tolerance there are two broad categories of failure models. Any behavior can be classified in a failure model if it does not comply with the designed protocol or contract. Node failures, which are failures caused at individual node participating in a Distributed System. Communication Failures, failures caused due to unreliable communication channels connecting the nodes. ii. Process resilience Processes can be made fault tolerant by arranging to have a group of processes, with each member of the group being identical. A message sent to the group is delivered to all of the “copies” of the process (the group members), and then only one of them performs the required service (Vardas, Ploumidis & Marazakis, 2020). If one of the processes fail, it is assumed that one of the others will still be able to function (and service any pending request or operation. iii. Reliable client-server communication Remote procedure calls and shared variable are communication abstractions which allow the various processes of a distributed program, often modelled as clients and servers, to communicate with one another across machine boundaries. A key requirement of the abstractions DISTRIBUTED SYSYTEM ASIGNMENT 3 is to mask the machine and communication failures that may occur during the client-server communications. In practice, many distributed applications can inherently tolerate failures under certain situations. Using such application layer information, it is often possible to relax the constraints under which the failure masking algorithms in the client-server communication layer have to operate (Burns, 2018). The relaxation can significantly simplify the algorithms and the underlying message transport layer and allow formulation of efficient algorithms. This is incorporated in application-driven approach into failure masking techniques in a two ways; One is the orphan handling in Remote procedure calls. A new technique of adopting the orphans caused by failures during RPCs is introduced. The adoption technique eliminates or reduces orphan killing and rollback. Orphan adoption is incorporated into two schemes of server replication: Primary-secondary scheme in which one of the replicas of the server acts as the primary executing RPCs from clients while the other replicas standby as secondaries. If the primary fails, one of the secondaries becomes the primary and adopts the orphan. Replicated execution scheme in which more than one replica of the server executes an RPC. If any of the replicas fails, the orphan is adopted by the surviving replicas. Both schemes require call re-executions by servers based on application-level idempotency properties of the calls. The other is access to shared variables. Contemporary distributed programs deal with a new class of shared variables such as information on name bindings, distributed load and leadership within a server group. Since the consistency constraints on such system variables need not be as strong as those for user data, the access operations on the variables may be made simpler using this application layer information. Along this direction, an abstraction called application-driven shared variable is introduced which enforces consistency of a variable only to the extent required by the application; this allows use of a weak form of intra-server group communication for access operations on the variable. iv. Reliable group communication DISTRIBUTED SYSYTEM ASIGNMENT 4 When one source process tries to communicate with multiple processes at once, it is called Group Communication. Communication between two processes in a distributed system is required to exchange various data, such as code or a file, between the processes. When one source process tries to communicate with multiple processes at once, it is called Group Communication. Communication between two processes in a distributed system is required to exchange various data, such as code or a file, between the processes (Li et al.,2020). A group is a collection of interconnected processes with abstraction. There are three types of group communication, Broadcast, Multicast and Unicast. a) Broadcast Communication: When the host process tries to communicate with every process in a distributed system at same time. Broadcast communication comes in handy when a common stream of information is to be delivered to each and every process in most efficient manner possible. Since it does not require any processing whatsoever, communication is very fast in comparison to other modes of communication. However, it does not support a large number of processes and cannot treat a specific process individually. b) Multicast Communication: When the host process tries to communicate with a designated group of processes in a distributed system at the same time. This technique is mainly used to find a way to address problem of a high workload on host system and redundant information from process in system. Multitasking can significantly decrease time taken for message handling. c) Unicast Communication: When the host process tries to communicate with a single process in a distributed system at the same time. Although, same information may be passed to multiple processes. This works best for two processes communicating as only it has to treat a specific process only. However, it leads to overheads as it has to find exact process and then exchange information/data. v. Distributed commit DISTRIBUTED SYSYTEM ASIGNMENT 5 Distributed commit is often established by means of a coordinator. In a simple scheme, this coordinator tells all other processes that are also involved, called participants, whether or not to (locally) perform the operation in question (Shan, Tsai & Zhang, 2017). This scheme is referred to as a one-phase commit protocol. In distributed data base and transaction systems a distributed commit protocol is required to ensure that the effects of a distributed transaction are atomic, that is, either all the effects of the transaction persist or none persist, whether or not failures occur. vi. Recovery. Failure recovery is a process that involves restoring an erroneous state to an error-free state Failure A system is said to “fail” when it cannot meet its promises. A failure is brought about by the existence of “errors” in the system. The cause of an error is a “fault”. It is concerned with the physical and logical units of the processor. The system may freeze, reboot and also it does not perform any functioning leading it to go in an idle state (Wang et al., 2018). This can be cured by rebooting the system as soon as possible and configuring the failure point and wrong state. DISTRIBUTED SYSYTEM ASIGNMENT 6 References Ganesan, A., Alagappan, R., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2017). Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17) (pp. 149-166). Vardas, I., Ploumidis, M., & Marazakis, M. (2020). Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement. arXiv preprint arXiv:2012.14757. Burns, B. (2018). Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services. " O'Reilly Media, Inc.". Li, Y., Mosavat-Jahromi, H., Cai, L., & Lu, L. (2020). GNC-MAC: Grouping and Network Coding-assisted MAC for Reliable Group-casting in V2X. In 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall) (pp. 1-6). IEEE. Shan, Y., Tsai, S. Y., & Zhang, Y. (2017, September). Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (pp. 323-337). Wang, X., Jin, M., Feng, W., Shu, G., Tian, H., & Liang, Y. (2018). Cascade energy optimization for waste heat recovery in distributed energy systems. Applied energy, 230, 679-695.