Uploaded by Muthusi Mathew M

Distributed Systems --- Assignment

advertisement
DISTRIBUTED SYSYTEM ASIGNMENT
Assignment
Matthew Musee Muthusi
Rongo University
Distributed Systems
Jane Juma
21/5/2021
1
DISTRIBUTED SYSYTEM ASIGNMENT
2
Assignment
i.
Introduction to fault tolerance
Fault tolerance or being able to handle any type of fault in itself is a motivation for
distributed systems. This is one of the most widely studied topics in the area of Distributed
Systems. In distributed environment of thousands of machines, it is evident that almost always,
some will fail. Due to this very obvious fact, failures have become the norm rather than an
exception. A poorly designed Distributed System is counter-intuitive and worse than a nondistributed system (Ganesan et al., 2017). A distributed system is one in which the failure of a
computer you didn’t even know existed can render your own computer unusable. The current
electrical problem in the machine room is not the culprit-it just highlights a situation that has
been getting progressively worse.
In Fault tolerance there are two broad categories of failure models. Any behavior can be
classified in a failure model if it does not comply with the designed protocol or contract. Node
failures, which are failures caused at individual node participating in a Distributed System.
Communication Failures, failures caused due to unreliable communication channels connecting
the nodes.
ii.
Process resilience
Processes can be made fault tolerant by arranging to have a group of processes, with each
member of the group being identical.
A message sent to the group is delivered to all of the “copies” of the process (the group
members), and then only one of them performs the required service (Vardas, Ploumidis &
Marazakis, 2020).
If one of the processes fail, it is assumed that one of the others will still be able to
function (and service any pending request or operation.
iii.
Reliable client-server communication
Remote procedure calls and shared variable are communication abstractions which allow
the various processes of a distributed program, often modelled as clients and servers, to
communicate with one another across machine boundaries. A key requirement of the abstractions
DISTRIBUTED SYSYTEM ASIGNMENT
3
is to mask the machine and communication failures that may occur during the client-server
communications. In practice, many distributed applications can inherently tolerate failures under
certain situations. Using such application layer information, it is often possible to relax the
constraints under which the failure masking algorithms in the client-server communication layer
have to operate (Burns, 2018). The relaxation can significantly simplify the algorithms and the
underlying message transport layer and allow formulation of efficient algorithms.
This is incorporated in application-driven approach into failure masking techniques in a
two ways; One is the orphan handling in Remote procedure calls. A new technique of adopting
the orphans caused by failures during RPCs is introduced. The adoption technique eliminates or
reduces orphan killing and rollback. Orphan adoption is incorporated into two schemes of server
replication:

Primary-secondary scheme in which one of the replicas of the server acts
as the primary executing RPCs from clients while the other replicas standby as
secondaries. If the primary fails, one of the secondaries becomes the primary and adopts
the orphan.

Replicated execution scheme in which more than one replica of the server
executes an RPC. If any of the replicas fails, the orphan is adopted by the surviving
replicas. Both schemes require call re-executions by servers based on application-level
idempotency properties of the calls.
The other is access to shared variables. Contemporary distributed programs deal with a
new class of shared variables such as information on name bindings, distributed load and
leadership within a server group. Since the consistency constraints on such system variables need
not be as strong as those for user data, the access operations on the variables may be made
simpler using this application layer information. Along this direction, an abstraction called
application-driven shared variable is introduced which enforces consistency of a variable only to
the extent required by the application; this allows use of a weak form of intra-server group
communication for access operations on the variable.
iv.
Reliable group communication
DISTRIBUTED SYSYTEM ASIGNMENT
4
When one source process tries to communicate with multiple processes at once, it is
called Group Communication. Communication between two processes in a distributed system is
required to exchange various data, such as code or a file, between the processes. When one
source process tries to communicate with multiple processes at once, it is called Group
Communication.
Communication between two processes in a distributed system is required to exchange
various data, such as code or a file, between the processes (Li et al.,2020). A group is a
collection of interconnected processes with abstraction. There are three types of group
communication, Broadcast, Multicast and Unicast.
a) Broadcast Communication:
When the host process tries to communicate with every process in a distributed system at
same time. Broadcast communication comes in handy when a common stream of information is
to be delivered to each and every process in most efficient manner possible. Since it does not
require any processing whatsoever, communication is very fast in comparison to other modes of
communication. However, it does not support a large number of processes and cannot treat a
specific process individually.
b) Multicast Communication:
When the host process tries to communicate with a designated group of processes in a
distributed system at the same time. This technique is mainly used to find a way to address
problem of a high workload on host system and redundant information from process in system.
Multitasking can significantly decrease time taken for message handling.
c) Unicast Communication:
When the host process tries to communicate with a single process in a distributed system at
the same time. Although, same information may be passed to multiple processes. This works best
for two processes communicating as only it has to treat a specific process only. However, it leads
to overheads as it has to find exact process and then exchange information/data.
v.
Distributed commit
DISTRIBUTED SYSYTEM ASIGNMENT
5
Distributed commit is often established by means of a coordinator. In a simple scheme, this
coordinator tells all other processes that are also involved, called participants, whether or not to
(locally) perform the operation in question (Shan, Tsai & Zhang, 2017). This scheme is referred
to as a one-phase commit protocol.
In distributed data base and transaction systems a distributed commit protocol is required to
ensure that the effects of a distributed transaction are atomic, that is, either all the effects of the
transaction persist or none persist, whether or not failures occur.
vi.
Recovery.
Failure recovery is a process that involves restoring an erroneous state to an error-free state
Failure A system is said to “fail” when it cannot meet its promises. A failure is brought about by
the existence of “errors” in the system. The cause of an error is a “fault”.
It is concerned with the physical and logical units of the processor. The system may freeze,
reboot and also it does not perform any functioning leading it to go in an idle state (Wang et al.,
2018). This can be cured by rebooting the system as soon as possible and configuring the failure
point and wrong state.
DISTRIBUTED SYSYTEM ASIGNMENT
6
References
Ganesan, A., Alagappan, R., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2017).
Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to
single errors and corruptions. In 15th {USENIX} Conference on File and Storage
Technologies ({FAST} 17) (pp. 149-166).
Vardas, I., Ploumidis, M., & Marazakis, M. (2020). Improving the Performance and Resilience
of MPI Parallel Jobs with Topology and Fault-Aware Process Placement. arXiv preprint
arXiv:2012.14757.
Burns, B. (2018). Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable
Services. " O'Reilly Media, Inc.".
Li, Y., Mosavat-Jahromi, H., Cai, L., & Lu, L. (2020). GNC-MAC: Grouping and Network
Coding-assisted MAC for Reliable Group-casting in V2X. In 2020 IEEE 92nd Vehicular
Technology Conference (VTC2020-Fall) (pp. 1-6). IEEE.
Shan, Y., Tsai, S. Y., & Zhang, Y. (2017, September). Distributed shared persistent memory.
In Proceedings of the 2017 Symposium on Cloud Computing (pp. 323-337).
Wang, X., Jin, M., Feng, W., Shu, G., Tian, H., & Liang, Y. (2018). Cascade energy
optimization for waste heat recovery in distributed energy systems. Applied energy, 230,
679-695.
Download