Uploaded by DCOM STUDENT

1 DS-notes

advertisement
Distributed Computing
EG 3113 CT
Parallel computing structure
Parallel computing is a type of computing architecture in which several processors simultaneously execute
multiple, smaller calculations broken down from an overall larger, complex problem. Parallel computing refers to
the process of breaking down larger problems into smaller, independent, often similar parts that can be executed
simultaneously by multiple processors communicating via shared memory, the results of which are combined
upon completion as part of
an overall algorithm.
The primary goal of parallel computing is to increase available computation power for faster application
processing and problem solving.
Motivation Of Parallelism
The whole real-world runs in dynamic nature i.e. many things happen at a certain time but at different places
concurrently. This data is extensively huge to manage.
Real-world data needs more dynamic simulation and modeling, and for achieving the same, parallel computing
is the key.
Parallel computing provides concurrency and saves time and money.
Complex, large datasets, and their management can be organized only and only using parallel computing’s
approach.
Ensures the effective utilization of the resources. The hardware is guaranteed to be used effectively whereas in
serial computation only some part of the hardware was used and the rest rendered idle.
Also, it is impractical to implement real-time systems using serial computing.
Types of Parallelism
1. Bit-level parallelism: increases processor word size, which reduces the quantity of instructions the processor
must execute in order to perform an operation on variables greater than the length of the word.
Example: Consider a scenario where an 8-bit processor must compute the sum of two 16-bit integers. It must
first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring two instructions to perform the
operation. A 16-bit processor can perform the operation with just one instruction.
2. Instruction-level parallelism:
the hardware approach works upon dynamic parallelism, in which the processor decides at run-time which
instructions to execute in parallel.
the software approach works upon static parallelism, in which the compiler decides which instructions to execute
in parallel
3. Task parallelism:
a form of parallelization of computer code across multiple processors that runs several different tasks at the
same time on the same data
Moore's Law
Moore's Law refers to Gordon Moore's perception that the number of transistors on a microchip doubles every
two years, though the cost of computers is halved. Moore's Law states that we can expect the speed and
capability of our computers to increase every couple of years, and we will pay less for them.
Another tenet of Moore's Law asserts that this growth is exponential.
In 1965, Gordon E. Moore, the co-founder of Intel, made this observation that became known as Moore's Law.
Grand Challenge problems
Grand Challenges are defined by the Federal High Performance Computing and Communications (HPCC)
program as fundamental problems in science and engineering with broad economic and scientific impact, whose
solutions require the application of high-performance computing.
The following is a list of "official" Grand Challenge applications:
• Aerospace
• Computer Science
• Energy
• Environmental Monitoring and Prediction
• Molecular Biology and Biomedical Imaging
• Product Design and Process Optimization
• Space Science
Instruction level parallelism (ILP)
Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be
performed simultaneously. The potential overlap among instructions is called instruction level parallelism. ILP is
used to achieve not only instruction overlap, but the actual execution of more than one instruction at a time
through dynamic scheduling and how to maximize the throughput of a processor.
Two main approaches:
Dynamic or hardware-based Dynamic parallelism means the processor decides at run time which instructions to
execute in parallel. Used in server and desktop processors
Static approaches or compiler-based static parallelism means the compiler decides which instructions to execute
in parallel.
Consider the following program:
e=a+b
f=c+d
m=e*f
Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are
completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated
simultaneously. If we assume that each operation can be completed in one unit of time then these three
instructions can be completed in a total of two units of time.
thread-level parallelism
TLP is a software capability that allows high-end programs, such as a database or web application, to work with
multiple threads at the same time. Programs that support this ability can do a lot more, even under high
workloads.
Thread-level parallelism used to only be utilized on commercial servers. However, as applications and processes
became more demanding, this technology has trickled down to the masses in the form of multi-core processors
found in today's desktop computers.
ILP can be quite limited. There is a limitation with the hardware that we use. The number of virtual registers that
we actually have is limited. Similarly, we may not be able to resolve memory address disambiguates always.
More importantly, it may lead to increase in power consumption.
Furthermore, there may be significant parallelism occurring naturally at a higher level in the application that
cannot be exploited with the approaches used to exploit ILP. For example, an online transaction processing
system has natural parallelism among the multiple queries and updates that are presented by requests. These
queries and updates can be processed mostly in parallel, since they are largely independent of one another.
This higher-level parallelism is called thread level parallelism because it is logically structured as separate
threads of execution.
Data Level parallelism
Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on
distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular
data structures like arrays and matrices by working on each element in parallel.
Data analysis of parallel job on an array of n elements can be divided equally among all the processors. In a
multiprocessor system where each one is executing a single set of instructions, data parallelism is achieved
when each processor performs the same task on different pieces of distributed data.
Data parallelism is used by many applications especially data processing applications; one of the examples is
database applications.
Memory Level Parallelism (MLP)
Memory level parallelism means generating and servicing multiple memory accesses in parallel. Memory-level
parallelism (MLP) is a term in computer architecture referring to the ability to have pending multiple memory
operations, in particular cache misses or translation look aside buffer (TLB) misses, at the same time.
In a single processor, MLP may be considered a form of instruction-level parallelism (ILP). In general, processors
are fast but memory is slow. One way to bridge this gap is to service the memory accesses in parallel. If the
misses are serviced in parallel, then the processor would take only one long latency stall for all the parallel miss.
This whole idea of fetching the misses in parallel is termed as Memory Level Parallelism.
Granularity
In parallel computing, granularity (or grain size) of a task is a measure of the amount of work (or computation)
which is performed by that task.
Fine-grain Parallelism
In fine-grained parallelism, the program is divided into a large number of small tasks. These tasks are assigned
individually for many processors. The amount of work associated with a parallel task is low, and the work is
evenly distributed among the processors. Therefore, fine-grained parallelism facilitates load balancing. Each
task processes less data, the number of processors required to perform a full treatment high. This, in turn,
increases communication and synchronization.
Fine-grained parallelism with a large number of independent tasks needs to consider the following:
➢ Compute intensity needs to be moderate so that each independent unit of parallelism has sufficient work to
do.
➢ The cost of communication needs to be low so that each independent unit of parallelism can execute
independently.
➢ Workload scheduling is important in fine-grained parallelism owing to the large number of independent tasks
that can execute in parallel. The flexibility provided by an effective workload scheduler can and achieve load
balance when a large number of tasks are being executed.
Coarse-grain Parallelism
In coarse-grained parallelism, the program is broken down into major tasks. In this regard, a large amount of
computation occurs in the processor. This can lead to load imbalance, in which certain processing tasks of the
data array, while others may be idle. Further, coarse-grained parallelism is not used parallelism in the program
as most of the computations are executed sequentially on the CPU.
The advantage of this type of parallelism is low communication and synchronization.
Coarse-grained parallelism needs to consider the following:
➢ Compute intensity needs to be higher than in the fine-grained case since there are fewer tasks that will
execute independently.
➢ Coarse-grained parallelism would require the developer to identify complete portions of an application that
can serve as a task.
Medium Grained Parallelism
Medium-grained parallelism is used relatively to fine-grained and coarse-grained parallelism. Medium-grained
parallelism is a compromise between fine-grained and coarse-grained parallelism, where we have task size and
communication time greater than fine-grained parallelism and lower than coarse-grained parallelism.
Most general-purpose parallel computers fall in this category.
Performance of Parallel Processor
Run Time: The parallel run time is defined as the time that elapses from the moment that a parallel
computation starts to the moment that the last processor finishes execution.
Notation: Serial run time Ts , parallel run time Tp .
The speed up is defined as the ratio of the serial run time of the best sequential algorithm for solving a problem
to the time taken by the parallel algorithm to solve the same problem on parallel processors.
S = Ts / Tp
The efficiency is defined as the ratio of speed up to the number of processors. Efficiency measures the fraction
of time for which a processor is usefully utilized.
E=S/P
The cost of solving a problem on a parallel system is defined as the product of run time and the number of
processors.
Amdahl’s Law
Amdahl’s Law was named after Gene Amdahl, who presented it in 1967. In general terms, Amdahl’s Law
states that in parallelization, if P is the proportion of a system or program that can be made parallel, and 1-P is
the proportion that remains serial, then the maximum speedup S(N) that can be achieved using N processors
is:
As N grows the speedup tends to 1/(1-P). Speedup is limited by the total time needed for the sequential (serial)
part of the program. For 10 hours of computing, if we can parallelize 9 hours of computing and 1 hour cannot
be parallelized, then our maximum speedup is limited to 10 times as fast.
Gustafson’s law
This law says that increase of problem size for large machines can retain scalability with respect to the number
of processors.
American computer scientist and businessman, John L. Gustafson (born January 19, 1955) found out that
practical problems show much better speedup than Amdahl predicted.
The computation time is constant (instead of the problem size), increasing number of CPUs solve bigger
problem and get better results in the same time.
•
•
•
•
•
Execution time of program on a parallel computer is (a+b)
a is the sequential time and b is the parallel time
Total amount of work to be done in parallel varies linearly with the number of processors.
So b is fixed as P is varied. The total run time is (a+p*b)
The speed up is (a+p*b)/(a+b)
•
Define α = a/(a+b), the sequential fraction of the execution time, then
Any sufficiently large problem can be efficiently parallelized with a speedup
S = P – α * (P – 1)
Where p is the number of processors, and α is the serial portion of the problem.
Gustafson proposed a fixed time concept which leads to scaled speed up for larger problem sizes. Basically,
we use larger systems with more processors to solve larger problems.
Chapter 2 Processor Architecture
2.1 Uniprocessor Architecture
A uniprocessor system is defined as a computer system that has a single central processing unit that is used to
execute computer tasks. A uniprocessor is a system with a single processor which has three major components
that are main memory i.e. the central storage unit, the central processing unit i.e. CPU, and an input-output unit
like monitor, keyboard, mouse, etc.
The first computers were all uniprocessor systems. Very simple embedded systems often have only one
processor: Car keys, digital alarm clocks, smoke detectors, etc. are all likely to have only one processor.
Any non-safety critical system with limited functionality will be a uni processor system.
2.2 RISC & CISC Architecture
RISC (reduced instruction set computer)
A reduced instruction set computer is a computer that only uses simple commands that can be divided into
several instructions that achieve low-level operation within a single CLK cycle. The main function of this is to
reduce the time of instruction execution by limiting as well as optimizing the number of commands. So each
command cycle uses a single clock cycle where every clock cycle includes three parameters namely fetch,
decode & execute. The kind of processor is mainly used to execute several difficult commands by merging them
into simpler ones. RISC processor needs a number of transistors to design and it reduces the instruction time
for execution.
Advantages
✓ The performance of this processor is good because of the easy & limited no. of the instruction set.
✓ This processor uses several transistors in the design so that making is cheaper.
✓ RISC processor allows the instruction to utilize open space on a microprocessor due to its simplicity.
✓ It is very simple as compared with another processor due to this; it can finish its task within a single clock
cycle.
CISC (Complex instruction set computer)
Complex instruction set computer is a computer where single instructions can perform numerous low-level
operations like a load from memory, an arithmetic operation, and a memory store or are accomplished by multistep processes or addressing modes in single instructions.
CISC supports high-level languages for simple compilation and complex data structure. For writing an
application, less instruction is required. The code length is very short, so it needs extremely small RAM. It
highlights the instruction on hardware while designing as it is faster to design than the software. Instructions are
larger as compared with a single word.
Advantages
✓ In the CISC processor, the compiler needs a small effort to change the program or statement from high-level
to assembly otherwise machine language.
✓ A single instruction can be executed by using different low-level tasks
✓ It doesn’t use much memory due to a short length of code.
✓ CISC utilizes less instruction set to execute the same instruction as the RISC.
examples of the CISC processor include AMD, VAX, System/360 & Intel x86.
differences between CISC and RISC architectures
2.3 Parallel processing mechanism for Uni-processor
Parallelism in a uniprocessor means a system with a single processor performing two or more than two tasks
simultaneously. Parallelism can be achieved by two means hardware and software.
Hardware Approach for Parallelism in Uniprocessor
1. Multiplicity of Functional Unit
In earlier computers, the CPU consists of only one arithmetic logic unit which used to perform only one function
at a time. This slows down the execution of the long sequence of arithmetic instructions. To overcome this the
functional units of the CPU can be increased to perform parallel and simultaneous arithmetic operations.
2. Parallelism and Pipelining within CPU
The term Pipelining refers to a technique of decomposing a sequential process into sub-operations, with each
sub-operation being executed in a dedicated segment that operates concurrently with all other segments. The
most important characteristic of a pipeline technique is that several computations can be in progress in distinct
segments at the same time.
Parallel adders can be implemented using techniques such as carry-lookahead and carry-save. A parallel adder
is a digital circuit that adds two binary numbers, where the length of one bit is larger as compared to the length
of another bit and the adder operates on equivalent pairs of bits parallelly.
The multiplier can be recoded to eliminate more complex calculations. Various instruction execution phases are
pipelined and to overcome the situation of overlapped instruction execution the techniques like instruction
prefetch and data buffers are used.
2.4 Multiprocessor and Multicomputer Model
Multiprocessor
A Multiprocessor is a computer system with two or more central processing units (CPUs) share full access to a
common RAM. The main objective of using a multiprocessor is to boost the system’s execution speed.
There are two types of multiprocessors, one is called shared memory multiprocessor and another is distributed
memory multiprocessor. In shared memory multiprocessors, all the CPUs shares the common memory but in a
distributed memory multiprocessor, every CPU has its own private memory.
Benefits of using a Multiprocessor
➢
➢
➢
➢
➢
Enhanced performance.
Multiple applications.
Multi-tasking inside an application.
High throughput and responsiveness.
Hardware sharing among CPUs.
Shared Memory multiprocessor
Three most common shared memory multiprocessors models are:
1. Uniform Memory Access (UMA)
2. Non-uniform Memory Access (NUMA)
3. Cache Only Memory Architecture (COMA)
UMA (Uniform Memory Access) model
This model shares physical memory in a uniform way between the processors where all the processors have an
even access time to all memory words. All the processors have equal access time to all the memory words and
each processor may have a private cache memory.
When all the processors have equal access to all the peripheral devices, the system is called a symmetric
multiprocessor. When only one or a few processors can access the peripheral devices, the system is called an
asymmetric multiprocessor.
Suitable for General purpose and time-sharing applications. It is slower than NUMA.
Non-uniform Memory Access (NUMA)
NUMA is also a multiprocessor model in which each processor connected with the dedicated memory. However,
these small parts of the memory combine to make a single address space. In NUMA multiprocessor model, the
access time varies with the location of the memory word. Here, the shared memory is physically distributed
among all the processors, called local memories.
The collection of all local memories forms a global address space which can be accessed by all the processors.
NUMA architecture is intended to increase the available bandwidth to the memory.
Suitable for Real-time and time-critical applications.
Cache Only Memory Architecture (COMA)
The COMA model is a special case of the NUMA model. Here, all the distributed main memories are converted
to cache memories. This model is composed by the combination of multiprocessor and cache memory. It
changes distributed memory into caches and is an exceptional case of NUMA. It lacks the use of the memory
hierarchy, and global address space is made up of combining all the caches.
Difference Between UMA and NUMA
Multicomputer
A multicomputer system is a computer system with multiple processors that are connected together to solve a
problem, known as nodes. Each processor has its own memory and it is accessible by that particular processor
and those processors can communicate with each other via an interconnection network. Each node acts as an
autonomous computer having a processor, a local memory and sometimes I/O devices. In this case, all local
memories are private and are accessible only to the local processors.
As the multicomputer is capable of messages passing between the processors, it is possible to divide the task
between the processors to complete the task. Hence, a multicomputer can be used for distributed computing. It
is cost effective and easier to build a multicomputer than a multiprocessor.
➢ NORMA Model
In a NoRMA architecture, the address space globally is not unique and the memory is not globally accessible by
the processors. Accesses to remote memory modules are only indirectly possible by messages through the
interconnection network to other processors, which in turn possibly deliver the desired data in a reply message.
The advantage of the NoRMA model is the ability to construct extremely large configurations, which is achieved
by shifting the problem to the user configuration.
Programs for NoRMA architectures need to evenly partitioning the data into local memory modules, handle
transformations of data identifiers from one processor's address space to another, and realize a messagepassing system for remote access to data. The programming model of Norma architecture is therefore extremely
complicated.
multi computer architecture
Difference between multiprocessor and Multicomputer
Download