Distributed Computing EG 3113 CT Parallel computing structure Parallel computing is a type of computing architecture in which several processors simultaneously execute multiple, smaller calculations broken down from an overall larger, complex problem. Parallel computing refers to the process of breaking down larger problems into smaller, independent, often similar parts that can be executed simultaneously by multiple processors communicating via shared memory, the results of which are combined upon completion as part of an overall algorithm. The primary goal of parallel computing is to increase available computation power for faster application processing and problem solving. Motivation Of Parallelism The whole real-world runs in dynamic nature i.e. many things happen at a certain time but at different places concurrently. This data is extensively huge to manage. Real-world data needs more dynamic simulation and modeling, and for achieving the same, parallel computing is the key. Parallel computing provides concurrency and saves time and money. Complex, large datasets, and their management can be organized only and only using parallel computing’s approach. Ensures the effective utilization of the resources. The hardware is guaranteed to be used effectively whereas in serial computation only some part of the hardware was used and the rest rendered idle. Also, it is impractical to implement real-time systems using serial computing. Types of Parallelism 1. Bit-level parallelism: increases processor word size, which reduces the quantity of instructions the processor must execute in order to perform an operation on variables greater than the length of the word. Example: Consider a scenario where an 8-bit processor must compute the sum of two 16-bit integers. It must first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring two instructions to perform the operation. A 16-bit processor can perform the operation with just one instruction. 2. Instruction-level parallelism: the hardware approach works upon dynamic parallelism, in which the processor decides at run-time which instructions to execute in parallel. the software approach works upon static parallelism, in which the compiler decides which instructions to execute in parallel 3. Task parallelism: a form of parallelization of computer code across multiple processors that runs several different tasks at the same time on the same data Moore's Law Moore's Law refers to Gordon Moore's perception that the number of transistors on a microchip doubles every two years, though the cost of computers is halved. Moore's Law states that we can expect the speed and capability of our computers to increase every couple of years, and we will pay less for them. Another tenet of Moore's Law asserts that this growth is exponential. In 1965, Gordon E. Moore, the co-founder of Intel, made this observation that became known as Moore's Law. Grand Challenge problems Grand Challenges are defined by the Federal High Performance Computing and Communications (HPCC) program as fundamental problems in science and engineering with broad economic and scientific impact, whose solutions require the application of high-performance computing. The following is a list of "official" Grand Challenge applications: • Aerospace • Computer Science • Energy • Environmental Monitoring and Prediction • Molecular Biology and Biomedical Imaging • Product Design and Process Optimization • Space Science Instruction level parallelism (ILP) Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. The potential overlap among instructions is called instruction level parallelism. ILP is used to achieve not only instruction overlap, but the actual execution of more than one instruction at a time through dynamic scheduling and how to maximize the throughput of a processor. Two main approaches: Dynamic or hardware-based Dynamic parallelism means the processor decides at run time which instructions to execute in parallel. Used in server and desktop processors Static approaches or compiler-based static parallelism means the compiler decides which instructions to execute in parallel. Consider the following program: e=a+b f=c+d m=e*f Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time. thread-level parallelism TLP is a software capability that allows high-end programs, such as a database or web application, to work with multiple threads at the same time. Programs that support this ability can do a lot more, even under high workloads. Thread-level parallelism used to only be utilized on commercial servers. However, as applications and processes became more demanding, this technology has trickled down to the masses in the form of multi-core processors found in today's desktop computers. ILP can be quite limited. There is a limitation with the hardware that we use. The number of virtual registers that we actually have is limited. Similarly, we may not be able to resolve memory address disambiguates always. More importantly, it may lead to increase in power consumption. Furthermore, there may be significant parallelism occurring naturally at a higher level in the application that cannot be exploited with the approaches used to exploit ILP. For example, an online transaction processing system has natural parallelism among the multiple queries and updates that are presented by requests. These queries and updates can be processed mostly in parallel, since they are largely independent of one another. This higher-level parallelism is called thread level parallelism because it is logically structured as separate threads of execution. Data Level parallelism Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. Data analysis of parallel job on an array of n elements can be divided equally among all the processors. In a multiprocessor system where each one is executing a single set of instructions, data parallelism is achieved when each processor performs the same task on different pieces of distributed data. Data parallelism is used by many applications especially data processing applications; one of the examples is database applications. Memory Level Parallelism (MLP) Memory level parallelism means generating and servicing multiple memory accesses in parallel. Memory-level parallelism (MLP) is a term in computer architecture referring to the ability to have pending multiple memory operations, in particular cache misses or translation look aside buffer (TLB) misses, at the same time. In a single processor, MLP may be considered a form of instruction-level parallelism (ILP). In general, processors are fast but memory is slow. One way to bridge this gap is to service the memory accesses in parallel. If the misses are serviced in parallel, then the processor would take only one long latency stall for all the parallel miss. This whole idea of fetching the misses in parallel is termed as Memory Level Parallelism. Granularity In parallel computing, granularity (or grain size) of a task is a measure of the amount of work (or computation) which is performed by that task. Fine-grain Parallelism In fine-grained parallelism, the program is divided into a large number of small tasks. These tasks are assigned individually for many processors. The amount of work associated with a parallel task is low, and the work is evenly distributed among the processors. Therefore, fine-grained parallelism facilitates load balancing. Each task processes less data, the number of processors required to perform a full treatment high. This, in turn, increases communication and synchronization. Fine-grained parallelism with a large number of independent tasks needs to consider the following: ➢ Compute intensity needs to be moderate so that each independent unit of parallelism has sufficient work to do. ➢ The cost of communication needs to be low so that each independent unit of parallelism can execute independently. ➢ Workload scheduling is important in fine-grained parallelism owing to the large number of independent tasks that can execute in parallel. The flexibility provided by an effective workload scheduler can and achieve load balance when a large number of tasks are being executed. Coarse-grain Parallelism In coarse-grained parallelism, the program is broken down into major tasks. In this regard, a large amount of computation occurs in the processor. This can lead to load imbalance, in which certain processing tasks of the data array, while others may be idle. Further, coarse-grained parallelism is not used parallelism in the program as most of the computations are executed sequentially on the CPU. The advantage of this type of parallelism is low communication and synchronization. Coarse-grained parallelism needs to consider the following: ➢ Compute intensity needs to be higher than in the fine-grained case since there are fewer tasks that will execute independently. ➢ Coarse-grained parallelism would require the developer to identify complete portions of an application that can serve as a task. Medium Grained Parallelism Medium-grained parallelism is used relatively to fine-grained and coarse-grained parallelism. Medium-grained parallelism is a compromise between fine-grained and coarse-grained parallelism, where we have task size and communication time greater than fine-grained parallelism and lower than coarse-grained parallelism. Most general-purpose parallel computers fall in this category. Performance of Parallel Processor Run Time: The parallel run time is defined as the time that elapses from the moment that a parallel computation starts to the moment that the last processor finishes execution. Notation: Serial run time Ts , parallel run time Tp . The speed up is defined as the ratio of the serial run time of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on parallel processors. S = Ts / Tp The efficiency is defined as the ratio of speed up to the number of processors. Efficiency measures the fraction of time for which a processor is usefully utilized. E=S/P The cost of solving a problem on a parallel system is defined as the product of run time and the number of processors. Amdahl’s Law Amdahl’s Law was named after Gene Amdahl, who presented it in 1967. In general terms, Amdahl’s Law states that in parallelization, if P is the proportion of a system or program that can be made parallel, and 1-P is the proportion that remains serial, then the maximum speedup S(N) that can be achieved using N processors is: As N grows the speedup tends to 1/(1-P). Speedup is limited by the total time needed for the sequential (serial) part of the program. For 10 hours of computing, if we can parallelize 9 hours of computing and 1 hour cannot be parallelized, then our maximum speedup is limited to 10 times as fast. Gustafson’s law This law says that increase of problem size for large machines can retain scalability with respect to the number of processors. American computer scientist and businessman, John L. Gustafson (born January 19, 1955) found out that practical problems show much better speedup than Amdahl predicted. The computation time is constant (instead of the problem size), increasing number of CPUs solve bigger problem and get better results in the same time. • • • • • Execution time of program on a parallel computer is (a+b) a is the sequential time and b is the parallel time Total amount of work to be done in parallel varies linearly with the number of processors. So b is fixed as P is varied. The total run time is (a+p*b) The speed up is (a+p*b)/(a+b) • Define α = a/(a+b), the sequential fraction of the execution time, then Any sufficiently large problem can be efficiently parallelized with a speedup S = P – α * (P – 1) Where p is the number of processors, and α is the serial portion of the problem. Gustafson proposed a fixed time concept which leads to scaled speed up for larger problem sizes. Basically, we use larger systems with more processors to solve larger problems. Chapter 2 Processor Architecture 2.1 Uniprocessor Architecture A uniprocessor system is defined as a computer system that has a single central processing unit that is used to execute computer tasks. A uniprocessor is a system with a single processor which has three major components that are main memory i.e. the central storage unit, the central processing unit i.e. CPU, and an input-output unit like monitor, keyboard, mouse, etc. The first computers were all uniprocessor systems. Very simple embedded systems often have only one processor: Car keys, digital alarm clocks, smoke detectors, etc. are all likely to have only one processor. Any non-safety critical system with limited functionality will be a uni processor system. 2.2 RISC & CISC Architecture RISC (reduced instruction set computer) A reduced instruction set computer is a computer that only uses simple commands that can be divided into several instructions that achieve low-level operation within a single CLK cycle. The main function of this is to reduce the time of instruction execution by limiting as well as optimizing the number of commands. So each command cycle uses a single clock cycle where every clock cycle includes three parameters namely fetch, decode & execute. The kind of processor is mainly used to execute several difficult commands by merging them into simpler ones. RISC processor needs a number of transistors to design and it reduces the instruction time for execution. Advantages ✓ The performance of this processor is good because of the easy & limited no. of the instruction set. ✓ This processor uses several transistors in the design so that making is cheaper. ✓ RISC processor allows the instruction to utilize open space on a microprocessor due to its simplicity. ✓ It is very simple as compared with another processor due to this; it can finish its task within a single clock cycle. CISC (Complex instruction set computer) Complex instruction set computer is a computer where single instructions can perform numerous low-level operations like a load from memory, an arithmetic operation, and a memory store or are accomplished by multistep processes or addressing modes in single instructions. CISC supports high-level languages for simple compilation and complex data structure. For writing an application, less instruction is required. The code length is very short, so it needs extremely small RAM. It highlights the instruction on hardware while designing as it is faster to design than the software. Instructions are larger as compared with a single word. Advantages ✓ In the CISC processor, the compiler needs a small effort to change the program or statement from high-level to assembly otherwise machine language. ✓ A single instruction can be executed by using different low-level tasks ✓ It doesn’t use much memory due to a short length of code. ✓ CISC utilizes less instruction set to execute the same instruction as the RISC. examples of the CISC processor include AMD, VAX, System/360 & Intel x86. differences between CISC and RISC architectures 2.3 Parallel processing mechanism for Uni-processor Parallelism in a uniprocessor means a system with a single processor performing two or more than two tasks simultaneously. Parallelism can be achieved by two means hardware and software. Hardware Approach for Parallelism in Uniprocessor 1. Multiplicity of Functional Unit In earlier computers, the CPU consists of only one arithmetic logic unit which used to perform only one function at a time. This slows down the execution of the long sequence of arithmetic instructions. To overcome this the functional units of the CPU can be increased to perform parallel and simultaneous arithmetic operations. 2. Parallelism and Pipelining within CPU The term Pipelining refers to a technique of decomposing a sequential process into sub-operations, with each sub-operation being executed in a dedicated segment that operates concurrently with all other segments. The most important characteristic of a pipeline technique is that several computations can be in progress in distinct segments at the same time. Parallel adders can be implemented using techniques such as carry-lookahead and carry-save. A parallel adder is a digital circuit that adds two binary numbers, where the length of one bit is larger as compared to the length of another bit and the adder operates on equivalent pairs of bits parallelly. The multiplier can be recoded to eliminate more complex calculations. Various instruction execution phases are pipelined and to overcome the situation of overlapped instruction execution the techniques like instruction prefetch and data buffers are used. 2.4 Multiprocessor and Multicomputer Model Multiprocessor A Multiprocessor is a computer system with two or more central processing units (CPUs) share full access to a common RAM. The main objective of using a multiprocessor is to boost the system’s execution speed. There are two types of multiprocessors, one is called shared memory multiprocessor and another is distributed memory multiprocessor. In shared memory multiprocessors, all the CPUs shares the common memory but in a distributed memory multiprocessor, every CPU has its own private memory. Benefits of using a Multiprocessor ➢ ➢ ➢ ➢ ➢ Enhanced performance. Multiple applications. Multi-tasking inside an application. High throughput and responsiveness. Hardware sharing among CPUs. Shared Memory multiprocessor Three most common shared memory multiprocessors models are: 1. Uniform Memory Access (UMA) 2. Non-uniform Memory Access (NUMA) 3. Cache Only Memory Architecture (COMA) UMA (Uniform Memory Access) model This model shares physical memory in a uniform way between the processors where all the processors have an even access time to all memory words. All the processors have equal access time to all the memory words and each processor may have a private cache memory. When all the processors have equal access to all the peripheral devices, the system is called a symmetric multiprocessor. When only one or a few processors can access the peripheral devices, the system is called an asymmetric multiprocessor. Suitable for General purpose and time-sharing applications. It is slower than NUMA. Non-uniform Memory Access (NUMA) NUMA is also a multiprocessor model in which each processor connected with the dedicated memory. However, these small parts of the memory combine to make a single address space. In NUMA multiprocessor model, the access time varies with the location of the memory word. Here, the shared memory is physically distributed among all the processors, called local memories. The collection of all local memories forms a global address space which can be accessed by all the processors. NUMA architecture is intended to increase the available bandwidth to the memory. Suitable for Real-time and time-critical applications. Cache Only Memory Architecture (COMA) The COMA model is a special case of the NUMA model. Here, all the distributed main memories are converted to cache memories. This model is composed by the combination of multiprocessor and cache memory. It changes distributed memory into caches and is an exceptional case of NUMA. It lacks the use of the memory hierarchy, and global address space is made up of combining all the caches. Difference Between UMA and NUMA Multicomputer A multicomputer system is a computer system with multiple processors that are connected together to solve a problem, known as nodes. Each processor has its own memory and it is accessible by that particular processor and those processors can communicate with each other via an interconnection network. Each node acts as an autonomous computer having a processor, a local memory and sometimes I/O devices. In this case, all local memories are private and are accessible only to the local processors. As the multicomputer is capable of messages passing between the processors, it is possible to divide the task between the processors to complete the task. Hence, a multicomputer can be used for distributed computing. It is cost effective and easier to build a multicomputer than a multiprocessor. ➢ NORMA Model In a NoRMA architecture, the address space globally is not unique and the memory is not globally accessible by the processors. Accesses to remote memory modules are only indirectly possible by messages through the interconnection network to other processors, which in turn possibly deliver the desired data in a reply message. The advantage of the NoRMA model is the ability to construct extremely large configurations, which is achieved by shifting the problem to the user configuration. Programs for NoRMA architectures need to evenly partitioning the data into local memory modules, handle transformations of data identifiers from one processor's address space to another, and realize a messagepassing system for remote access to data. The programming model of Norma architecture is therefore extremely complicated. multi computer architecture Difference between multiprocessor and Multicomputer