2015-11-19 Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 TDTS 08 – Lecture 7 Performance Improvement Reduction of instruction execution time: Increased clock frequency by fast circuit technology. Simplify instructions (RISC). Parallelism within processor: Pipelining. Parallel execution of instructions (ILP): • Superscalar processors. • VLIW architectures. Parallel processing. Huge degree of parallelism possible. Zebo Peng, IDA, LiTH 2 TDTS 08 – Lecture 7 1 2015-11-19 Why Parallel Processing? Traditional computers are not able to meet high-performance requirements for many applications: Simulation of large complex systems in physics, economy, biology... Distributed data base with search function. Computer-aided design. Visualization and multimedia. Multi-tasking and multi-user systems (e.g., super computers). Such applications are characterized by a very large amount of numerical computations and/or a high quantity of input data. In order to deliver sufficient performance for such applications, we can have many processors in a single computer. Zebo Peng, IDA, LiTH 3 TDTS 08 – Lecture 7 Why Parallel Processing (Cont’d)? Technology development: Hardware and silicon technology makes it possible to build machines with huge degree of parallelism cost effectively. It started with mainframes and super computers. Now even file servers and regular PCs are implemented often as parallel machines. PP has also the potential of being more reliable: If one processor fails, the system continues to work, with a lower performance. PP provides also a platform to build scalable systems with different performances and capabilities. Zebo Peng, IDA, LiTH 4 TDTS 08 – Lecture 7 2 2015-11-19 Parallel Computer Parallel computers refer to architectures in which many CPUs are running in parallel to implement a given application or a set of applications. Such computers can be organized in different ways, depending on several key parameters: number and complexity of individual CPUs; availability of common (shared) memory; interconnection technology and topology; performance of interconnection network; I/O devices; etc. Zebo Peng, IDA, LiTH 5 TDTS 08 – Lecture 7 Parallel Program In order to fully utilize a parallel computer, one must decompose a problem into sub-problems that can be solved in parallel. The results of sub-problems may have to be combined to get the final result of the main problem. Due to data dependency among the sub-problems, it is not easy to decompose some problem to get a large degree of parallelism. Due to data dependency, the processors may also have to communicate among each other. The time taken for communication is usually very high when compared with the processing time. The communication mechanism must therefore be very well designed in order to get a good performance. Zebo Peng, IDA, LiTH 6 TDTS 08 – Lecture 7 3 2015-11-19 Parallel Program Example (1) Matrix computations: A11 B11 A12 B12 A B A 22 B22 21 21 C A B A 31 B31 A32 B32 A B A N 2 BN 2 N1 N1 A13 B13 A 23 B23 A33 B33 A1 M B1 M A 2 M B2 M A 3 M B3 M A N 3 B N 3 A NM B NM Vector computation with vector of m elements: for i:=1 to n do C[i,1:m]:=A[i,1:m] + B[i,1:m]; end for; Zebo Peng, IDA, LiTH 7 TDTS 08 – Lecture 7 Parallel Program Example (2) A vector dot product is common in filtering: N Y a(i ) x(i ) i 1 Parallel sorting: Unsorted-1 . Unsorted-2 . . U N S O R Unsorted-3 TED . . . Unsorted-4 Sorting Sorting Sorting Sorting Sorted-1 Sorted-2 Sorted-3 Sorted-4 Parallel part Sequential part Merge SORTED Zebo Peng, IDA, LiTH 8 TDTS 08 – Lecture 7 4 2015-11-19 Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 9 TDTS 08 – Lecture 7 Flynn’s Classification of Architectures Based on the nature of the instruction flow executed by the computer and the data flow on which the instructions operate. The multiplicity of instruction stream and data stream gives us four different classes: Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD Multiple instruction, multiple data stream- MIMD Zebo Peng, IDA, LiTH 10 TDTS 08 – Lecture 7 5 2015-11-19 Single Instruction, Single Data - SISD The regular computers we have discussed up till now: A single processor; A single instruction stream; and Data stored in a single memory. CPU Processing unit Control Unit Memory System Zebo Peng, IDA, LiTH 11 TDTS 08 – Lecture 7 Single Instruction, Multiple Data - SIMD A single machine instruction stream. Simultaneous execution on different sets of data. A large number of processing elements. Lockstep synchronization among the process elements. The processing elements can: have their respective private data memory; or share a common memory via an interconnection network. Array and vector processors are the most common examples of SIMD machines. Zebo Peng, IDA, LiTH 12 TDTS 08 – Lecture 7 6 2015-11-19 SIMD with Shared Memory Control Unit IS Processing Unit_2 DS1 DS2 … Processing Unit_n Zebo Peng, IDA, LiTH DSn Interconnection Network Processing Unit_1 Shared Memory 13 TDTS 08 – Lecture 7 Multiple Instruction, Single Data - MISD A single sequence of data. Transmitted to a set of processors. Processors executes different instruction sequences. Not been commercially implemented up till now! Data Zebo Peng, IDA, LiTH ... PE1 PE2 14 PEn TDTS 08 – Lecture 7 7 2015-11-19 Multiple Instruction, Multiple Data - MIMD It consists of a set of processors. Simultaneously execute different instruction sequences. Different sets of data are operated on. The MIMD class can be further divided: Shared memory (tightly coupled): • Symmetric multiprocessor (SMP) • Non-uniform memory access (NUMA) Distributed memory (loosely coupled) = Clusters Zebo Peng, IDA, LiTH 15 TDTS 08 – Lecture 7 MIMD with Shared Memory LM1 CPU_1 Processing Unit_1 LM2 CPU_2 Control Unit_2 DS2 IS2 Processing Unit_2 … LMn CPU_n Control Unit_n Zebo Peng, IDA, LiTH DS IS2 Processing Unit_n 16 Interconnection Network Control Unit_1 DS1 IS1 Shared Memory n TDTS 08 – Lecture 7 8 2015-11-19 Discussion Very fast development in Parallel Processing and related areas has blurred concept boundaries, causing a lot of terminological confusion: concurrent computing, multiprocessing, distributed computing, etc. There is no strict delimiter for contributors to the area of parallel processing; it includes CA, OS, HLL design, compilation, databases, and computer networks. Zebo Peng, IDA, LiTH 17 TDTS 08 – Lecture 7 Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 18 TDTS 08 – Lecture 7 9 2015-11-19 Performance Metrics (1) How fast does a parallel computer run at its maximal potential? Peak rate: the maximal computation rate that can be theoretically achieved when all processors are fully utilized. Ex. The fastest supercomputer in the world has a peak rate of 55 PFlop/s. The peak rate is of no practical significance for an individual user. It is mostly used by vendor companies for marketing their computers. Zebo Peng, IDA, LiTH 19 TDTS 08 – Lecture 7 Performance Metrics (2) How fast execution can we expect from a parallel computer for a given application or a given set of applications? Note the increase of multi-tasking and multi-thread computing. Speedup: measures the gain we get by using a parallel computer, over a sequential one, to run a given application. S = Ts Tp TS: execution time needed with the sequential computer; Tp : execution time needed with the parallel computer. Zebo Peng, IDA, LiTH 20 TDTS 08 – Lecture 7 10 2015-11-19 Performance Metrics (3) Efficiency: to relate speedup to the number of processors used; it provides therefore a measure of the efficiency with which the processors are used. E= S P S: speedup; P: number of processors. For the ideal situation, in theory: S = P; which means E = 1. Practically the ideal efficiency of 1 cannot be achieved! Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 7 Speed Up Limitation Let f be the ratio of computations that, according to the algorithm, have to be executed sequentially (0 f 1); and P the number of processors. (1 – f ) × Ts P Ts (1 – f ) × Ts f × Ts + P Tp = f × Ts + S= S 10 9 8 7 6 5 4 3 2 1 = 1 (1 – f ) f+ P For a parallel computer with 10 processing elements 0 .2 Zebo Peng, IDA, LiTH 0 .4 0 .6 0 .8 22 1 .0 f TDTS 08 – Lecture 7 11 2015-11-19 Speed Up vs. % of Parallel Part (1-f) 1024 0.5 512 0.75 0.95 256 0.99 1 128 Speedup 64 32 16 8 4 2 1 1 2 4 8 16 32 64 128 256 512 1024 Cores Zebo Peng, IDA, LiTH 23 TDTS 08 – Lecture 7 Amdahl’s Law Even a little ratio of sequential computation imposes a limit on the speedup. A higher speedup than 1/f can’t be achieved, regardless of the number of processors, since S= 1 (1 – f ) f+ P ≤ 1 f If there is 20% sequential computation, the speedup will maximally be 5, even If you have 1 million processors. To efficiently exploit a high number of processors, f must be small (the algorithm has to be highly parallel), since E= Zebo Peng, IDA, LiTH S P = 1 f × (P – 1) + 1 24 TDTS 08 – Lecture 7 12 2015-11-19 Other Factors that Limit Speedup Beside the intrinsic sequentiality of parts of an algorithm, there are also other factors that limit the achievable speedup: communication cost; load balancing of the processors; costs of creating and scheduling processes; and I/O operations (mostly sequential in nature). There are many algorithms with a high degree of parallelism. The value of f is very small and can be ignored, and they are suited for massively parallel systems. The other limiting factors, such as the cost of communications, become critical, in such algorithms. Zebo Peng, IDA, LiTH 25 TDTS 08 – Lecture 7 Impact of Communication Consider a highly parallel computation, f is small and can be neglected. Let fc be the fractional communication overhead of a processor: Tcalc: the time that a processor executes computations; Tcomm: the time that a processor is idle because of communication; Tcomm Tcalc TS S = Tp = TS Tp = P × (1 + fc) fc = P 1 + fc 1 E = 1 + fc 1 – fc (if fc is small) With algorithms having a high degree of parallelism, massively parallel computers, consisting of large number of processors, can be efficiently used if fc is small. The time spent by a processor for communication has to be small compared to its time for computation. Communication time is very much impacted by the interconnection network. Zebo Peng, IDA, LiTH 26 TDTS 08 – Lecture 7 13 2015-11-19 Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 27 TDTS 08 – Lecture 7 Interconnection Network Interconnection network (IN) is a key component of a parallel architecture. It has a decisive influence on: the overall performance; and the total cost of the architecture. The traffic in an IN consists of: data transfer; and transfer of commands and requests (control information). The key parameters of an IN are total bandwidth: transferred bits/second; and implementation cost. Zebo Peng, IDA, LiTH 28 TDTS 08 – Lecture 7 14 2015-11-19 Single Bus Node1 Node2 ... Noden Single bus networks are simple, cheap and relatively flexible; and you have a broadcast mechanism. One single communication is allowed at a time; the bandwidth is shared by all nodes. Performance is relatively poor. In order to have good performance, the number of nodes is limited (to around 16 - 20). Multiple buses can be used instead, if needed. Zebo Peng, IDA, LiTH 29 TDTS 08 – Lecture 7 Completely Connected Network N x (N-1)/2 wires Each node is connected to every other one. Communications can be performed in parallel between any pair of nodes. Both performance and cost are high. Cost increases rapidly with number of nodes. Zebo Peng, IDA, LiTH 30 TDTS 08 – Lecture 7 15 2015-11-19 Crossbar Network Node 1 Node 2 … Node n A dynamic network: the interconnection topology can be modified by configurating the switches. It is completely connected: any node can be directly connected to any other. Fewer interconnections are needed than for the static completely connected network; however, a large number of switches is needed. A large number of communications can be performed in parallel (even though one node can receive or send only one data at a time). Zebo Peng, IDA, LiTH 31 TDTS 08 – Lecture 7 Mesh Network Torus: Cheaper than completely connected networks, while giving relatively good performance. In order to transmit data between two nodes, routing through intermediate nodes is needed (maximum 2×(n-1) intermediates for an n×n mesh). It is possible to provide wrap-around connections: Torus. Three dimensional meshes have also been implemented. Zebo Peng, IDA, LiTH 32 TDTS 08 – Lecture 7 16 2015-11-19 Hypercube Network 2-D 3-D 4-D 5-D 2n nodes are arranged in an n-dimensional cube. Each node is connected to n neighbors. In order to transmit data between two nodes, routing through intermediate nodes is needed (maximum n intermediates). Zebo Peng, IDA, LiTH 33 TDTS 08 – Lecture 7 Summary The growing need for high performance can not always be satisfied by traditional computers. With parallel computers, multiple CPUs are running concurrently in order to solve a given problem. Parallel programs have to be developed in order to make efficient use of a parallel computer. Computers can be classified based on the nature of the instruction flow and the data flow on which the instructions operate. Another key component of a parallel architecture is the interconnection network. The performance of a parallel computer depends not only on the number of available processors but also on characteristics of the executed programs. Zebo Peng, IDA, LiTH 34 TDTS 08 – Lecture 7 17