Lecture 7: Parallel Processing Performance Improvement Introduction and motivation

2015-11-19 Lecture 7: Parallel Processing     Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 TDTS 08 – Lecture 7 Performance Improvement  Reduction of instruction execution time:  Increased clock frequency by fast circuit technology.  Simplify instructions (RISC).  Parallelism within processor:  Pipelining.  Parallel execution of instructions (ILP): • Superscalar processors. • VLIW architectures.  Parallel processing.  Huge degree of parallelism possible. Zebo Peng, IDA, LiTH 2 TDTS 08 – Lecture 7 1 2015-11-19 Why Parallel Processing?  Traditional computers are not able to meet high-performance requirements for many applications:  Simulation of large complex systems in physics, economy, biology...  Distributed data base with search function.  Computer-aided design.  Visualization and multimedia.  Multi-tasking and multi-user systems (e.g., super computers).  Such applications are characterized by a very large amount of numerical computations and/or a high quantity of input data.  In order to deliver sufficient performance for such applications, we can have many processors in a single computer. Zebo Peng, IDA, LiTH 3 TDTS 08 – Lecture 7 Why Parallel Processing (Cont’d)?  Technology development:  Hardware and silicon technology makes it possible to build machines with huge degree of parallelism cost effectively.  It started with mainframes and super computers.  Now even file servers and regular PCs are implemented often as parallel machines.  PP has also the potential of being more reliable:  If one processor fails, the system continues to work, with a lower performance.  PP provides also a platform to build scalable systems with different performances and capabilities. Zebo Peng, IDA, LiTH 4 TDTS 08 – Lecture 7 2 2015-11-19 Parallel Computer  Parallel computers refer to architectures in which many CPUs are running in parallel to implement a given application or a set of applications.  Such computers can be organized in different ways, depending on several key parameters:       number and complexity of individual CPUs; availability of common (shared) memory; interconnection technology and topology; performance of interconnection network; I/O devices; etc. Zebo Peng, IDA, LiTH 5 TDTS 08 – Lecture 7 Parallel Program  In order to fully utilize a parallel computer, one must decompose a problem into sub-problems that can be solved in parallel.  The results of sub-problems may have to be combined to get the final result of the main problem.  Due to data dependency among the sub-problems, it is not easy to decompose some problem to get a large degree of parallelism.  Due to data dependency, the processors may also have to communicate among each other.  The time taken for communication is usually very high when compared with the processing time.  The communication mechanism must therefore be very well designed in order to get a good performance. Zebo Peng, IDA, LiTH 6 TDTS 08 – Lecture 7 3 2015-11-19 Parallel Program Example (1) Matrix computations:   A11  B11 A12  B12  A B A 22  B22 21  21 C  A  B   A 31  B31 A32  B32     A  B A N 2  BN 2 N1  N1 A13  B13 A 23  B23 A33  B33  A1 M  B1 M  A 2 M  B2 M  A 3 M  B3 M   A N 3  B N 3  A NM         B NM  Vector computation with vector of m elements: for i:=1 to n do C[i,1:m]:=A[i,1:m] + B[i,1:m]; end for; Zebo Peng, IDA, LiTH 7 TDTS 08 – Lecture 7 Parallel Program Example (2)  A vector dot product is common in filtering: N Y   a(i )  x(i ) i 1  Parallel sorting: Unsorted-1 . Unsorted-2 . . U N S O R Unsorted-3 TED . . . Unsorted-4 Sorting Sorting Sorting Sorting Sorted-1 Sorted-2 Sorted-3 Sorted-4 Parallel part Sequential part Merge SORTED Zebo Peng, IDA, LiTH 8 TDTS 08 – Lecture 7 4 2015-11-19 Lecture 7: Parallel Processing     Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 9 TDTS 08 – Lecture 7 Flynn’s Classification of Architectures  Based on the nature of the instruction flow executed by the computer and the data flow on which the instructions operate.  The multiplicity of instruction stream and data stream gives us four different classes:  Single instruction, single data stream - SISD  Single instruction, multiple data stream - SIMD  Multiple instruction, single data stream - MISD  Multiple instruction, multiple data stream- MIMD Zebo Peng, IDA, LiTH 10 TDTS 08 – Lecture 7 5 2015-11-19 Single Instruction, Single Data - SISD The regular computers we have discussed up till now:  A single processor;  A single instruction stream; and  Data stored in a single memory. CPU Processing unit Control Unit Memory System Zebo Peng, IDA, LiTH 11 TDTS 08 – Lecture 7 Single Instruction, Multiple Data - SIMD      A single machine instruction stream. Simultaneous execution on different sets of data. A large number of processing elements. Lockstep synchronization among the process elements. The processing elements can:  have their respective private data memory; or  share a common memory via an interconnection network.  Array and vector processors are the most common examples of SIMD machines. Zebo Peng, IDA, LiTH 12 TDTS 08 – Lecture 7 6 2015-11-19 SIMD with Shared Memory Control Unit IS Processing Unit_2 DS1 DS2 … Processing Unit_n Zebo Peng, IDA, LiTH DSn Interconnection Network Processing Unit_1 Shared Memory 13 TDTS 08 – Lecture 7 Multiple Instruction, Single Data - MISD  A single sequence of data.  Transmitted to a set of processors.  Processors executes different instruction sequences.  Not been commercially implemented up till now! Data Zebo Peng, IDA, LiTH ... PE1 PE2 14 PEn TDTS 08 – Lecture 7 7 2015-11-19 Multiple Instruction, Multiple Data - MIMD  It consists of a set of processors.  Simultaneously execute different instruction sequences.  Different sets of data are operated on.  The MIMD class can be further divided:  Shared memory (tightly coupled): • Symmetric multiprocessor (SMP) • Non-uniform memory access (NUMA)  Distributed memory (loosely coupled) = Clusters Zebo Peng, IDA, LiTH 15 TDTS 08 – Lecture 7 MIMD with Shared Memory LM1 CPU_1 Processing Unit_1 LM2 CPU_2 Control Unit_2 DS2 IS2 Processing Unit_2 … LMn CPU_n Control Unit_n Zebo Peng, IDA, LiTH DS IS2 Processing Unit_n 16 Interconnection Network Control Unit_1 DS1 IS1 Shared Memory n TDTS 08 – Lecture 7 8 2015-11-19 Discussion  Very fast development in Parallel Processing and related areas has blurred concept boundaries, causing a lot of terminological confusion:  concurrent computing,  multiprocessing,  distributed computing,  etc.  There is no strict delimiter for contributors to the area of parallel processing; it includes CA, OS, HLL design, compilation, databases, and computer networks. Zebo Peng, IDA, LiTH 17 TDTS 08 – Lecture 7 Lecture 7: Parallel Processing     Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 18 TDTS 08 – Lecture 7 9 2015-11-19 Performance Metrics (1)  How fast does a parallel computer run at its maximal potential?  Peak rate: the maximal computation rate that can be theoretically achieved when all processors are fully utilized.  Ex. The fastest supercomputer in the world has a peak rate of 55 PFlop/s.  The peak rate is of no practical significance for an individual user.  It is mostly used by vendor companies for marketing their computers. Zebo Peng, IDA, LiTH 19 TDTS 08 – Lecture 7 Performance Metrics (2)  How fast execution can we expect from a parallel computer for a given application or a given set of applications?  Note the increase of multi-tasking and multi-thread computing.  Speedup: measures the gain we get by using a parallel computer, over a sequential one, to run a given application. S = Ts Tp TS: execution time needed with the sequential computer; Tp : execution time needed with the parallel computer. Zebo Peng, IDA, LiTH 20 TDTS 08 – Lecture 7 10 2015-11-19 Performance Metrics (3)  Efficiency: to relate speedup to the number of processors used; it provides therefore a measure of the efficiency with which the processors are used. E= S P S: speedup; P: number of processors. For the ideal situation, in theory: S = P; which means E = 1. Practically the ideal efficiency of 1 cannot be achieved! Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 7 Speed Up Limitation  Let f be the ratio of computations that, according to the algorithm, have to be executed sequentially (0  f  1); and P the number of processors. (1 – f ) × Ts P Ts (1 – f ) × Ts f × Ts + P Tp = f × Ts + S= S 10 9 8 7 6 5 4 3 2 1 = 1 (1 – f ) f+ P For a parallel computer with 10 processing elements 0 .2 Zebo Peng, IDA, LiTH 0 .4 0 .6 0 .8 22 1 .0 f TDTS 08 – Lecture 7 11 2015-11-19 Speed Up vs. % of Parallel Part (1-f) 1024 0.5 512 0.75 0.95 256 0.99 1 128 Speedup 64 32 16 8 4 2 1 1 2 4 8 16 32 64 128 256 512 1024 Cores Zebo Peng, IDA, LiTH 23 TDTS 08 – Lecture 7 Amdahl’s Law Even a little ratio of sequential computation imposes a limit on the speedup.   A higher speedup than 1/f can’t be achieved, regardless of the number of processors, since S=  1 (1 – f ) f+ P ≤ 1 f If there is 20% sequential computation, the speedup will maximally be 5, even If you have 1 million processors. To efficiently exploit a high number of processors, f must be small (the algorithm has to be highly parallel), since E= Zebo Peng, IDA, LiTH S P = 1 f × (P – 1) + 1 24 TDTS 08 – Lecture 7 12 2015-11-19 Other Factors that Limit Speedup  Beside the intrinsic sequentiality of parts of an algorithm, there are also other factors that limit the achievable speedup:  communication cost;  load balancing of the processors;  costs of creating and scheduling processes; and  I/O operations (mostly sequential in nature).  There are many algorithms with a high degree of parallelism.  The value of f is very small and can be ignored, and they are suited for massively parallel systems.  The other limiting factors, such as the cost of communications, become critical, in such algorithms. Zebo Peng, IDA, LiTH 25 TDTS 08 – Lecture 7 Impact of Communication   Consider a highly parallel computation, f is small and can be neglected. Let fc be the fractional communication overhead of a processor:   Tcalc: the time that a processor executes computations; Tcomm: the time that a processor is idle because of communication; Tcomm Tcalc TS S = Tp = TS Tp = P × (1 + fc) fc =  P 1 + fc 1 E = 1 + fc  1 – fc (if fc is small) With algorithms having a high degree of parallelism, massively parallel computers, consisting of large number of processors, can be efficiently used if fc is small.   The time spent by a processor for communication has to be small compared to its time for computation. Communication time is very much impacted by the interconnection network. Zebo Peng, IDA, LiTH 26 TDTS 08 – Lecture 7 13 2015-11-19 Lecture 7: Parallel Processing     Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 27 TDTS 08 – Lecture 7 Interconnection Network  Interconnection network (IN) is a key component of a parallel architecture. It has a decisive influence on:  the overall performance; and  the total cost of the architecture.  The traffic in an IN consists of:  data transfer; and  transfer of commands and requests (control information).  The key parameters of an IN are  total bandwidth: transferred bits/second; and  implementation cost. Zebo Peng, IDA, LiTH 28 TDTS 08 – Lecture 7 14 2015-11-19 Single Bus Node1 Node2 ... Noden  Single bus networks are simple, cheap and relatively flexible; and you have a broadcast mechanism.  One single communication is allowed at a time; the bandwidth is shared by all nodes.  Performance is relatively poor.  In order to have good performance, the number of nodes is limited (to around 16 - 20).  Multiple buses can be used instead, if needed. Zebo Peng, IDA, LiTH 29 TDTS 08 – Lecture 7 Completely Connected Network N x (N-1)/2 wires  Each node is connected to every other one.  Communications can be performed in parallel between any pair of nodes.  Both performance and cost are high.  Cost increases rapidly with number of nodes. Zebo Peng, IDA, LiTH 30 TDTS 08 – Lecture 7 15 2015-11-19 Crossbar Network Node 1 Node 2 … Node n  A dynamic network: the interconnection topology can be modified by configurating the switches.  It is completely connected: any node can be directly connected to any other.  Fewer interconnections are needed than for the static completely connected network; however, a large number of switches is needed.  A large number of communications can be performed in parallel (even though one node can receive or send only one data at a time). Zebo Peng, IDA, LiTH 31 TDTS 08 – Lecture 7 Mesh Network Torus:    Cheaper than completely connected networks, while giving relatively good performance. In order to transmit data between two nodes, routing through intermediate nodes is needed (maximum 2×(n-1) intermediates for an n×n mesh). It is possible to provide wrap-around connections:  Torus.  Three dimensional meshes have also been implemented. Zebo Peng, IDA, LiTH 32 TDTS 08 – Lecture 7 16 2015-11-19 Hypercube Network 2-D 3-D 4-D 5-D  2n nodes are arranged in an n-dimensional cube. Each node is connected to n neighbors.  In order to transmit data between two nodes, routing through intermediate nodes is needed (maximum n intermediates). Zebo Peng, IDA, LiTH 33 TDTS 08 – Lecture 7 Summary  The growing need for high performance can not always be satisfied by traditional computers.  With parallel computers, multiple CPUs are running concurrently in order to solve a given problem.  Parallel programs have to be developed in order to make efficient use of a parallel computer.  Computers can be classified based on the nature of the instruction flow and the data flow on which the instructions operate.  Another key component of a parallel architecture is the interconnection network.  The performance of a parallel computer depends not only on the number of available processors but also on characteristics of the executed programs. Zebo Peng, IDA, LiTH 34 TDTS 08 – Lecture 7 17

Lecture 7: Parallel Processing Performance Improvement Introduction and motivation

Related documents

Products

Support

Lecture 7: Parallel Processing Performance Improvement Introduction and motivation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib