DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING PARALLEL PROGRAMMING AND PERFORMANCE PROGRAMMING AND PERFORMANCE ANALYSIS OF PARALLEL COMPUTERS DHPC Programming and Performance Programming Models for Parallel Computing First look at an example of parallelizing a real-world task, taken from Solving Problems on Concurrent Processors, Fox et al. Hadrian’s Wall was built by the ancient Romans to keep the marauding Scottish barbarians out of Roman England. It was originally 120 km long and 2 meters high. How would you build such a huge structure in the shortest possible time? Clearly the sequential approach — a single bricklayer building the entire wall — would be too slow. Can have modest task parallelism by having different specialist workers concurrently: • making the bricks • delivering the bricks • laying the bricks But unless we can replicate these workers, this only gives us a threefold speedup of the process. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 2 DHPC Programming and Performance Data Parallelism To really speed up completion of the task, need many workers laying the bricks concurrently. In general there are two ways to do this: • pipelining (vectorization) • replication (parallelization) Concurrent execution of a task requires assigning different sections of the problem (processes and data) to different processors. We will concentrate on data parallelism, where the processors work concurrently on their own section of the whole data set, or domain. Splitting up the domain between processors is known as domain decomposition. Each processor has their own sub-domain, or grain, to work on. Parallelism may be described as fine-grained (lots of very small domains) or coarse-grained (a smaller number of larger domains). Deciding how the domain decomposition is done is a key issue in implementing efficient parallel processing. For the most efficient (least time) execution of the task, need to: • minimize communication of data between processors • distribute workload equally among the processors (known as load balancing) [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 3 DHPC Programming and Performance Vectorization (Pipelining) One possible approach is to allocate a different bricklayer to each row of the wall, i.e. a horizontal decomposition of the problem domain. This is a pipelined approach — each bricklayer has to wait until the row underneath them has been started, so there is some inherent inefficiency. Once all rows have been started (the pipeline is full) all the bricklayers (processors) are working efficiently at the same time, until the end of the task, when there is some overhead (idle workers) while the upper levels are completed (the pipeline is flushed). [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 4 DHPC Programming and Performance Parallelization (Replication) Another approach is to do a vertical decomposition of the problem domain, so each bricklayer gets a vertical section of the wall to complete. In this case, the workers must communicate and synchronize their actions at the edges where the sub-domains meet. In general this communication and synchronization will incur some overhead, so there is some inefficiency. However each worker has an inner section of wall within their sub-domain that is completely independent of the others, which they can build just as efficiently as if there were no other workers. As long as the time taken to build this inner section is much longer than the time taken up by the communication and synchronization overhead, then the parallelization will be efficient and give good speedup over using a single worker. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 5 DHPC Programming and Performance Parallel I/O For large tasks with lots of data, need efficient means to pass the appropriate data to each processor. In building a wall, need to keep each bricklayer supplied with bricks. Simple approach has a single host processor connected to outside network and handling all I/O. Passes data to other processors through internal comms network of the machine. Host processor is a sequential bottleneck for I/O. Bandwidth is limited to a single network connection. Better approach is to do I/O in parallel, so each node (or group of nodes) has a direct I/O channel to a disk array. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 6 DHPC Programming and Performance Domain Decompositions Some standard domain decompositions of a regular 2D grid (array) include: BLOCK — contiguous chunks of rows or columns of data on each processor. BLOCK-BLOCK — block decomposition in both dimensions. CYCLIC — data is assigned to processors like cards dealt to poker players, so neighbouring points are on different processors. This can be good for load balancing applications with varying workloads that have certain types of communication, e.g. very little, or a lot (global sums or all-to-all), or strided. BLOCK-CYCLIC — a block decomposition in one dimension, cyclic in the other. SCATTERED — points are scattered randomly across processors. This can be good for load balancing applications with little (or lots of) communication. The human brain seems to work this way – neighboring sections may control widely separated parts of the body. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 7 DHPC Programming and Performance Domain Decompositions Illustrated BLOCK BLOCK-BLOCK CYCLIC BLOCK-CYCLIC The squares represent the 2D data array, the colours represent the processors where the data elements are stored. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 8 DHPC Programming and Performance Static Load Balancing For maximum efficiency, domain decomposition should give equal work to each processor. In building the wall, can just give each bricklayer an equal length segment. But things can become much more complicated: • What if some bricklayers are faster than others? (this is like an inhomogeneous cluster of different workstations) • What if there are guard towers every few hundred meters, which require more work to construct? (in some applications, more work is required in certain parts of the domain) If we know in advance 1. the relative speed of the processors, and 2. the relative amount of processing required for each part of the problem then we can do a domain decomposition that takes this into account, so that different processors may have different sized domains, but the time to process them will be about the same. This is static load balancing, and can be done at compile-time. For some applications, maintaining load balance while simultaneously minimising communication can be a very difficult optimisation problem. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 9 DHPC Programming and Performance Irregular Doman Decomposition In this figure the airflow over an aeroplane wing is modeled on an irregular triangulated mesh. The grid is finer in areas where there is the most change in the airflow (e.g. turbulent regions) and coarsest where the flow is more regular (laminar). The domain is distributed among processors given by different colours. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 10 DHPC Programming and Performance Dynamic Load Balancing In some cases we do not know in advance one (or both) of: • the effective performance of the processors – may be sharing the processors with other applications, so the load and available CPU may vary • the amount of work required for each part of the domain – many applications are adaptive or dynamic, and the workload is only known at runtime (e.g. dynamic irregular mesh for CFD, or varying convergence rates for PDE solvers in different sections of a regular grid) In this case we need to dynamically change the domain decomposition by periodically repartitioning the data between processors. This is dynamic load balancing, and it can involve substantial overheads in: • Figuring out how to best repartition the data – may need to use a fast method that gives a good (but not optimal) domain decomposition, to reduce computation. • Moving the data between processors – could restrict this to local (neighbouring processor) moves to reduce communication. Usually repartition as infrequently as possible (e.g. every few iterations instead of every iteration). There is a tradeoff between performance improvement and repartitioning overhead. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 11 DHPC Programming and Performance Causes of Inefficiency in Parallel Programs • Communication Overhead In some cases communication can be overlapped with computation, i.e. needed data can be prefetched at the same time as useful computation is being done. But often processors must wait for data, causing inefficiency (like a cache miss on a sequential machine, but worse). • Load Imbalance Overhead Any load imbalance will cause some processors to be idle some of the time. In some problems the domain will change during the computation, requiring expensive dynamic load balancing. • Algorithmic Overhead The best algorithm to solve a problem in parallel is often slightly different (or in some cases very different) to the best sequential algorithm. Any excess cycles caused by the difference in algorithm (more operations or more iterations) gives an algorithmic overhead. Note that to calculate speedup honestly, should not use the time for the parallel algorithm on 1 processor, but the time for the best sequential algorithm. • Sequential Overhead The program may have parts that are not parallelizable, so that each processor replicates the same calculation. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 12 DHPC Programming and Performance Parallel Speedup If N workers or processors are used, expect to be able to finish the task N times faster. Time on 1 processor Speedup = Time on N processors Speedup will (usually) be at most N on N processors. (Question: How can it be more?) N.B. Usually better to plot speedup vs number of processors, rather than time taken (1/speedup) vs #procs, since it is much harder to judge deviation from a 1/x curve than from a straight line. Any problem of fixed size will have a maximum speedup, beyond which adding more processors will not reduce (and will usually increase) the time taken. Speedup Linear Speedup (Perfect Speedup) Actual Speedup Number of Processors [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 13 DHPC Programming and Performance Superlinear Speedup Answer to previous question ..... It is possible to get speedups greater than N on N processors if the parallel algorithm is very efficient (no load imbalance or algorithmic overhead, and any communication is overlapped with computation), and if splitting the data among processors allows a greater proportion of the data to fit into cache memory on each processor, resulting in faster sequential performance of the program on each processor. Same idea applies to programs that would be out-of-core on 1 processor (i.e. require costly paging to disk) but can have all data in core memory if they are spread over multiple processors. Some parallel implementations of tree search applications (e.g. branch and bound optimisation algorithms) can also have superlinear speedup, by exploring multiple branches in parallel, allowing better pruning and thus finding the solution faster. However the same effect can be achieved by running multiple processes (or threads) on a single processor. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 14 DHPC Programming and Performance Parallel Efficiency Efficiency measures the actual speedup relative to the maximum speedup, given as a fraction or a percentage. It is often quoted instead of speedup. Efficiency = Speedup N Efficiency will (usually) be at most 1 (or 100%). It is still a function of the number of processors N, but not as much as speedup, since it is scaled by N. Parallel programs are rarely 100% efficient - there is usually some overhead f from the parallel implementation, so that the time taken for the parallel program is Time on N procs = Time on 1 proc ∗ (1 + f ) N N Speedup = Efficiency = [KAH/PDC] (1 + f) 1 (1 + f) ≈ 1 − f (for smallf ) DHPC CS7933, Performance, Jul-Oct 2000, 15 DHPC Programming and Performance Alternative Definitions of Speedup and Efficiency Previous definitions of speedup and efficiency assume that times are measured for a fixed problem size. This is artificial – in practice, often use more processors to solve larger problems in the same time, not the same sized problem in less time. Also, cannot use this definition to assess scalability of the program or algorithm to larger numbers of processors. Cannot expect to obtain good speedups as amount of data per processor is reduced to zero, no matter how good the parallel algorithm might be. Alternative definitions of speedup and efficiency use: • Constant problem size (standard definition) • Constant domain size on each processor • Constant time for solution (adjust size accordingly) When quoting speedups, be sure to say which of these you are measuring. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 16 DHPC Programming and Performance Vector Speedup and Efficiency For vector machines, efficiency is largest when the vector size is largest, so the startup time to fill the pipeline is amortized. For large vectors (i.e. lots of parallelism), may get close to maximum performance, while for very small vectors, may get little or no speedup. Performance of vector and parallel machines for a vector of length N is given approximately by the empirical formula rN= r∞ n1/2/N+ 1 flops where • r∞ – maximum performance of the machine • n1/2 – vector length that gives half the maximum performance are two parameters that quantify the performance of a given vector or parallel machine for a particular program. For good performance, N [KAH/PDC] should be much larger than n1/2. DHPC CS7933, Performance, Jul-Oct 2000, 17 DHPC Programming and Performance Amdahl’s Law Suppose there is some fraction f of the calculation that cannot be vectorized or parallelized. Then the time TNon N processors will be (1 − f )T1 TN= + f T1 N Speedup = N (1 − f ) + N f ≤1 f So if 5% of the computational time in a program is spent doing work that is not parallelizable, the maximum speedup of the program is 20. This bound on the speedup is known as Amdahl’s Law. It was originally believed that this would mean that massive parallelism would not be viable, since most applications were thought to have some sequential component that would limit the speedup. However there are many applications which can be fully parallelised, or have a sequential component that is very small compared to the computation that can be done in parallel, allowing large speedups. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 18 DHPC Programming and Performance Minimizing Communication Overhead A crucial problem in parallel programming is to minimize the ratio of communication to computation, otherwise cannot have efficient programs. Want to: • Minimize the amount of communication • Overlap communications with computation where possible • Reduce latency by sending a few large messages, rather than a lot of small messages. At the hardware level, can reduce latency by using fast (but expensive) communications. At the systems software level, can reduce latency by using lightweight message passing protocols, such as Active Messages. But most of the real work needs to be done by the programmer and/or the compiler. • Keep data local • Calculate using local data while getting remote data • Buffer communications to send fewer messages (e.g. send all edge data at once rather than one point at a time) [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 19 DHPC Programming and Performance Benchmarking The peak performance of a computer is the maximum number of operations it can perform per second. This is usually: Flops = number of pipes (flops/cycle) * clock speed (cycles/sec) * processors Many applications do not achieve anywhere near the peak performance, particularly on high-performance computers. Many standard benchmarks have been developed to determine actual performance of a computer over a range of applications, e.g. SPECmarks, transactions per second. Benchmarks aimed at parallel and vector HPC machines include: • LINPACK matrix solver (from SCALAPACK parallel linear algebra library), used to rank Top 500 list • NAS Benchmarks (kernels for some NASA fluid dynamics applications) • PARKBENCH Benchmarks (PARallel Kernels and BENCHmarks) • Others at BenchWeb (www.netlib.org/benchweb/) Best benchmark is of course to run your applications (or the compute-intensive application kernel) on the machine. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 20 DHPC Programming and Performance Using Benchmarks Care is needed in obtaining and analyzing results from benchmarks. • Run multiple runs using different data sizes and numbers of processors. • May need to do multiple runs and either average the results or take the best result. • Analyze performance, speedup, efficiency, scalability as functions of number of procs and different data sizes. • Should have a dedicated machine - someone else running a job one node of a cluster could reduce performance by a factor of 2! • I/O and initialization time may affect results - may be large difference in CPU time and total wall clock time. • Percentage of time spent on I/O and initialization may be very different for benchmark (short run time) vs real application (long run time), so should they be excluded? Or is I/O an important part of the application? • Need to be careful in comparing results on different machines - may need different optimization flags for compilers, different data decomposition, or even to modify code to perform better on different architecture. [KAH/PDC] DHPC CS7933, Performance, Jul-Oct 2000, 21