CPS522 Parallel Architectures Review for the First Exam The exam will be in-class and timed for the last 75 minutes of the class. You may bring two pages of notes if you so desire. I will ask questions pertaining to concepts such as "Why do we concern ourselves with the diameter of a hypercube"? I also may ask either fill-in-the-word or true-false questions. These types of questions allow me to cover a lot of material. I will ask some questions from your homework and labs. This exam will cover my notes, the first four chapters of the text, information about running an MPICHbased cluster, my notes from Tanenbaumand the concepts contained in the labs. I also expect that you know something about operating systems and threads and SMP, The way to study for this is to ask yourself what parts of the material are important and understand them. Chapter 1 1.1 The demand for computational speed. Why do we need to run things faster? Parallel programming on a parallel computer and the increase in speed. 1.2 Potential for increased computational speed Computation/communication ratio speedup factor overhead Maximum speedup, Amdal's law scalability and Gustafson's law 1.2.3 Message-passing computations Computation/communication ratio as a metric. 1.3.1 Shared memory multiprocessor system The typical example of this is SMP. Know how this works; how does an SMP machine handle threads? You could find much more about this in an operating systems text. 1.3.2 Message-Passing Multicomputer Static network, nodes, links Network information such as communication latency, startup time, diameter, bisection width. Attributes of Completely connected networks line and ring or pipeline mesh tree hypercube Embedding as a technique This is basically what we do when we map a algorithm onto a multicomputer Dilation is used to indicate the quality of the embedding. Communication methods Store and forward, virtual cut-through Wormhole, flits Latency Deadlock The role of the host 1.3.3 Distributed shared memory is a little off the path. 1.3.4 MIMD and SIMD Classifications You should definitely have this section down cold. We will also refer to SPMD. 1.4 Cluster Computing 1.4.1 Interconnected Computers as a Computing Platform Different types of interconnections, ethernets Addressing 1.4.2 Networked computers as a multicomputer platform Existing networks Dedicated Cluster Problems Chapter 2 Message-Passing Computing 2.1 Basics of Message-Passing Programming 2.1.1 Programming Options Programming a message-passing multicomputer can be achieved by Designing a special parallel programming language: occam on transputers Extending the syntax/reserved words of an existing sequential high-level language to handle message passing CC+ and Fortran M Using an existing sequential high-level language and providing a library of external procedures for message passing. Need a method of creating separate processes for execution on different computers and a method of sending and receiving messages. 2.1.2 Process Creation The Concept of a Process static process creation where all the processes are specified before execution and the system will execute a fixed number of processes master - slave or farmer - worker format Single program multiple data (SPMD) model, the different processes are merged into one program. Within the program are control statements that will customize the code, select different parts for each process. The program is compiled and each processor loads a copy MPI Dynamic Process Creation Processes can be created and their execution initiated during execution of other processes done by library/system calls The code for the processes has to be written and compiled before execution of any process Different program on each machine - MPMD Usually uses master/slave approach spawn(nameofprocess) PVM uses dynamic process creation, can also be SPMD 2.13 Message-Passing Routines Basic send and receive routines send(&x, destination_id); receive(&y, source_id); Synchronous Message Passing when the message is not sent until the receiver is ready to receive. rendezvous Doesn't need a buffer Usually require some sort of request to send and acknowledgement. Block and Nonblocking Message Passing Needs a buffer, so that it will return after their local actions complete. Message Selection Message tags are used so that messages can be given a id and a source or destination. e.g. receive message 5 from node 3. Broadcast, Gather, and Scatter Broadcast send the same message to all the processes. Multicast the same message to a defined group of processes. See Figure 2.6 Scatter is sending each element of an array of data in the root to a separate process. Gather is used to describe having one process collect individual values form a set of processes. Usually the opposite of scatter. Reduce operations combine gather with an arithmetic/logical operation . 2.2 Using Workstation clusters 2.2.1 Software tools We have the MPICH implementation of MPI in the Sun lab. 2.2.2 MPI This section and the MPI tutorial should do it. Process Creation and Execution Using the SPMD Computational Model Message-Passing Routines communicators point-to-point completion blocking routines nonblocking routines Send Communication Modes Collective Communication broadcast and scatter routines barrier Sample MPI program 2.2.3 Pseudocode Constructs The additions of PVM or MPI constructs to the code detract from readability. Use C-type pseudocode to describe the algorithms. Really only need send and receive. The process identification is place last in the list as in MPI. To send the message consisting of the integer x and a float y, form the process called master to the process called slave, assigning to a and b, we write in the master process send(&, &y, Pslave); and in the slave process recv(&a, &b, Pmaster); Note thta this is the format for locally blocking. We could use ssend(&data1, Pdestination); for a synchronous send. 2.3 Evaluating Parallel Programs For parallel algorithms we need to estimate communication overhead in addition to determining the number of computational steps. 2.3.1 Parallel Execution Time parallel execution time = computation time + communication time We can estimate the computation time in the same way we estimate time for a sequential algorithm. Communication time is dependent on the size of the message and some startup time for a workstation called latency. communication time = startup time + (number of data words * time for one data word) Latency hiding is having the processor do useful work while waiting for the communication to be completed. 2.3.2 Time Complexity Have you had an algorithm course? The O notation can be defined as f(x) = O(g(x)) if and only if there exits positive constants, c and x0, such that 0 <= f(x) <= cg(x) for all x >= x0 where f(x) and g(x) are functions of x. f(x) = 4x**2 + 2x + 12, we could use c = 6 for f(x) = O(x**2) since 0 <= 4x**2 +2x + 12 <= 6x**2 for x >=3 p. 67 Time complexity of a parallel algorithm Time complexity of a parallel algorithm will be the some of the time complexity of the computation and the communication For adding n numbers on 2 processors and adding the two results on one of them time for computation = n/2 + 1 communication requires time to send out half the numbers and to retrieve the partial result time for communication = time for start up + n/2 * time for data item + time for startup + time for one data item. Communication is costly, if both computation and communication have time the same time complexity, increasing the number of data items is unlikely to increase performance. If computation is greater than communication then we can get computation to dominate communication. A cost-optimal algorithm is one in which the cost to solve a problem is proportional to the execution time on a single processor. (parallel time complexity) X (number of processors) = sequential time complexity 2.3.3 Comments on Asymptotic Analysis In general we worry about limited size data sets on a small number of processors. Startup can dominate the time. Worrying about limits approaching infinity are usually not applicable. 2.3.4 Time Complexity of Broadcast/Gather Almost all problems require data to be broadcast to processes and data to be gathered from processes. Hypercube A message can be broadcast to all nodes in an n-node hypercube in log n steps. This implies the time complexity for broadcast/gather is O(n) which is optimal because the diameter of a hypercube is log n. Broadcast on a Mesh Network Without wraparound, from upper left, across top row, and down each column. This requires 2(n-1) steps or O(n) on a nxn mesh. Broadcast on a Workstation Cluster Broadcast on a single ethernet connection can be done using a single message that is read by all destinations on the network simultaneously. Can use a 1-to-n fan-out broadcast via daemons as in the PVM broadcast routine. This will usuall result in a tree structure. O(M + N/M) for one data item broadcast to N destination where there are N daemons. 2.4 Debugging and Evaluating Parallel Programs Not really interested in 2.4.1 and 2.4.2 2.4.1 Low-level Debugging 2.4.2 Visualization Tools 2.4.3 Debugging Strategies 1. If possible run the program as a single process and debug as a normal sequential program. 2. Execute the program using two to four multitasked processes on a single computer. Now examine actions such as checking that messages are indeed being sent to the correct places. It is very common to make mistakes with message tags and have messages sent to the wrong places. 3. Execute the program using the same two to four processes but not across several computers. This step helps find problems that are caused by network delays related to synchronization and timing. 2.4.4 Evaluating Programs Empirically Measuring execution time - we will have to do this. Communication by the Ping-Pong Method Profiling 2.4.5 Comments on Optimizing the Parallel Code. Chapter 3 You should have a basic understanding of how the two examples we discuss in class. Geometric transformation – lots of homework The mandelbrot set – section 3.2.2 and the code. Chapter 4 Review the homework questions.