PARALLEL PROGRAMMING - 1 ALEXANDER CHEFRANOV Japanese project of parallel intellectual computing systems of the 5-th generation (end of 80-s - beginning of 90-s years) has not succeeded [1], but its idea and achievements did not disappear, and presently without any pomp in USA develop work on the program of the Accelerated Strategic Computer Initiative (ASCI), within the framework of which to 2004 is expected to create on the basis of processors Pentium computing system on 100 thousand processors with the performance 100 TFlops (1014 operations with floating point per second) with the main destination of development of the nucleus weapon without its actual testing; at present already marketed sample on 7264 processors with performance 1.8 TFlops [2]. Alongside with military applications such technology will find a broad using for the forecasting of the weather, syntheses of new medicines with required characteristics, ensuring protection of bank systems etc. In Top 500 (http://www.top500.org/list/2002/06/) TOP500 Sublist for June 2002 (Revised June 20th, 14:00 EST) Rank Manufacturer Computer NEC EarthSimulator 2 Rmax(GFlops) Installation Site Country Year Installation Type Installation Area Processor Earth Simulator Center Japan 2002 Research IBM ASCI White, SP Power3 7226.00 375 MHz Lawrence Livermore National Laboratory USA 2000 Research 3 HewlettPackard AlphaServer SC ES45/1 4463.00 GHz Pittsburgh Supercomputing USA Center 2001 Academic 3016 4 HewlettPackard AlphaServer SC ES45/1 3980.00 GHz Commissariat a l'Energie France Atomique (CEA) 2001 Research 2560 5 IBM SP Power3 375 MHz 16 3052.00 way NERSC/LBNL USA 2001 Research 3328 6 HewlettPackard AlphaServer SC ES45/1 2916.00 GHz Los Alamos National Laboratory USA 2002 Research 2048 7 Intel ASCI Red Sandia National Laboratories USA 1999 Research 9632 8 IBM pSeries 690 Turbo 2310.00 1.3GHz Oak Ridge National Laboratory USA 2002 Research 864 9 IBM ASCI BluePacific SST, 2144.00 IBM SP Lawrence Livermore National USA 1999 Research 1 35860.00 2379.00 5120 Energy Energy 8192 5808 10 IBM 604e Laboratory pSeries 690 Turbo 2002.00 1.3GHz IBM/US Army Research Laboratory (ARL) USA 2002 Vendor it is given a table of distribution of supercomputers across countries and organizations and, it turns out that their large part is used by governmental institutions and banks, rather then research organizations. So, single mentioned in Russia in 2000 supercomputer which in worldwide rating stays on the 156th place, was situated in the National reserve bank of Russia; in June, 2002, Russian self-made MVS-1000 supercomputer had 64-th place and it was in scientific organization. Multiprocessor computing systems and parallel programming come to us not only as supercomputers, but as well from the side of widespread PC on the basis of Intel processors (multiprocessor systems, parallel processes) that defines practical worth and urgency of the "Parallel programming" course for preparing specialists in "Software engineering" and adjacent directions. Within the framework of this textbook there are considered questions of the organization of parallel programs, i.e. parallel programming. For the simplification of the perception of stated approaches to parallel programming introduction contains classical model of program control by von Neumann and classification of computing systems on the base of flows of commands and data streams is given. Section 1 gives several typical examples of computing systems, to which we shall appeal at the interpretation of languages and methods of parallel programming. Known parallel versions of Fortran language are discussed In the section 2 for systems of different classes. Section 3 gives information on programming on languages Occam-2 and parallel ANSI C for multi-transputer systems. In the section 4 it is represented language ANSI C for multi-transputer system P-Cube by Parsytec company. Facilities of parallel programming are discussed in the section 5 within the framework of Win32 operating systems (Windows-95, 98, NT). In the section 6 it given information on approaches and methods of decision applied tasks on parallel computing machines with common control (SIMD type system PS-2000) that can be used when undertaking the practical lessons on the "Parallel programming". In the section 7 there are given topics of laboratory works. Presented here material is formed on the base of the lectures, read by the author for students in “Software Engineering” of Taganrog State University of Radio-Engineering 19962001. I am grateful to my students for the joint labor in the study and mastering this very interesting field of knowledge, for preparing an electronic version of lecture notes and laboratory works on distributed applications using CORBA, DCOM, PVM (R. Trotsenko, M. Danilin, A. Loktionov, A. Solovyev, A. Korolev). Programs created for performing on traditionally used computing systems, realizing principle of program governing by John von Neumann (sequential computers) [3], are at present developed so that program, written for one sequential computer could practically without changes run on others, although here, certainly, there are may be architectural differences, for instance length of the machine word. However, consequent programming is sufficiently unified and problems of carrying the programs from one machine on other are not too complex, since in their base lays one and the same model of the organization of computations, which can be illustrated by Fig. 1. Here program (codes of operations) and processed data are kept in one Random Access Memory (RAM) - a linear address space, where each cell of the memory (for instance, byte) has its unique number - an address. Program is executed due to the loading of the address of the first byte of the first instruction to be executed in the Program Counter (PC) from which is determined code of the first 768 command of the program. Control Unit (CU) +l Program Counter l CU Program and ALU data in RAM Fig. 1 reads address of the next instruction to be executed from the PC, retrieves at this address and reads a specified byte, which is interpreted as an operation code, which defines a format (structure) of the command. Knowledge of the operation code and format of the command allows CU to define, what command is necessary to execute, on what operands and where to place a result of operations. Format of the command defines, how much operands has operation, where they are situated (in memory, in registers, right in the command), how must be interpreted contents of respective regions of the memory, i.e. what is their type. This information is transferred to Arithmetic-Logical Unit (ALU), which executes command. Format of the command also determines its length, and after fetching of the current command CU increases PC on its length so that upon completion a current command PC already has address of the next command (if this was not a command of control transition). Thus, in sequential computers it is realized so called thread of control (thread), or instructions stream, processing data stream (operands), defined by these commands. Parallel computing systems, programming of which forms a subject of our consideration, have architecture, greatly distinguishing from one system to another, so there is no united concepts of parallel programming, similar to sequential. Programming of the parallel computing systems requires a good knowledge of their architecture to use effectively their possibilities. At present there are known many hundreds different architecture parallel computing systems and to consider each of them separately is unrealistic and there is no point. For parallel computing systems, in spite of their differences, there are some common features, which are a central to their classification. The one which is widely used was offered by Flynn [4], it is sufficiently rough, but allows to split computing systems in four greatly different classes. More detailed classification of parallel computing systems is given in [5], but it also does not enumerate all possible variants of computing systems. It is necessary to note, that real systems can have features, referring them to different classes, i.e. referring of one or another system to one or another class is rather fuzzy. According to Flynn all computing systems may be split into 4 classes: SISD SIMD MISD MIMD where SISD – single instruction single data streams; SIMD – single instruction multiple data streams; MISD – multiple instruction single data streams; MIMD – multiple instructions multiple data streams. The first of these classes - SISD - corresponds to sequential computer with one PC, through which it is realized one flow of instructions, defining one dataflow to be processed. The second class - SIMD - corresponds to situations, when the system has several ALU, each with its memory of data, but CU and PC are in one copy. In the same way as in SISD systems, CU reads from the region of codes next command, decodes it, prepares to performing a next command, but given command already goes on execution not to one ALU, but to several simultaneously, and each of them reads operands from cells with specified addresses, but each of its local RAM containing its own data. This can be illustrated by the following situation: leader of the tourist group gives instructions to tourists on filling, for instance, customs declarations, when crossing borders - "Write a surname in the item 1" and etc. Here, leader plays role of common CU, each tourist is similar to ALU, and its memory and form play the role of data. SIMD systems are named often as vector or matrix, since elementary processors (ALU + local RAM) are organized as one dimensional (vector) or multidimensional (in particular, two-dimensional) arrays. The third class - MISD - corresponds to parallel systems of the pipeline type. Here there are several executive processors (EP), each of which runs under its flow of commands, but processed data are consecutively transferred from one processor to another. Similar, for instance, is realized assembly of cars in modern plants - one product is advanced consecutively on steps of the pipeline from first to last, on each step being subject to specific transformations. Final product is got on output of the last pipeline step. In pipeline computing systems also result of calculations is got only on output of the last pipeline step. Processing of each given product (task) runs as much time as under consequent processing, but due to the using of many processing devices, processing device (step) having done certain operation for the given task, can go to execution of the same operation, but for the next task already. For efficient functioning a pipeline requires flow of uniform tasks; then, if each step of a pipeline executes operation in time, outputs of a pipeline, in stationary mode, will also appear with time gap of . For efficient functioning of the SIMD systems it is required to have a possibility of the massive parallelization of the algorithm on data, i.e. we are to have much different data being processed uniformly. Systems of both specified types have found a broad using, since many mathematical and physical tasks satisfy specified restrictions. However, in many situations algorithms of processing are more complex, data are not uniform, and systems of specified types in such situations are inefficient. MIMD systems assume that each EP runs under its flow of commands. Such architecture is the most flexible from considered; however it is the most complex in realization and programming. Recently it became popular SPMD approach - one program - many data streams, where each EP runs under its CU, RAMs of each EP have one and the same program (code) but different data. Before going to consideration of questions of the programming for parallel computing systems, we shall give instances of several computing systems, representing previously specified classes. Approaches to programming will be based on the models of these computing system architectures. 1. INSTANCES OF PARALLEL COMPUTING SYSTEMS 1.1. CRAY-X-MP CRAY-X-MP is a two-processor vector system [6] (Fig. 2). Real-time clock P 2М P RAM Intermediate memory 32М Common regs Fig. 2 In the given system each of two processors P is a processor CRAY-1 (Fig. 3). RAM (2M of 64-bit words) is shared and both processors can address to it simultaneously. Machine tact is 9,5 ns. This system was designed in 1982-3 years. Intermediate memory is of 32M 64-bit words. Let’s consider a structure of the processor. Block of address registers of the processor keeps 8 common address and 8 common scalar registers, also there are 32 semaphore single-bit registers (registers for synchronization). Besides, processor has 12 functional units (FU), organized in 4 groups: vector functional units - VFU; functional units for working with floating point numbers - FUFP; scalar functional units - SFU; address functional units - AFU. VFU VR FUFP RAM SFU SR BSR AFU AR BAR Fig. 3 All FUs are pipelined and may work in parallel to each other. Between common RAM and FU there are the following groups of registers: 8 address (AR), 64 buffer address (BAR), 8 scalar (SR), 64 buffer scalar (BSR), 8 64-element vector registers (VR) , each element of which is a 64-bit word . 1.2. DAP DAP (1972) – distributed array processor - is a SIMD system [7] (Fig. 4). Host EP – elementary processor CU EP11 EPk1 EP12 EPk2 EP1k-1 EPkk-1 EP1k EPkk Fig. 4 There were implemented in England several variants of the system, one of them with 1024 1-bit processors (mesh 32х32), performance of the system was 25*106 op/s on the 32-bit floating point numbers. 1.3. MIMD systems These systems differ significantly in the methods of communications. Let’s consider typical cases. 1.3.1. Alliant FX/8 Alliant FX/8 (1985) uses common bus for the sake of processors communication [6] (Fig. 5). Microprocessors Motorola 68000 are used as computing processors. 1.3.2. Intel iPSC Intel iPSC (1985) [6] hypercube architecture system has several modifications: d5 is a 5dimensional cube with 32 nodes; d7 is a similar 7-dimensional cube with 128 nodes. Fig. 6 gives a scheme of the 4-dimensional cube (with 16 nodes). Each processor can work in parallel with other processors, binary number of adjacent nodes differ in one digit that simplifies a routing of packages between processors. Each element is a processor i80286 (i80386). With 32 elements such system performance is estimated as 1 MFlops. IP – interface processor Memory bus CP – computational processor Cache Cache Cache Cache 32 Кbyte 32 Кbyte 32 Кbyte 32 Кbyte IP IP CP Con- IP 8 IP CP 68000 IP bus mp IP trol CP Fig. 5 1100 1101 1110 0100 0111 0110 1111 0101 1010 1011 0010 0011 1000 0000 1001 0001 Fig. 6 1.3.3. Computing system PASM Computing system PASM (1985) – partitionable SIMD-MIMD system [8,9] (Fig. 7). As a main computing element it is used processor of the company Motorola 68010. System can function as several independent subsystems. Characteristic of partition is supported by the connecting network. Memory of control is intended for keeping of programs, performed by microcontrollers (MC). Parallel processor consists of ensembles of processors, working in parallel; they execute commands, assigned by the microcontroller (in the SIMD mode). Since there are several microcontrollers several SIMD-systems can work simultaneously, i.e. it may be got a MIMD or a SIMD-MIMD system. Parallel processor is given in Fig. 8. System CU Memory control system Control Microcontrollers memory … System memory Parallel Processor Fig. 7 For ensuring quick exchanges of information between the local operative memory of processors and memory of the system each microprocessor MP has 2 RAMs (RAMA and RAMB), which can work in parallel: when processor works with one of them, memory of the system can simultaneously work with the other. Connecting network is intended for organization of interprocessor exchanges; outputs and inputs of each processor are connected with it; it is multistage and has a hypercube structure. Fig. 9 gives an example of connecting network for eight elements (8 inputs and 8 outputs). Multistage network is built from switches, each of them having two inputs and two outputs. M I C RAM 0А MP0 O C ЗУ0В System RAM memory 1А R O MP1 N T R ЗУ1В O L L RAMn- E MPn-1 1А R S ЗУn-1В Connecting network Fig. 8 Each of switches can realize one of the following functions of the exchange: 1) transfer straight and with exchange 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 2) spreading of lower (upper) input 2 0 000 4А100 1 0 000 2А010 0 0 000 1 001 2 3 1 001 5B101 1 001 3B011 2 010 3 011 4 5 2 010 6А110 4 100 6А110 4 100 5 101 6 7 3 011 7B111 5 101 7B111 0 1 6 110 7 111 0 1 2 3 4 5 6 7 Fig. 9 Amount of cascades is counted using the formula: n=log2N(in our case n=log28=3); cascades are numbered from right to left. On i-th cascade of the network switches realize a following function of join: if number of one input (output) of the switch is Pn-1Pn-2..Pi+1PiPi1..P1P0, the number of other input (output) is Pn-1Pn-2..Pi+1 P iPi-1..P1P0. Output P of the i-th cascade switch is connected with input P of the switch of (i-1)-th cascade, input (output) P of the network is connected with input (output) P of the switch of (n-1)st (zero) cascade. For routing governing there are used traveling tags, located in headlines of messages. They ensure an individual control of each switch and therefore, decentralization of connecting network governing. Traveling tag T is defined as follows: T S R, where S - a binary address of the source, R - a binary address of the receiver, - a sign of the operation "exclusive OR". Let T=tn-1tn-2..t0 is a binary representation of the tag, then switch of the i-th cascade checks a value ti and if it is 1, transits to the state "exchange", otherwise - to the state "straight". For instance, if S=010, R=100, then T=110 and respective route is shown in Fig.9 by fat line. Networks of such type are capable to be split in independent subsystems of the different size; moreover each of such subsystems possesses the same characteristics as the primary network. In PASM partition is realized in such a way: to get the subsystem of the size 2i there are chosen inputs and outputs of the primary network, numbers of which in the binary presentation have the same values of (n-i) junior bits. Example: n=3, i=2, i.e. 2i=4 – let’s split a network of the size 8x8 into two independent subnets of the sizes 4x4; here must coincide n-i= 3-2=1bits (one junior bit). 0 2 1 0 junior bit 1 1 2 3 4 5 6 7 1 subnet 1 2 0 2 subnet А 0 В Switches of the zero cascade for isolating A and B are set in the state "straight" (refer to Fig. 9). System has Q=2q microcontrollers, addressed within the range of 0 (Q 1) . Each microcontroller controls N/Q processors, N - a total number of processors. In the maximum variant Q=32, in prototype Q=4. Each subsystem of the parallel processor works in the SIMD mode. Module of the memory of each microcontroller has 2 RAMs, as MPs. Physical addresses of N/Q processors, connected to microcontroller, must have coinciding values of the junior q bits so that network could be divided into independent subsystems. Value of these q bit complies with the physical address of the microcontroller. Example: N=2n, Q=2q, N/Q=2n-q, i.e. i=n-q is a logarithm of the amount of the subsystem, n-i=n-(n-q)=q, so it is needed to coincide in q junior bits. Virtual SIMD- or MIMD- machine, containing RN/Q processors, where R=2r, rq, is got, if R microcontrollers will work together and issue into processor elements one and the same commands (in the SIMD mode) or co-ordinate functioning of MP (in the MIMD mode). Physical addresses of these microcontrollers must have coinciding junior q-r bits so that all processors (MP) in this partition should also had coinciding r-q junior bits in their own physical addresses. q 000 00 000 or 01 n-q 2 10 11 001 n=5 n-q 00 001 01 10 11 Microcontrollers are connected by the reconfigurable common bus and situated on the bus in binary-inverse order of addresses so that adjacent microcontrollers could be united into one subsystem (Fig. 10). Processors of each subsystem of the size of RN/Q are assigned logical addresses within the range of 0 R address ( 2 r N 1 . Logical number of the processor is the senior (r+n-q) bits of its Q 2n 2 n r q ). q 2 Wherever this system being located, user writes a program with logical numbers of processors 0 R N 1 , but when this program will be executed, not always processor r with Q the number, equal to zero, will have a physical number of 0. If logarithm of the of the subsystem size is i=(n+r-q), then the physical addresses of processors, referred to this subsystem, must coincide in n-(n+r-q)=q-r junior bits. Transition from logical addresses to physical and back is provided by the operating system, and user works with logical numbers of processors. When a subsystem is working in the SIMD mode it is used masking of processors of two types: address and conditional. When using the address masking there are activated such processors, which addresses correspond to an address mask; for conditional masking - each processor checks an issued by microcontroller condition on its local data, and then executes one of the two alternative actions. Control memory keeps programs for microcontrollers. System memory is used for keeping files; it consists of N/Q independent memory units. System has N processors and Q microcontrollers. Each memory unit is connected with Q modules of memory of MPs: i-th memory unit is connected with modules of memories of MPs, which physical addresses contain value i in (n-q) senior bits. n Logical Processor # n-q+rdiffering 1 1 1001101 q-r jr. bits coincide in physical addreses адресах Under such organization of communications all N/Q processors, controlled by one microcontroller, can be loaded simultaneously. Processors, corresponding to different microcontrollers, must be loaded consecutively. Sending information between processors and devices of the memory is controlled by a Control memory system. System CU is responsible for the total co-ordination of all the rest components. Binary-inverse M numbers I C 000 0 R O C 001 1 010 2 011 3 100 4 O 101 5 L L 100 4 M 010 2 110 6 001 1 101 5 011 3 111 7 M T R 0 O O N C 000 110 6 O N B U S E R 111 7 S Fig. 10