PARALLEL PROGRAMMING - 1 - Computer Engineering Department

advertisement
PARALLEL PROGRAMMING - 1
ALEXANDER CHEFRANOV
Japanese project of parallel intellectual computing systems of the 5-th generation (end of
80-s - beginning of 90-s years) has not succeeded [1], but its idea and achievements did not
disappear, and presently without any pomp in USA develop work on the program of the
Accelerated Strategic Computer Initiative (ASCI), within the framework of which to 2004 is
expected to create on the basis of processors Pentium computing system on 100 thousand
processors with the performance 100 TFlops (1014 operations with floating point per second) with
the main destination of development of the nucleus weapon without its actual testing; at present
already marketed sample on 7264 processors with performance 1.8 TFlops [2]. Alongside with
military applications such technology will find a broad using for the forecasting of the weather,
syntheses of new medicines with required characteristics, ensuring protection of bank systems
etc. In Top 500 (http://www.top500.org/list/2002/06/)
TOP500 Sublist for June 2002 (Revised June 20th,
14:00 EST)
Rank Manufacturer
Computer
NEC
EarthSimulator
2
Rmax(GFlops) Installation Site Country Year Installation Type Installation Area Processor
Earth Simulator
Center
Japan
2002 Research
IBM
ASCI White,
SP Power3 7226.00
375 MHz
Lawrence
Livermore
National
Laboratory
USA
2000 Research
3
HewlettPackard
AlphaServer
SC ES45/1 4463.00
GHz
Pittsburgh
Supercomputing USA
Center
2001 Academic
3016
4
HewlettPackard
AlphaServer
SC ES45/1 3980.00
GHz
Commissariat a
l'Energie
France
Atomique (CEA)
2001 Research
2560
5
IBM
SP Power3
375 MHz 16 3052.00
way
NERSC/LBNL
USA
2001 Research
3328
6
HewlettPackard
AlphaServer
SC ES45/1 2916.00
GHz
Los Alamos
National
Laboratory
USA
2002 Research
2048
7
Intel
ASCI Red
Sandia National
Laboratories
USA
1999 Research
9632
8
IBM
pSeries 690
Turbo
2310.00
1.3GHz
Oak Ridge
National
Laboratory
USA
2002 Research
864
9
IBM
ASCI BluePacific SST, 2144.00
IBM SP
Lawrence
Livermore
National
USA
1999 Research
1
35860.00
2379.00
5120
Energy
Energy
8192
5808
10
IBM
604e
Laboratory
pSeries 690
Turbo
2002.00
1.3GHz
IBM/US Army
Research
Laboratory
(ARL)
USA
2002 Vendor
it is given a table of distribution of supercomputers across countries and organizations and, it turns out
that their large part is used by governmental institutions and banks, rather then research organizations. So,
single mentioned in Russia in 2000 supercomputer which in worldwide rating stays on the 156th place,
was situated in the National reserve bank of Russia; in June, 2002, Russian self-made MVS-1000
supercomputer had 64-th place and it was in scientific organization. Multiprocessor computing systems
and parallel programming come to us not only as supercomputers, but as well from the side of widespread
PC on the basis of Intel processors (multiprocessor systems, parallel processes) that defines practical
worth and urgency of the "Parallel programming" course for preparing specialists in "Software
engineering" and adjacent directions.
Within the framework of this textbook there are considered questions of the organization
of parallel programs, i.e. parallel programming. For the simplification of the perception of stated
approaches to parallel programming introduction contains classical model of program control by
von Neumann and classification of computing systems on the base of flows of commands and
data streams is given. Section 1 gives several typical examples of computing systems, to which
we shall appeal at the interpretation of languages and methods of parallel programming. Known
parallel versions of Fortran language are discussed In the section 2 for systems of different
classes. Section 3 gives information on programming on languages Occam-2 and parallel ANSI C
for multi-transputer systems.
In the section 4 it is represented language ANSI C for multi-transputer system P-Cube by
Parsytec company. Facilities of parallel programming are discussed in the section 5 within the
framework of Win32 operating systems (Windows-95, 98, NT). In the section 6 it given
information on approaches and methods of decision applied tasks on parallel computing machines
with common control (SIMD type system PS-2000) that can be used when undertaking the
practical lessons on the "Parallel programming". In the section 7 there are given topics of
laboratory works. Presented here material is formed on the base of the lectures, read by the author
for students in “Software Engineering” of Taganrog State University of Radio-Engineering 19962001. I am grateful to my students for the joint labor in the study and mastering this very
interesting field of knowledge, for preparing an electronic version of lecture notes and laboratory
works on distributed applications using CORBA, DCOM, PVM (R. Trotsenko, M. Danilin, A.
Loktionov, A. Solovyev, A. Korolev).
Programs created for performing on traditionally used computing systems, realizing
principle of program governing by John von Neumann (sequential computers) [3], are at present
developed so that program, written for one sequential computer could practically without changes
run on others, although here, certainly, there are may be architectural differences, for instance
length of the machine word. However, consequent programming is sufficiently unified and
problems of carrying the programs from one machine on other are not too complex, since in their
base lays one and the same model of the organization of computations, which can be illustrated
by Fig. 1. Here program (codes of operations) and processed data are kept in one Random Access
Memory (RAM) - a linear address space, where each cell of the memory (for instance, byte) has
its unique number - an address.
Program is executed due to the loading of the address of the first byte of the first
instruction to be executed in the Program Counter (PC) from which is determined code of the first
768
command of the program. Control Unit (CU)
+l
Program
Counter
l
CU
Program and
ALU
data in RAM
Fig. 1
reads address of the next instruction to be executed from the PC, retrieves at this address and
reads a specified byte, which is interpreted as an operation code, which defines a format
(structure) of the command. Knowledge of the operation code and format of the command allows
CU to define, what command is necessary to execute, on what operands and where to place a
result of operations. Format of the command defines, how much operands has operation, where
they are situated (in memory, in registers, right in the command), how must be interpreted
contents of respective regions of the memory, i.e. what is their type. This information is
transferred to Arithmetic-Logical Unit (ALU), which executes command. Format of the
command also determines its length, and after fetching of the current command CU increases PC
on its length so that upon completion a current command PC already has address of the next
command (if this was not a command of control transition). Thus, in sequential computers it is
realized so called thread of control (thread), or instructions stream, processing data stream
(operands), defined by these commands. Parallel computing systems, programming of which
forms a subject of our consideration, have architecture, greatly distinguishing from one system to
another, so there is no united concepts of parallel programming, similar to sequential.
Programming of the parallel computing systems requires a good knowledge of their architecture
to use effectively their possibilities. At present there are known many hundreds different
architecture parallel computing systems and to consider each of them separately is unrealistic and
there is no point. For parallel computing systems, in spite of their differences, there are some
common features, which are a central to their classification. The one which is widely used was
offered by Flynn [4], it is sufficiently rough, but allows to split computing systems in four greatly
different classes. More detailed classification of parallel computing systems is given in [5], but it
also does not enumerate all possible variants of computing systems. It is necessary to note, that
real systems can have features, referring them to different classes, i.e. referring of one or another
system to one or another class is rather fuzzy.
According to Flynn all computing systems may be split into 4 classes:
SISD
SIMD
MISD MIMD
where SISD – single instruction single data streams;
SIMD – single instruction multiple data streams;
MISD – multiple instruction single data streams;
MIMD – multiple instructions multiple data streams.
The first of these classes - SISD - corresponds to sequential computer with one PC,
through which it is realized one flow of instructions, defining one dataflow to be processed. The
second class - SIMD - corresponds to situations, when the system has several ALU, each with its
memory of data, but CU and PC are in one copy. In the same way as in SISD systems, CU reads
from the region of codes next command, decodes it, prepares to performing a next command, but
given command already goes on execution not to one ALU, but to several simultaneously, and
each of them reads operands from cells with specified addresses, but each of its local RAM
containing its own data. This can be illustrated by the following situation: leader of the tourist
group gives instructions to tourists on filling, for instance, customs declarations, when crossing
borders - "Write a surname in the item 1" and etc. Here, leader plays role of common CU, each
tourist is similar to ALU, and its memory and form play the role of data. SIMD systems are
named often as vector or matrix, since elementary processors (ALU + local RAM) are organized
as one dimensional (vector) or multidimensional (in particular, two-dimensional) arrays.
The third class - MISD - corresponds to parallel systems of the pipeline type. Here there
are several executive processors (EP), each of which runs under its flow of commands, but
processed data are consecutively transferred from one processor to another. Similar, for instance,
is realized assembly of cars in modern plants - one product is advanced consecutively on steps of
the pipeline from first to last, on each step being subject to specific transformations. Final product
is got on output of the last pipeline step. In pipeline computing systems also result of calculations
is got only on output of the last pipeline step. Processing of each given product (task) runs as
much time as under consequent processing, but due to the using of many processing devices,
processing device (step) having done certain operation for the given task, can go to execution of
the same operation, but for the next task already. For efficient functioning a pipeline requires
flow of uniform tasks; then, if each step of a pipeline executes operation in time, outputs of a
pipeline, in stationary mode, will also appear with time gap of .
For efficient functioning of the SIMD systems it is required to have a possibility of the
massive parallelization of the algorithm on data, i.e. we are to have much different data being
processed uniformly. Systems of both specified types have found a broad using, since many
mathematical and physical tasks satisfy specified restrictions. However, in many situations
algorithms of processing are more complex, data are not uniform, and systems of specified types
in such situations are inefficient. MIMD systems assume that each EP runs under its flow of
commands. Such architecture is the most flexible from considered; however it is the most
complex in realization and programming. Recently it became popular SPMD approach - one
program - many data streams, where each EP runs under its CU, RAMs of each EP have one and
the same program (code) but different data.
Before going to consideration of questions of the programming for parallel computing
systems, we shall give instances of several computing systems, representing previously specified
classes. Approaches to programming will be based on the models of these computing system
architectures.
1. INSTANCES OF PARALLEL COMPUTING SYSTEMS
1.1. CRAY-X-MP
CRAY-X-MP is a two-processor vector system [6] (Fig. 2).
Real-time
clock
P
2М
P
RAM
Intermediate memory 32М
Common
regs
Fig. 2
In the given system each of two processors P is a processor CRAY-1 (Fig. 3). RAM (2M
of 64-bit words) is shared and both processors can address to it simultaneously. Machine tact is
9,5 ns. This system was designed in 1982-3 years. Intermediate memory is of 32M 64-bit words.
Let’s consider a structure of the processor. Block of address registers of the processor keeps 8
common address and 8 common scalar registers, also there are 32 semaphore single-bit registers
(registers for synchronization). Besides, processor has 12 functional units (FU), organized in 4
groups:
vector functional units - VFU;
functional units for working with floating point numbers - FUFP;
scalar functional units - SFU;
address functional units - AFU.
VFU
VR
FUFP
RAM
SFU
SR
BSR
AFU
AR
BAR
Fig. 3
All FUs are pipelined and may work in parallel to each other.
Between common RAM and FU there are the following groups of registers: 8 address
(AR), 64 buffer address (BAR), 8 scalar (SR), 64 buffer scalar (BSR), 8 64-element vector
registers (VR) , each element of which is a 64-bit word .
1.2. DAP
DAP (1972) – distributed array processor - is a SIMD system [7] (Fig. 4).
Host
EP – elementary
processor
CU
EP11
EPk1
EP12
EPk2
EP1k-1
EPkk-1
EP1k
EPkk
Fig. 4
There were implemented in England several variants of the system, one of them with
1024 1-bit processors (mesh 32х32), performance of the system was 25*106 op/s on the 32-bit
floating point numbers.
1.3. MIMD systems
These systems differ significantly in the methods of communications. Let’s consider
typical cases.
1.3.1. Alliant FX/8
Alliant FX/8 (1985) uses common bus for the sake of processors communication [6] (Fig.
5). Microprocessors Motorola 68000 are used as computing processors.
1.3.2. Intel iPSC
Intel iPSC (1985) [6] hypercube architecture system has several modifications: d5 is a 5dimensional cube with 32 nodes; d7 is a similar 7-dimensional cube with 128 nodes. Fig. 6 gives
a scheme of the 4-dimensional cube (with 16 nodes). Each processor can work in parallel with
other processors, binary number of adjacent nodes differ in one digit that simplifies a routing of
packages between processors. Each element is a processor i80286 (i80386). With 32 elements
such system performance is estimated as 1 MFlops.
IP – interface processor
Memory bus
CP – computational processor
Cache
Cache
Cache
Cache
32 Кbyte
32 Кbyte
32 Кbyte
32 Кbyte
IP
IP
CP
Con-
IP
8
IP
CP
68000
IP
bus
mp
IP
trol
CP
Fig. 5
1100
1101
1110
0100
0111
0110
1111
0101
1010
1011
0010
0011
1000
0000
1001
0001
Fig. 6
1.3.3. Computing system PASM
Computing system PASM (1985) – partitionable SIMD-MIMD system [8,9] (Fig. 7).
As a main computing element it is used processor of the company Motorola 68010.
System can function as several independent subsystems. Characteristic of partition is supported
by the connecting network.
Memory of control is intended for keeping of programs, performed by microcontrollers
(MC). Parallel processor consists of ensembles of processors, working in parallel; they execute
commands, assigned by the microcontroller (in the SIMD mode).
Since there are several microcontrollers several SIMD-systems can work simultaneously,
i.e. it may be got a MIMD or a SIMD-MIMD system. Parallel processor is given in Fig. 8.
System CU
Memory control system
Control
Microcontrollers
memory
…
System memory
Parallel Processor
Fig. 7
For ensuring quick exchanges of information between the local operative memory of
processors and memory of the system each microprocessor MP has 2 RAMs (RAMA and
RAMB), which can work in parallel: when processor works with one of them, memory of the
system can simultaneously work with the other. Connecting network is intended for organization
of interprocessor exchanges; outputs and inputs of each processor are connected with it; it is
multistage and has a hypercube structure. Fig. 9 gives an example of connecting network for eight
elements (8 inputs and 8 outputs). Multistage network is built from switches, each of them having
two inputs and two outputs.
M
I
C
RAM
0А
MP0
O
C
ЗУ0В
System
RAM
memory
1А
R
O
MP1
N
T
R
ЗУ1В
O
L
L
RAMn-
E
MPn-1
1А
R
S
ЗУn-1В
Connecting network
Fig. 8
Each of switches can realize one of the following functions of the exchange:
1) transfer straight and with exchange
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
2) spreading of lower (upper) input
2
0 000
4А100
1
0 000
2А010
0
0 000
1 001
2
3
1 001
5B101
1 001
3B011
2 010
3 011
4
5
2 010
6А110
4 100
6А110
4 100
5 101
6
7
3 011
7B111
5 101
7B111
0
1
6 110
7 111
0
1
2
3
4
5
6
7
Fig. 9
Amount of cascades is counted using the formula: n=log2N(in our case n=log28=3);
cascades are numbered from right to left. On i-th cascade of the network switches realize a
following function of join: if number of one input (output) of the switch is Pn-1Pn-2..Pi+1PiPi1..P1P0, the number of other input (output) is Pn-1Pn-2..Pi+1 P iPi-1..P1P0.
Output P of the i-th cascade switch is connected with input P of the switch of (i-1)-th
cascade, input (output) P of the network is connected with input (output) P of the switch of (n-1)st (zero) cascade.
For routing governing there are used traveling tags, located in headlines of messages.
They ensure an individual control of each switch and therefore, decentralization of connecting
network governing. Traveling tag T is defined as follows:
T  S  R,
where S - a binary address of the source, R - a binary address of the receiver,  - a sign
of the operation "exclusive OR".
Let T=tn-1tn-2..t0 is a binary representation of the tag, then switch of the i-th cascade
checks a value ti and if it is 1, transits to the state "exchange", otherwise - to the state "straight".
For instance, if S=010, R=100, then T=110 and respective route is shown in Fig.9 by fat line.
Networks of such type are capable to be split in independent subsystems of the different
size; moreover each of such subsystems possesses the same characteristics as the primary
network. In PASM partition is realized in such a way: to get the subsystem of the size 2i there are
chosen inputs and outputs of the primary network, numbers of which in the binary presentation
have the same values of (n-i) junior bits.
Example: n=3, i=2, i.e. 2i=4 – let’s split a network of the size 8x8 into two independent
subnets of the sizes 4x4; here must coincide n-i= 3-2=1bits (one junior bit).
0
2
1
0
junior bit
1
1
2
3
4
5
6
7
1 subnet
1
2
0
2 subnet
А
0
В
Switches of the zero cascade for isolating A and B are set in the state "straight" (refer to
Fig. 9).
System has Q=2q microcontrollers, addressed within the range of 0  (Q  1) . Each
microcontroller controls N/Q processors, N - a total number of processors. In the maximum
variant Q=32, in prototype Q=4.
Each subsystem of the parallel processor works in the SIMD mode. Module of the
memory of each microcontroller has 2 RAMs, as MPs. Physical addresses of N/Q processors,
connected to microcontroller, must have coinciding values of the junior q bits so that network
could be divided into independent subsystems. Value of these q bit complies with the physical
address of the microcontroller.
Example: N=2n, Q=2q, N/Q=2n-q, i.e. i=n-q is a logarithm of the amount of the subsystem,
n-i=n-(n-q)=q, so it is needed to coincide in q junior bits.
Virtual SIMD- or MIMD- machine, containing RN/Q processors, where R=2r, rq, is got,
if R microcontrollers will work together and issue into processor elements one and the same
commands (in the SIMD mode) or co-ordinate functioning of MP (in the MIMD mode). Physical
addresses of these microcontrollers must have coinciding junior q-r bits so that all processors
(MP) in this partition should also had coinciding r-q junior bits in their own physical addresses.
q
000
00 000
or
01
n-q
2
10
11
001
n=5
n-q
00 001
01
10
11
Microcontrollers are connected by the reconfigurable common bus and situated on the
bus in binary-inverse order of addresses so that adjacent microcontrollers could be united into one
subsystem (Fig. 10).
Processors of each subsystem of the size of RN/Q are assigned logical addresses within
the range of 0  R 
address ( 2 r 
N
 1 . Logical number of the processor is the senior (r+n-q) bits of its
Q
2n
 2 n  r  q ).
q
2
Wherever this system being located, user writes a program with logical numbers of
processors 0  R 
N
 1 , but when this program will be executed, not always processor r with
Q
the number, equal to zero, will have a physical number of 0. If logarithm of the of the subsystem
size is i=(n+r-q), then the physical addresses of processors, referred to this subsystem, must
coincide in n-(n+r-q)=q-r junior bits. Transition from logical addresses to physical and back is
provided by the operating system, and user works with logical numbers of processors.
When a subsystem is working in the SIMD mode it is used masking of processors of two
types: address and conditional. When using the address masking there are activated such
processors, which addresses correspond to an address mask; for conditional masking - each
processor checks an issued by microcontroller condition on its local data, and then executes one
of the two alternative actions.
Control memory keeps programs for microcontrollers. System memory is used for
keeping files; it consists of N/Q independent memory units. System has N processors and Q
microcontrollers. Each memory unit is connected with Q modules of memory of MPs: i-th
memory unit is connected with modules of memories of MPs, which physical addresses contain
value i in (n-q) senior bits.
n
Logical Processor #
n-q+rdiffering
1 1 1001101
q-r
jr. bits coincide in physical
addreses адресах
Under such organization of communications all N/Q processors, controlled by one
microcontroller, can be loaded simultaneously. Processors, corresponding to different
microcontrollers, must be loaded consecutively. Sending information between processors and
devices of the memory is controlled by a Control memory system. System CU is responsible for
the total co-ordination of all the rest components.
Binary-inverse
M
numbers
I
C
000 0
R
O
C
001 1
010 2
011 3
100 4
O 101 5
L
L
100
4
M
010
2
110
6
001
1
101
5
011
3
111
7
M
T
R
0
O
O
N
C
000
110 6
O
N
B
U
S
E
R
111 7
S
Fig. 10
Download