MIMD COMPUTERS

advertisement
MIMD COMPUTERS
G. Alaghband
Fundamentals of Parallel
Processing
1, MIMD
MIMD Computers or Multiprocessors
There are several terms which are often used in a confusing way.
Definition:
Multiprocessors are computers capable of running multiple instruction
streams simultaneously to cooperatively execute a single program.
Definition:
Multiprogramming is the sharing of computing equipment by many
independent jobs. They interact only through their requests for the same
resources. Multiprocessors can be used to multiprogram single stream
programs.
Definition:
A process is a dynamic instance of an instruction stream. It is a
combination of code and process state, for example program counter and
the status words.
Processes are also called tasks, threads, or virtual processors.
G. Alaghband
Fundamentals of Parallel
Processing
2, MIMD
Definition:
Multiprocessing is either:
a) running a program on a multiprocessor
(it may be a sequential one),
[not of interest to us],
or
b) running a program consisting of multiple
cooperating processes.
G. Alaghband
Fundamentals of Parallel
Processing
3, MIMD
Two main types of MIMD or multiprocessor architectures.
Shared memory multiprocessor
Distributed memory multiprocessor
Distributed memory multiprocessors are also known as explicit
communication multiprocessors.
G. Alaghband
Fundamentals of Parallel
Processing
4, MIMD
Notations: A summary of notations used in the following figures are given
below:
L:
Link a component that transfers information from one place to
another place.
K:
Controller, a component that evokes the operation of other
components in the system.
S:
Switch, constructs a link between components. It has associated
with it a set of possible links, it sets some and breaks other links to
establish connection.
T:
Transducer, a component that changes the i-unit (information)
used to encode a given meaning. They don’t change meaning, but
format.
G. Alaghband
Fundamentals of Parallel
Processing
5, MIMD
Some Example Configurations
Fully Shared Memory Architecture:
G. Alaghband
Fundamentals of Parallel
Processing
6, MIMD
Adding private memories to the previous configuration produces a hybrid
architecture.
Shared Plus Private Memory Architecture:
G. Alaghband
Fundamentals of Parallel
Processing
7, MIMD
If local memories are managed by hardware, they are called
cache.
NUMA (Non-uniform Memory Access ) Machines:
There is an important impact on performance if some locations in
shared memory take longer to access than others.
UMA (Uniform Memory Access ) Machines:
Cluster:
Is referred to connecting few processor shared memory
multiprocessors, often called clusters, using a communication
network accessed by send and receive instructions.
The shared memory of a cluster is private WRT other clusters
G. Alaghband
Fundamentals of Parallel
Processing
8, MIMD
Characteristics of Shared memory multiprocessors:
Interprocessor communication is done in the memory interface by
read and write instructions.
Memory may be physically distributed, and reads and writes from
different processors may take different amounts of time and may
collide in the interconnection network.
Memory latency (time to complete a read or write) may be long and
variable.
Messages through the interconnecting switch are the size of single
memory words (or perhaps cache lines).
Randomization of requests (as by interleaving words across
memory modules) may be used to reduce the probability of
collision.
G. Alaghband
Fundamentals of Parallel
Processing
9, MIMD
Characteristics of Message passing multiprocessors:
Interprocessor communication is done by software using data
transmission instructions (send and receive).
Read and write refer only to memory private to the processor
issuing them.
Data may be aggregated into long messages before being sent into
the interconnecting switch.
Large data transmissions may mask long and variable latency in the
communications network.
Global scheduling of communications can help avoid collisions
between long messages.
G. Alaghband
Fundamentals of Parallel
Processing
10, MIMD
Distributed memory multiprocessors are characterized by their network
topologies
Both Distributed and Shared memory multiprocessors use an
Interconnection
Network.
The distinctions are often in the details of the low level switching protocol
rather than in high level switch topology:
Indirect Networks: often used in shared memory architectures, resources
such as processors, memories and I/O devices are attached externally to a
switch that may have a complex internal structure of interconnected
switching nodes
Direct Networks: more common to message passing architectures,
associate resources with the individual nodes of a switching topology
G. Alaghband
Fundamentals of Parallel
Processing
11, MIMD
Ring Topology
An N processor ring topology can take up to N/2 steps to transmit a
message from one processor to another (assuming bi-directional ring).
G. Alaghband
Fundamentals of Parallel
Processing
12, MIMD
A rectangle mesh topology is also possible:
An N processor mesh topology can take up to steps to transmit a message
from one processor to another.
G. Alaghband
Fundamentals of Parallel
Processing
13, MIMD
The hypercube architecture is another interconnection topology:
Each processor
connects directly with
log2N others, whose
indices are obtained by
changing one bit of the
binary number of the
reference processor
(gray code). Up to
log2N steps are needed
to transmit a message
between processors.
G. Alaghband
Fundamentals of Parallel
Processing
14, MIMD
Form of a four dimensional hypercube
G. Alaghband
Fundamentals of Parallel
Processing
15, MIMD
Classification of real systems
Overview of CM* Architecture, An early system
G. Alaghband
Fundamentals of Parallel
Processing
16, MIMD
Five clusters with ten PEs each were built.
The cm* system illustrates a mixture of shared and distributed memory
ideas.
There are 3 answers to the question:
Is cm* a shared memory multiprocessor?
1.
At the level of mcode in the K.map, there are explicit send and
receive instructions and message passing software \ No it is not shared
memory.
2.
At the level of LSP-11 instruction set, the machine has shared
memory. There are no send and receive instructions, any memory address
could be accessed by any processor in the system \ Yes it is shared
memory.
3.
2 operating systems were built for the machine. STAROS and
MEDUSA. The processes which these operating systems supported could
not share any memory. They communicated by making operating system
calls to pass messages between processors \ No it is not shared memory.
G. Alaghband
Fundamentals of Parallel
Processing
17, MIMD
The architecture of the Sequent Balance system similar to Symmetry and
Encore Multimax illustrates another bus based architecture.
G. Alaghband
Fundamentals of Parallel
Processing
18, MIMD
G. Alaghband
Fundamentals of Parallel
Processing
19, MIMD
Sequent Balance
System bus:
80 Mbytes/second system bus. It links CPU, memory,
and IO processors. Data and 32-bit addresses are timemultiplexed on the bus. Sustain transfer rate of 53
Mbytes/second.
Multibus:
Provides access to standard peripherals.
SCSI:
Small Computer System Interface, provides access to
low-cost peripherals for entry-level configurations and
for software distribution.
Ethernet:
Connect systems in a local area network.
G. Alaghband
Fundamentals of Parallel
Processing
20, MIMD
Sequent Balance
Atomic Lock Memory, ALM:
User accessible hardware locks are available to allow mutual
exclusion of shared data structures.
There are 16K such hardware locks in a set.
One or more sets can be installed in a machine, one/multibus
adapter board.
Each lock is a 32-bit double word. The least significant bit
determines the state of a lock:
locked (1), and unlocked (0)
Reading the lock returns the value of this bit and sets it to 1,
thus locking the lock.
Writing 0 to a lock, unlocks it.
Locks can support a variety of synchronization techniques including:
busy waits, Counting/queuing semaphores, and barriers.
G. Alaghband
Fundamentals of Parallel
Processing
21, MIMD
Alliant FX/8: Was designed to exploit parallelism found in scientific programs
automatically.
Up to 8 processors called Computational Elements (CE’s) and up to 12
Interactive Processors (IP’s) shared a global memory of up to 256 Mbytes.
All accesses of CE’s and IP’s to the bus are through cache memory.
There can be up to 521 Kbytes of cache shared by CE’s and up to 128 Kbytes
of cache shared by IP’s. Every 3 Ip’s share 32 Kbytes of cache.
CE’s are connected together directly through a concurrency control bus.
Each IP contains a Motorola 68000 CPU. IP’s are used for interactive
processes and IO.
CE’s have custom chips to support M68000 instructions and floating point
instructions (Weitek processor chip), vector arithmetic instructions, and
concurrency instructions.
The vector registers are 32-element long for both integers and floating point
types
G. Alaghband
Fundamentals of Parallel
Processing
22, MIMD
G. Alaghband
Fundamentals of Parallel
Processing
23, MIMD
Programming Shared Memory Multiprocessors
Key Features needed to Program Shared memory MIMD Computers:
• Process Management:
– Fork/Join
– Create/Quit
– Parbegin/Parend
• Data Sharing:
– Shared Variables
– Private Variables
• Synchronization:
– Controlled-based:
» Critical Sections
» Barriers
– Data-based:
» Lock/Unlock
» Produce/Consume
G. Alaghband
Fundamentals of Parallel
Processing
24, MIMD
In the introduction to the MIMD pseudo code we presented minimal
extensions for process management and data sharing to
sequential pseudo codes.
we saw:
Fork/Join
for basic process management
Shared/Private
storage class for data sharing by
processes.
We will discuss these in more details a little later, but
Another essential mechanism for programming shared memory
multiprocessors is synchronization.
Synchronization guarantees some relationship between the rate of
progress of the parallel processes.
G. Alaghband
Fundamentals of Parallel
Processing
25, MIMD
Lets demonstrate why synchronization is absolutely essential:
Example:
Assume the following statement is being executed by n processes in a
parallel program:
where
Sum:
Psum:
Shared variable, initially 0
Private variable.
Assume further that
P1 calculates Psum = 10
P2 calculates Psum = 3
Therefore, the final value of Sum must be 13.
G. Alaghband
Fundamentals of Parallel
Processing
26, MIMD
At the assembly level Pi’s code i
s:
load
load
add
store
Ri2, Sum;
Ri1, Psum;
Ri1, Ri2;
Sum, Ri1;
Ri2  Sum
Ri1  Psum
Ri1  Ri1 + Ri2
Sum  Ri1
Where Rix refers to register x of process i.
The following scenario is possible when two processes execute the
statement concurrently:
G. Alaghband
Fundamentals of Parallel
Processing
27, MIMD
Synchronization operations can be divided into 2 basic classes:
•
Control Oriented:
Progress past some point in the program is controlled.(Critical Sections)
•
Data Oriented:
Access to data item is controlled by the state of the data item. (Lock and
Unlock)
An important concept is atomicity. The word atomic is used in the sense
of invisible.
Definition: Let S be a set of processes and q be an operation, perhaps
composite. q is atomic with respect to S iff for any process P S which
shares variables with q, the state of these variables seen by P is either
that before the start of q or that resulting from completion of q. In other
words, states internal to q are invisible to processes of S.
G. Alaghband
Fundamentals of Parallel
Processing
28, MIMD
Synchronization Examples
Control Oriented Synchronization:
Critical section is a simple control oriented based synchronization:
Process 1
Process 2
•••••••
•••••••
Critical
Critical
code body1
End critical
G. Alaghband
code body2
End critical
Fundamentals of Parallel
Processing
29, MIMD
Software Solution:
We first implement the critical section using software methods only.
These solutions are all based on the fact that read/write (load/ store) are
the atomic machine level (hardware) instructions available.
We must ensure that only one process at a time is allowed in the
critical section, and once a process is executing in its critical
section, no other process is allowed to enter the critical section.
G. Alaghband
Fundamentals of Parallel
Processing
30, MIMD
We first present a solution for 2 process execution only.
Shared variable:
Var want-in[0..1] of Boolean;
turn: 0..1;
Initially want-in[0] = want-in[1] = false
turn = 0
Next, we present a software solution for N processes.
G. Alaghband
Fundamentals of Parallel
Processing
31, MIMD
Bakery Algorithm (due to Leslie Lamport)
Definitions/Notations:
•
Before a process enters its critical section, it receives a number.
The process holding the smallest number is allowed to enter the critical
section.
•
Two processes Pi and Pj may receive the same number. In this
case if i < j, then Pi is served first.
•
The numbering scheme generates numbers in increasing order
of enumeration.
For example: 1, 2, 2, 3, 4, 4, 4, 5, 6, 7, 7,...
•
(A,B) < (C, D) if:
1. A < C or
2. A = C and B < D
G. Alaghband
Fundamentals of Parallel
Processing
32, MIMD
Bakery Algorithm:
Shared Data:
VAR
piknum: ARRAY[0..N-1] of BOOLEAN;
number :ARRAY[0..N-1] of INTEGER;
Initially
piknum[i] = false, for i = 0, 1, ..., N-1
number[i] = 0, for i = 0, 1, ..., N-1
G. Alaghband
Fundamentals of Parallel
Processing
33, MIMD
Hardware Solutions:
Most computers provide special instruction to ease implementation of
critical section code. In general an instruction is needed that can read and
modify the contents of a memory location in one cycle. These instructions,
referred to as rmw (read-modify-write), can do more than just a read (load)
or write (store) in one memory cycle.
Test&Set -- is a machine-level instruction (implemented in hardware) that
can test and modify the contents of a word in one memory cycle.
Its operation can be described as follows:
Function Test&Set(Var v: Boolean): Boolean;
Begin
Test&Set := v;
v:= true;
End.
In other words Test&Set returns the old value of v and sets it to true
regardless of its previous value.
G. Alaghband
Fundamentals of Parallel
Processing
34, MIMD
Swap -- Is another such instruction. This instruction swaps the contents of
two memory locations in one memory cycle. This instruction is common in
IBM computers. Its operation can be described as follows:
Procedure Swap(Var a, b : Boolean);
Var temp: Boolean;
Begin
temp:= a;
a:= b;
b:= temp;
End
Now we can implement the critical section entry and exit sections using
Test&Set and Swap instructions:
G. Alaghband
Fundamentals of Parallel
Processing
35, MIMD
G. Alaghband
Fundamentals of Parallel
Processing
36, MIMD
The implementation with Swap requires the use of two variables,
one shared and one private:
G. Alaghband
Fundamentals of Parallel
Processing
37, MIMD
The above implementation suffer from Busy Waiting. That is
while a process is in its critical section, the other processes
attempting to enter their critical section are waiting in either the
While loop (Test$Set case) or in the Repeat loop (Swap case).
The amount of busy waiting by processes is proportional to the
number of processes to execute the critical section and to the
length of the critical section.
When fine-grain parallelism is used, then busy-waiting of
processes may be the best performance solution. However, if
most programs are designed with coarse-grain parallelism in
mind, then busy-waiting becomes very costly in terms of
performance and machine resources. The contention
problems resulting from busy-waiting of other processes, will
result in degraded performance of even the process that is
executing in its critical section.
G. Alaghband
Fundamentals of Parallel
Processing
38, MIMD
Semaphores are one way to deal with cases with potentially large amounts
of busy-waiting. Usage of semaphore operations can limit the amount of
busy-waiting .
Definition:
Semaphore S is a
• Shared Integer variable
and
• can only be accessed through 2 indivisible operations P(S) and V(S)
P(S) :
V(S) :
S:= S - 1;
If S < 0 Then
Block(S);
S:= S + 1;
If S  0 Then Wakeup(S);
•
Block(S) -- results in the suspension of the
process invoking it.
•
Wakeup(S) -- results in the resumption of exactly
one process that has previously invoked Block(S).
Note: P and V are executed atomically.
G. Alaghband
Fundamentals of Parallel
Processing
39, MIMD
Given the above definition, the critical section entry and exit can be
implemented using a semaphore as follows:
Shared Var mutex : Semaphore;
Initially mutex = 1
P(mutex);
•••
Critical section
V(mutex);
G. Alaghband
Fundamentals of Parallel
Processing
40, MIMD
G. Alaghband
Fundamentals of Parallel
Processing
41, MIMD
Semaphores are implemented using machine-level instructions such asTest&Set
or Swap.
Shared Var lock : Boolean
Initially lock = false
• P(S):
While Test&Set(lock) Do { };
Begin
S:= S - 1;
If S < 0 Then
Block process;
lock := flase;
End
• V(S):
While Test&Set(lock) Do { };
Begin
S := S + 1;
If S  0 Then
Make a suspended process ready;
lock := flase;
End
G. Alaghband
Fundamentals of Parallel
Processing
42, MIMD
Problem:
Implement the semaphore operations
using the Swap instruction.
G. Alaghband
Fundamentals of Parallel
Processing
43, MIMD
Shared Var lock
Initially lock = false
Private Var key
•P(S):
G. Alaghband
key = true;
Repeat
Swap(lock, key);
Until key = false;
S := S - 1;
If S < 0 Then
Block process;
lock := flase
Fundamentals of Parallel
Processing
44, MIMD
Data Oriented Synchronization:
•
LOCK L - If LOCK L is set then wait; if it is clear, set it
and proceed.
•
UNLOCK L -
Unconditionally clear lock L.
Using Test&Set LOCK and UNLOCK correspond to the following:
LOCK L:
Repeat y = Test&Set(L) Until y=0
UNLOCK L:
G. Alaghband
L=0
Fundamentals of Parallel
Processing
45, MIMD
Relationship between locks and critical sections:
Critical sections are more like locks if we consider named
critical sections. Execution inside a named critical section
excludes simultaneous execution inside any other critical
section of the same name. However there may be
processes that are exectuing concurrently in critical sections
of different names.
A simple correspondance between locks and critical
sections is:
Critical max
critical code
End critical
LOCK max
critical code
UNLOCK max
Both synchronizations are used to solve the mutual
exclusion problem. However,
G. Alaghband
Fundamentals of Parallel
Processing
46, MIMD
Locks are more general than critical sections, since UNLOCK does
not have to appear in the same process as LOCK.
G. Alaghband
Fundamentals of Parallel
Processing
47, MIMD
Asynchronous Variables:
A second type of data oriented synchronization. These variables have
both a value and a state which is either full or empty.
Asynchronous variables are accesses by two principal atomic operations:
• Produce:
Wait for the state of the Asynch variable
to be empty, write a value, and set the
state to full.
• Consume:
Wait for the state of the Asynch variable
to be full, read the value, and set the
state to emply.
G. Alaghband
Fundamentals of Parallel
Processing
48, MIMD
A more complete set of operations on Asynchronous variables
are those provided by the Force parallel programming
language:
Produce asynch var = expression
Consume private var = asynch var
Copy private var = asynch var
Void asynch var
- wait for full, read value,
don’t change state.
- Initialize the state to empty.
Asynchronous variables can be imlemented in terms of critical
sections.
G. Alaghband
Fundamentals of Parallel
Processing
49, MIMD
Represent an asynchronous variable by the following data structure:
V
Vf
value
state --
Vn
name
true corresponds to full,
false corresponds to empty
Pseudo code to implement the Produce operation is:
1.
2.
3.
4.
5.
6.
7.
8.
9.
L:
Critical Vn
privf := Vf
If not(privf) Then
Begin
V := vlue-of-expression;
Vf := true;
End;
End critical
If privf Then goto L;
Note: Private variable privf is used to obtain a copy of the shared Vf, the state of
the asynch variable, before the process attempts to perform the Produce operation.
This way if the test in statement number 3 reveals that the state is full, then the
process returns to 1 and tries again.
G. Alaghband
Fundamentals of Parallel
Processing
50, MIMD
Problem (4-11):
A multiprocessor supports synchronization with lock/unlock hardware.
The primitives are represented at the compiler language level by two
subroutine calls, lock(q) and unlock(q). The lock(q) operation waits
for the lock to be clear, sets it and returns, while unlock(q) clears the
lock unconditionally. It is desired to implement produce and consume
on full/empty variables where produce(x,v) waits for x empty, writes it
with value v and sets full while consume(x,v) waits for x full, copies its
value to v and sets x to empty.
Using sequential pseudo code extended by the two operators lock(q)
and unlock(q), write code sections which implement produce(x,v)
and consume(x,v) on an asynchronous variable x and normal
variable v. Carefully describe your representation for an asynchronous
variable. No synchronization operations other than lock and unlock
may be used.
G. Alaghband
Fundamentals of Parallel
Processing
51, MIMD
First Solution
We will represent the asynchronous variable X by a record of three
items: the value of X, a boolean full flag and a unique lock for the
variable X.
record X{value
: real;
full
: boolean;
l
: lock
}
Produce and consume can then be implemented as follows:
procedure produce(X,V) {
R : lock(X.l)
if X.full then
{ unlock(X.l) ;
goto R
else
{X.full := true ;
X.value := V ;
unlock(X.l) } }
G. Alaghband
Fundamentals of Parallel
Processing
}
52, MIMD
procedure consume(X,V)
{
R : lock(X.l)
if not X.full then
{
unlock(X.l) ;
goto R
}
else
{
V := X.value ;
X.full := false ;
unlock(X.l) }
}
G. Alaghband
Fundamentals of Parallel
Processing
53, MIMD
(Alternate Solution)
A simpler solution is possible by using only the lock operation to do a wait,
but only if it is recognized that one lock is not enough. Two different
conditions are being waited for in produce and consume, so two locks are
needed.
record X { value : real ;
f : lock ;
e : lock
}
procedure produce(X,V) {
lock(X.f)
;
X.value := V;
unlock(X.e)
}
G. Alaghband
State
full
empty
locked
unused
X.f
1
0
1
0
X.e
0
1
1
0
procedure consume(X,V)
lock(X.e);
V := X.value;
unlock(X.f)
}
Fundamentals of Parallel
Processing
54, MIMD
{
Control oriented: Barrier Implementation:
Initial state and values:
unlock (barlock)
lock (barwit)
barcnt = 0
lock (barlock)
if (barcnt < NP -1 ) then
All processes except the
barcnt := barcnt +1 ;
last will increment the
unlock (barlock) ;
counter and wait at the
lock (barwit) ;
lock(barwit).
endif ;
if (barcnt = NP -1) then
Last process executes
…
the code body and
code body
unblocks barwit.
endif ;
if (barcnt = 0) then
unlock (barlock) ;
else
All processes except the
barcnt := barcnt -1 ;
last, will decrement the
unlock (barwit) ;
counter and unlock (barwit)The
endif
last process unlocks barlock for
correct state of the next barrier
execution.
G. Alaghband
Fundamentals of Parallel
Processing
55, MIMD
Alternatives for Process Management:
Different systems provide alternative ways to fork and join processes. Some
common alternatives are outlined bellow:
•
Instead of Fork Label, the Unix fork gives two identical processes
returning from the fork call with a return value being the only distinction between
the two processes.
•
The join operation may combine process management and
synchronization. New processes can just quit when finished and some other
operation may be used to synchronize with their completion. Therefore, if
processes are to wait for other processes to complete before they can quit, we
may use a barrier synchronization before the join.
•
Parameter passing can be included in a fork using a create statement
(as was done in the HEP multiprocessor). The create statement is similar to a
subroutine call, except that a new process executes the subroutine in parallel with
the main program, which continues immediately.
G. Alaghband
Fundamentals of Parallel
Processing
56, MIMD
Multiprocessors provide some basic synchronization and process
management tools (machine dependent). On these machines,
sequential language must be extended so that parallel programming
becomes possible.
Fortran for example can be extended by a coherent set of
synchronization and process management primitives to allow for parallel
programming.
Lets use:
CREATE subr (A, B, C, ...) starts a subroutine in parallel with the main
program. parameters are passed by reference.
RETURN in a created subroutine means quit. While RETURN in a called
subroutine means a normal return to the calling program.
G. Alaghband
Fundamentals of Parallel
Processing
57, MIMD
Parameter passing to parallel programs has
some pitfalls:
Consider the following:
10
Do 10 I = 1, N-1
CREATE sub(I)
The intent is to assign and index value to each of the created
processes.
Problem?
G. Alaghband
Fundamentals of Parallel
Processing
58, MIMD
By the time the subroutine gets around reading I, it will
have changed!!
In parallel processing call-by-reference and call-by-value
are not the same, even for read-only parameters.
To come up with a solution remember that neither the
subroutine nor the program may change the parameter
during parallel execution.
Next we show how to pass an argument by value to a
created subroutine:
G. Alaghband
Fundamentals of Parallel
Processing
59, MIMD
The main program:
G. Alaghband
The created subroutine
Asych Integer II
Private Integer I
Shared Integer N
Void II
• • • • •
Private Integer IP
Do 10 I = 1, N-1
Produce II = I
CREATE sub(II)
Subroutine sub(II)
Consume IP = II
• • • • •
Fundamentals of Parallel
Processing
60, MIMD
Implementing an N-way fork and join
Assume integers IC, N and logical DONE are in Shared Common, the main
program executes the code:
10
Void IC
Void DONE
Produce IC=N
Do 10 I = 1, N-1
CREATE proc(...)
CALL proc(...)
forks N stream of
processes
calling process continues
• • • • •
C
C
The process doing the forking returns here and does
part of the join operation.
Consume F= DONE
DONE was voided initially
G. Alaghband
Fundamentals of Parallel
Processing
61, MIMD
…
AT THE END OF SUBROUTINE
…
The rest of the join operation is done at the end of the subroutine proc.
At the end of proc, processes will execute the following:
Consume J= IC
J=J-1
If (J .ne. 0) Then
Produce IC = J
RETURN
Endif
Produce DONE = .true.
RETURN
G. Alaghband
IC was initialized to N.
decrement number of
processes in the critical
section.
quit if it was a created
process, return if called.
last process will execute
the last 2 statements.
Fundamentals of Parallel
Processing
62, MIMD
Single Program, Multiple Data, or SPMD
We will decouple process creation from parallel program code.
Processes can be created at the beginning of the program, execute a
common set of code that distributes work in the parallel loop by either
prescheduling or self scheduling, and terminate at the end of the program.
Multiple processes, each with a unique process identifier, 0  id  P-1,
execute a single program simultaneously but not necessarily synchronously.
Private data may cause processes to execute if-then-else statements
differently or to execute a loop a different number of times.
The SPMD style of programming is almost the only choice for managing
many processes in a so-called massively parallel processor (MPP) with
hundreds or thousands of processors.
G. Alaghband
Fundamentals of Parallel
Processing
63, MIMD
Process creation for a SPMD program
P:
id:
shared, number of processes
private, unique process identifier, available to all the processes
executing the parallel main program (parmain),
id is passed by value.
processes may be required to synchronize before ending parmain, or
exit may be able to automatically wait for processes that have not
finished (join).
shared P;
private id;
for id := 0 step 1 until P-2
create parmain(id, P);
id := P-1;
call parmain(id, P);
call exit();
P and id make up a parallel environment in which the MIMD processes
execute.
G. Alaghband
Fundamentals of Parallel
Processing
64, MIMD
SPMD program for the recurrence solver
procedure parmain(value id, P)
shared P, n, a[n, n], x[n], c[n];
private id, i, j, sum, priv;
forall i := 1 step 1 until n
void x[i];
barrier;
forall i := 1 step 1 until n
begin
sum := c[i];
for j := 1 step 1 until i-1
{copy x[j] into priv;
sum := sum + a[i, j]*priv;}
produce x[i] := sum;
end
barrier;
 code to use x[] 
end procedure
G. Alaghband
Fundamentals of Parallel
Processing
In forall, no value of i
should be assigned to
a process before all
preceding values
have been assigned.
This prevents infinite
waits at the copy
operation.
Some use doacross
this purpose and
doall for completely
independent body
instances (forall
here).
65, MIMD
Work distribution
Parallel regions are most often tied to a forall construct,
Which indicates that all instances of a loop body for different
index values can be executed in parallel.
The potential parallelism is equal to the number of values of
the loop index (N) and is
usually much larger than the number of processors (P)
(or processes if there is time multiplexing)
used to execute the parallel region.
G. Alaghband
Fundamentals of Parallel
Processing
66, MIMD
Prescheduled loop code for an individual process
Block mapping
shared lwr, stp, upr, np;
private i, lb, ub, me;
/* Compute private lower and upper bounds from lwr, upr,
stp, process number me and number np of processes.*/
for i := lb step stp until ub
loop body(i);
G. Alaghband
Fundamentals of Parallel
Processing
67, MIMD
Prescheduled loop code for an individual process
Cyclic mapping
shared lwr, upr, np;
private i, me;
for i := lwr + me*stp step np*stp
until upr
loop body(i);
G. Alaghband
Fundamentals of Parallel
Processing
68, MIMD
SPMD code for one of np processes
having identifier me
executing its portion of a prescheduled loop
forall i := lwr step stp until upr
Block mapping requires
computing lower and upper bounds
for each process.
G. Alaghband
Fundamentals of Parallel
Processing
69, MIMD
self-scheduling code for one process executing the same forall
shared lwr, stp, upr, np, isync;
private i;
barrier
void isync;
produce isync := lwr;
end barrier
while (true) do
begin
consume isync into i;
if (i > upr) then
{produce isync := i;
break;}/* End while loop */
else
{produce isync := i + stp;
loop body(i);}
end
G. Alaghband
Fundamentals of Parallel
Processing
70, MIMD
Parallelizing a simple imperfect loop nest
Serial imperfect two loop nest
for i := 0 step 1 until n-1
begin
s := f(i);
for j := 0 step 1 until m-1
loop body(i, j, s);
end
G. Alaghband
Split into parallel perfect nests
forall i := 0 step 1 until n-1
s[i] := f(i);
forall k := 0 step 1 until m*n-1
i := k/m;
j := k mod m;
loop body(i, j, s[i]);
Fundamentals of Parallel
Processing
71, MIMD
Adaptive Quadrature Integration Method
A simple example to show the dynamic scheduling concept where the amount of
work to be done depends on the outcome of the ongoing computation.
In this method two basic operations are needed:
First the integral is approximated on the interval (a, b),
Then the error is estimated in the approximation on $ (a, b)
f
a
G. Alaghband
Fundamentals of Parallel
Processing
b
72, MIMD
Sequential procedure for this integration can be described with
the following steps:
1) Apply the approximation to the interval,
approx(a, b, f)
2) Apply the error estimate. Accurate(a, b, f)
3)
a. If the error is small enough add contribution to
the integral.
b. If not, split the interval in two and recursively
do each half.
4) Return from this recursion level.
G. Alaghband
Fundamentals of Parallel
Processing
73, MIMD
To parallelize this procedure, step 3 can be revised as follows:
3)
a. If the error is small enough cooperatively add to
the integral and quit.
b. If not, split interval into two, create a process to
do one half, and do the other half.
G. Alaghband
Fundamentals of Parallel
Processing
74, MIMD
Therefore, one process starts the integration and every time the interval
is split a new process is created.
The unlimited recursive creation of processes will
produce a breath first expansion of a exponentially
large problem, and is unnecessary. In spite of virtual
processes, no parallel system will execute this
approach efficiently.
The method to implement the adaptive quadrature integration efficiently
is to allow a single process to be responsible for integrating a
subinterval. Define two intervals for a process:
Interval (a, b) for which the process is responsible for
computing an answer.
Interval ( au, bu) which is the currently active subinterval.
G. Alaghband
Fundamentals of Parallel
Processing
75, MIMD
A high level description of the algorithm can now be presented.
0. Initialize (au, bu) to (a, b) .
1. Apply approximation to (au, bu) .
2. Estimate the error on (au, bu) .
3.
a. If the result is accurate add it to the total integral,
report process free and quit.
b. If not accurate enough, split (au,bu) and make the
left half the new active interval.
4. Assign a free process, if there is one, to integrate a
remaining inactive interval.
5. Go to step 1.
G. Alaghband
Fundamentals of Parallel
Processing
76, MIMD
approx(a, b, f):
Returns an approximation to the integral of f
over the interval (a, b),
accurate(a, b, f):
Returns true if the approximation is accurate
enough and false otherwise.
workready():
Returns true if the work list is not empty and
false otherwise,
getwork(task):
Returns true and a task from the work list is not
empty and false otherwise,
putwork(task):
Puts a new task on the work list, returning false
if the list was full and true otherwise.
task is a structure consisting of:
Interval endpoints:
and the function:
G. Alaghband
task.a and task.b,
task.f.
Fundamentals of Parallel
Processing
77, MIMD
Initially, the program starts with :
one task, (a, b, f) on the shared work list
&
idle = P, number of idle processors, shared
Other variable:
Integral:
Result of the integration of function f, shared
More:
true if work list is not empty, private
T:
approximate integral over an interval, private
Ok:
true if approximation is accurate enough, private
Cent:
midpoint for interval splitting, private
Task, task1, task2:
current task and the two tasks (1 & 2) resulting
from task split, private
shared P, idle, integral;
private more, t, ok, cent, task, task1, task2;
G. Alaghband
Fundamentals of Parallel
Processing
78, MIMD
while (true) begin
critical work;
more := workready() or (idle  P);
if (more) then idle := idle - 1;
end critical;
if (not more) then break;
while (getwork(task)) begin
t := approx(a, b, f); ok := accurate(a, b, f);
if (ok) then
critical int;
integral := integral + t;
end critical;
else begin
cent := (task.a + task.b)/2.0;
task1.a := task.a; task1.b := cent;
task2.a := cent; task2.b := task.b;
task1.f := task2.f := task.f;
if (not putwork(task1) or not putwork(task2)) then
Report no room in task set.;
end;
end; /* of while loop over available tasks */
critical work;
idle := idle + 1;
end critical;
end; /* of while loop over available tasks or active processes. */
G. Alaghband
Fundamentals of Parallel
Processing
79, MIMD
The inner while loop terminates when there is no more work in the list,
but there is a chance more work will be added if any processes are still
executing tasks,
so the outer while does not terminate until all processes have failed to
find more work to do (break statement).
The order of putting work onto, and taking it from, the list is important.
This is similar to traversing a binary tree constructed as intervals are
split in two.
Managing the work list in last-in, first-out order gives depth-first
traversal, while using first-in, first-out order gives breadth-first traversal.
Breadth-first order adds more and more tasks to the work list, generating
new work much faster than work is completed until the end of the
computation nears. NOT
DESIRABLE.
Properly managed, the framework of a set of processes cooperatively
accessing a shared work list can be a very effective form of dynamic
scheduling.
G. Alaghband
Fundamentals of Parallel
Processing
80, MIMD
OpenMP
• A language extension
• Extensions exists for C, C++. And Fortran
(API)
• OpenMP constructs are limited to compiler
directives and library subroutine calls
(!$OMP), so the base language, so
OpenMP programs also correspond to
legal programs in the base language.
G. Alaghband
Fundamentals of Parallel
Processing
81, MIMD
OpenMP has
• one process management construct,
the parallel region,
• three parallel variable scopes,
shared, private, and threadprivate,
• four work distribution constructs,
loops, sections, single execution, and master
execution,
• six synchronization methods,
locks, critical sections, barriers, atomic update,
ordered sections, and flush.
This moderate sized set of constructs is sufficient for simple
parallelization of sequential programs, but may present
challenges for more complex parallel programs.
G. Alaghband
Fundamentals of Parallel
Processing
82, MIMD
The OpenMP Fortran Applications Program Interface (API)
Parallelism is introduced into an OpenMP program by the
parallel region construct
!$OMP PARALLEL [clause[[,] clause ]
block
!$OMP END PARALLEL
block is a single entry, single exit group of statements
executed by all threads of the team created on entry to the
parallel region in SPMD fashion.
Branching into/out of the block is illegal, except for
subroutine or function calls, but different threads may follow
different paths through the block.
G. Alaghband
Fundamentals of Parallel
Processing
83, MIMD
The number of threads, num_threads, is set by calling
SUBROUTINE OMP_SET_NUM_THREADS(integer)
where the integer argument may be an expression.
The number of running threads is returned by
INTEGER FUNCTION OMP_GET_NUM_THREADS()
Each thread gets its own unique integer between 0 and
num_threads - 1 by calling
INTEGER FUNCTION OMP_GET_THREAD_NUM()
The integer 0 is always assigned to the master thread.
G. Alaghband
Fundamentals of Parallel
Processing
84, MIMD
The parallel scope of variables inside a parallel region is specified by
clauses attached to the !$OMP PARALLEL directive,
except that threadprivate common blocks are specified by a directive.
A list of variables or labeled common blocks can be specified as
private by the clause
PRIVATE(list)
or shared by the clause
SHARED(list)
All copies of private variables disappear at the end of the parallel
region except for the master thread’s copy.
G. Alaghband
Fundamentals of Parallel
Processing
85, MIMD
A more complex parallel scope specification is that for
reduction variables
REDUCTION({operator | intrinsic}: list)
A variable X in list will appear inside the parallel region only
in reduction statements of the form
X = X operator expression
or
X = intrinsic(X, expression)
operator may be +, *, -, .AND., .OR., .EQV., or .NEQV. and
intrinsic may be MAX, MIN, IAND, IOR, or IEOR. The
variable has shared scope,
G. Alaghband
Fundamentals of Parallel
Processing
86, MIMD
A Simple Example
PROGRAM MAIN
INTEGER K
REAL A(10), X
CALL INPUT(A)
CALL OMP_SET_NUM_THREADS(10)
!$OMP PARALLEL SHARED(A, X) PRIVATE(K) REDUCTION(+:X)
K = OMP_GET_THREAD_NUM()
X = X + A(K+1)
!$OMPEND PARALLEL
PRINT *, ‘Sum of As: ‘, X
STOP
END
G. Alaghband
Fundamentals of Parallel
Processing
87, MIMD
Constructs of the OpenMP Fortran API
Thread management constructs in OpenMP Fortran
G. Alaghband
Fundamentals of Parallel
Processing
88, MIMD
Thread management constructs in OpenMP Fortran
G. Alaghband
Fundamentals of Parallel
Processing
89, MIMD
Thread management constructs in OpenMP Fortran
Run-time library routines
SUBROUTINE OMP_SET_NUM_THREADS(integer)
Set number or max number of threads.
INTEGER FUNCTION OMP_GET_NUM_THREADS()
Return number of threads in use
INTEGER FUNCTION OMP_GET_MAX_THREADS()
Return max number of threads
INTEGER FUNCTION OMP_GET_THREAD_NUM()
Return number of the calling thread
INTEGER FUNCTION OMP_GET_NUM_PROCS()
Return number of processors available.
G. Alaghband
Fundamentals of Parallel
Processing
90, MIMD
Thread management constructs in OpenMP Fortran
LOGICAL FUNCTION OMP_IN_PARALLEL()
True if called from dynamic extent of a parallel region.
SUBROUTINE OMP_SET_DYNAMIC(logical)
Allow (true) or disallow (false) dynamic change in number of threads.
LOGICAL FUNCTION OMP_GET_DYNAMIC()
Return true or false setting of dynamic
SUBROUTINE OMP_SET_NESTED(logical)
Allow nested parallelism or not
LOGICAL FUNCTION OMP_GET_NESTED()
Return true if nested parallelism allowed
G. Alaghband
Fundamentals of Parallel
Processing
91, MIMD
Parallel data scope specification in OpenMP Fortran
Directive
!$OMP THREADPRIVATE(/cb/[, /cb/])
Specifies that previously declared common blocks cb are
private and persist across parallel regions
Clauses
PRIVATE(list)
SHARED(list
Variables or common blocks in list are private in block
introduced by the directive.
Variables or common blocks in list are shared
DEFAULT(PRIVATE|SHARED|NONE)
Default scope for all variables in block
Private variables in list are initialized to values on entry to
block
LASTPRIVATE(list)
The values of variables on list are set to values written by
last iteration or section.
REDUCTION({operator | intrinsic}: list) Reduce across threads by specified operation.
FIRSTPRIVATE(list)
COPYIN(list)
G. Alaghband
Initialize threadprivate variables or common blocks on list to
master’s value on entry.
Fundamentals of Parallel
Processing
92, MIMD
OpenMP Fortran work distribution constructs
Allowed clauses: PRIVATE, FIRSTPRIVATE, LASTPRIVATE, REDUCTION,
SCHEDULE, ORDERED
G. Alaghband
Fundamentals of Parallel
Processing
93, MIMD
OpenMP Fortran work distribution constructs
setenv OMP_SCHEDULE "STATIC, 10
Prescheduling with chunk size of 10 iterations
G. Alaghband
Fundamentals of Parallel
Processing
94, MIMD
OpenMP Fortran work distribution constructs
Allowed clauses: PRIVATE, FIRSTPRIVATE, LASTPRIVATE, REDUCTION,
G. Alaghband
Fundamentals of Parallel
Processing
95, MIMD
OpenMP Fortran work distribution constructs
Allowed clauses: PRIVATE, FIRSTPRIVATE,
G. Alaghband
Fundamentals of Parallel
Processing
96, MIMD
OpenMP Fortran work distribution constructs
G. Alaghband
Fundamentals of Parallel
Processing
97, MIMD
OpenMP Fortran synchronization constructs
G. Alaghband
Fundamentals of Parallel
Processing
98, MIMD
OpenMP Fortran synchronization constructs
Implicit in: BARRIER, CRITICAL, END CRITICAL, END DO, END SECTIONS,
END PARALLEL, END SINGLE, ORDERED, END ORDERED (unless
NOWAIT)
G. Alaghband
Fundamentals of Parallel
Processing
99, MIMD
OpenMP Fortran synchronization constructs
G. Alaghband
Fundamentals of Parallel
Processing
100, MIMD
OpenMP Fortran synchronization constructs
Run-time library routines
SUBROUTINE OMP_INIT_LOCK(var)
Create and initialize lock with name var
SUBROUTINE OMP_DESTROY_LOCK(var)
Destroy lock var, where var is type integer
SUBROUTINE OMP_SET_LOCK(var)
Wait until lock var is unset, then set it
SUBROUTINE OMP_UNSET_LOCK(var) Release lock var owned by this thread.
LOGICAL FUNCTION OMP_TEST_LOCK(var)
Attempt to set lock var; return .TRUE.
on success and .FALSE. on failure.
G. Alaghband
Fundamentals of Parallel
Processing
101, MIMD
OpenMP Fortran combined process management and work distribution
Allowed clauses: PRIVATE, SHARED, DEFAULT, FIRSTPRIVATE,
LASTPRIVATE, REDUCTION, SCHEDULE, ORDERED, IF, COPYIN
G. Alaghband
Fundamentals of Parallel
Processing
102, MIMD
OpenMP Fortran combined process management and work distribution
OpenMP Fortran combined process management and work distribution
Allowed clauses: PRIVATE, SHARED, DEFAULT, FIRSTPRIVATE,
LASTPRIVATE, REDUCTION, IF, COPYIN
G. Alaghband
Fundamentals of Parallel
Processing
103, MIMD
An OpenMP Example
A particle dynamics program to compute position, p, velocity, v,
Acceleration, a, force, f, potential and kinetic energy pot and kin.
n particles are represented by four 3 by n arrays
First dimension for x, y, z space dimension
Second for particles
Time stepping loop central to the main particle dynamics program
do i = 1, nsteps
call compute(n, pos, vel, mass, f, pot, kin)
call update(n, pos, vel, f, a, mass, dt)
enddo
G. Alaghband
Fundamentals of Parallel
Processing
104, MIMD
pot = 0.0
kin = 0.0
Body of the compute subroutine
!$OMP PARALLEL DO&
!$OMP& DEFAULT(SHARED)&
!$OMP& PRIVATE(i, j, k, rij, d)&
!$OMP& REDUCTION(+: pot, kin)
do i = 1, n
do k = 1, 3
f(k, i) = 0.0
enddo
do j = 1, n
if (i .ne. j) then
call dist(pos(1, i), pos(1, j), rij, d)
pot = pot + 0.5*v(d)
do k = 1, 3
f(k, i) = f(k, i) - rij(k)*dv(d)/d
enddo
endif
enddo
kin = kin + dot(vel(1, i), vel(1, i))
enddo
!$OMP END PARALLEL DO
kin = kin*0.5*mass
G. Alaghband
Fundamentals of Parallel
Processing
105, MIMD
Body of the update subroutine
!$OMP PARALLEL DO&
!$OMP& DEFAULT(SHARED)&
!$OMP& PRIVATE(i, k)
do i = 1, n
do k = 1, 3
pos(k,i) = pos(k,i) + vel(k,i)*dt + 0.5*dt*dt*a(k,i)
vel(k,i) = vel(k,i) + 0.5*dt*(f(k,i)/mass + a(k,i))
a(k,i) = f(k,i)/mass
enddo
enddo
!$OMP END PARALLEL DO
This is a minimal change to the program, but threads are created and
terminated twice for each time step. If process management overhead is
large, this can slow the program significantly.
A better method is described in the text, see Program 4-13,
We will present the idea of global parallelism by introducing the
FORCE Parallel Language
G. Alaghband
Fundamentals of Parallel
Processing
106, MIMD
Development of Parallel Programming Languages
Issues to consider:
• Explicit vs. Implicit Parallelism
• Efficiency of Parallel Constructs
• Portability of Programs to Multiprocessors of the
Same Type
G. Alaghband
Fundamentals of Parallel
Processing
107, MIMD
Main Features of the Force:
•
Parallel constructs as extensions to Fortran
•
SPMD:
Single program executed by many processes
•
Global parallelism:
parallel execution is the norm,
sequential execution must be explicitly specified
•
G. Alaghband
Arbitrary number of processes:
Execution with one process is fine,
it allows program logic bugs to be separated
from
synchronization bugs.
Fundamentals of Parallel
Processing
108, MIMD
Main Features of the Force continued:
•
Process management is suppressed:
There are no fork, Join, Create or Kill operations.
Only the rate of progress is influenced by
synchronizations.
•
•
All processes are identical in their capability
Data is either private to one, or uniformly shared
by all processes.
G. Alaghband
Fundamentals of Parallel
Processing
109, MIMD
Main Features of the Force continued
•
Generic Synchronizations:
Synch. operations do not identify specific
processes. They use quantifiers such as all, none,
only one, or state of a variable.
•
•
G. Alaghband
The Force supports both fine and coarse grain
parallelism.
Force is designed as a two-level macro processor.
The parallel construct macros are machine
independent, are built on top of a hand-full of
machine dependent macros. To port Force to a
new platform, few macros need to be re-written
Fundamentals of Parallel
Processing
110, MIMD
Force has been ported to:
HEP,
Flex/32,
Encore Multimax,
Sequent Balance and Symmetry,
Alliant Fx8,
Convex,
IBM 3090,
Cray multiprocessors,
KSR-1
G. Alaghband
Fundamentals of Parallel
Processing
111, MIMD
c
c
Establish the number of processes requested by the execute
command, and set up the parallel environment.
Force PROC of NP ident ME
c
Shared real X(1000), GMAX
Private real PMAX
Private integer I
End Declarations
c
Barrier
read(*,*)(X(I), I=1,1000)
GMAX=0.0
End barrier
G. Alaghband
*********************
Strictly sequential
*********************
Fundamentals of Parallel
Processing
112, MIMD
c
*********************
Replicated sequential
PMAX=0.0
cc
Presched Do 10 I = 1,1000
If ( X(I) .gt. PMAX) PMAX = X(I)
10
*********************
Multiple sequential
*********************
End presched do
c
Critical MAX
*********************
if (PMAX .gt. GMAX) GMAX = PMAX
Strictly sequential
End critical
*********************
c
Barrier
write(*,*) GMAX
End barrier
c
Join
End
G. Alaghband
Fundamentals of Parallel
Processing
113, MIMD
Compilation of Force programs in Unix environment
G. Alaghband
Fundamentals of Parallel
Processing
114, MIMD
Compilation of Force programs in Unix environment
1. The SED pattern matcher translates the Force statements into
parameterized function macros.
2. The M4macro processor replaces the function macros with Fortran
code and the specific parallel programming extensions of the
target Fortran language. The underlying Fortran language must
include mechanisms for process creation, memory sharing among
processes, and synchronized access to shared variables.
3. The Fortran compiler translates the expanded code, links it with
the Force driver and multiprocessing libraries.
4. The Force driver is responsible for creation of processes, setting
up the Force parallel environment, and shared memory.
G. Alaghband
Fundamentals of Parallel
Processing
115, MIMD
Program Structure:"
Force <name> of <NP> ident <process id>
<declaration of variables>
[Externf <Force module name>]
End declarations
<Force program>
Join
Forcesub <name> ([parameters]) of <NP> ident <process id>
<declarations>
[Externf <Force module name>]
End declarations
<subroutine body>
Return
Forcecall <name>([parameters])
G. Alaghband
Fundamentals of Parallel
Processing
116, MIMD
Variable Declarations (6 Types):
Private <Fortran type> <variable list>
Private Common /<label>/ <Fortran type> <variable list>
Shared <Fortran type> <variable list>
Shared Common /<label>/ <Fortran type> <variable list>
Async <Fortran type> <variable list>
Async Common /<label>/ <Fortran type> <variable list>
G. Alaghband
Fundamentals of Parallel
Processing
117, MIMD
Work Distribution Macros (9 Constructs, 2 commonly used):
Presched Do <n> <var> = <i1>, <i2> [,<i3>]
<loop body>
<n>
End Presched Do
Prescheduled DOALLs require no synchronization;
Each process computes its own index values; Best performance is achieved
when the amount of work for each index value is the same. Indices are
allocated in a fixed way at compile time.
Cyclic mapping:
Shared i1, i2, np;
Private var, me;
for var := i1+ me*i3 step np*i3 until i2,
loop body
G. Alaghband
Fundamentals of Parallel
Processing
118, MIMD
<n>
VPresched Do <n> <var> = <i1>, <i2> [,<i3>, <b2>]
<loop body>
End Presched Do
Block mapping:
Shared i1, i2, i3, np;
Private var, lb, ub, me;
/* Compute private lower & upper bounds (lb, ub) from
i1, i2, i3, me and np */
for var := lb, i3, until ub
<loop body.
G. Alaghband
Fundamentals of Parallel
Processing
119, MIMD
<n>
Selfsched Do <n> <var> = <i1>, <i2> [,<i3>]
<loop body>
End Selfsched Do
<n>
Pre2do <n> <var1> = <i1>, <i2> [,<i3>];
<var2> = <j1>, <j2> [,<j3>]
<doubly indexed loop body>
End Presched Do
Selfscheduled DOALLs adapt to varying workload.
Synchronized access to a shared index is needed.
Each process obtains the next index value by
incrementing a shared variable at run time.
G. Alaghband
Fundamentals of Parallel
Processing
120, MIMD
G. Alaghband
Fundamentals of Parallel
Processing
121, MIMD
self-scheduling code for one process executing the same forall
shared lwr, stp, upr, np, isync;
private i;
barrier
void isync;
produce isync := lwr;
end barrier
while (true) do
begin
consume isync into i;
if (i > upr) then
{produce isync := i;
break;}/* End while loop */
else
{produce isync := i + stp;
loop body(i);}
end
G. Alaghband
Fundamentals of Parallel
Processing
122, MIMD
Work Distribution Macros continued
<n>
Self2do <n> <var1> = <i1>, <i2> [,<i3>];
<var2> = <j1>, <j2> [,<j3>]
<doubly indexed loop body>
End Selfsched Do
<n>
Askfor Do <n> Init = <i>
/* # of initial work units*/
<loop body>
Critical <var>
More work <j>
/* add j work units to loop*/
<put work in data structure>
End critical
<loop body>
End Askfor Do
G. Alaghband
Fundamentals of Parallel
Processing
123, MIMD
Work Distribution Macros continued
Pcase on <var>
<code block>
[Usect]
<code block>
[Csect (<condition>)]
.
.
End Pcase
G. Alaghband
Scase on <var>
<code block>
[Usect]
<code block>
[Csect (<condition>)]
.
.
End Scase
Fundamentals of Parallel
Processing
124, MIMD
Resolve into <name>
Component <name> strength <number or var>
.
.
Component <name> strength <number or var>
Unify
G. Alaghband
Fundamentals of Parallel
Processing
125, MIMD
Synchronization:
Consume <async var> into <variable>
Produce <async var> = <expression>
Copy <async var> into <variable>
Void <async var>
Isfull (<async var>)
Critical <lock var>
<code block>
End critical
Barrier
<code block>
End barrier
G. Alaghband
Fundamentals of Parallel
Processing
126, MIMD
Consume
waits for the state of the variable to be full,
reads the value into a private variable and
sets it to empty.
Produce
waits for the state of the variable to be
empty, sets its value to the expression and
sets it to full.
Copy
waits for the asynchronous variable to
become full, copies its value into a private
variable, but leaves its state as full.
Void
sets the state of its asynchronous variable
to empty regardless of its previous state.
Isfull
Returns a boolean value indicating whether
the state of an asynchronous variable is full
or empty.
G. Alaghband
Fundamentals of Parallel
Processing
127, MIMD
Synchronization:
•
Data oriented:
procedure produce(x, expr);
shared struct {real var; boolean f} x;
private ok;
ok := false;
repeat
critical
if (not x.f) then
{ok := true;
x.f := true;
x.var := expr;};
end critical;
until ok;
end procedure;
G. Alaghband
Fundamentals of Parallel
Processing
128, MIMD
G. Alaghband
Fundamentals of Parallel
Processing
129, MIMD
Control oriented: Barrier Implementation:
Initial state and values:
unlock (barlock)
lock (barwit)
barcnt = 0
lock (barlock)
if (barcnt < NP -1 ) then
All processes except the
barcnt := barcnt +1 ;
last will increment the
unlock (barlock) ;
counter and wait at the
lock (barwit) ;
lock(barwit).
endif ;
if (barcnt = NP -1) then
Last process executes
…
the code body and
code body
unblocks barwit.
endif ;
if (barcnt = 0) then
unlock (barlock) ;
else
All processes except the
barcnt := barcnt -1 ;
last, will decrement the
unlock (barwit) ;
counter and unlock (barwit)The
endif
last process unlocks barlock for
correct state of the next barrier
execution.
G. Alaghband
Fundamentals of Parallel
Processing
130, MIMD
Forcesub bsolve(n) of nprocs ident me
Shared common /matrix/real a(500,500),b(500)
Async common /sol/ real x(500)
Private integer i,j
Private real psum,temp
End declarations
c
Initialize the asynchronous vector x to empty.
100
Presched Do 100 i=1,n
Void x(i)
End presched do
G. Alaghband
Fundamentals of Parallel
Processing
131, MIMD
c
The back solve process
c
Produce value of x(n) to be used in the first loop iteration
Barrier
Produce x(n)=b(n)/a(n,n)
End Barrier
Selfsched Do 200 i= n-1, 1, -1
psum=0.0
c
150
c
200
c
Do 150 j=n, i+1, -1
wait for X(j) to become full and copy its value
Copy x(j) into temp
psum = psum + a(i,j) * temp
continue
produce the value of x(i) and mark it as full.
Produce x(i)=(b(i) -psum)/a(i,i)
End Selfsched Do
Return
end
G. Alaghband
Fundamentals of Parallel
Processing
132, MIMD
G. Alaghband
Fundamentals of Parallel
Processing
133, MIMD
Download