Data dependence

advertisement
Program & network properties
Module 4
Points to be covered
Condition of parallelism(2.1 ACA Kai hwang)
o Data dependence and resource dependence
o Hardware and software dependence
o The role of compiler
 Program partitioning and scheduling(2.2 ACA Kai hwang)
o Grain size and latency
o Grain packing and scheduling
• System interconnect architecture(2.4 ACA Kai Hwang)
o Network properties and routing
o Static connection network
o Dynamic connection network

Condition of parallelism


The ability to execute several program segments in
parallel requires each segment to be independent of the
other segments. We use a dependence graph to describe
the relations.
The nodes of a dependence graph correspond to the
program statement (instructions), and directed edges with
different labels are used to represent the ordered
relations among the statements.
Data dependence
The ordering relationship between statements is
indicated by the data dependence. Five type of data
dependence are defined below:1) Flow dependence
2) Anti dependence
3) Output dependence
4) I/O dependence
5) Unknown dependence

Flow Dependence
A statement S2 is flow dependent on S1 if an execution
path exists from S1 to S2 and if at least one output
(variables assigned) of S1 feeds in as input(operands to be
used) to S2 and denoted as-:
S1  S2

Example-:
S1:Load R1,A
S2:ADD R2,R1

Anti Dependence

Statement S2 is anti dependent on the statement S1 if S2
follows S1 in the program order and if the output of S2
overlaps the input to S1 and denoted as :
Example-:
S1:Add R2,R1
S2:Move R1,R3
Output dependence
Two statements are output dependent if they produce
(write) the same output variable.
Example-:
S1:Load R1,A
S2:Move R1,R3
I/O

Read and write are I/O statements. I/O dependence
occurs not because the same variable is involved but
because the same file referenced by both I/O statement.
Example-:
S1:Read(4),A(I)
S3:Write(4),A(I)

Unknown Dependence
The dependence relation between two statements cannot
be determined.
Example-:Indirect Addressing

S1 Load R1, A
 S2 Add R2, R1
 S3 Move R1, R3
 S4 Store B, R1
Flow dependency
S1to S2
S3 to S4
S2 to S2
Anti-dependency
S2to S3
Output dependency
S1 toS3

Control Dependence



This refers to the situation where the order of the
execution of statements cannot be determined before
run time.
For example all condition statement, where the flow of
statement depends on the output.
Different paths taken after a conditional branch may
depend on the data hence we need to eliminate this data
dependence among the instructions.


This dependence also exists between operations
performed in successive iterations of looping procedure.
Control dependence often prohibits parallelism from
being exploited.
Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
 Control-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}



Control dependence also avoids parallelism to being
exploited.
Compilers are used to eliminate this control dependence
and exploit the parallelism.
Resource dependence


Resource independence is concerned with conflicts in
using shared resources, such as registers, integer and
floating point ALUs, etc. ALU conflicts are called ALU
dependence.
Memory (storage) conflicts are called storage
dependence.
Bernstein’s Conditions
Bernstein’s conditions are a set of conditions which must
exist if two processes can execute in parallel.
Notation
 Ii is the set of all input variables for a process Pi
 . Ii is also called the read set or domain of Pi.
 Oi is the set of all output variables for a process Pi .Oi is also
called write set.


If P1 and P2 can execute in parallel (which is written as P1 ||
P2) then we should have-:


In terms of data dependencies, Bernstein’s conditions
imply that two processes can execute in parallel if they
are flow-independent, antiindependent, and
outputindependent.
The parallelism relation || is commutative (Pi || Pj implies Pj
|| Pi ), but not transitive (Pi || Pj and Pj || Pk does not imply Pi
|| Pk )
Hardware and software parallelism




Hardware parallelism is defined by machine architecture.
It can be characterized by the number of instructions that
can be issued per machine cycle. If a processor issues k
instructions per machine cycle, it is called a k-issue
processor.
Conventional processors are one-issue machines.
Examples. Intel i960CA is a three-issue processor
(arithmetic, memory access, branch). IBM RS -6000 is a
four-issue processor (arithmetic, floating-point, memory
access, branch)
Software Parallelism

Software parallelism is defined by the control and data
dependence of programs, and is revealed in the program’s
flow graph i.e., it is defined by dependencies with in the
code and is a function of algorithm, programming style,
and compiler optimization.
Mismatch between software and
hardware parallelism - 1
Cycle 1
L1
L2
L3
L4
Cycle 2
X1
X2
Cycle 3
+
-
A
B
Maximum
software
parallelism
(L=load, X/+/- =
arithmetic).
Mismatch between software and
hardware parallelism - 2
Same problem, but
considering the parallelism
on a two-issue superscalar
processor.
X1
Cycle 1
L2
Cycle 2
L3
Cycle 3
L4
Cycle 4
X2
Cycle 5
+
Cycle 6
A
B
L1
Cycle 7
Types of Software Parallelism


Control Parallelism – two or more operations can be
performed simultaneously. This can be detected by a
compiler, or a programmer can explicitly indicate control
parallelism by using special language constructs or dividing a
program into multiple processes.
Data parallelism – multiple data elements have the same
operations applied to them at the same time. This offers the
highest potential for concurrency (in SIMD and MIMD
modes). Synchronization in SIMD machines handled by
hardware.
The Role of Compilers




Compilers used to exploit hardware features to improve
performance.
Interaction between compiler and architecture design is a
necessity in modern computer development.
It is not necessarily the case that more software
parallelism will improve performance in conventional
scalar processors.
The hardware and compiler should be designed at the
same time.
Program Partitioning & Scheduling




The size of the parts or pieces of a program that can be
considered for parallel execution can vary.
The sizes are roughly classified using the term “granule size,”
or simply “granularity.”
The simplest measure, for example, is the number of
instructions in a program part.
Grain sizes are usually described as fine, medium or coarse,
depending on the level of parallelism involved.
Latency



Latency is the time required for communication between
different subsystems in a computer.
Memory latency, for example, is the time required by a
processor to access memory.
Synchronization latency is the time required for two
processes to synchronize their execution.
Levels of Parallelism
Jobs or programs
Increasing
communication
demand and
scheduling
overhead
Higher degree
of parallelism
Subprograms, job steps or
related parts of a program
Procedures, subroutines,
tasks, or coroutines
Non-recursive loops
or unfolded iterations
}
}
}
Coarse grain
Medium grain
Fine grain
Instructions
or statements
Types and Levels Of Parallelism
1)
2)
3)
4)
5)
Instruction Level Parallelism
Loop-level Parallelism
Procedure-level Parallelism
Subprogram-level Parallelism
Job or Program-Level Parallelism
Instruction Level Parallelism
This fine-grained, or smallest granularity level typically
involves less than 20 instructions per grain.
 The number of candidates for parallel execution varies
from 2 to thousands, with about five instructions or
statements (on the average) being the average level of
parallelism.
Advantages:
 There are usually many candidates for parallel execution
 Compilers can usually do a reasonable job of finding this
parallelism

Loop-level Parallelism



Typical loop has less than 500 instructions. If a loop
operation is independent between iterations, it can be
handled by a pipeline, or by a SIMD machine.
Most optimized program construct to execute on a
parallel or vector machine. Some loops (e.g. recursive) are
difficult to handle.
Loop-level parallelism is still considered fine grain
computation.
Procedure-level Parallelism




Medium-sized grain; usually less than 2000 instructions.
Detection of parallelism is more difficult than with
smaller grains; interprocedural dependence analysis is
difficult.
Communication requirement less than instruction level
SPMD (single procedure multiple data) is a special case
Multitasking belongs to this level.
Subprogram-level Parallelism



Grain typically has thousands of instructions.
Multi programming conducted at this level
No compilers available to exploit medium- or coarsegrain parallelism at present
Job Level


Corresponds to execution of essentially independent jobs
or programs on a parallel computer.
This is practical for a machine with a small number of
powerful processors, but impractical for a machine with a
large number of simple processors (since each processor
would take too long to process a single job).
Permutations



Given n objects, there are n ! ways in which they can be
reordered (one of which is no reordering).
A permutation can be specified by giving the rule fo
reordering a group of objects.
Permutations can be implemented using crossbar switches,
multistage networks, shifting, and broadcast operations.
Example(Permutation in Crossbar Switch)
i
n
p
u
t
1
2
1
2
3
3
4
4
1 2 3
4
output
A 4x4 cross-bar
1 2 3
4
(1,2, 3, 4)->
(3, 1, 2, 4)
Permutation: (1, 2, 3, 4) -> (3, 1, 2, 4)
Perfect Shuffle and Exchange


Stone suggested the special permutation that entries
according to the mapping of the k-bit binary number a b …
k to b c … k a (that is, shifting 1 bit to the left and wrapping
it around to the least significant bit position).
The inverse perfect shuffle reverses the effect of the perfect
shuffle.
Hypercube Routing Functions


If the vertices of a n-dimensional cube are labeled with n-bit
numbers so that only one bit differs between each pair of
adjacent vertices, then n routing functions are defined by the
bits in the node (vertex) address.
For example, with a 3-dimensional cube, we can easily
identify routing functions that exchange data between nodes
with addresses that differ in the least significant, most
significant, or middle bit.
Factors Affecting Performance





Functionality – how the network supports data routing, interrupt
handling, synchronization, request/message combining, and
coherence
Network latency – worst-case time for a unit message to be
transferred
Bandwidth – maximum data rate
Hardware complexity – implementation costs for wire, logic,
switches, connectors, etc.
Scalability – how easily does the scheme adapt to an increasing
number of processors, memories, etc..
Dynamic Networks – Bus Systems

A bus system (contention bus, time-sharing bus) has







a collection of wires and connectors
multiple modules (processors, memories, peripherals, etc.) which connect to
the wires
data transactions between pairs of modules
Bus supports only one transaction at a time.
Bus arbitration logic must deal with conflicting requests.
Lowest cost and bandwidth of all dynamic schemes.
Many bus standards are available.
2 × 2 Switches
*From Advanced Computer Architectures, K. Hwang, 1993 .
78
Single-stage networks




*From Ben Macey at http://www.ee.uwa.edu.au/~maceyb/aca319-2003
79
Single stage Shuffle-Exchange IN (left)
Perfect shuffle mapping function (right)
Perfect shuffle operation: cyclic shift 1
place left, eg 101 --> 011
Exchange operation: invert least
significant bit, e.g. 101 --> 100
Crossbar Switch Connections
Has n inputs and m outputs; n and
m are usually the same.
 Data can flow in either directions.

Each crosspoint can open or close
to realize a connection.

All possible combinations can be
realized.

The inputs are usually connected to
processors and outputs connected to
memory, I/O, or other processors.
 These switches have complexities
of O(n2); doubling the number of
inputs and outputs also doubles the
size of the switch.
 To solve this problem, multistage
interconnection networks were
developed.

0
1
inputs
.
.
.
.
.
.
n-1
0
1
...
outputs
m-1
Multistage Interconnection Networks

The capability of single stage networks are limited but if we cascade enough of
them together, they form a completely connected MIN (Multistage Interconnection
Network).

Switches can perform their own routing or can be controlled by a central router

This type of networks can be classified into the following four categories:

Nonblocking
 A network is called strictly nonblocking if it can connect any idle input to any
idle output regardless of what other connections are currently in process

Rearrangeable nonblocking
 In this case a network should be able to establish all possible connections
between inputs and outputs by rearranging its existing connections.

Blocking interconnection
 A network is said to be blocking if it can perform many, but not all, possible
connections between terminals.
 Example: the Omega network
81
Multistage network

Refer to fig 2.23 on page no 91 from ACA book (KAI
HWANG)


Different classes of Multistage Interconnection
Networks(MINs) differ in switch module and in the kind
of interstage pattern used.
The patterns often include perfect
shuffle,butterfly,crossbar,cube connection etc
Omega Network

A 2  2 switch can be configured for







Straight-through
Crossover
Upper broadcast (upper input to both outputs)
Lower broadcast (lower input to both outputs)
(No output is a somewhat vacuous possibility as well)
With four stages of eight 2  2 switches, and a static perfect
shuffle for each of the four ISCs, a 16 by 16 Omega network
can be constructed (but not all permutations are possible).
In general , an n-input Omega network requires log 2 n stages
of 2  2 switches and n / 2 switch modules.
16 x 16 omega network
8 x 8 Omega Network
Consists of four 2 x 2
switches per stage.
 The fixed links between
every pair of stages are
identical.
 A perfect shuffle is formed
for the fixed links between
every pair of stages.
 Has complexity of O(n lg n).

For 8 possible inputs, there
are a total of 8! = 40,320 1 to 1
mappings of the inputs onto
the outputs. But only 12
switches for a total of 212 =
4096 settings. Thus, network is
blocking.

0
1
2
3
4
5
6
7
A
B
C
D
E
F
G
H
I
J
K
L
0
1
2
3
4
5
6
7
Baseline Network
Similar to the Omega network,
essentially the front half of a Benes
network.
 The figure to the right
shows an 8 x 8 Baseline
network.
 To generalize into an n x n
Baseline network, first create
one stage of (n / 2) 2 x 2
switches.

Then one output from each 2 x
2 switch is connected to an input
of each (n / 2) x (n / 2) switch.
 Then the (n / 2) x (n / 2) switches
are replaced by (n / 2) x (n / 2)
Baseline networks constructed in
theThe
same
way. and Omega

Baseline
networks are isomorphic with
each other.

0
1
2
3
4
5
6
7
A
E
I
C
G
J
B
F
K
D
H
L
0
1
2
3
4
5
6
7
Isomorphism Between Baseline
and Omega Networks (cont.)

Starting with the Baseline network.
If B and C, and F and G
are repositioned while
keeping the fixed links as
the switches are moved.
 The Baseline network
transforms into the Omega
network.
0
1
Therefore, the
Baseline and Omega
networks are
isomorphic.
5


2
3
4
6
7
A
E
I
C
B
G
F
J
C
B
G
F
K
D
H
L
0
1
2
3
4
5
6
7
Recursive Construction


The first stage contains one NXN block and second stage
contains 2 (N/2)x (N/2) sub blocks labeled Co and C1.
This construction can be recursively repeated to bub
block until 2x2 switch is reached.
Crossbar Networks




A crossbar network can be visualized as a single-stage
switch network.
Like a telephone switch board, the crosspoint switches
provide dynamic connections between(source,
destination) pairs.
Each cross point switch can provide a dedicated
connection path between a pair.
The switch can be set on or off dynamically upon
program demand.
Shared Memory Crossbar


To build a shared-memory multiprocessor, one can use a
crossbar network between the processors and memory
modules (Fig. 2.26a).
The C.mmp multiprocessor has implemented a 16 x 16
crossbar network which connects 16 PDP 11 processors
to 16 memory modules, each of which has a capability of
1 million words of memory cells.
…
Processors
Shared Memory Crossbar Switch
Crossbar
Switch
…
Memories
Shared Memory Crossbar Switch


Note that each memory module can satisfy only one
processor request at a time.
When multiple requests arrive at the same memory
module simaltaneously,cross bar must resolve the
conflicts.
Interprocess Communication Crossbar
Switch



This large crossbar was actually built in vector parallel
processor.
The PEs are the processor with attached memory.
The CPs stand for control processor which are used to
supervise entire system operation.
Interprocess Communication Crossbar
Switch
Summary of Crossbar Network
Different types of devices can be connected, yielding different
constraints on which switches can be enabled.
• With m processors and n memories, one processor may be able
to generate requests for multiple memories in sequence; thus
several switches might be set in the same row.
• For m  m interprocessor communication, each PE is connected
to both an input and an output of the crossbar; only one switch in
each row and column can be turned on simultaneously. Additional
control processors are used to manage the crossbar itself.
End Of Module 4 
Download