Lecture 7: Parallel Processing Performance Improvement Introduction and motivation

advertisement
2015-11-19
Lecture 7: Parallel Processing




Introduction and
motivation
Architecture classification
Performance evaluation
Interconnection network
Zebo Peng, IDA, LiTH
1
TDTS 08 – Lecture 7
Performance Improvement

Reduction of instruction execution time:
 Increased clock frequency by fast circuit technology.
 Simplify instructions (RISC).

Parallelism within processor:
 Pipelining.
 Parallel execution of instructions (ILP):
• Superscalar processors.
• VLIW architectures.

Parallel processing.
 Huge degree of parallelism possible.
Zebo Peng, IDA, LiTH
2
TDTS 08 – Lecture 7
1
2015-11-19
Why Parallel Processing?

Traditional computers are not able to meet high-performance
requirements for many applications:
 Simulation of large complex systems in physics, economy, biology...
 Distributed data base with search function.
 Computer-aided design.
 Visualization and multimedia.
 Multi-tasking and multi-user systems (e.g., super computers).

Such applications are characterized by a very large amount of
numerical computations and/or a high quantity of input data.

In order to deliver sufficient performance for such applications, we
can have many processors in a single computer.
Zebo Peng, IDA, LiTH
3
TDTS 08 – Lecture 7
Why Parallel Processing (Cont’d)?

Technology development:
 Hardware and silicon technology makes it possible to build
machines with huge degree of parallelism cost effectively.
 It started with mainframes and super computers.
 Now even file servers and regular PCs are implemented often as
parallel machines.

PP has also the potential of being more reliable:
 If one processor fails, the system continues to work, with a lower
performance.

PP provides also a platform to build scalable systems with
different performances and capabilities.
Zebo Peng, IDA, LiTH
4
TDTS 08 – Lecture 7
2
2015-11-19
Parallel Computer

Parallel computers refer to architectures in which many
CPUs are running in parallel to implement a given
application or a set of applications.

Such computers can be organized in different ways,
depending on several key parameters:






number and complexity of individual CPUs;
availability of common (shared) memory;
interconnection technology and topology;
performance of interconnection network;
I/O devices;
etc.
Zebo Peng, IDA, LiTH
5
TDTS 08 – Lecture 7
Parallel Program

In order to fully utilize a parallel computer, one must decompose a
problem into sub-problems that can be solved in parallel.
 The results of sub-problems may have to be combined to get the
final result of the main problem.

Due to data dependency among the sub-problems, it is not easy to
decompose some problem to get a large degree of parallelism.

Due to data dependency, the processors may also have to
communicate among each other.

The time taken for communication is usually very high when
compared with the processing time.

The communication mechanism must therefore be very well
designed in order to get a good performance.
Zebo Peng, IDA, LiTH
6
TDTS 08 – Lecture 7
3
2015-11-19
Parallel Program Example (1)
Matrix computations:

 A11  B11 A12  B12
 A B
A 22  B22
21
 21
C  A  B   A 31  B31 A32  B32




A  B
A N 2  BN 2
N1
 N1
A13  B13
A 23  B23
A33  B33
 A1 M  B1 M
 A 2 M  B2 M
 A 3 M  B3 M


A N 3  B N 3  A NM







 B NM 
Vector computation with vector of m elements:
for i:=1 to n do
C[i,1:m]:=A[i,1:m] + B[i,1:m];
end for;
Zebo Peng, IDA, LiTH
7
TDTS 08 – Lecture 7
Parallel Program Example (2)

A vector dot product is common in filtering:
N
Y   a(i )  x(i )
i 1

Parallel sorting:
Unsorted-1
. Unsorted-2
. . U N S O R Unsorted-3
TED . . .
Unsorted-4
Sorting
Sorting
Sorting
Sorting
Sorted-1
Sorted-2
Sorted-3
Sorted-4
Parallel part
Sequential part
Merge
SORTED
Zebo Peng, IDA, LiTH
8
TDTS 08 – Lecture 7
4
2015-11-19
Lecture 7: Parallel Processing




Introduction and
motivation
Architecture classification
Performance evaluation
Interconnection network
Zebo Peng, IDA, LiTH
9
TDTS 08 – Lecture 7
Flynn’s Classification of Architectures

Based on the nature of the instruction flow executed
by the computer and the data flow on which the
instructions operate.

The multiplicity of instruction stream and data stream
gives us four different classes:
 Single instruction, single data stream - SISD
 Single instruction, multiple data stream - SIMD
 Multiple instruction, single data stream - MISD
 Multiple instruction, multiple data stream- MIMD
Zebo Peng, IDA, LiTH
10
TDTS 08 – Lecture 7
5
2015-11-19
Single Instruction, Single Data - SISD
The regular computers we have discussed up till now:

A single processor;

A single instruction stream; and

Data stored in a single memory.
CPU
Processing
unit
Control Unit
Memory
System
Zebo Peng, IDA, LiTH
11
TDTS 08 – Lecture 7
Single Instruction, Multiple Data - SIMD





A single machine instruction stream.
Simultaneous execution on different sets of data.
A large number of processing elements.
Lockstep synchronization among the process
elements.
The processing elements can:
 have their respective private data memory; or
 share a common memory via an interconnection
network.

Array and vector processors are the most common
examples of SIMD machines.
Zebo Peng, IDA, LiTH
12
TDTS 08 – Lecture 7
6
2015-11-19
SIMD with Shared Memory
Control
Unit
IS
Processing
Unit_2
DS1
DS2
…
Processing
Unit_n
Zebo Peng, IDA, LiTH
DSn
Interconnection Network
Processing
Unit_1
Shared
Memory
13
TDTS 08 – Lecture 7
Multiple Instruction, Single Data - MISD

A single sequence of data.

Transmitted to a set of processors.

Processors executes different instruction sequences.

Not been commercially implemented up till now!
Data
Zebo Peng, IDA, LiTH
...
PE1
PE2
14
PEn
TDTS 08 – Lecture 7
7
2015-11-19
Multiple Instruction, Multiple Data - MIMD

It consists of a set of processors.

Simultaneously execute different instruction sequences.

Different sets of data are operated on.

The MIMD class can be further divided:
 Shared memory (tightly coupled):
• Symmetric multiprocessor (SMP)
• Non-uniform memory access (NUMA)
 Distributed memory (loosely coupled) = Clusters
Zebo Peng, IDA, LiTH
15
TDTS 08 – Lecture 7
MIMD with Shared Memory
LM1
CPU_1
Processing
Unit_1
LM2
CPU_2
Control
Unit_2
DS2
IS2
Processing
Unit_2
…
LMn
CPU_n
Control
Unit_n
Zebo Peng, IDA, LiTH
DS
IS2
Processing
Unit_n
16
Interconnection Network
Control
Unit_1
DS1
IS1
Shared
Memory
n
TDTS 08 – Lecture 7
8
2015-11-19
Discussion

Very fast development in Parallel Processing and
related areas has blurred concept boundaries, causing
a lot of terminological confusion:
 concurrent computing,
 multiprocessing,
 distributed computing,
 etc.

There is no strict delimiter for contributors to the area of
parallel processing; it includes CA, OS, HLL design,
compilation, databases, and computer networks.
Zebo Peng, IDA, LiTH
17
TDTS 08 – Lecture 7
Lecture 7: Parallel Processing




Introduction and
motivation
Architecture classification
Performance evaluation
Interconnection network
Zebo Peng, IDA, LiTH
18
TDTS 08 – Lecture 7
9
2015-11-19
Performance Metrics (1)

How fast does a parallel computer run at its maximal
potential?
 Peak rate: the maximal computation rate that can be
theoretically achieved when all processors are fully
utilized.
 Ex. The fastest supercomputer in the world has a peak
rate of 55 PFlop/s.

The peak rate is of no practical significance for an
individual user.

It is mostly used by vendor companies for marketing
their computers.
Zebo Peng, IDA, LiTH
19
TDTS 08 – Lecture 7
Performance Metrics (2)

How fast execution can we expect from a parallel computer
for a given application or a given set of applications?
 Note the increase of multi-tasking and multi-thread
computing.

Speedup: measures the gain we get by using a parallel
computer, over a sequential one, to run a given application.
S =
Ts
Tp
TS: execution time needed with the sequential computer;
Tp : execution time needed with the parallel computer.
Zebo Peng, IDA, LiTH
20
TDTS 08 – Lecture 7
10
2015-11-19
Performance Metrics (3)

Efficiency: to relate speedup to the number of processors
used; it provides therefore a measure of the efficiency with
which the processors are used.
E=
S
P
S: speedup;
P: number of processors.
For the ideal situation, in theory:
S = P; which means E = 1.
Practically the ideal efficiency of 1 cannot be achieved!
Zebo Peng, IDA, LiTH
21
TDTS 08 – Lecture 7
Speed Up Limitation

Let f be the ratio of computations that, according to the algorithm,
have to be executed sequentially (0  f  1); and P the number of
processors.
(1 – f ) × Ts
P
Ts
(1 – f ) × Ts
f × Ts +
P
Tp = f × Ts +
S=
S
10
9
8
7
6
5
4
3
2
1
=
1
(1 – f )
f+
P
For a parallel computer with 10 processing elements
0 .2
Zebo Peng, IDA, LiTH
0 .4
0 .6
0 .8
22
1 .0
f
TDTS 08 – Lecture 7
11
2015-11-19
Speed Up vs. % of Parallel Part (1-f)
1024
0.5
512
0.75
0.95
256
0.99
1
128
Speedup
64
32
16
8
4
2
1
1
2
4
8
16
32
64
128
256
512
1024
Cores
Zebo Peng, IDA, LiTH
23
TDTS 08 – Lecture 7
Amdahl’s Law
Even a little ratio of sequential computation imposes a limit on the
speedup.

 A higher speedup than 1/f can’t be achieved, regardless of the
number of processors, since
S=

1
(1 – f )
f+
P
≤
1
f
If there is 20% sequential
computation, the speedup will
maximally be 5, even If you
have 1 million processors.
To efficiently exploit a high number of processors, f must be
small (the algorithm has to be highly parallel), since
E=
Zebo Peng, IDA, LiTH
S
P
=
1
f × (P – 1) + 1
24
TDTS 08 – Lecture 7
12
2015-11-19
Other Factors that Limit Speedup

Beside the intrinsic sequentiality of parts of an algorithm, there
are also other factors that limit the achievable speedup:
 communication cost;
 load balancing of the processors;
 costs of creating and scheduling processes; and
 I/O operations (mostly sequential in nature).

There are many algorithms with a high degree of parallelism.
 The value of f is very small and can be ignored, and they are
suited for massively parallel systems.
 The other limiting factors, such as the cost of communications,
become critical, in such algorithms.
Zebo Peng, IDA, LiTH
25
TDTS 08 – Lecture 7
Impact of Communication


Consider a highly parallel computation, f is small and can be neglected.
Let fc be the fractional communication overhead of a processor:


Tcalc: the time that a processor executes computations;
Tcomm: the time that a processor is idle because of communication;
Tcomm
Tcalc
TS
S = Tp =
TS
Tp = P × (1 + fc)
fc =

P
1 + fc
1
E = 1 + fc  1 – fc (if fc is small)
With algorithms having a high degree of parallelism, massively parallel
computers, consisting of large number of processors, can be efficiently
used if fc is small.


The time spent by a processor for communication has to be small compared
to its time for computation.
Communication time is very much impacted by the interconnection network.
Zebo Peng, IDA, LiTH
26
TDTS 08 – Lecture 7
13
2015-11-19
Lecture 7: Parallel Processing




Introduction and
motivation
Architecture classification
Performance evaluation
Interconnection network
Zebo Peng, IDA, LiTH
27
TDTS 08 – Lecture 7
Interconnection Network

Interconnection network (IN) is a key component of a
parallel architecture. It has a decisive influence on:
 the overall performance; and
 the total cost of the architecture.

The traffic in an IN consists of:
 data transfer; and
 transfer of commands and requests (control information).

The key parameters of an IN are
 total bandwidth: transferred bits/second; and
 implementation cost.
Zebo Peng, IDA, LiTH
28
TDTS 08 – Lecture 7
14
2015-11-19
Single Bus
Node1
Node2
...
Noden

Single bus networks are simple, cheap and relatively flexible;
and you have a broadcast mechanism.

One single communication is allowed at a time; the
bandwidth is shared by all nodes.

Performance is relatively poor.

In order to have good performance, the number of nodes is
limited (to around 16 - 20).
 Multiple buses can be used instead, if needed.
Zebo Peng, IDA, LiTH
29
TDTS 08 – Lecture 7
Completely Connected Network
N x (N-1)/2 wires

Each node is connected to every other one.

Communications can be performed in parallel between any pair
of nodes.

Both performance and cost are high.

Cost increases rapidly with number of nodes.
Zebo Peng, IDA, LiTH
30
TDTS 08 – Lecture 7
15
2015-11-19
Crossbar Network
Node 1
Node 2
…
Node n

A dynamic network: the interconnection topology can be modified by
configurating the switches.

It is completely connected: any node can be directly connected to any
other.

Fewer interconnections are needed than for the static completely
connected network; however, a large number of switches is needed.

A large number of communications can be performed in parallel (even
though one node can receive or send only one data at a time).
Zebo Peng, IDA, LiTH
31
TDTS 08 – Lecture 7
Mesh Network
Torus:



Cheaper than completely connected networks, while giving
relatively good performance.
In order to transmit data between two nodes, routing through
intermediate nodes is needed (maximum 2×(n-1) intermediates
for an n×n mesh).
It is possible to provide wrap-around connections:
 Torus.

Three dimensional meshes have also been implemented.
Zebo Peng, IDA, LiTH
32
TDTS 08 – Lecture 7
16
2015-11-19
Hypercube Network
2-D
3-D
4-D
5-D

2n nodes are arranged in an n-dimensional cube. Each node is
connected to n neighbors.

In order to transmit data between two nodes, routing through
intermediate nodes is needed (maximum n intermediates).
Zebo Peng, IDA, LiTH
33
TDTS 08 – Lecture 7
Summary

The growing need for high performance can not always be
satisfied by traditional computers.

With parallel computers, multiple CPUs are running concurrently in
order to solve a given problem.

Parallel programs have to be developed in order to make efficient
use of a parallel computer.

Computers can be classified based on the nature of the instruction
flow and the data flow on which the instructions operate.

Another key component of a parallel architecture is the
interconnection network.

The performance of a parallel computer depends not only on the
number of available processors but also on characteristics of the
executed programs.
Zebo Peng, IDA, LiTH
34
TDTS 08 – Lecture 7
17
Download