Super Linear Speedup for Parallel Givens Algorithm on a

advertisement
Super Linear Speedup for Parallel Givens Algorithm on a Network of
Workstations (NoW)
M. J. Zemerly, K.M. S. Abdulla and M.H. Shariff
Department of Computer Engineering
Etisalat College of Engineering
PO Box: 980, Sharjah
UNITED ARAB EMIRATES
http://www.ece.ac.ae
Abstract: - This paper describes a parallel direct linear solver algorithm based on the Givens rotations on a
network of workstations (12 Ultra 10 Sun machines). PVM (Parallel Virtual Machine) was used as the
communication platform. The parallel algorithm uses data decomposition and pipeline techniques. Results
obtained for large dense matrices (1024 and 2048) showed superlinear speedup. The superlinear speedup
increases as the size of the matrix increases. This was attributed to cache effect as the algorithm itself is not well
balanced and has a sequential part.
Key-Words: - Cluster computing, PVM, parallel Givens linear solver, superlinear speedup.
1. Introduction
Direct linear solvers for dense matrices have been
used in many applications [1]. These include
computer simulation, electronic circuit analysis,
structural analysis, chemical engineering, economic
modelling, to name but a few. Most applications
involve solving an extremely large number of linear
equations. This can take a considerable amount of
time to compute on the most powerful sequential
computer available to date. One way to reduce the
time required to solve these equations is to use a
parallel processing to do the task, where the
calculations are divided between the processors.
Since each processor now has less work to do, the
task can be finished more quickly. This is similar to
a supermarket having more than one checkout
counter. Each cashier has fewer customers to serve,
so the checkout time is much faster. Recently, other
than the low cost and availability reasons, the power
of clusters of workstations became evident with the
advance in chip technology which made the speed of
single
processor
improve
dramatically.
Communication libraries such as PVM (Parallel
Virtual Machine) [2] and MPI (Message Passing
Interface) [4] facilitate the use of clusters of
workstations.
The selection of the parallelisation technique is
very important and is the first step in parallelisation
process. As a result of this step, the data space and
the algorithm are partitioned. Despite the partition,
which reduces the execution time, communication
and synchronization overheads are introduced [1,8].
Due to this overhead and the possibility of having a
sequential part in the parallel algorithm, it is very
difficult to obtain decent speedups from parallel
algorithms. The computation time can be reduced by
selecting an appropriate parallelisation strategy
which yields good load balancing. An efficient
technique used for parallelisation is functional
pipelining which partitions the control flow into
stages and share the same data flow. This is most
effective if all the stages perform a similar amount
of computation such that all stages of the pipeline
are busy [5]. In this paper a linear solver based on
Givens rotations is selected and parallelised using a
normal row data decomposition and a pipeline
technique [5] which was used on a transputer based
machine. The results obtained from the parallel
algorithm shows a super linear speedup effect for
large matrix sizes.
2. Direct Givens linear solver
There is no denial that linear solvers are so essential
for many applications such as the fields of
engineering, design and simulation [3].
Linear solvers belong to the computationally
intensive class of algorithms that can be represented
in the following form: Ax =b, where A is the nonsingular square matrix, and b is the right hand side
vector and x is the vector of unknowns.
The Givens transformation is defined by a 22
rotation matrix [6]:
 c s

 s c
G = 
for their rows (except the first row for each
processor) in parallel.
The processor holding the row of the column to be
eliminated passes it to its neighbour to eliminate its
column and then passed in a pipeline fashion
through other neighbours until all the columns are
eliminated (see Figure 1). The last processor keeps
the lines in a full matrix to be used in the sequential
back-substitution stage later on its own.
where c2 + s2 = 1.
A Givens rotation is used to eliminate the elements
of a vector or a matrix as follows:
 c s a r 

 ×   =  
 s c  b  0
Where c  a / a 2  b 2 and s  b/ a 2  b 2
Givens does not require pivoting, which is difficult
to parallelise, and is known to be numerically stable
and inherently more accurate than other methods
(Gaussian elimination) [1,5,6]. Givens does,
however, require more computation than Guassian
elimination, with complexities of (4/3) N³ + O(N2).
vs (1/3) N3 + O(N2) respectively.
The Givens algorithm for solving a linear system
with N equations can be decomposed into two
computational stages. The first is the triangulation of
the initial matrix; this stage is represented by the
execution of the elimination block N times. The
second stage is the back-substitution block that
solves the triangular matrix. The complexity of
back-substitution is O(N²) compared with the
triangulation phase of O(N³). In this paper we
concentrate on parallelising only the triangulation
stage, described in Section 3, since the other stages
are not as computationally demanding.
3. Parallel
method
Givens
using
pipeline
Initially, block-row data decomposition, which
divides the matrix horizontally and assigns adjacent
blocks of rows to neighbour processors, is used.
Firstly all processors eliminate the required columns
Figure 1: Data decomposition and pipelining for
parallel Givens
As can be seen the parallel algorithm is not well load
balanced as the last processor work the hardest and
the first processor works least, as it finishes
execution and becomes idle as soon as it eliminates
its N/p columns. Note that sending in PVM is
asynchronous and receiving is synchronous. Hence
the last processor has to wait after finishing the
elimination of the first column of the first line to
propagate through all processors and this is the time
required to fill the pipeline. After that all the workers
operate in a pipeline fashion until they finish one
after the other and become idle and leave the last
processor working on its own to eliminate the
remaining columns. The last processor also find the
solution vector using back-substitution. Figures 2
and 3 show a screen dump from XPVM [3] of the
space-time and utilization graphs of the algorithm
for 6 processors and for matrix size of 512. It can be
seen from the figures that the load balancing is not
so good and has a stepping effect. This may be
overcome by distributing more matrix lines to the
first processors and giving fewer lines to the last
processor.
Figure 2: XPVM space-time graph of the algorithm
Figure 3: XPVM utilization graph of the algorithm
4. Results
The results presented here are based on 3
performance measures: execution time, speedup
and efficiency [7].
4.1 Execution time:
The results have been obtained for two large matrix
sizes (1024 and 2048). Table 1 shows the execution
times (minimum of 10 runs was taken) for the two
matrix sizes and Figure 4 shows a graphical
representation of the results in Table 1.
4.2 Speedup:
The speedups (Sp = Ts/Tp, where Ts is the sequential
time and Tp is the parallel time for p processors)
obtained for both matrix sizes and for up to 12
processors are shown in Table 2. Figure 5 shows
the speedup graphical representation results.
Processors 1024
1
110.53
2
34.03
3
25.62
4
21.3
5
18.89
6
17.21
7
16.09
8
15.08
9
14.41
10
13.94
11
14.19
12
13.83
2048
902.98
262.3
192.47
154.95
131.94
116.42
104.83
96.48
88.7
83.74
78.88
74.74
Table 1: Execution times
Execution time (secs)
1000
1024
2048
100
that the superlinear speedup occurs in all the data
points where the efficiency is greater than 1. That is
for the matrix size of 1024 up to 5 processors and
for 2048 matrix size up to 8 processors. It can be
seen that as the number of processors increases the
effect of overheads from communication and
synchronization increases and the performance of
the parallel algorithm decreases
Processors 1024 2048
2
1.62 1.72
3
1.44 1.56
4
1.30 1.46
5
1.17 1.37
6
1.07 1.29
7
0.98 1.23
8
0.92 1.17
9
0.85 1.13
10
0.79 1.08
11
0.71 1.04
12
0.67 1.01
10
1
2
3
4
5
6
7
8
9 10 11 12
No. of processors
Figure 4: Execution times representation
1024
2048
3.25
4.31
5.19
5.85
6.42
6.87
7.33
7.67
7.93
7.79
7.99
3.44
4.69
5.83
6.84
7.76
8.61
9.36
10.18
10.78
11.45
12.08
Table 2: Speedup results
14
Speedup
12
1024
2048
Table 3: Efficiency results
2
1024
2048
1.5
Efficiency
Processors
2
3
4
5
6
7
8
9
10
11
12
1
0.5
10
8
0
6
2
3
4
5
6
7
8
9 10 11 12
4
No. of processors
2
Figure 6: Efficiency graph
0
2
3
4
5
6
7
8
9 10 11 12
No. of processors
Figure 5: Speedups graphic representation
4.3 Efficiency
The efficiency (Ep = Sp/p) of both matrix sizes is
given in Table 3 and a graphical representation is
shown in Figure 4. As can be seen from this figure
5. Discussion
From the results of the Givens parallel algorithm
shown previously, two effects are quite clearly
observed here. For any given number of processors
applied to the problem, the speedup increases with
the size of the problem. For smaller problems the
processors spend a large proportion of the time
communicating with other processors in order to
share relevant information. However, as the
problem size gets larger, the communication time
(although increases with the problem size) becomes
relatively less significant compared to the time
required to do the computation. Speedup is no
longer bound by communication but by
computation and begins to approach the ideal linear
speedup (speedup factor equal to the number of
processors used) predicted by the number of
processors used.
The second effect demonstrated is that speedup
varies with number of processors for a given
problem size. As the number of processors
increases, so does the speedup but beyond a certain
number of processors the speed-up saturates and
will not increase further (as can be seen from
Figure 5 for matrix size 1024, not so clear for size
2048). This remains true irrespective of the number
of extra processors that are applied to the problem.
It can be seen from the speedup and the efficiency
figures that both matrix sizes have superlinear
speed up (when the speedup factor is greater than
the number of processors used) associated with
both of them. Cache effect is a common cause of
superlinear speedups [7]. The cache is used to
accelerate access to frequently used data. When the
processor needs some data, it checks first to see if it
is available in the cache. This can avoid having to
retrieve the data from a slower source such as the
primary memory (RAM). In a computer with 2
processors, the amount of cache is doubled because
each processor includes its own cache. This allows
a larger amount of data to be quickly available and
the speed increases by more than what is expected.
Note that the superlinear effect happens despite the
fact, as can be seen from the utilization figure
(Figure 3), that the algorithm is not well balanced
and has a stepping effect. The improvement in the
cache hit ratio for the parallel algorithm had a
greater impact on the performance than the
communication overheads and the load balance.
6. Conclusion
A pipeline parallel direct linear solver algorithm
based on the Givens rotation was presented and
results which showed superlinear speedup on a
network of 12 Sun Ultra-10 workstations (333
MHz, 512 Mbytes RAM) are given. The superlinear
speedup is attributed to the cache effect where the
sum of cache capacity on all processors is used
rather than one cache (on a sequential processor)
and this affects the amount of cache miss/hit that
occurs during the processing. The parallel
algorithm itself is not well balanced although the
pipeline effect hides the overheads of the
communication and synchronization in the system
once the pipeline is full. Improving the algorithm
load balancing may be achieved by increasing the
load of the first processor and reducing the load of
the last one in the pipeline. However, care must be
taken when the number of processors is low (2 or 3)
not to overload the first processor and deteriorate
the performance of its cache.
References
[1] I. S. Duff, “Direct Methods”, Technical Report,
TR/PA/98/28,
1998.
Available
from:
http://www.cerfacs.fr/algor/reports/Dissertation
s/TR_PA_98_28.ps.gz, accessed 1/12/2003.
[2] A. Geist et al., “PVM: Parallel Virtual Machine
- A Users' Guide and Tutorial for Networked
Parallel Computing”, MIT press, 1994.
[3] A. Geist, J. Kohl, and P. Papadopoulos.
Visualization, Debugging, and Performance in
PVM. In Proc. Visualization and Debugging
Workshop, October 1994.
[4] W. Gropp et al., Using MPI: Portable Parallel
Programming with the Message Passing
Interface, 2nd Edition, Scientific and
Engineering Computing Series, 1999.
[5] J. Papay, M.J. Zemerly and G.R. Nudd,
“Pipelining the Givens Linear Solver on
Distributed Memory Machine”, Supercomputer
Journal, Vol XII, no. 3, December 1996, pp 3743.
[6] J.W. Givens, “Computation of plane unitary
rotations transforming a general matrix to a
triangular form”, Journal of Soc. Ind. Appl.
Math., Vol 6, 1958, pp. 26-50.
[7] http://www.eli.sdsu.edu/courses/spring96/cs662/
notes/speedup/speedup.html,accessed 1/12/2003.
[8] G.M. Amdahl, “Validity of single-processor
approach to achieving large-scale computing
capability”, Proceedings of AFIPS Conference,
Reston, VA. 1967. pp. 483-485.
Download