Parallel Block LU Factorization on the Synergy

advertisement
Parallel Block LU Factorization – An Experience in Stateless Parallel Processing
Weijun He (weijunhe@weijunhe@temple.edu)
Yuan Shi, (shi@temple.edu, faculty supervisor)
CIS Department
Temple University
Philadelphia, PA 19122
Abstract. This paper presents an experience in parallelizing a Block LU
factorization algorithm using a Stateless Parallel Processing (SPP) framework.
Compared to message passing systems such as PVM/MPI, SPP allows a
simpler dataflow computation model that enables highly effective implicit
parallel processing. This allows a range of benefits that were not possible in
direct message-passing systems: programming ease, automated fault tolerance
and load balancing with easy granularity control. The SPP architecture also
promises high performance featuring automatic formation of SIMD, MIMD
and pipeline processor clusters at runtime.
This paper uses a LU factorization algorithm to illustrate the SPP
programming, debugging and execution methods. We use timing models to
evaluate and justify parallel performance. Timing models are also used to
project scalability of the application in various processing environments in
order to establish relevance. The reported experiments are conducted using
Synergy V3.0 on a cluster of 4 Linux machines running RedHat9.
1 Introduction
Difficulties in parallel programming are rooted in two areas: a) inter-process
communication and synchronization; and b) performance optimization. For
multiprocessor systems without a shared memory (such as cluster of COTS processor),
the communication and synchronization programming difficulties increase
exponentially. Identifying the “best” processing granularity for optimized performance
is another challenge. Great performance losses will result either the granularity is too
large or too small. Unfortunately, for direct message-passing systems, processing
granularity is inter-woven in the task design that cannot be easily adjusted.
In addition to programming difficulties, multiprocessor systems also face a serious
availability challenge in that higher processor counts adversely impact the application
availability since any single processor crash can halt the entire parallel application.
Stateless Parallel Processing (SPP) concerns with a processor design and parallel
programming discipline that facilitates automated fault tolerance and load balancing. In
particular, SPP eliminates the need for a single uniform memory space while allowing
automatic program/data partitioning with high performance. An SPP system can
1
automatically form SIMD, MIMD and pipelined processors at runtime. It is free from
the difficult cache coherence problems. Therefore, SPP is ideally suited for using
multiple COTS processors.
Implementation of SPP system requires a data access model using tuple space [4]. The
tuple space data access model is in theory identical to the dataflow model. It allows
implicit parallel processing thus simplifies programming.
There have been numerous efforts exploiting the tuple space programming paradigm
and implementation techniques. In this paper, we focus on a server-centric
implementation (Synergy) that offers fault resilience to multiple worker faults but is
known to have larger overhead to application inter-process communication when
compared to direct message-passing systems. A full SPP implementation will eliminate
this overhead and offer fault resilience to multiple processor and network failures.
In this paper, we use a Block LU factorization algorithm to illustrate the programming
and performance tuning features of SPP. We use timing models to guide our program
design and to predict the achievable timing results. We use the measured timing results
to validate the models. Finally we examine the scalability of this application under
various processing environments.
2 Stateless Parallel Processing (SPP)
For parallel programming, SPP requires a separation between functionality
programming and performance programming (including fault tolerance). A SPP
application consists of an interconnected network of stateless parallel programs. A
stateless parallel program is any program that does not contain or generate persistent
global state information. For example, a program uses a specific IP address and port
number to communicate with another program is NOT stateless. Similarly, a program
that spawns programs on multiple processors is also NOT stateless for it creates a
persistent parent-child relationship. Typically, it is performance-programming practice
that produces non-stateless codes.
A SPP application contains only two types of programs: master and worker. A master
is any program that distributes working assignments and assembles the results. A
worker program reads a working assignment and returns complete result. Using the
tuple space data access model, each SPP implementation represents a complete
dataflow computer.
An SPP runtime system exploits the implied data dependencies in the parallel programs.
It optimizes processing efficiency by minimizing communication overhead and
automatic formation of SIMD, MIMD and pipelined processor clusters at runtime.
The user can further optimize performance by tuning the processing granularity by
2
leveraging the stateless coding style.
3 Synergy Parallel Programming and Processing System
Synergy V3.0 is a research parallel programming and processing system in use for
graduate and undergraduate computer science education at Temple University since
1990. It was also successfully used in a number of research projects in IBM T.J. Watson
Research Center, Wharton School of University of Pennsylvania and the Center for
Advanced Computing and Communications at Temple University.
Synergy uses a server-centric tuple space mechanism [5] for inter-parallel program
communication. Different from Linda’s compiler-based implementation [4], Synergy’s
tuple space is implemented by runtime created servers. Synergy programs can be
written in any sequential programming language and compiled with a link to the
Synergy runtime library.
A Synergy parallel application consists of a set of SPP programs and a configuration
specification that links the programs into a logical parallel application.
The Synergy runtime system interprets the configuration specification, activates the
specified components and provides functions for monitor and control. It contains
debugging support using concurrent activation of multiple gdb debuggers. It also
contains automatic fault tolerance supporting multiple worker processor faults.
Currently, Synergy V3.0 supports RedHat Linux9 and SunOS 5.9.
4 Synergy application programming interface (SAPI).
Synergy runtime library contains the following calls:

tsd=cnf_open(“TupleSpaceName”) – opens a named tuple space object with a returning
handle.

cnf_term() – terminates a running program.

cnf_tsput(tsd, TupleName, Buffer) – insert a named tuple into an opened space object.

cnf_tsget(tsd, TupleName, &Buffer) – fetch a matching tuple and remove it from the
given space object.

cnf_tsread(tsd, TupleName, &Buffer) – read the contents of a matching tuple in the
given space object.
There are three environment variable fetching calls:

cnf_gett() – get the value defined by “threshold=” clause in application specification.

cnf_getf() – get the value defined by “factor=” clause in application specification.

cnf_getP() – get the number of processors.
3
These values facilitate grain size adjustments.
Every Synergy program is separately compiled using a host language compiler and
Synergy runtime library; and must use SAPI in order to achieve parallel processing
benefits.
5 Synergy Parallel Application Configuration Specification (SPACS)
The user is required to specify a parallel application by linking multiple compiled
parallel programs. The following is the specification for Block LU factorization
application (BLU.csl):
configuration: BLU;
m: master = BLUMaster (threshold = 1 debug = 0)
-> f: problem (type = TS)
-> m: worker = BLUWorker (type = slave)
-> f: result (type = TS)
-> m: master;
The highlighted words in this specification correspond to the cnf_open statements in
the parallel programs. The “=” signs in the M-clauses are used to equate the logical
components to the physical binary file names in the $HOME/bin directory.
This specification is processed by the Synergy runtime system to generate dynamic
tuple space daemons and a customized application control console responsible for
monitoring and control the entire application (DAC, Distributed Application
Controller).
6 Synergy Runtime Environment (SRE)
Synergy was designed to exploit resource-sharing as opposed to cycle-stealing. The
idea is that all users should share the same pool of resources in any way legally
permitted. To this end, it does not use SSH (Secured Shell) to activate user programs.
Every user is required to run a login-based command interpret daemon (CID) to gain
CPU access to a remote machine. These daemons inherit all legal restrictions of the
login ID thus avoid any security related concerns. A port mapping daemon (PMD)
manages the dynamic ports for all users on a single host. In a NFS (Network File
System), a single shared copy of binaries (and link library) is sufficient for all cluster
users. On multiple distributed clusters without a shared file system, multiple copies of
Synergy binary and consistent path naming are necessary.
Each user maintains a host list as the resource pool. The state of this pool defines
resource availability for the next parallel run. Activating CID will automatically launch
4
PMD if it is not already running. Multiple activation of PMD on the same node will not
affect the correctness of operation.
The following commands manages the host list states in SRE:






cds [-a] [-f] – Checks daemon states. The default is checking only active hosts.
“-a” lists inactive hosts as well. “-f” saves the output to a file named .snghosts.
sds – Starts CIDs on all current active hosts.
kds – Kills all daemons on current active hosts.
addhost – Adds a host to the host list.
delhost – Deletes a host from the host list.
chosts – Chooses hosts to activate or deactivate.
Multiple cluster users can share the same processor pool at the same time.
The following commands execute, debug and monitor a Synergy application:

prun [–debug] SCF_name [-ft] – This activates a parallel application named
SCF_name. “-debug” option prepares a virtual dataflow computer ready for
debugging using multiple gdb sessions. “-ft” option activates the worker
processor fault tolerance. The default is no fault tolerance to facilitate
debugging.

pcheck – This command can monitor all applications started by the same login
ID. It monitors at both application level and process level and can also be used
to terminate either a process or an entire application.
7 Synergy Programming and Communication
The direct message passing systems allows the best possible communication
performance deliverable by the low-level networking systems [3]. However, for load
balance and fault tolerance, indirect communication is necessary since the target
recipient may not be present or appropriate for work due to dynamic status changes.
Synergy uses indirect communication that each tuple transmission must go through
three phases: sending, matching and receiving. The inevitable overhead can be offset by
load balancing, automatic SIMD, MIMD and pipeline formation and fault tolerance
benefits [5].
Similar to message passing systems, Synergy programs contain initialization,
communication and termination library calls:
Tsd1=cnf_open(“problem”, 0);
Tsd2=cnf_open(“result”,0);
// Initializing “problem” space.
// Initializing “result” space
strcpy(tpname,”A0”);
5
cnf_tsput(tsd1, tpname, buffer, buf_size); // Communication: put a tuple “A0”
strcpy(tpname,”B*”);
len = cnf_tsread(tsd2, tpname, buffer2, 0); // Communication: read a tuple “B*”
…
cnf_term();
// Termination
In a debugging session, a virtual dataflow computer is created by launching appropriate
daemons. This allows user to step through instruction-by-instruction using tools like
gdb.
8 LU Factorization
Dense linear systems are commonly solved using LU factorization. Block methods in
matrix computations are widely recognized as being able to achieve high performance
on parallel computers [1].
In solving a dense linear system A.X=B, we first factorize A into L.U where L is a lower
triangular unit matrix and U is an upper triangular matrix. There are four major types of
computations. Here we omit the permutation matrix P for simplicity.




LU factorization for diagonal sub-matrices
Solve lower triangular unit sub-linear systems like LZ=subA
Solve upper triangular sub-linear systems like WU=subA
Matrix product for updates like subA = subA –W.Z
LU
LZ=subA
………….
.
.
.
WU
=
subA
subA=subA-W.Z
Figure 1. Four tasks for LU Factorization
9 Parallel LU Factorization
The parallel LU Factorization application contains one master and one worker. The
6
master maintains a task road map according to the sequential dependencies. The worker
performs given task (there are four types) and returns corresponding result. Worker is
activated on multiple hosts, including the host that runs the master.
Assuming total number of tasks is N, the parallel framework is as follows:
Master
Initialization: put the ready-to-compute tasks to the task pool
For I = 1 to N
Receive a completed task from a worker
Refresh the task roadmap
While a new task is ready-to-compute
Put the task in the task pool
Workers
Get any ready-to-compute task from the pool
Do the computing
Return the completed task to the Master
10 Optimization for Parallel Block LU Algorithm
We denote the sub-blocks as submat[i][j][status], here the status = O/U/F:
O: Original,
U: Updated,
F: Finished.
The dependency graph for the sub-tasks for a 3 x 3 block factorization is as follows:
(submat[1][i][O] = submat[1][i][U], submat[i][1][O] = submat[i][1][U])
7
[3][3][O]
[1][3][F]
[3][1][F]
[1][3][U]
[3][1][U]
[1][1][F]
[2][1][U]
[1][2][U]
[1][1][U]
[3][3][U]
[2][1][F]
[1][2][F]
[2][2][O]
[3][2][U]
[3][2][F]
[2][3][U]
[2][3][F]
[2][2][U]
[2][2][F]
Figure 2. Dependency Graph for Block LU Factorization (3 x 3 blocks)
Each task involves either a matrix-matrix product or a upper(lower) triangle solver. To
optimize parallel execution time, the task distribution should ensure the minimal
waiting time between tasks. Figure 2 shows that this application does not render itself
well for parallel processing. Many paths are sequential in essence. However, it contains
SIMD (multiple matrix-matrix products (L or U solvers) can run in parallel), MIMD
(matrix-matrix products, L and U solvers can run at the same time) and pipelined (the
wave-front dependency) processor forms.
11 Synergy Implementation
Roadmap Status Matrix
We use a K x K matrix to record the update history of K x K blocks. Originally each
cell is assigned value 0. When we update the according sub-matrix, the value increases
1. For Block LU factorization for cell (i.j), when it is done, value(i,j) = min(i,j)+1.
For a 3 x 3 grid, the serialized execution path of the dependency graph is as follows
(many blocks can execute in parallel):
LU (0,0)
LZ(0,1)
LZ(0,2)
WU(1,0)
WU(2,0)
0
0
0 1
0
0
1
1
0
1
1
1
1
1
1
0
0
0 0
0
0
0
0
0
0
0
0
1
0
0
0
0
0 0
0
0
0
0
0
0
0
0
1
0
0
Update (1,1)
Update(1,2)
Update(2,1)
Update(2,2)
LU(1,1)
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
0
0 1
1
0
1
1
1
1
1
1
1
2
1
1
0
0 1
0
0
1
10
10
1
1
0
1
1
1
LZ (1,2)
WU(2,1)
Update(2,2)
LU(2,2)
1
1
1 1
1
1
1
1
1
1
1
1
1
2
1 1
2
2
1
2
2
1
2
2
1
1
1 1
1
1
1
2
31
1
2
2
8
[3][3][F]
Figure 5. Typical Serialization Path for a 3x3 Blocks
9
struct msg
{
int type;
int x;
int y;
int row;
int col;
float mat1[M][M];
float mat2[M][M];
};
struct res_msg
{
int type;
int x;
int y;
int row;
int col;
float res_mat[M][M];
};
Figure 6. Task and corresponding result structures
Where
x, y : the coordinates where the result sub-matrix begins.
row,col: the size of the result sub-matrix.
type:
=0, Function: LU factorization for the diagonal sub-matrix
=1, Function: Solve lower triangular unit sub-linear systems like LZ=subA.
mat1: input L ; mat2: input subA
=2, Function: Solve upper triangular sub-linear systems like LZ=subA.
mat1: input U; mat2: input subA
=3, Function: Matrix-matrix product
result = mat1 * mat2
=4, Function: Termination of all tasks.
12 Flow Charts
Putting all these together, the next flowcharts show our goal is to minimize
synchronization. The roadmp matrix is Pstatus.
10
Master:
Solve LU(0,0), Update Status
put LZ(0,1:NBLOKCS-1), WU(1:NBLOCK-1,0) to Problem Space(PS)
Receive a tuple (i,j) from result space, update sub-matrix and status
Type=0
Type=1
Type=2
Type=3
(LZ,i<j)
(WU,i>j)
MatMul((i,k),(k,j))
(LU,i=j)
k=i+1:NBLOCKS-1
k=i+1:NBLOCKS-1
k=i+1:NBLOCKS-1
if(PStatus(i,k)==1)
if(PStatus(k,i)==i+1)
if(PStatus(j,k)==i+1)
put LZ(i,k) to PS
put MatMul((k,i),(i,j)
if(PStatus(k,i)==1)
to PS
PStatus(i,j)==
put MatMul((i,j),(j,k)
min(i,j)?
to PS
put WU(k,I) to PS
N
Y
i=j
i<j
I==NBLOCKS-1
Solve LU for the
last diagonal sub
matrix
i>j
I<NBLOCKS-1
Put LU(i,i) to
If PStatus(i,i)==i+1
If PStatus(i,i)==i+1
PS
Put LZ(i,j) to PS
Put LZ(i,j) to PS
Put the termination
tuple to PS
exit
Figure 7. Master Flowchart
11
Workers:
Receive a tuple from PS
Recv_length==4?
Put back the
Type=0
Type=1
Type=2
termination tuple
Type=3
Solve
Solve
Solve
Solve MatMul(
LU(i,i)
LZ(i,j)
WU(I,j)
(i,k), (k,j))
exit
Put the result to the Result Space(RS)
Figure 8. Worker Flowchart
The above parallel system contains two overlapping parts: Computation and
Communication. Using high performance network switches, it is possible to incur
concurrent communications.
Master (Tuple Space+Data+Roadmap), Work0
Work1
Switch
Work3
Work2
Figure 9. Overlapping Computing and Communications
12
13 Performance Analysis
In this section, we use timing models to evaluation the potentials of the designed
parallel application. Timing model is performance analysis methodology that uses
holistic time complexity models and itemized instrumentations to reveal a parallel
application’s performance characteristics under different processing environments [2].
Let:
n : size of the matrix
m : number of blocks
k : block size, n  m  k
w : CPU power (algorithmic steps per second)
 : network speed (bytes per second)
 : number of bytes in each word
p : number of processors
We notice that in our program model, we actually eliminate the synchronization mostly
and also the computation and communication overlap at the most time. So the parallel
execution time is determined by the bigger of the two, Tcomp and Tcomm. That is,
T par  max( Tcomp , Tcomm )
Tcomp 
2n 3
3wp
And
Tcomm  Acomm *  / 
where
m
Acomm  {6(m  q)  3(m  q) 2 }  k 2
q 1
m 1
  (6i  3i 2 )k 2
i 1
 m 3 k 2  o( m 3 k 2 )
 mn 2  o(mn 2 )
Therefore,
Tcomm  mn 2 /   o(mn 2 /  )
T par  max( Tcomp , Tcomm )  max(
2n 3 mn 2
2n 3 n 3
,
)  max(
,
)
3wp 
3wp k
Tseq 
2n 3
3w
13
Speedup 
Tseq
T par

1
2k
 min( p,
)
1 3w
3w
max( ,
)
p 2k
The above formula illustrates that the speedup depends on p, w, k , 
-
The bigger p, k and  , the better the speedup
-
The smaller w , the better the speedup
And, the best block size is:
k
3 pw
2
Suppose that w  10 7 ,   10 6 , k  500, p  4, m  5,   4 , we can change the value of
3.5
3.5
3.0
3.0
2.5
2.5
Speedup
2.0
1.5
1.5
1.0
1.0
0.5
0.5
4.0
Speedup
5.0
3.0
2.5
1.0
0.5
2
1.8
1.6
1.4
1
network speed (u, 10^6)
3.5
2.0
1.5
1.2
10
14
18
22
26
30
node computing power(w,10^6)
0.8
0.2
6
0.6
0.0
0.0
3.0
2.0
1.0
Block Size
900
1000
800
700
600
500
400
300
200
0.0
100
Speedup
2.0
0.4
Speedup
each variable separately to see their influence to the speedup.
0.0
2
3
4
5
6
7
8
9
10
number of processors,p
Figure 10. Block LU Factorization Scalability Analysis
14
14 Experimental Results
Figure 11 shows actual parallel running results.
Sequential
matrix size block size seq-time(s)
Parallel
Mflops
#CPU
time(s)
Mflops
Speedup
500
200
0.66
125.62
3
0.69
120.61
0.96
800
300
3.75
91.14
3
2.62
130.10
1.43
1200
400
15.03
76.65
3
7.63
150.94
1.97
1500
300
32.21
69.85
3
12.24
183.80
2.63
1600
400
33.02
82.71
3
15.07
181.25
2.19
2000
500
72.70
73.36
3
34.95
152.61
2.08
1000
250
6.00
111.20
4
3.24
205.96
1.85
1000
300
7.91
84.30
4
4.77
139.86
1.66
1200
300
13.39
86.03
4
5.89
195.66
2.27
1600
400
33.02
82.71
4
13.73
198.94
2.41
Figure 11. Computation Timing Results
We observe that for N=1600 we achieve the best speedup, resulting in 184 MFLOPS
using three processors. It also shows that according to timing models, if w improves too
much there will be no chance for speedup.
15 Conclusion
This report documents the computational experiments conducted using Synergy V3.0
for a Block LU factorization algorithm. It shows that it is easier to program than direct
message passing system. It also shows that it is possible to achieve positive results
using slow networks when the processor power is small. Timing model analysis reveals
that this positive result can go away if we use BLAS optimized library codes in the
parallel core. This is because BLAS library codes will improve w measure by 10 folds.
References
1) Matrix Computation, 3rd Edition, Gene H. Golub, Charles F. Van Loan, The
Johns Hopkins University Press
2) Parallel Program Scalability Analysis, Yuan Shi, International Conference on
Parallel and Distributed Computing, October 1997, pp.451-456
3) The Message Passing Interface (MPI) Standard.
http://www-unix.mcs.anl.gov/mpi/
4) S. Ahuja, N. Carriero and D. Gelertner, “Linda and friends,” IEEE
Computer,26-32, August 1986.
5) Getting Started in Synergy, Yuan Shi, Online Documentation:
http://spartan.cis.temple.edu/shi/synergy.
15
Download