High-Level Parallel Software Development with Python and BSP

advertisement
High-Level Parallel Software Development with
Python and BSP
Konrad Hinsen
Centre de Biophysique Moléculaire (UPR 4301 CNRS)
Rue Charles Sadron
45071 Orléans Cedex 2, France
Abstract
One of the main obstacles to a more widespread use of parallel computing in science is the difficulty of implementing, testing, and maintaining parallel programs.
The combination of a simple parallel computation model, BSP, and a high-level
programming language, Python, simplifies these tasks significantly. It allows the
rapid development facilities of Python to be applied to parallel programs, providing interactive development as well as interactive debugging of parallel code.
1
Introduction
The focus of parallel computing research has traditionally been on the development and optimization of parallel algorithms, whereas implementation issues
have been considered of secondary importance. The emphasis on algorithms was
reasonable as long as parallel computers were expensive and thus rare; parallel
computing was used only for particularly important problems, which would also
justify a significant implementation effort. Nowadays, PC clusters have made parallel computing accessible to all researchers in computational science. Moreover,
a large number of carefully analyzed parallel algorithms are available. Nevertheless, parallel software remains rare. PC clusters are used either to run a few
parallelized standard codes, or they are used as arrays of individual workstations
that run independent calculations.
An important reason for the lack of parallel software development activity
among computational scientists is its reputation of being difficult. In addition to
the usual problems in software development, communication and synchronization
have to be taken into account, which can make the behaviour of a single process
non-deterministic for all practical purposes. While some parallel development
1
tools, in particular debuggers, exist, they have a significant learning curve and
tend to be expensive.
This article presents an attempt to resolve the difficulties of parallel software
development by the use of a high-level parallelization model (BSP) and a highlevel programming language (Python). The result of this combination is a system
that permits the interactive use of parallel computers for development and testing,
including an interactive parallel debugger. The parallel development tools are
straightforward extensions of those that a Python programmer is familiar with.
A competent Python programmer who wants to move on to parallel programs
can concentrate on learning about parallel algorithms.
The paper is organized as follows: After a short presentation of BSP and
Python, the design and implementation of a BSP package for Python is described
in detail. A report of practical experience with BSP in Python is complemented
by three illustrative examples.
2
BSP
The Bulk Synchronous Parallel (BSP) [1] model divides the execution flow of a
parallel program into alternating computation and communication phases. The
combination of one computation and one communication step is called a “superstep”. The separation of computation and communication facilitates the analysis
of algorithms. In particular, it permits the estimation of the execution time based
on a three-parameter model. From an implementation point of view, grouping
communication together in a seperate program phase permits a global optimization of the data exchange by the communications library.
From a software engineering point of view, the best known advantage of BSP
compared to lower-level parallelization models such as message-passing is the impossibility of deadlocks. However, this is merely the most visible aspect of a
communication model that shifts the responsibility for timing and synchronization issues from the applications to the communications library. As with other
low/high level design decisions, the applications programmer gains simplicity but
gives up some flexibility and performance. However, the performance issue is not
as simple as it seems: while a skilled programmer can in principle always produce more efficient code with a low-level tool (be it message passing or assembly
language), it is not at all evident that a real-life program, produced in a finite
amount of time, can actually realize that theoretical advantage, especially when
the program is to be used on a wide range of machines.
A less discussed software engineering advantage of BSP is that it greatly facilitates debugging. The computations going on during a superstep are completely
independent and can thus be debugged independently. They can also be rearranged at will as seems fit for debugging. For example, the Python/BSP parallel
debugger that will be described later serializes the computations in one superstep
2
on a single processor. This is possible only because the runtime system knows
precisely which computations are independent. In a message-passing system, the
independent sections tend to be smaller, and identifying them is much harder.
3
Python
High-level languages such as Python [2] have only recently become popular in
scientific computing. Most scientists did not consider them sufficiently fast for
number crunching. What made high-level languages interesting in such applications was the idea of multiple language coding: the usually small parts of the
code in which most of the CPU time is spent are written in a compiled language,
usually Fortran, C, or C++, whereas the bulk of the code can be written in a
high-level language. Python in particular became popular due to its very good
integration facilities for compiled languages, which were later extended by automatic interface generators for C/C++ and Fortran. Numerical Python [3], a
Python module implementing efficient array operations for Python, also added
significantly to Python’s popularity among scientists.
The same division of labour can be used in parallel programs. An additional
aspect is, of course, communication. Programs with a relatively simple communication structure can be implemented with all the communication in Python.
However, nothing prevents C or Fortran modules from doing communication as
well.
Python itself has no provisions for inter-processor communication or synchronization. A BSP implementation for Python must therefore rely on some other
library for low-level communication. In the current implementation, this can be
either MPI [4] (via the Python MPI interface in Scientific Python [5]) or BSPlib
[6] (via the BSPlib interface in Scientific Python). The choice between the two
is made at runtime, application programmers use only the Python/BSP API in
their programs.
4
The Python BSP package
The main design criteria for the Python BSP package were ease of use, good integration into the Python environment, and suitability for treating large amounts of
data. The possibility to design and implement complex codes was also important.
Python/BSP is not a mere design study, nor is it meant to be a purely educational
tool. In particular, it was considered important to facilitate the implementation
of distributed data types in the form of special classes.
The fundamental design decisions were
• Object-oriented approach
3
• Communication for high-level data types
• Standard classes for simple distributed data
• Base classes for user-defined distributed data types
• Parallel-machine semantics
The last item requires an explanation. A Python/BSP program is to be read
as a program for a single parallel machine with N processors. In contrast, MPI
programs, as well as C programs using BSPlib, are programs for a single processor
that communicates with N − 1 other processors. Parallel machine semantics
were chosen because they are more natural for defining distributed data types.
A Python/BSP program has two levels, local (single processor) and global (all
processors), whereas message-passing programs work only at the local level. In
message-passing programs, communication is specified in terms of local send and
receive operations. In a Python/BSP program, communication is a synchronized
global operation in which all processors participate.
4.1
Python/BSP concepts
Processors
A parallel machine is made up of N processors that can communicate with each
other. Each processor receives a number between 0 and N − 1. All processors
are considered equal for most purposes. For operations that give a special role
to one of the processors, this role is by default assigned to processor number 0.
For example, adding up values over all processors leaves the result by default on
processor 0.
Local and global objects
The most important concept for BSP programming in Python is the distinction
between local and global objects. Local objects are standard Python objects,
they exist on a single processor. Global objects exist on the parallel machine as
a whole. They have a local value on each processor, which may or may not be
the same everywhere.
There are several ways to create global object, corresponding to their typical
uses. In the simplest form (ParConstant), a global object represents a constant
that is available on all processors. Another common situation is that the local
representation is a function of the processor number and the total number of
processors (ParData). For example, a global object “processor id” would have a
local value equal to the processor number. Finally, global objects often represent
data sets of which each processor stores a part (e.g. ParSequence), the local
value is then the part of the data that any one processor is responsible for.
4
Local and global functions
Functions being objects in Python, the same distinction between the global and
local level applies to functions as well. Standard Python functions are local
functions: their arguments are local objects, and their return values are local
objects as well. Global functions take global objects as arguments and return
global objects. A global function is defined by one or more local functions that
act on the local values of the global objects. In most cases, the local function is
the same on all processors, but it is also common to have a different function on
one processor, usually number 0, e.g. for I/O operations.
Local and global classes
Standard Python classes are local classes, their instances are local objects, and
their methods behave like local functions. Global classes define global objects,
and their methods behave like global functions. There are some standard classes
(e.g. ParConstant, ParData) that define the common global data types which are
meaningful with arbitrary local values. There is also support for implementing
specialized global classes that represent specific distributed data types, which
may use communication inside their methods.
Communication
According to the BSP model, all communication takes place at the end of a superstep, after the local computations. It is tempting to identify local computations
with local functions and methods. However, while the superstep is a useful concept for analyzing the performance of algorithms, it is not a useful concept for
structuring code in object-oriented design. As for serial object-oriented programs,
the code structure is defined by the data types. Communication operations occur
wherever the algorithms require them, and that is in general within the methods
of global objects. A superstep is then simply everything that happens between
two communication operations, which in general involves code in several functions
and methods. Instead of focussing on supersteps, programmers should focus on
communication operations.
In view of these considerations, it is not appropriate to follow the example of
BSPlib and define separate API routines for sending data and for synchronization,
which implies reception. Such a separation would invite erroneous situations
in which a routine sends data and then calls another function or method that
happens to perform a synchronization. This risk is eliminated in Python/BSP by
making synchronization an integral part of every communication operation. A
single API call sends data and returns the received data after the synchronization.
Python/BSP communication operations are defined as methods on global objects. An immediate consequence is that no communication is possible within
5
local functions or methods of local classes. However, communication is possible
within the methods of global classes, which define distributed data types.
In one important aspect, Python/BSP is much higher-level than BSPlib for C:
communication operations can transmit almost any kind of data. This is achieved
by using the general serialization protocol of Python, which generates a unique
byte string representation of any data object and also permits the reconstruction
of the object from the byte string. For complex objects, this operation can
be quite slow. An optimization was introduced for one particularly important
data type in scientific programs: arrays. Arrays are sent using their internal
binary representation, thus with minimal overhead. Programmers can therefore
optimize communication performance by sending large amounts of data in the
form of arrays, which are likely to be the preferred internal storage anyway.
4.2
Global data classes
The classes ParConstant, ParData, and ParSequence are the core of the BSP
module. They are global classes that represent global data objects which can be
used in communication and in parallelized computations. The three classes differ
in how their local values are specified.
ParConstant defines a constant, i.e. its local value is the same on all processors. Example:
zero = ParConstant(0)
has a local representation of 0 on all processors.
ParData defines the local value as a function of the processor number and the
total number of processors. Example:
pid = ParData(lambda pid, nprocs: pid)
has an integer (the processor number) as its local value.
ParSequence distributes its argument (which must be a Python sequence
object) over the processors as evenly as possible. Example:
integers = ParSequence(range(10))
divides the ten integers among the processors. With two processors, number 0
receives the local representation [0, 1, 2, 3, 4] and number 1 receives [5,
6, 7, 8, 9].
All these classes define the standard arithmetic and accessing operations of
Python, which are thus automatically parallelized. Moreover, method calls on the
global objects are translated into equivalent method calls on the local values, and
thus also parallelized automatically. Finally, local objects that are passed as arguments to these methods are treated like global constant objects. In combination,
these features make it possible to use large pieces of Python code unchanged in
6
parallel programs, only the data definitions must be modified to generate parallel
data.
Example: the serial code
text = ’Processor 0’
print text.split()[1]
becomes
text = ParData(lambda p, n: ’Processor %d’ % p)
print text.split()[1]
When executed in the interactive parallel interpreter, the output is
-- Processor 0 ---------------------------------------ParValue[0](’0’)
-- Processor 1 ---------------------------------------ParValue[1](’1’)
4.3
Communication operations
A set of communication operations is implemented as methods in all of the global
data classes:
• put(pid list)
Sends the local value to all processors in pid list (a global object whose
local value is a list of processor numbers). Returns a global data object
whose local value is a list of all the values received from other processors,
in unspecified order.
• get(pid list)
Requests the local value from all processors in pid list. Returns a global
data object whose local value is a list of all the data received from other
processors, in unspecified order.
• broadcast(source pid=0)
Transmits the local data on processor source pid (which defaults to 0) to
all processors. Returns a global data object whose local value, identical on
all processors, is the initial value on processor source pid.
• fullExchange()
Sends the local value of each processor to all other processors. Returns a
global data object whose local value is a list of all the received values, in
unspecified order.
7
• reduce(operator, zero)
Performs a reduction with operator (a function of two arguments) over
the local values of all processors using zero as initial value. The result is a
global data object whose local value is the reduction result on processor 0
and zero on all other processors.
• accumulate(operator, zero)
Performs an accumulation with operator over the local values of all processors using zero as initial value. The result is a global data object whose
local value on each processor is the reduction of the values from all processors with lower or equal number.
• alltrue()
Returns a local value of 1 (boolean true) if the local values on all processors
are true. The result can be used directly in a condition test.
• anytrue()
Returns a local value of 1 (boolean true) if at least one of the local values
on all processors is true.
4.4
Useful special cases
In the communication operations described until now, it is always the local value
of the global data type that is sent, whether to one or to several receiving processors. In some situations, it is necessary to send different values to different
processors. This can in principle be achieved by a series of put operations, but a
special communication operation is both more efficient and allows more readable
code.
For this purpose, Python/BSP provides a specialized global data type called
ParMessages. Its local values are lists of data - processor number pairs. The
method exchange() sends each data item to the corresponding processor and
returns another ParMessages object storing the received data items.
Another frequent situation is that many objects are to be sent to the same
processors. This could be handled by a simple sequence of put operations. The
ParTuple object provides a more compact and more efficient way to write this:
a, b, c = ParTuple(a, b, c).put(pid_list)
4.5
Synchronized loops
A frequent situation in distributed data analysis is the iteration over a large
number of data elements that are distributed over the processors. As long as the
data elements are small, the simplest solution is an initial distribution followed
by local computations and a collection of the results at the end. However, the
8
data elements can be so big that it is not feasible to store them in memory at
the same time. In such a situation, each processor should treat only one data
element at a time, which is for example read from a file. A loop over all the data
items therefore usually involves both computation and communication (for I/O)
in each iteration.
Python/BSP provides support for such loops in the form of parallel iterator
classes. The class ParIterator implements iteration over a distributed sequence.
This is best illustrated by the following example:
data = ParSequence(range(10))
for item in ParIterator(data):
print item
The class ParIndexIterator differs in that it doesn’t return data elements from
the distributed sequence, but indices into the distributed sequence. The above
example could thus also be written as:
data = ParSequence(range(10))
for index in ParIndexIterator(data):
print data[index]
What makes these operations non-trivial is the possibility that the length of
the sequence is in general not divisible by the number of processors. During the
last iteration, some processors might then have no more work to do. Python/BSP
handles this automatically by flagging the local value of the iteration variable
“invalid” when no more data is available. All operations involving such values
are skipped, and communication operations involving them send no data. The
application programmer does not have to take any special precautions in most
cases.
Another global class that is useful in synchronized loops is the accumulator ParAccumulator. It provides a straightforward accumulation of results in a
parallel loop, as shown in the following example:
data = ParSequence(range(10))
result = ParAccumulator(operator.add, 0)
for item in ParIterator(data):
result.addValue(item)
print result.calculateTotal()
The value of result.calculateTotal() on processor 0 is the sum of all elements
of the distributed sequence data.
4.6
Functions
Functions being normal objects in Python, a global function can be constructed
from a local function using the global class ParConstant. If different local func9
tions are required on each processor, a suitable global function object can be
created with ParData.
However, many programmers consider functions fundamentally different from
data, and the required syntax for turning dissimilar local functions into a global
functions is not particularly readable. Python/BSP therefore provides special
class constructors ParFunction (little more than an alias for ParConstant) and
ParRootFunction, which covers the only practically important case of dissimilar
local functions: one local function is used for processor 0, another one for all
other processors. It is most commonly used for I/O operations.
4.7
Global class definitions
Object-oriented design for parallel computing requires the possibility to define
global data objects by corresponding classes. The standard global wrapper classes
described above are sufficient for many applications, but they have a major limitation: user-written methods, being defined at the local class level, cannot do any
communication. This limitation is lifted by the definition of specialized global
classes.
A global class definition starts with a local class definition that describes the
local behaviour; the global class is then obtained by applying ParClass. Communication methods can be inherited from ParBase. The local class definition
can define two constructors: the standard init constructor, which is used for
creating a new object from inside a method in the class, and the parinit
constructor, which is called when a global object is created by instantiating the
global class. It receives two standard arguments (processor number and total
number of processors) in addition to the arguments passed to the global class for
instantiation.
An illustration of global class definitions is given in the distributed matrix
example (Appendix B). Note that in the end (next to last line) the distributed
matrix objects are used exactly like standard objects, the parallelism is implicit.
A bigger and practically more relevant example of a global class definition is
the distributed netCDF file class that comes with the BSP package. It behaves
much like the standard Python netCDF interface, except that data is automatically distributed among the processors along one of the netCDF data dimensions,
which the user can define when opening the file.
5
Implementation
The implementation of Python/BSP is part of the Scientific Python package [5].
It consists of a set of standard Python classes. In spite of the parallel machine
semantics of the application programs, the implementation executes the Python
code on each processor independently. All communication operations end up
10
calling two primitives: put and sync. Depending on the availability at runtime,
these primitives use BSPlib or MPI for low-level communications. At the current
time, no effort has been made to optimize low-level communications when using
MPI. None of the specialized MPI operations are used, only simple sends and
receives. If neither MPI nor BSPlib is available, the program is executed on a
single processor. The overhead incurred by the use of global objects is often
negligible, such that it is usually not necessary to maintain a separate serial
version of the code.
The interactive parallel interpreter is almost a standard Python/BSP application of only 190 lines (including comments). It accepts commands interactively
on processor 0 and broadcasts them to all processors for execution. To capture
the output, the interpreter redirects output into an in-memory buffer on each
processor, and at the end of execution of a command, the contents of the buffers
are sent to processor 0 for printing.. The only non-trivial aspect is error handling. The interpreter must detect if some processors stopped due to an error,
and then stop the other processors as well before the end of the superstep, because otherwise they would enter an endless loop at synchronization, waiting for
the “missing” one. In order to recognize such a situation, the interpreter broadcasts a special token object after the execution of each command. If a processor
terminates prematurely due to an error, this token is received by the still running
processors in the next synchronizaton call. A modified synchronization routine
(the only feature of the interpreter that bypasses the standard BSP API) catches
this event and ends execution of the still running processes.
5.1
Debugging
A notoriously difficult problem in parallel programming is debugging. With
message-passing systems, there is a fundamental problem in that debugging can
break the synchronization mechanism of the code by slowing down the process(es)
being debugged. With BSP, debugging within a superstep merely slows down the
execution of the program as a whole. There remain the technical problems of implementing debugging over several processors.
Python has very flexible debugging support in the interpreter: a debugger can
install callback functions that are invoked at various points during execution of
a program, e.g. before the execution of each statement. This makes it possible
to implement debuggers in Python and to debug programs without any special
preparation. The standard Python debugger makes use of these features.
Python/BSP comes with a special debugging tool that builds on the standard
Python debugger. To facilitate debugging, it serializes the execution of a BSP
program. Each process becomes a thread, and a global scheduler runs each thread
up to the synchronization operation at the end of the superstep, at which point it
hands over control to the next thread. Communication operations become simple
copy operations in this serial mode. Each thread can be debugged much like a
11
standard Python program using the standard Python debugger commands. It
is also possible to inspect the thread state after a crash using the post-mortem
analysis option of the Python debugger.
The disadvantage of this approach is that it breaks the parallel machine semantics, the user has to be aware of the execution on individual processors. A
better solution would be to implement debugging at the parallel machine level and
provide operations such as collective single-stepping through the code, with inspection of variable states in all processors in parallel. While there is no technical
obstacle to the implementation of such a debugger, it requires considerably more
effort than the current serializing debugger, which inherits most of its operations
from the standard serial Python debugger.
6
Practical experience
Python/BSP is still a rather new tool, therefore practical experience is limited to
a few applications. One of them, the solution of partial differential equations, has
been documented in detail [7]. Another one, the analysis of molecular dynamics
simulation data, is illustrated by the example in Appendix C. It is particularly
difficult to make statements about performance, as it depends strongly on the
algorithms and their communication patterns.
The most frequent mistake during program development is a confusion of local and global levels, in particular forgetting ParConstant constructors for global
constants, and calling local functions with global data arguments. These mistakes
are immediately caught in testing because they cause a Python exception, but
they slow down development. The current implementation does not give particularly helpful error messages in such situations, but that will improve in future
releases.
Ideally, the syntactic distinction between local and global objects should disappear completely. It is clear from the context (local function vs. global parallel
machine level) if a newly created object is a local one or the local value of a global
one. However, such an approach cannot be implemented in Python because it
would require modifications to standard Python data types. It might be worthwhile to pursue a different approach: integrate BSP into the Python interpreter
by adding communication operations and some other details to all data types.
Such a modified Python interpreter could still execute standard Python code,
with negligible overhead, but also parallel code. It could also provide special
syntax for defining parallel functions and classes, and thereby help to simplify
parallel software development.
On the whole, the experience with Python/BSP is positive. Development is
rapid, and the resulting code tends to be stable. The possibilities for carrying
object-oriented design over to parallel applications are particularly useful for
developing libraries of standard code. A somewhat unexpected field of application
12
is the administration and maintenance of parallel machines. A quickly written
parallel script or even a few lines of interactively entered code can often do a
better job than specialized administration tools, which do not have the power of
the Python standard library behind them.
Acknowledgements
The design of Python/BSP was influenced by BSMLlib for OCaml [8]. Discussions with G. Hains, Q. Miller, and R. Bisseling were helpful for understanding
the BSP model and its applications.
References
[1] L. Valiant, ”A bridging model for parallel computation”, Communication of
the ACM, 33(8), 103 (1990)
[2] G. van Rossum et al., Python, http://www.python.org/
[3] D. Ascher, P. Dubois, K. Hinsen, J. Huguin and T. Oliphant, Numerical
Python, Technical Report UCRL-MA-128569, Lawrence Livermoore National Laboratory, 2001, http://numpy.sourceforge.net
[4] Message Passing Interface Forum, ”MPI: A Message-Passing Interface Standard”, Technical Report CS-94-230, University of Tennessee, Knoxville, 1994
[5] K. Hinsen, Scientific Python, http://dirac.cnrs-orleans.fr/ScientificPython
[6] J.M.D. Hill, W.F. McColl, D.C. Stefanescu, M. W. Goudreau, K. Lang,
S.B. Rao, T. Suel, T. Tsantilas and R. Bisseling, ”BSPlib, the BSP
programming library”, Technical report, BSP Worldwide, May 1997.
http://www.bsp-worldwide.org/
[7] K. Hinsen, H.P. Langtangen, O. Skavhaug, Å. Ødegård, ”Using BSP and
Python to Simplify Parallel Programming”, accepted by Future Generation
Computer Systems
[8] G. Hains and F. Loulergue, ”Functional Bulk Synchronous Parallel Programming using the BSMLlib Library”, CMPP’2000 International Workshop on
Constructive Methods for Parallel Programming, S. Gorlatch ed., Ponte de
Lima, Portugal, Fakultät für Mathematik und Informatik, Univ. Passau,
Germany, Technical Report, July, 2000
13
A
A simple example
The following example shows how a systolic loop is implemented in Python. The
loop applies some computation to all possible pairs of items in a sequence. In
a real application, the computation done on each pair would of course be more
complicated.
from Scientific.BSP import ParData, ParSequence, ParAccumulator, \
ParFunction, ParRootFunction, \
numberOfProcessors
import operator
# Local and global computation functions.
def makepairs(sequence1, sequence2):
pairs = []
for item1 in sequence1:
for item2 in sequence2:
pairs.append((item1, item2))
return pairs
global_makepairs = ParFunction(makepairs)
# Local and global output functions.
def output(result):
print result
global_output = ParRootFunction(output)
# A list of data items (here letters) distributed over the processors.
my_items = ParSequence(’abcdef’)
# The number of the neighbour to the right (circular).
neighbour_pid = ParData(lambda pid, nprocs: [(pid+1)%nprocs])
# Loop to construct all pairs.
pairs = ParAccumulator(operator.add, [])
pairs.addValue(global_makepairs(my_items, my_items))
other_items = my_items
for i in range(numberOfProcessors-1):
other_items = other_items.put(neighbour_pid)[0]
pairs.addValue(global_makepairs(my_items, other_items))
# Collect results on processor 0.
all_pairs = pairs.calculateTotal()
# Output results from processor 0.
global_output(all_pairs)
14
B
Distributed matrix example
This program illustrates the definition of a distributed data class. The data
elements of a distributed matrix are divided among the processors either along
the rows or the columns, as specified by the application. Matrix addition is
permitted only if the distribution scheme is the same for both matrices, which
makes it a local operation. Matrix multiplication requires communication via
methods inherited from ParBase.
from Scientific.BSP import ParClass, ParBase, numberOfProcessors
import Numeric, operator
class _DistributedMatrix(ParBase):
def __parinit__(self, pid, nprocs, matrix, distribution_mode):
self.full_shape = matrix.shape
self.mode = distribution_mode
if distribution_mode == ’row’:
chunk_size = (matrix.shape[0]+numberOfProcessors-1) \
/ numberOfProcessors
self.submatrix = matrix[pid*chunk_size:(pid+1)*chunk_size, :]
elif distribution_mode == ’column’:
chunk_size = (matrix.shape[1]+numberOfProcessors-1) \
/ numberOfProcessors
self.submatrix = matrix[:, pid*chunk_size:(pid+1)*chunk_size]
else:
raise ValueError, "undefined mode " + distribution_mode
def __init__(self, local_submatrix, full_shape, distribution_mode):
self.submatrix = local_submatrix
self.full_shape = full_shape
self.mode = distribution_mode
def __repr__(self):
return "\n" + repr(self.submatrix)
def __add__(self, other):
if self.full_shape == other.full_shape and self.mode == other.mode:
return _DistributedMatrix(self.submatrix+other.submatrix,
self.full_shape, self.mode)
else:
raise ValueError, "incompatible matrices"
def __mul__(self, other):
if self.full_shape[1] != other.full_shape[0]:
15
raise ValueError, "incompatible matrix shapes"
if self.mode == ’row’ or other.mode == ’column’:
raise ValueError, "not implemented"
product = Numeric.dot(self.submatrix, other.submatrix)
full_shape = product.shape
chunk_size = (full_shape[0]+numberOfProcessors-1)/numberOfProcessors
messages = []
for i in range(numberOfProcessors):
messages.append((i, product[i*chunk_size:(i+1)*chunk_size, :]))
data = self.exchangeMessages(messages)
sum = reduce(operator.add, data, 0)
return _DistributedMatrix(sum, full_shape, ’row’)
DistributedMatrix = ParClass(_DistributedMatrix)
#
# A usage example
#
m = Numeric.resize(Numeric.arange(10), (8, 8))
v = Numeric.ones((8, 1))
matrix = DistributedMatrix(m, ’column’)
vector = DistributedMatrix(v, ’row’)
product = matrix*vector
print product
16
C
Production example
The following script is a simple real-life application example. It calculates the
mean-square displacement of atomic trajectories for an arbitrary molecular system. This quantity involves a sum over all atoms, which is distributed over the
processors. From an algorithmic point of view, parallelization is trivial: at the
beginning, the definition of the atomic system is read from a file and broadcast
(in the class DistributedTrajectory), then each processor calculates independently for a part of the atoms, and in the end the results are added up over
all processors. However, in a low-level parallel programming environment, the
initial distribution of the system definition alone would be complicated task. In
Python/BSP, it is no more than a broadcast operation on a single object.
# Mean-Square Displacement
#
from parMoldyn.Trajectory import DistributedTrajectory
from parMoldyn.Correlation import acf
from Scientific.BSP import ParConstant, ParSequence, \
ParIterator, ParAccumulator, \
ParFunction, ParRootFunction
from Scientific.IO.ArrayIO import writeArray
import Numeric, operator
filename = ’lysozyme_1000bar_100K.nc’
trajectory = DistributedTrajectory(filename, local_access=’auto’)
universe = trajectory.universe
time = trajectory.readVariable(’time’)
def msd_calc_local(t, weight):
series = t.array
dsq = series[:,0]**2+series[:,1]**2+series[:,2]**2
sum_dsq1 = Numeric.add.accumulate(dsq)
sum_dsq2 = Numeric.add.accumulate(dsq[::-1])
sumsq = 2.*sum_dsq1[-1]
msd = sumsq - Numeric.concatenate(([0.], sum_dsq1[:-1])) \
- Numeric.concatenate(([0.], sum_dsq2[:-1]))
Sab = 2.*Numeric.add.reduce(acf(series, 0), 1)
return (msd-Sab)/(len(series)-Numeric.arange(len(series)))*weight
msd_calc = ParFunction(msd_calc_local)
atoms = ParSequence(universe.atomList())
weight = ParConstant(1.)/universe.numberOfAtoms()
msd = ParAccumulator(operator.add, 0.)
17
for atom in ParIterator(atoms):
pt = trajectory.readParticleTrajectory(atom)
msd.addValue(msd_calc(pt, weight))
msd = msd.calculateTotal()
def write(msd, time, filename):
writeArray(Numeric.transpose(Numeric.array([time, msd])), filename)
ParRootFunction(write)(msd, time, ’msd_par.plot’)
18
Download