High-Level Parallel Software Development with Python and BSP Konrad Hinsen Centre de Biophysique Moléculaire (UPR 4301 CNRS) Rue Charles Sadron 45071 Orléans Cedex 2, France Abstract One of the main obstacles to a more widespread use of parallel computing in science is the difficulty of implementing, testing, and maintaining parallel programs. The combination of a simple parallel computation model, BSP, and a high-level programming language, Python, simplifies these tasks significantly. It allows the rapid development facilities of Python to be applied to parallel programs, providing interactive development as well as interactive debugging of parallel code. 1 Introduction The focus of parallel computing research has traditionally been on the development and optimization of parallel algorithms, whereas implementation issues have been considered of secondary importance. The emphasis on algorithms was reasonable as long as parallel computers were expensive and thus rare; parallel computing was used only for particularly important problems, which would also justify a significant implementation effort. Nowadays, PC clusters have made parallel computing accessible to all researchers in computational science. Moreover, a large number of carefully analyzed parallel algorithms are available. Nevertheless, parallel software remains rare. PC clusters are used either to run a few parallelized standard codes, or they are used as arrays of individual workstations that run independent calculations. An important reason for the lack of parallel software development activity among computational scientists is its reputation of being difficult. In addition to the usual problems in software development, communication and synchronization have to be taken into account, which can make the behaviour of a single process non-deterministic for all practical purposes. While some parallel development 1 tools, in particular debuggers, exist, they have a significant learning curve and tend to be expensive. This article presents an attempt to resolve the difficulties of parallel software development by the use of a high-level parallelization model (BSP) and a highlevel programming language (Python). The result of this combination is a system that permits the interactive use of parallel computers for development and testing, including an interactive parallel debugger. The parallel development tools are straightforward extensions of those that a Python programmer is familiar with. A competent Python programmer who wants to move on to parallel programs can concentrate on learning about parallel algorithms. The paper is organized as follows: After a short presentation of BSP and Python, the design and implementation of a BSP package for Python is described in detail. A report of practical experience with BSP in Python is complemented by three illustrative examples. 2 BSP The Bulk Synchronous Parallel (BSP) [1] model divides the execution flow of a parallel program into alternating computation and communication phases. The combination of one computation and one communication step is called a “superstep”. The separation of computation and communication facilitates the analysis of algorithms. In particular, it permits the estimation of the execution time based on a three-parameter model. From an implementation point of view, grouping communication together in a seperate program phase permits a global optimization of the data exchange by the communications library. From a software engineering point of view, the best known advantage of BSP compared to lower-level parallelization models such as message-passing is the impossibility of deadlocks. However, this is merely the most visible aspect of a communication model that shifts the responsibility for timing and synchronization issues from the applications to the communications library. As with other low/high level design decisions, the applications programmer gains simplicity but gives up some flexibility and performance. However, the performance issue is not as simple as it seems: while a skilled programmer can in principle always produce more efficient code with a low-level tool (be it message passing or assembly language), it is not at all evident that a real-life program, produced in a finite amount of time, can actually realize that theoretical advantage, especially when the program is to be used on a wide range of machines. A less discussed software engineering advantage of BSP is that it greatly facilitates debugging. The computations going on during a superstep are completely independent and can thus be debugged independently. They can also be rearranged at will as seems fit for debugging. For example, the Python/BSP parallel debugger that will be described later serializes the computations in one superstep 2 on a single processor. This is possible only because the runtime system knows precisely which computations are independent. In a message-passing system, the independent sections tend to be smaller, and identifying them is much harder. 3 Python High-level languages such as Python [2] have only recently become popular in scientific computing. Most scientists did not consider them sufficiently fast for number crunching. What made high-level languages interesting in such applications was the idea of multiple language coding: the usually small parts of the code in which most of the CPU time is spent are written in a compiled language, usually Fortran, C, or C++, whereas the bulk of the code can be written in a high-level language. Python in particular became popular due to its very good integration facilities for compiled languages, which were later extended by automatic interface generators for C/C++ and Fortran. Numerical Python [3], a Python module implementing efficient array operations for Python, also added significantly to Python’s popularity among scientists. The same division of labour can be used in parallel programs. An additional aspect is, of course, communication. Programs with a relatively simple communication structure can be implemented with all the communication in Python. However, nothing prevents C or Fortran modules from doing communication as well. Python itself has no provisions for inter-processor communication or synchronization. A BSP implementation for Python must therefore rely on some other library for low-level communication. In the current implementation, this can be either MPI [4] (via the Python MPI interface in Scientific Python [5]) or BSPlib [6] (via the BSPlib interface in Scientific Python). The choice between the two is made at runtime, application programmers use only the Python/BSP API in their programs. 4 The Python BSP package The main design criteria for the Python BSP package were ease of use, good integration into the Python environment, and suitability for treating large amounts of data. The possibility to design and implement complex codes was also important. Python/BSP is not a mere design study, nor is it meant to be a purely educational tool. In particular, it was considered important to facilitate the implementation of distributed data types in the form of special classes. The fundamental design decisions were • Object-oriented approach 3 • Communication for high-level data types • Standard classes for simple distributed data • Base classes for user-defined distributed data types • Parallel-machine semantics The last item requires an explanation. A Python/BSP program is to be read as a program for a single parallel machine with N processors. In contrast, MPI programs, as well as C programs using BSPlib, are programs for a single processor that communicates with N − 1 other processors. Parallel machine semantics were chosen because they are more natural for defining distributed data types. A Python/BSP program has two levels, local (single processor) and global (all processors), whereas message-passing programs work only at the local level. In message-passing programs, communication is specified in terms of local send and receive operations. In a Python/BSP program, communication is a synchronized global operation in which all processors participate. 4.1 Python/BSP concepts Processors A parallel machine is made up of N processors that can communicate with each other. Each processor receives a number between 0 and N − 1. All processors are considered equal for most purposes. For operations that give a special role to one of the processors, this role is by default assigned to processor number 0. For example, adding up values over all processors leaves the result by default on processor 0. Local and global objects The most important concept for BSP programming in Python is the distinction between local and global objects. Local objects are standard Python objects, they exist on a single processor. Global objects exist on the parallel machine as a whole. They have a local value on each processor, which may or may not be the same everywhere. There are several ways to create global object, corresponding to their typical uses. In the simplest form (ParConstant), a global object represents a constant that is available on all processors. Another common situation is that the local representation is a function of the processor number and the total number of processors (ParData). For example, a global object “processor id” would have a local value equal to the processor number. Finally, global objects often represent data sets of which each processor stores a part (e.g. ParSequence), the local value is then the part of the data that any one processor is responsible for. 4 Local and global functions Functions being objects in Python, the same distinction between the global and local level applies to functions as well. Standard Python functions are local functions: their arguments are local objects, and their return values are local objects as well. Global functions take global objects as arguments and return global objects. A global function is defined by one or more local functions that act on the local values of the global objects. In most cases, the local function is the same on all processors, but it is also common to have a different function on one processor, usually number 0, e.g. for I/O operations. Local and global classes Standard Python classes are local classes, their instances are local objects, and their methods behave like local functions. Global classes define global objects, and their methods behave like global functions. There are some standard classes (e.g. ParConstant, ParData) that define the common global data types which are meaningful with arbitrary local values. There is also support for implementing specialized global classes that represent specific distributed data types, which may use communication inside their methods. Communication According to the BSP model, all communication takes place at the end of a superstep, after the local computations. It is tempting to identify local computations with local functions and methods. However, while the superstep is a useful concept for analyzing the performance of algorithms, it is not a useful concept for structuring code in object-oriented design. As for serial object-oriented programs, the code structure is defined by the data types. Communication operations occur wherever the algorithms require them, and that is in general within the methods of global objects. A superstep is then simply everything that happens between two communication operations, which in general involves code in several functions and methods. Instead of focussing on supersteps, programmers should focus on communication operations. In view of these considerations, it is not appropriate to follow the example of BSPlib and define separate API routines for sending data and for synchronization, which implies reception. Such a separation would invite erroneous situations in which a routine sends data and then calls another function or method that happens to perform a synchronization. This risk is eliminated in Python/BSP by making synchronization an integral part of every communication operation. A single API call sends data and returns the received data after the synchronization. Python/BSP communication operations are defined as methods on global objects. An immediate consequence is that no communication is possible within 5 local functions or methods of local classes. However, communication is possible within the methods of global classes, which define distributed data types. In one important aspect, Python/BSP is much higher-level than BSPlib for C: communication operations can transmit almost any kind of data. This is achieved by using the general serialization protocol of Python, which generates a unique byte string representation of any data object and also permits the reconstruction of the object from the byte string. For complex objects, this operation can be quite slow. An optimization was introduced for one particularly important data type in scientific programs: arrays. Arrays are sent using their internal binary representation, thus with minimal overhead. Programmers can therefore optimize communication performance by sending large amounts of data in the form of arrays, which are likely to be the preferred internal storage anyway. 4.2 Global data classes The classes ParConstant, ParData, and ParSequence are the core of the BSP module. They are global classes that represent global data objects which can be used in communication and in parallelized computations. The three classes differ in how their local values are specified. ParConstant defines a constant, i.e. its local value is the same on all processors. Example: zero = ParConstant(0) has a local representation of 0 on all processors. ParData defines the local value as a function of the processor number and the total number of processors. Example: pid = ParData(lambda pid, nprocs: pid) has an integer (the processor number) as its local value. ParSequence distributes its argument (which must be a Python sequence object) over the processors as evenly as possible. Example: integers = ParSequence(range(10)) divides the ten integers among the processors. With two processors, number 0 receives the local representation [0, 1, 2, 3, 4] and number 1 receives [5, 6, 7, 8, 9]. All these classes define the standard arithmetic and accessing operations of Python, which are thus automatically parallelized. Moreover, method calls on the global objects are translated into equivalent method calls on the local values, and thus also parallelized automatically. Finally, local objects that are passed as arguments to these methods are treated like global constant objects. In combination, these features make it possible to use large pieces of Python code unchanged in 6 parallel programs, only the data definitions must be modified to generate parallel data. Example: the serial code text = ’Processor 0’ print text.split()[1] becomes text = ParData(lambda p, n: ’Processor %d’ % p) print text.split()[1] When executed in the interactive parallel interpreter, the output is -- Processor 0 ---------------------------------------ParValue[0](’0’) -- Processor 1 ---------------------------------------ParValue[1](’1’) 4.3 Communication operations A set of communication operations is implemented as methods in all of the global data classes: • put(pid list) Sends the local value to all processors in pid list (a global object whose local value is a list of processor numbers). Returns a global data object whose local value is a list of all the values received from other processors, in unspecified order. • get(pid list) Requests the local value from all processors in pid list. Returns a global data object whose local value is a list of all the data received from other processors, in unspecified order. • broadcast(source pid=0) Transmits the local data on processor source pid (which defaults to 0) to all processors. Returns a global data object whose local value, identical on all processors, is the initial value on processor source pid. • fullExchange() Sends the local value of each processor to all other processors. Returns a global data object whose local value is a list of all the received values, in unspecified order. 7 • reduce(operator, zero) Performs a reduction with operator (a function of two arguments) over the local values of all processors using zero as initial value. The result is a global data object whose local value is the reduction result on processor 0 and zero on all other processors. • accumulate(operator, zero) Performs an accumulation with operator over the local values of all processors using zero as initial value. The result is a global data object whose local value on each processor is the reduction of the values from all processors with lower or equal number. • alltrue() Returns a local value of 1 (boolean true) if the local values on all processors are true. The result can be used directly in a condition test. • anytrue() Returns a local value of 1 (boolean true) if at least one of the local values on all processors is true. 4.4 Useful special cases In the communication operations described until now, it is always the local value of the global data type that is sent, whether to one or to several receiving processors. In some situations, it is necessary to send different values to different processors. This can in principle be achieved by a series of put operations, but a special communication operation is both more efficient and allows more readable code. For this purpose, Python/BSP provides a specialized global data type called ParMessages. Its local values are lists of data - processor number pairs. The method exchange() sends each data item to the corresponding processor and returns another ParMessages object storing the received data items. Another frequent situation is that many objects are to be sent to the same processors. This could be handled by a simple sequence of put operations. The ParTuple object provides a more compact and more efficient way to write this: a, b, c = ParTuple(a, b, c).put(pid_list) 4.5 Synchronized loops A frequent situation in distributed data analysis is the iteration over a large number of data elements that are distributed over the processors. As long as the data elements are small, the simplest solution is an initial distribution followed by local computations and a collection of the results at the end. However, the 8 data elements can be so big that it is not feasible to store them in memory at the same time. In such a situation, each processor should treat only one data element at a time, which is for example read from a file. A loop over all the data items therefore usually involves both computation and communication (for I/O) in each iteration. Python/BSP provides support for such loops in the form of parallel iterator classes. The class ParIterator implements iteration over a distributed sequence. This is best illustrated by the following example: data = ParSequence(range(10)) for item in ParIterator(data): print item The class ParIndexIterator differs in that it doesn’t return data elements from the distributed sequence, but indices into the distributed sequence. The above example could thus also be written as: data = ParSequence(range(10)) for index in ParIndexIterator(data): print data[index] What makes these operations non-trivial is the possibility that the length of the sequence is in general not divisible by the number of processors. During the last iteration, some processors might then have no more work to do. Python/BSP handles this automatically by flagging the local value of the iteration variable “invalid” when no more data is available. All operations involving such values are skipped, and communication operations involving them send no data. The application programmer does not have to take any special precautions in most cases. Another global class that is useful in synchronized loops is the accumulator ParAccumulator. It provides a straightforward accumulation of results in a parallel loop, as shown in the following example: data = ParSequence(range(10)) result = ParAccumulator(operator.add, 0) for item in ParIterator(data): result.addValue(item) print result.calculateTotal() The value of result.calculateTotal() on processor 0 is the sum of all elements of the distributed sequence data. 4.6 Functions Functions being normal objects in Python, a global function can be constructed from a local function using the global class ParConstant. If different local func9 tions are required on each processor, a suitable global function object can be created with ParData. However, many programmers consider functions fundamentally different from data, and the required syntax for turning dissimilar local functions into a global functions is not particularly readable. Python/BSP therefore provides special class constructors ParFunction (little more than an alias for ParConstant) and ParRootFunction, which covers the only practically important case of dissimilar local functions: one local function is used for processor 0, another one for all other processors. It is most commonly used for I/O operations. 4.7 Global class definitions Object-oriented design for parallel computing requires the possibility to define global data objects by corresponding classes. The standard global wrapper classes described above are sufficient for many applications, but they have a major limitation: user-written methods, being defined at the local class level, cannot do any communication. This limitation is lifted by the definition of specialized global classes. A global class definition starts with a local class definition that describes the local behaviour; the global class is then obtained by applying ParClass. Communication methods can be inherited from ParBase. The local class definition can define two constructors: the standard init constructor, which is used for creating a new object from inside a method in the class, and the parinit constructor, which is called when a global object is created by instantiating the global class. It receives two standard arguments (processor number and total number of processors) in addition to the arguments passed to the global class for instantiation. An illustration of global class definitions is given in the distributed matrix example (Appendix B). Note that in the end (next to last line) the distributed matrix objects are used exactly like standard objects, the parallelism is implicit. A bigger and practically more relevant example of a global class definition is the distributed netCDF file class that comes with the BSP package. It behaves much like the standard Python netCDF interface, except that data is automatically distributed among the processors along one of the netCDF data dimensions, which the user can define when opening the file. 5 Implementation The implementation of Python/BSP is part of the Scientific Python package [5]. It consists of a set of standard Python classes. In spite of the parallel machine semantics of the application programs, the implementation executes the Python code on each processor independently. All communication operations end up 10 calling two primitives: put and sync. Depending on the availability at runtime, these primitives use BSPlib or MPI for low-level communications. At the current time, no effort has been made to optimize low-level communications when using MPI. None of the specialized MPI operations are used, only simple sends and receives. If neither MPI nor BSPlib is available, the program is executed on a single processor. The overhead incurred by the use of global objects is often negligible, such that it is usually not necessary to maintain a separate serial version of the code. The interactive parallel interpreter is almost a standard Python/BSP application of only 190 lines (including comments). It accepts commands interactively on processor 0 and broadcasts them to all processors for execution. To capture the output, the interpreter redirects output into an in-memory buffer on each processor, and at the end of execution of a command, the contents of the buffers are sent to processor 0 for printing.. The only non-trivial aspect is error handling. The interpreter must detect if some processors stopped due to an error, and then stop the other processors as well before the end of the superstep, because otherwise they would enter an endless loop at synchronization, waiting for the “missing” one. In order to recognize such a situation, the interpreter broadcasts a special token object after the execution of each command. If a processor terminates prematurely due to an error, this token is received by the still running processors in the next synchronizaton call. A modified synchronization routine (the only feature of the interpreter that bypasses the standard BSP API) catches this event and ends execution of the still running processes. 5.1 Debugging A notoriously difficult problem in parallel programming is debugging. With message-passing systems, there is a fundamental problem in that debugging can break the synchronization mechanism of the code by slowing down the process(es) being debugged. With BSP, debugging within a superstep merely slows down the execution of the program as a whole. There remain the technical problems of implementing debugging over several processors. Python has very flexible debugging support in the interpreter: a debugger can install callback functions that are invoked at various points during execution of a program, e.g. before the execution of each statement. This makes it possible to implement debuggers in Python and to debug programs without any special preparation. The standard Python debugger makes use of these features. Python/BSP comes with a special debugging tool that builds on the standard Python debugger. To facilitate debugging, it serializes the execution of a BSP program. Each process becomes a thread, and a global scheduler runs each thread up to the synchronization operation at the end of the superstep, at which point it hands over control to the next thread. Communication operations become simple copy operations in this serial mode. Each thread can be debugged much like a 11 standard Python program using the standard Python debugger commands. It is also possible to inspect the thread state after a crash using the post-mortem analysis option of the Python debugger. The disadvantage of this approach is that it breaks the parallel machine semantics, the user has to be aware of the execution on individual processors. A better solution would be to implement debugging at the parallel machine level and provide operations such as collective single-stepping through the code, with inspection of variable states in all processors in parallel. While there is no technical obstacle to the implementation of such a debugger, it requires considerably more effort than the current serializing debugger, which inherits most of its operations from the standard serial Python debugger. 6 Practical experience Python/BSP is still a rather new tool, therefore practical experience is limited to a few applications. One of them, the solution of partial differential equations, has been documented in detail [7]. Another one, the analysis of molecular dynamics simulation data, is illustrated by the example in Appendix C. It is particularly difficult to make statements about performance, as it depends strongly on the algorithms and their communication patterns. The most frequent mistake during program development is a confusion of local and global levels, in particular forgetting ParConstant constructors for global constants, and calling local functions with global data arguments. These mistakes are immediately caught in testing because they cause a Python exception, but they slow down development. The current implementation does not give particularly helpful error messages in such situations, but that will improve in future releases. Ideally, the syntactic distinction between local and global objects should disappear completely. It is clear from the context (local function vs. global parallel machine level) if a newly created object is a local one or the local value of a global one. However, such an approach cannot be implemented in Python because it would require modifications to standard Python data types. It might be worthwhile to pursue a different approach: integrate BSP into the Python interpreter by adding communication operations and some other details to all data types. Such a modified Python interpreter could still execute standard Python code, with negligible overhead, but also parallel code. It could also provide special syntax for defining parallel functions and classes, and thereby help to simplify parallel software development. On the whole, the experience with Python/BSP is positive. Development is rapid, and the resulting code tends to be stable. The possibilities for carrying object-oriented design over to parallel applications are particularly useful for developing libraries of standard code. A somewhat unexpected field of application 12 is the administration and maintenance of parallel machines. A quickly written parallel script or even a few lines of interactively entered code can often do a better job than specialized administration tools, which do not have the power of the Python standard library behind them. Acknowledgements The design of Python/BSP was influenced by BSMLlib for OCaml [8]. Discussions with G. Hains, Q. Miller, and R. Bisseling were helpful for understanding the BSP model and its applications. References [1] L. Valiant, ”A bridging model for parallel computation”, Communication of the ACM, 33(8), 103 (1990) [2] G. van Rossum et al., Python, http://www.python.org/ [3] D. Ascher, P. Dubois, K. Hinsen, J. Huguin and T. Oliphant, Numerical Python, Technical Report UCRL-MA-128569, Lawrence Livermoore National Laboratory, 2001, http://numpy.sourceforge.net [4] Message Passing Interface Forum, ”MPI: A Message-Passing Interface Standard”, Technical Report CS-94-230, University of Tennessee, Knoxville, 1994 [5] K. Hinsen, Scientific Python, http://dirac.cnrs-orleans.fr/ScientificPython [6] J.M.D. Hill, W.F. McColl, D.C. Stefanescu, M. W. Goudreau, K. Lang, S.B. Rao, T. Suel, T. Tsantilas and R. Bisseling, ”BSPlib, the BSP programming library”, Technical report, BSP Worldwide, May 1997. http://www.bsp-worldwide.org/ [7] K. Hinsen, H.P. Langtangen, O. Skavhaug, Å. Ødegård, ”Using BSP and Python to Simplify Parallel Programming”, accepted by Future Generation Computer Systems [8] G. Hains and F. Loulergue, ”Functional Bulk Synchronous Parallel Programming using the BSMLlib Library”, CMPP’2000 International Workshop on Constructive Methods for Parallel Programming, S. Gorlatch ed., Ponte de Lima, Portugal, Fakultät für Mathematik und Informatik, Univ. Passau, Germany, Technical Report, July, 2000 13 A A simple example The following example shows how a systolic loop is implemented in Python. The loop applies some computation to all possible pairs of items in a sequence. In a real application, the computation done on each pair would of course be more complicated. from Scientific.BSP import ParData, ParSequence, ParAccumulator, \ ParFunction, ParRootFunction, \ numberOfProcessors import operator # Local and global computation functions. def makepairs(sequence1, sequence2): pairs = [] for item1 in sequence1: for item2 in sequence2: pairs.append((item1, item2)) return pairs global_makepairs = ParFunction(makepairs) # Local and global output functions. def output(result): print result global_output = ParRootFunction(output) # A list of data items (here letters) distributed over the processors. my_items = ParSequence(’abcdef’) # The number of the neighbour to the right (circular). neighbour_pid = ParData(lambda pid, nprocs: [(pid+1)%nprocs]) # Loop to construct all pairs. pairs = ParAccumulator(operator.add, []) pairs.addValue(global_makepairs(my_items, my_items)) other_items = my_items for i in range(numberOfProcessors-1): other_items = other_items.put(neighbour_pid)[0] pairs.addValue(global_makepairs(my_items, other_items)) # Collect results on processor 0. all_pairs = pairs.calculateTotal() # Output results from processor 0. global_output(all_pairs) 14 B Distributed matrix example This program illustrates the definition of a distributed data class. The data elements of a distributed matrix are divided among the processors either along the rows or the columns, as specified by the application. Matrix addition is permitted only if the distribution scheme is the same for both matrices, which makes it a local operation. Matrix multiplication requires communication via methods inherited from ParBase. from Scientific.BSP import ParClass, ParBase, numberOfProcessors import Numeric, operator class _DistributedMatrix(ParBase): def __parinit__(self, pid, nprocs, matrix, distribution_mode): self.full_shape = matrix.shape self.mode = distribution_mode if distribution_mode == ’row’: chunk_size = (matrix.shape[0]+numberOfProcessors-1) \ / numberOfProcessors self.submatrix = matrix[pid*chunk_size:(pid+1)*chunk_size, :] elif distribution_mode == ’column’: chunk_size = (matrix.shape[1]+numberOfProcessors-1) \ / numberOfProcessors self.submatrix = matrix[:, pid*chunk_size:(pid+1)*chunk_size] else: raise ValueError, "undefined mode " + distribution_mode def __init__(self, local_submatrix, full_shape, distribution_mode): self.submatrix = local_submatrix self.full_shape = full_shape self.mode = distribution_mode def __repr__(self): return "\n" + repr(self.submatrix) def __add__(self, other): if self.full_shape == other.full_shape and self.mode == other.mode: return _DistributedMatrix(self.submatrix+other.submatrix, self.full_shape, self.mode) else: raise ValueError, "incompatible matrices" def __mul__(self, other): if self.full_shape[1] != other.full_shape[0]: 15 raise ValueError, "incompatible matrix shapes" if self.mode == ’row’ or other.mode == ’column’: raise ValueError, "not implemented" product = Numeric.dot(self.submatrix, other.submatrix) full_shape = product.shape chunk_size = (full_shape[0]+numberOfProcessors-1)/numberOfProcessors messages = [] for i in range(numberOfProcessors): messages.append((i, product[i*chunk_size:(i+1)*chunk_size, :])) data = self.exchangeMessages(messages) sum = reduce(operator.add, data, 0) return _DistributedMatrix(sum, full_shape, ’row’) DistributedMatrix = ParClass(_DistributedMatrix) # # A usage example # m = Numeric.resize(Numeric.arange(10), (8, 8)) v = Numeric.ones((8, 1)) matrix = DistributedMatrix(m, ’column’) vector = DistributedMatrix(v, ’row’) product = matrix*vector print product 16 C Production example The following script is a simple real-life application example. It calculates the mean-square displacement of atomic trajectories for an arbitrary molecular system. This quantity involves a sum over all atoms, which is distributed over the processors. From an algorithmic point of view, parallelization is trivial: at the beginning, the definition of the atomic system is read from a file and broadcast (in the class DistributedTrajectory), then each processor calculates independently for a part of the atoms, and in the end the results are added up over all processors. However, in a low-level parallel programming environment, the initial distribution of the system definition alone would be complicated task. In Python/BSP, it is no more than a broadcast operation on a single object. # Mean-Square Displacement # from parMoldyn.Trajectory import DistributedTrajectory from parMoldyn.Correlation import acf from Scientific.BSP import ParConstant, ParSequence, \ ParIterator, ParAccumulator, \ ParFunction, ParRootFunction from Scientific.IO.ArrayIO import writeArray import Numeric, operator filename = ’lysozyme_1000bar_100K.nc’ trajectory = DistributedTrajectory(filename, local_access=’auto’) universe = trajectory.universe time = trajectory.readVariable(’time’) def msd_calc_local(t, weight): series = t.array dsq = series[:,0]**2+series[:,1]**2+series[:,2]**2 sum_dsq1 = Numeric.add.accumulate(dsq) sum_dsq2 = Numeric.add.accumulate(dsq[::-1]) sumsq = 2.*sum_dsq1[-1] msd = sumsq - Numeric.concatenate(([0.], sum_dsq1[:-1])) \ - Numeric.concatenate(([0.], sum_dsq2[:-1])) Sab = 2.*Numeric.add.reduce(acf(series, 0), 1) return (msd-Sab)/(len(series)-Numeric.arange(len(series)))*weight msd_calc = ParFunction(msd_calc_local) atoms = ParSequence(universe.atomList()) weight = ParConstant(1.)/universe.numberOfAtoms() msd = ParAccumulator(operator.add, 0.) 17 for atom in ParIterator(atoms): pt = trajectory.readParticleTrajectory(atom) msd.addValue(msd_calc(pt, weight)) msd = msd.calculateTotal() def write(msd, time, filename): writeArray(Numeric.transpose(Numeric.array([time, msd])), filename) ParRootFunction(write)(msd, time, ’msd_par.plot’) 18