Distributed Data Assimilation - A case study Aad J. van der Steen High Performance Computing Group Utrecht University 1. The application 2. Parallel implementation 3. Model and experiments 4. Perspectives for distributed implementation The application - 1 " " " " The application, Ensflow, assimilates ocean flow data into a stochasic ocean flow model. Many realisations of the model with randomly distributed parameters forming an ensemble are run. Perodically these runs are integrated with satellite data and an optimal ensemble average is computed. The sequence of ensemble averages over time describes the development of the ocean's currents best fitting the observations. The application - 2 The region of interest in the southern tip of Africa: Data from the TOPEX/Nimbus satellite are used for the assimilation. Purpose is to understand the evolution of streams and eddies in this region. The application -3 Because of the stochastic nature of the model many realisations of the model with slightly different parameter values are to be evolved. The observations of the top layer values are interpolated to a 251x151 grid. The ensemble members are allowed to develop independently for some time and combined to find the ensemble mean R T B F With F the best estimate for the model evolution without observations. R = matrix of field measurement covariances. B = matrix of representer coefficients. The application - 4 The model performs two computational intensive tasks: 1. Generation of the ensemble members. 2. The computational flow part that describes the evolution of the stream function . Every 240 hourly timesteps an analysis of the ensemble is for the past period. done to obtain the optimal estimate Parallel implementation -1 1. Ensemble members are distributed evenly over the processors. 2. Data of ensemble members are independent and are local to the processors. 3. Only in the analysis phase to determine the globally optimal data have to be exchanged (using MPI). 4. The optimal global field is distributed and a new cycle is started. Parallel implementation -2 The program contains 2 irreducible scalar parts: 1. Initialisation, linearly dependent on the number of ensemble ne and depends on nd , the number of gridpoints members, 2 O n log nd . Init time = t i . by d 2. The analysis part for which holds that the analysis time t a n3e . On the DAS-2 systems ne 60 (for ). t i 59.5 s and t a 104 s Parallel implementation -3 The time per ensemble member per 24h time step t s30s. This amounts to 20x60x30 = 36,000s = 10h single processor time for the complete 20 day cycle considered. After the init phase a distribute operation and per analysis step a collect and a distribute operation are required. n n n 1.5 MB The total amount of data moved is x y l . The bandwidth at with this occurs is 120-140 MB/s (using Myrinet on one cluster). So, the total communication time is about 0.12s per transfer. t c 727 s Total communication time within one run . Model and experiments -1 The timing model has the following form: T p t i t a 1200t s p 36,000 36,000 t c 59.5 104 15173.5 s p p Model and experiments -2 Remarks: 1. There is a mistery with respect to the computation phases: for p = 1 t c 17 s , for p > 1 t c 30 s consistently. 2. For p < 6, using 1 CPU/node is somewhat faster, from p = 6 on, 2 CPUs/node is marginally faster due to decreasing competition for memory and faster intra-node communication. Model and experiments: Simulation results Shown is a simulation of 180 dayly periods, note the blue eddies that form counterclocwise in the Atlantic. Perspectives for distributed implementation -1 The timing model has the following form: T p t i t a 1200t s In the single-cluster implementation p and virtually independent of . p tc tc is quite small (ca. 15 s) For the distributed version this might not be the case: 1. Presently Globus cannot be used yet in conjunction with Myrinet's MPI, communication must be done via IP. 2. The geographical distance between the DAS clusters introduces non-negligable latencies. Perspectives for distributed implementation -2 As can be seen from the figure the communication time is still insignificant when distributing the model over two locations (UU and VU): Perspectives for distributed implementation -3 t c is quite erratic, more determined by synchronisation than communication time: Perspectives for distributed implementation -4 The results show that this application is excellently suited for distributed processes. Still, both communication and the analysis phase may be made more efficient: 1. When is known which process id.s are located where, first intra-cluster communication can be done, then the assembled messages can be exchanged. 3on the local ensemble 2. The analysis could be tdone n members (remember a e ) and synchronised less frequently. Perspectives for distributed implementation -5 Using more sites has a notable effect on the communication. Again, synchronisation effects are more important than the communication time proper : Sites 1 2 3 4 Exec.time (s) Comm.time(s) 3310 45.1 3274 62.9 3339 208.5 3299 151.9 12-proc. run: UU, UU+VU, UU+VU+Leiden, UU+VU+Leiden+Delft Perspectives for distributed implementation -6 This case study was a particular well suited candidate for distributed processing. Apart from improving this implementation we will proceed with three other projects that are promising: 1. Running two coupled oceanographic model within the Cactus framework. 2. Inexact sequence matching of genetic material. 3. Pattern recognition on proteomic micro arrays. Acknowledgements: Fons van Hees for the single-system parallelisation