Cochlea Simulation using Titanium Generic Immersed Boundary

advertisement

Cochlea Simulation using Titanium Generic

Immersed Boundary Software (TiGIBS)

Jason Duell (jcduell@lbl.gov),

Wren Montgomery (wren@ocf.berkeley.edu),

Simon Yau (smyau@cs.berkeley.edu)

CS267 Class Project, Dec 2001.

Katherine Yelick, Professor, David Bindel, TA

Introduction

The Immersed Boundary Method is a numerical method developed by Charles

Peskin and David McQueen of the Courant Institute of Mathematical

Sciences (CIMS) to simulate a class of biological systems [1]. The method is an efficient way to simulate systems that can be represented by elastic fibres immersed in an incompressible fluid. Using this method, McQueen and Peskin have written a simulation of the mammalian heartbeat [2], which has been used in medical research.

Edward Givelberg and Julian Bunn have used this method to simulate the behaviour of the human cochlea. Their code runs on the CACR Superdome and is scalable to 64 processors.

We have modified a version of the NYU code written by Nathaniel Cowen into a generic library in Titanium. Using this generic library, a scientist can write an immersed boundary method simulation by supplying only the portion of the simulation code that is specific to their simulation.

This project attempts to use the Titanium Generic Immersed Boundary

Software (TiGIBS) to run the simulation of the cochlea, and also to address some performance issues in the TiGIBS.

Titanium Generic Immersed Boundary Software

The Titanium Generic Immersed Boundary Software (TiGIBS) package provides a mechanism for simulation writers to develop their code more efficiently. It provides an API for simulation writers to access a tuned and optimised library that performs the part of the simulation that is common to all immersed boundary code, while simulation writers provide the rest of the code that is specific only to their simulation.

The package takes a minimalist approach to the API design. That is, it tries to include as little code as possible inside the package. This approach allows more flexibility – but more work – on the simulation writers’ part when write their simulations.

Immersed Boundary Method

In the immersed boundary method, the boundaries are represented by lists of fibre points, and the fluid surrounding the boundaries are represented by a rectangular lattice. At each time step, the fibre updates its force value to reflect the activity of boundary – for example, the heart muscles’ contraction. Then, the fibre points exert these forces onto the fluid lattice. This is performed via a Dirac delta function. The velocity of each cell of the fluid lattice is then calculated from these local forces using the Navier-Stokes equation of a viscous incompressible fluid. Finally, the fibre’s velocity is calculated from its surrounding fluid's velocity using the Dirac delta function. After that, the fibres are moved into a new position, based on their velocities. At the next time step, based on the new position of the fibre, the forces on the fibres can be recalculated, and the whole operation is repeated [2].

Titanium Generic Immersed Boundary Software

Immersed Boundary Simulations differ from each other only in the fibre data structure – how the fibres are arranged and activated. So a generic library will need to accommodate that. The Titanium Generic

Immersed Boundary Software library uses a class hierarchy to allow simulation writers to specify the fibre points.

All fibre points must extend a base class IbPoint. This class contains the information that the library needs to interact it with the fluid lattice – the fibre’s coordinate, and the amount of force that it is spreading onto the fluid lattice and its velocity vector. The simulation writers will write their own fibre point class that extend the IbPoint and provide the code dictating how each fibre activates.

Since this is a package written for distributed architectures, the fluid lattice need to be distributed over all processors. The decomposition scheme is determined by the library. In fact, in the current implementation the decomposition is limited by FFTW to use slab decomposition. However, in future implementations we may switch to algorithms that would benefit from a different layout. So we use the distributed array library to represent the fluid lattice.

A typical use of the generic library will be as follows. During initialisation, the program reads in an input file that describes the dimension of the fluid space, the viscosity of the fluid and other parameters such as the number of timesteps to simulate, or the duration of each timestep. The program would create a TiGIBS object, which will take care of the immersed boundary code. The program then either reads from a file, or calls some subroutine that would generate fibre points and register them to the TiGIBS object. Then the program can loop through a number of simulation timesteps. At each of these timesteps, the program would calculate the force carried by each fibre points and update the fields in each fibre point to reflect that change. Then the program can simply call the TiGIBS object to advance one timestep, after which the TiGIBS object would have spread these forces onto the fluid, solved the NS equation for the fluid velocity in the next timestep, and interpolated the velocity back to the fibre points. So the coordinates and force values on the fibre points can be seen as the

“in-parameters” to the TiGIBS simulation call, and the velocity vector for those points is the “out-parameter” for that call. Typically the program will update the coordinates of each fibre point using their velocities. After a certain number of timesteps, the program will also read the fluid grid and dump out the velocity field and pressure fields to an output file, and also the velocity and coordinates of each fibre points to an output file for analysis and visualisation. See the appendix for detailed description of the interface.

TiGIBS Performance

The performance of TiGIBS is highly dependent on how well the fibre points are partitioned. The Navier-Stokes solver operates on a regular lattice, so load balancing is simple. However, since we are using an

FFT-based solver, the all-to-all communication step in the transpose of the 3d FFT poses stress on the communication system. We are using FFTW

[4] to handle all the FFT calls, and it has been shown to be fairly scalable.

The performance of the interaction phases is highly dependent on how the fibre points are partitioned. In the first implementation of the contractile torus simulation, the fibre points are partitioned to the same processor that contains the fluid cells they interact with. As a result, there is severe load imbalance in the interaction phase.

However, in the second implementation, we partitioned the fibres independently from the fluid cells, and as a result, some of the fibre points need to interact with fluid cells that are not in the same processor. This will incur some communication overhead, as described in the next section.

The two partitioning schemes:

The fluid-cell dependent partitioning scheme (egg-slicer)

The fluid-cell independent partitioning scheme (pizza-cutter)

These two pictures show how the two partitioning schemes partition the contractile torus onto 8 processors. Each part of the torus that is partitioned onto a different processor is shown with a different colour. Note that in the fluid-cell dependent partitioning scheme, there are only 6 colours, since the two outer most processors do not have any fibre points interacting with their fluid cells, thus does not get any fibre points.

Performance results

The performance of the Navier-Stokes solver and the interaction phases are shown separately, since the Navier-Stokes solver’s performance is independent of the fibre points partitioning scheme.

NS Solver Performance

The NS Solver shows almost linear speed up, indicating that FFTW deserved the Gordon-Bell a few years back. The interaction phases did not scale up, due to load imbalance (in egg-slicer’s case) or communication cost (in pizza-cutter’s case).

1000

800

600

400

200

0

0 5 10 15 20

NS Solver speed up

Figure 1: Performance graph of the NS Solver

Interaction steps performance

The performance result of interaction steps is show in Figure 1. The experiment is run on a 64x64x64 fluid lattice and the torus consists of

67812 fibre points. The experiment is run on the millennium with GMbased MPI backend.

The egg-slicer partition has severe load partitioning problems, resulting in a speed up of less than 5 times using 16 processors. The pizza-cutter partition has a better load balance, but the communication overhead still limits its speed up to less than 10 times using 16 processors. Also, there seems to be a saturation point where the spread force step does not go faster after 8 processors. We do not know what cause this behaviour.

450

400

350

300

250

200

150

100

50

0

0

Spread

Force

(Pizza cutter)

Interpolate

Velocity

(Pizza cutter)

Spread

Force

(Egg slicer)

Interpolate

Velocity

(Egg slicer)

5 10

No. of processors.

15 20

<Figure 2: Performance of Torus simulation using the two partition schemes>

To quantify the communication bottleneck, we next investigate the communication pattern of the interaction steps.

Communication overhead

Figure 3 shows the communication overhead during the interaction phase for the two partitioning schemes.

25

20

15

10

Egg-slicer

Pizza-cutter

5

0

0 5 10 15 20

No. of procs

<Figure 3: Communication overhead of the two partition schemes>

As seen from this figure, the pizza-cutter scheme has a higher communication overhead, as expected. On 16 processors, as much as 22% of the time is spent on communication. Therefore we looked into the possibility of minimising the communication cost during the interaction phase.

Sparse Array Copy

The work of one of our co-conspirators focused on trying to improve the network performance of the TiGIBS library by using a new 'sparse' array copy call in Titanium.

As previously mentioned, the FFTW library mandates that TiGIBS use a slab decomposition of fluid cells across the processors. However, since the fibre points are not usually uniformly distributed across the slabs, a different partitioning for the fibres will typically be desirable to balance computation evenly across processors (partitioning choice is problem-specific--for the torus code, it proved best to cut the torus as though one had laid it on a table and cut it like a pizza-

-and is not handled by the TiGIBS library). As a consequence, the library needs to efficiently handle a situation where some of fluid cells owned by other processors, including fluid cells that are not in a block that is contiguous with the processor's own set of cells. This situation is illustrated by the figure below:

Original bounding box Mega bounding box

The diagram illustrates the set of fibre points (the dots) owned by a single processor, and how they map to the fluid cell space as it has been partitioned across 4 processors [For simplicity, the diagram has been rendered in two dimensions: in actuality the curve of dots would have the elbow-macaroni-like shape of a quarter of a torus].

The two versions of the chart show the different strategies we used for transferring fluid cell data across processors. In our original code, we first took the list of a processor's fibre points and split it into sub-lists by the cell block in which they exist (since the fibre points move, this needs to be dynamically calculated each time). In the process of constructing these lists, a simple set of min/max calculations yields a 'bounding box' that contains a superset of all the cells that the fibres will need to interact with. These blocks are then copied in entirety from the remote processors that own them. This

strategy has some drawbacks--it makes the code fairly complicated, data which are overlapped by multiple blocks will be copied repeatedly--but it seemed preferable in that it kept the total size of the copied blocks lower.

During the development of the TiGIBS library, the Titanium language acquired a 'sparse array copy' function, which allows only a specified set of points to be copied within a specified domain. This promised to make our communication more efficient, and so we set out to change our code to use it. The first change we made was to shift the bounding box calculation to create only a single, 'mega' bounding box, as shown in the second diagram. Though this box would be larger than the set of smaller bounding boxes, it would not matter, since only the points we specified would be copied. Second, we changed the code so that it calculated a unique list of the points we would need from each processor. This was accomplished by adding descriptions of the cells that interacted with each fibre point to a data structure that could quickly determine if the point was already added, and then return a list at the end. We used two such data structures, one a combination hashtable and linked list, and the other a grid of booleans that was marked with 'true' when a cell was needed, and then scanned (for all non-empty processor spaces) to create a list. The hashtable consumed less memory than the boolean grid, and promised to more quickly construct a list (traversal of a linked list vs. a table scan), but it had the disadvantage that each unique insertion required a memory allocation of a wrapper object for storing the cell's point (the

Titanium Point class is an immutable type, which are not allowed to have pointers to other members of the same type, making a linked list impossible), plus a copy of the point's data into the wrapper.

We thus ran four sets of code: the original 'bounding blocks' algorithm. A 'mega bounding block' version (which still did a full block copy like the original algorithm: we considered this an intermediate version on the way toward a sparse copy version), and the

'hashtable' and 'boolean grid' versions, which used the sparse copy call. The performance of these codes is shown below, separated into total time, setup cost (calculating the bounding boxes and point list if used), and network time:

As can be seen, the results were counter to most of our expectations: not only were the sparse copy versions comically slower than the original code due to their very high setup costs, but the network cost of the sparse array were higher even discarding the setup costs.

Furthermore, the full-copy mega-block version of the code, which we would have expected to be slower than the original code, proved to be the fastest algorithm.

A number of reasons appear to account for these results. The high setup costs of the sparse copy algorithms appear to be due to the much higher number of elements that need to be processed vs. the bounding box calculation (each fibre point in our model interacts with 64 fluid cells, so 64 times more data needed to be handled), plus the list construction cost. The 'payoff' per element processed also turned out to be very low. Since the fibre points are so tightly packed together, there was an overwhelming amount of duplicate fluid cells to be processed: our code wound up processing an average of around 500 thousand fluid cell points in order to construct a unique list of 5 thousand points. Finally, the higher network cost of the sparse copy appears to be a result of the fact that our 'mega bounding box' was much less sparse than we had anticipated: for a box with a domain of 8 thousand points, we were passing a list of 5 thousand points, for approximately 60% fill. The Titanium developers had informed us that for a simple, contiguous array approximately 10% fill was the cut-off for making the sparse copy worthwhile, and while our block copies

involved more complex indexing than a simple array would, and the

Titanium compiler underwent some sparse copy improvements after the 10% figure was quoted to us, the effective cut-off was obviously still much lower than our 60%.

One unexplained result is the slower network performance of the boolean grid algorithm versus the hashtable. This should not be, as the two should pass essentially the same list of points to the sparse copy function (the only difference being that the boolean grid algorithm passes a sorted list of points, which ought to actually potentially make the memory access pattern during gathering on the remote node more efficient. We have not carefully looked at the ordering of the list of points returned by the hashtable code, however. It is also quite possible that the result simply reflects noise on the notorious

Millennium cluster (we reported only minimums to try to avoid this), particularly since we ran our final timings on the night before the poster session...

While the ubiquitously dense layout of fibres (or other boundary thingies) in most immersed boundary codes probably means that the sparse array copy approach is doomed to suffer unacceptably high setup costs (from duplicate cell inserts, if nothing else), our research does raise a number of interesting questions related to optimising network costs in the TiGIBS library. It would be very interesting to know whether our speedup from using the 'mega-block' approach is generalizable across different problems, or whether it just happened to fit our data and choice of fibre point partitioning. Also, it would probably aid the chances of the TiGIBS library's adoption if it were determined how users could best be allowed to try to optimise their network performance without violating the library interface. One can imagine three basic approaches to this. First, the code could be kept

'as is' and users could simply be informed as to how the blocks are calculated and copied, so that they could attempt to partition their data in a way that works well with the algorithm. Second, the API could allow users to provide a callback function that handles how needed data is to be determined and copied. Or third, some clever and ambitious young computer scientist (note that this co-author fails all these qualifications) could attempt to automatically generate partitioning and network copying functions tailored to particular data sets.

Cochlea Code

To demonstrate the portability of the TiGIBS library, we tried to use it to write a real-life non-heart-related immersed boundary simulation.

We decided to use TiGIBS to re-write a distributed version of the cochlea simulation by Ed Givelberg and Julian Bunn [3].

The Cochlea

Human hearing depends on the cochlea, which converts sound waves into excited nerves. The first step in the process is the passage of the

wave from the stapes (a bone of the middle ear) to the oval window of the cochlea. This flexible membrane pulses and sends down the cochlea.

Although Givelberg and Bunn have modelled the entire cochlea, we concentrated on this oval window, since it is the gateway to the rest of the inner ear.

The Model

We made use of the generalized immersed boundary method solver and the science of Givelberg & Bunn. Because of computational limitations, our window is square. We placed this plate in a 32x32x32 fluid cell for the calculation.

Timestep 0

When the values of 0.499 for Poisson's ratio, 1.56e6 poise for Young's

Modulus are applied to a plate of thickness .004 cm, we get a reasonable simulation for the first few time steps:

Timestep 4 Timestep 8 Timestep 12

This is especially true after we consider that the squareness of the window is influencing the symmetry of the nodes--the oval window is mirror symmetric with 2 or 3 nodes, while the square window tends to 4 fold-symmetric vibration patterns. However, after the first few time steps, it is clear that something in our simulation is not correct.

Instead of dying down, the vibrations increase in strength, rather than behaving linearly and decreasing as they ought.

Timstep 2 Timestep 3 Timestep 4

This is probably due to the fact that the boundaries of our square are not fixed, whereas Givelberg and Bunn's oval window does have spatially fixed edges.

Conclusions

We have demonstrated with this model that it is possible to use the generalized immersed boundary method code based on heart-modelling code on a problem completely unrelated to the heart, being different in geometry, scale and biological importance.

Conclusion

We have introduced the Titanium Generic Immersed Boundary Software library, which enables scientists to write their immersed boundary simulation in a distributed platform with minimal effort. We also investigated the performance problems posed by writing this code on the distributed platform, and looked into one of the ways to address that problem. Finally, we demonstrated that the TiGIBS is a usable library by adapting a non-heart-related simulation using the library.

Appendix: TiGIBS API

package tigibs;

/* Class to represent a fibre point. All fibre points in TiGIBS must extend this class */ public abstract class IbPoint {

// id number.

int id;

// the Eulerian coordinates.

public double x_coord;

public double y_coord;

public double z_coord;

// velocity of this.

public double x_vel;

public double y_vel;

public double z_vel;

// the force this point is spreading onto the fluid.

public double force_pt1;

public double force_pt2;

public double force_pt3;

// constructor

public IbPoint (int pid, double x, double y, double z);

// field extractors

public inline int getId ();

public inline double getXCoord ();

public inline double getYCoord ();

public inline double getZCoord ();

// zeros out the velocity

public void zeroOutVel ();

}

/* TiGIBS simulation object. */ public class Tigibs {

// Register a fibre point in the object */

public void registerPoint (IbPoint p);

// Register a marker in the object (markers are points, but don’t spread force

public void registerMarker (IbPoint p);

// Performs the spread force, NS Solver, and interpolate velocity steps

public single void advanceOneIteration ();

// remove all points from the object

public single void unregisterAllPoints ();

// constructor

public single IbMain (int single numOfCellsX, int single numOfCellsY, int single numOfCellsZ);

// field extractor of the fluids

public single DistArrayDouble3d getU (); // dx/dt

public single DistArrayDouble3d getV (); // dy/dt

public single DistArrayDouble3d getW (); // dz/dt

public single DistArrayDouble3d getF1 (); // force vector

public single DistArrayDouble3d getF2 ();

public single DistArrayDouble3d getF3 ();

public single DistArrayDouble3d getP (); // pressure

// updates the coordinates of the registered points according to its velocity

public void move ();

}

References

[1] Charles S. Peskin and David M. McQueen.

A general method for the computer simulation of biological systems interacting with fluids, with

D. M. McQueen. In Biological Fluid Dynamics, ed.1995 http://www.math.nyu.edu/~mcqueen/Public/papers/seb/SEB_19971216/SEB_199

71216.html

[2] McQueen, D.M., and C.S. Peskin. 1997. Shared-memory parallel vector implementation of the immersed boundary method for the computation of blood flow in the beating mammalian heart. Journal of Supercomputing

11(3): 213-236. http://www.math.nyu.edu/~mcqueen/Public/papers/psc/PSC_19971216/PSC.199

71216.html

[3]E. Givelberg, J. Bunn. Detailed Simulation of the Cochlea: Recent

Progress Using Large Shared Memory Parallel Computers, CACR Technical

Report CACR-190, July 2001 http://www.cacr.caltech.edu/Publications/techpubs/cacr.190.pdf

[4]FFTW website: http://www.fftw.org

Download