Parallel Volume Rendering Using Binary-Swap Image Composition Presented by Jin Ding Spring 2002 Visualization and Advanced Computer Graphics Introduction Observations • Existing volume rendering methods are very computationally intensive and therefore fail to achieve interactive rendering rates for large data sets. • Volume data sets can be too large for a single processor to machine to hold in memory at once. • Trading memory for time results in another order of magnitude increase in memory use. • An animation sequence of volume visualization normally takes hours to days to generate. • Massively parallel computers and hundreds of high performance workstations are available! Goals & Ways • Distributing the increasing amount of data as well as the timeconsuming rendering process to the tremendous distributed computing resources available to us. • Parallel ray-tracing • Parallel compositing Implementations • Hardware – CM-5 and networked workstations • The parallel volume renderer evenly distributes data to the computing resources available. • Each volume is ray-traced locally and generates a partial image. • No need to communicate with other processing units. • The parallel compositing process then merges all resulting partial images in depth order to produce the complete image. • The compositing algorithm always makes use of all processing units by repeatedly subdividing partial images and distributing them to the appropriate processing units. Background Background • The major algorithmic strategy for paralleling volume rendering is the divide-and-conquer paradigm. • Data-space subdivision assigns the computation associated with particular subvolumes to processors – usually applied to a distributed-memory parallel computing environment. • Image-space subdivision distributes the computation associated with particular portions of the image space – often used in shared-memory multiprocessing environments. • The method to introduce here is a hybrid because it subdivides both data space ( during rendering ) and image space ( during compositing ). A Divide and Conquer Algorithm Starting Point • The volume ray-tracing technique presented by Levoy • The data volume is evenly sampled along the ray, usually at a rate of one or two samples per pixel. • The composting is front-to-back, based on Porter and Duff’s over operator which is associative: a over ( b over c ) = ( a over b ) over c. • The associativity allows breaking a ray up into segments, processing the sampling and compositing of segment independently and combining the results from each segment via a final compositing step. Data Subdivision/Load Balancing • The divide-and-conquer algorithm requires that we partition the input data into subvolumes. • Ideally each subvolumes requires about the same amount of computation. • Minimize the amount of data which must be communicated between processors during compositing. Data Subdivision/Load Balancing • The simplest method is probably to partition the volume along planes parallel to the coordinate planes of the data. • If the viewpoint is fixed and known, the data can be subdivided into “slices” orthogonal to the coordinate plane most nealy orthogonal to the view direction. • Produce subimages with little overlap and therefore little communications during compositing when orthographic projection is used. • When the view point is unknown a priori or perspective projection is used, it is better to partition the volume equally along all coordinate planes. • Known as block data distribution and can be accomplished by gridding the data equally along each dimension. Data Subdivision/Load Balancing • This method instead use a k-D tree structure for data subdivision. • Alternates binary subdivision of the coordinate planes at each level in the tree. • When trilinear interpolation is used, the data lying on the boundary between two subvolumes must be replicated and stored with both subvolumes. Parallel Rendering • Local rendering is performed on each processor independently. • Data communication is not required. • Only rays within the image region covering the corresponding subvolumes are cast and sampled. Parallel Rendering • Some care needs to be taken to avoid visible artifacts where subvolumes meet. • Consistent sampling locations must be ensured. S2 ( 1 ) – S1 ( n ) = S1 ( n ) – S1 ( n – 1 ) • Or the subvolume boundary can become visible as an artifact in the final image. Image Composition • The final step is to merge ray segments and thus all partial images into the final total image. • Need to store both color and opacity. • The rule for merging subimages is based on the over compositing operator. • Subimages are composited in a front-to-back order. • This order can be determined hierarchically by a recursive traversal of the k-D tree structure. • A natural way to achieve parallelism in compostion: sibling nodes in the tree may be processed concurrently. Image Composition • A naïve approach is binary compositing. • Each disjoint pair of processors produces a new subimage. • N/2 subimages are left after the first stage. • Half the number of the original processors are paired up for the next level of compositing hence another half would be idle. • More parallelism must be exploited to acquire subsecond times. • The binary-swap compositing method makes sure that every processor participates in all the stages of the process. • The key idea – at each compositing stage, the two processors involved in a composite operation split the image plane into two pieces. Image Composition • In early phases, each processor is responsible for a large portion of the image area. • In later phases, the portion of each processor gets smaller and smaller. • At the top of the tree, all processors have complete information for a small rectangle of the image. • The final image can be constructed by tiling these subimages onto the display. • Two forces affect the size of the bounding rectangle as we move up the compositing tree: - the bounding rectangle grows due to the contributions from other processors, - but shrinks due to the further subdivision of image plane. Image Composition • The binary-swap compositing algorithm for four processors: Communication Costs • At the end of local rendering each processor holds a subimage of size approximately p*n-2/3 pixels, where p is the number of pixels in the final image and n is the number of processors. • The total number of pixels over all n processors is therefore: p*n1/3. • Half of these pixels are communicated in the first phase and some reduction in the total number of pixels will occur due to the depth overlap resolved in this compositing stage. • The k-D tree partitioning splits each of the coordinate planes in half over 3 levels. • Orthogonal projection onto any plane will have an average depth overlap of 2. Communication Costs • This process repeats through log( n ) phases. If we number the phases from i = 1 to log( n ), each phase begins with 2-(i-1)/3 n1/3p pixels and ends with 2-i/3 n1/3p pixels. The last phase therefore ends with 2-log(n)/3 n1/3p = n-1/3 n1/3p = p pixels, as expected. At each phase, half of the pixels are communicated. Summing up the pixels communicated over all phases: log(n ) pixels transmitted = i 1 i 1 1 1 3 3 ( 2 n p) 2 • The 2-(i-1)/3 term accounts for depth overlap resolution. • The n1/3p term accounts for the initial local rendered image size, summed over all processors. • The factor of ½ accounts for the fact that only half of the active pixels are communicated in each phase. Communication Costs • This sum can be bounded by pulling out the terms which don’t depend on i and noticing that the remaining sum is a geometric series which converges: log(n ) pixels transmitted = i 1 i 1 1 1 3 3 ( 2 n p) 2 1 3 log(n1) 3 = n p 2 2 i 0 1 i 1 3 2.43n p Comparisons with Other Algorithms • Direct send, by Hsu and Neumann, subdivides the image plane and assigns each processor a subset of the total image pixels. • Each rendered pixel is sent directly to the processor assigned that portion of the image plane. • Processors accumulate these subimage pixels in an array and composite them in the proper order after all rendering is done. • The total number of pixels transmitted is n1/3p(1-1/n). • Requires each rendering processor send its rendering results to, potentially, every other processor. • Recommends interleaving the image regions assigned to different processors to ensure good load balance and network utilization. Comparisons with Other Algorithms • Direct send • Binary-swap - may require n ( n – 1 ) messages - requires n log( n ) messages to be transmitted to be transmitted - sends each partial ray segment - requires log( n ) communication result only once phases - in an asynchronous message - latency cost O ( log( n ) ) whether passing environment, O ( 1 ) asynchronous or synchronous latency cost communications are used - in a synchronous environment, - exploits faster nearest neighbor O(n) communication paths when they exist Comparisons with Other Algorithms • Projection, by Camahort and Chakravarty, uses a 3D grid decomposition of the volume data. • Parallel compositing is accomplished by propagating ray segment results front-to-back along the path of the ray through the volume to the processors holding the neighboring parts of the volume. • Each processor composites the incoming data with its own local subimage data before passing the results onto its neighbors in the grid. • The final image is projected on to a subset of the processor nodes: those assigned outer back faces in the 3D grid decomposition. Comparisons with Other Algorithms • Projection • Binary-swap - requires O ( n1/3 p ) messages - requires n log( n ) messages to be transmitted to be transmitted - each processor sends its results - requires log( n ) communication to at most 3 neighboring phases- exploits faster nearest neighbor processors in the 3D grid communication paths when they - only 3 message sends per exist processor - latency cost O ( log( n ) ) whether - each final image pixel must be asynchronous or synchronous routed through n1/3 processor communications are used nodes, on average, on its way to a face of the volume, thus latency costs grow by O ( n1/3 ) Implementation of the Renderer Implementation of the Renderer • Two versions – one on CM-5 and another on groups of networked stations • A data distributor – part of the host program which reads data from disk and distributes it to processors • A render – not highly tuned for best performance of loadbalancing based on the complexity of subvolumes • An image compositor • The host program – data distribution and image gathering • The node program – subvolume rendering and subimage compositing CM-5 and CMMD 3.0 • CM-5 – a massively parallel supercomputer with 1024 nodes, each of which is a Sparc microprocessor with 32MB local RAM and 4 64-bit wide vector units • 128 operations can be performed by a single instruction • A theoretical speed of 128 GFlops • The nodes can be divided into partitions whose size must be a power of two. Each user program operates within a single partition. Networked Workstations and PVM 3.1 • Using a set a high performance workstations connected by an Ethernet – the goal is to set up a volume rendering facility for handling large data sets and batch animation jobs. • A network environment provides nondedicated and scattered computing cycles. • A linear decrease of the rendering time and the ability to render data sets that are too large to render on a single machine are expected. • Real-time rendering is generally not achieveable in such an environment. Tests Tests • Three different data set - The vorticity data set: 256 x 256 x 256 voxel CFD data set, computed on a CM-200 showing the onset of turbulence - The head data set: 128 x 128 x 128, the now classic UNC Chapel Hill MR head - The vessel data set: 256 x 256 x 128 voxel Magnetic Resonance Angiography ( MRA ) data set showing the vascular structure within the brain of a patient Tests • Each row shows the images from one processor, while from left to right, the columns show the intermediate images before each composite phase. The right most column show the final results, still divided among the eight processors. Tests Tests on CM-5 Tests on CM-5 Tests on Networked Workstations Conclusions Conclusions • The binary-swap compositing method has merits which make it particularly suitable for massively parallel processing • While the parallel compositing proceeds, the decreasing image size for sending and compositing makes the overall compositing process very efficient. • It always keeps all processors busy doing useful work. • It is simple to implement with the use of the k-D tree structure. Conclusions • The algorithm has been implemented on both the CM-5 and a network of scientific workstations. • The CM-5 implementation showed good speedup characteristics out to the largest available partition size of 512 nodes. Only a small fraction of the total rendering time was spent in communications. • In a heterogeneous environment with shared workstations, linear speedup is difficult. Data distribution heuristics which account for varying workstation computation power and workload are being investigated. Future Research Directions • The host data distribution, image gather, and display times are bottlenecks on the current CM-5 implementation. These bottlenecks can be alleviated by exploiting the parallel I/O capabilities of the CM-5. • Rendering and compositing times on the CM-5 can also be reduced significantly by taking advantages of the vector units available at each processing node. • Real-time rendering rates is expected to be achievable at medium to high resolution with these improvements. • Performance of the distributed workstation implementation could be further improved by better load balancing.