Efficient Strategies for Acceleration Structure Updates in Interactive Ray Tracing Applications on the Cell Processor Martin Weier, Thorsten Roth, André Hinkenjann Computer Graphics Lab Bonn-Rhein-Sieg University of Applied Sciences Sankt Augustin, Germany {martin.weier|thorsten.roth}@smail.inf.h-brs.de andre.hinkenjann@h-brs.de Abstract. We present fast complete rebuild strategies, as well as adapted intelligent local update strategies for acceleration data structures for interactive ray tracing environments. Both approaches can be combined. Although the proposed strategies could be used with other data structures and architectures as well, they are currently tailored to the Bounding Interval Hierarchy on the Cell chip. 1 Introduction Recent hardware and software developments in the field of fast ray tracing allow for the use of these renderers in interactive environments. While the focus of research of the last two decades was mainly on efficient ray tracing of static scenes, current research focuses on dynamic, interactive scenes. Recently, many approaches came up that deal with ray tracing of dynamic deformable scenes. Current approaches use kd-trees [1,2], grids [3] or Bounding Volume Hierarchies (BVHs) on commodity CPUs or GPUs [4,5,6]. These either do a complete rebuild of the scene’s acceleration structure or provide methods to perform an intelligent dynamic update. However, there is always a trade-off between the tree’s quality and the time that is needed to perform the rebuild or update operations. Recent publications [7,8] on using kd-trees for deformable scenes always propose trying to perform a complete rebuild from scratch. However, these approaches do not seem to scale well in parallel [7]. Another approach is [9], a GPU based construction of kd-trees in breadth first manner. In order to use the fine-grained parallelism of GPUs, a novel strategy for processing large nodes and schemes for fast evaluation of the nodes’ split costs is introduced. A disadvantage of this approach is that its memory overhead is very high. Grids, on the other hand can be constructed very fast and in parallel [3]. Although grids are usually not as efficient as adaptive structures when it comes to complex scenes, coherence can be exploited here as well [10]. In recent years, BVHs seem to have become the first choice for ray tracing of deformable scenes. BVHs are well suited for dynamic updates. However, this often leads to increasing render times since the trees degenerate over time. To decide whether it makes sense to perform a complete rebuild, Lauterbach et al. [4] developed a metric to determine the tree’s quality. Another approach was proposed by Wald et al. [11]. They use an asynchronous construction that runs in parallel during vertex update and rendering. A novel approach is [12], performing a BVH construction on GPUs. There, the upper tree levels are parallelized using Morton curves. This idea is based on [13] and [14], where surface reconstruction on GPUs using space filling curves is performed. We have chosen 2 Martin Weier, Thorsten Roth, André Hinkenjann the Bounding Interval Hierarchy (BIH) [15] for interactive ray tracing of dynamic scenes on the Cell Broadband Engine Architecture [16]. The Bounding Interval Hierarchy has some advantages compared to kd-trees and BVHs. A BIH node describes one axis aligned split plane like a kd-tree node. However, the split planes in the BIH represent axis aligned bounding boxes (AABBs). Starting from the scene’s BB the intervals are always fitted to the right’s leftmost and the left’s rightmost values of the resulting partitions. Thus, the BIH can be seen as a hybrid of kd-trees and BVHs. One advantage of BIHs compared to kd-trees and BVHs is their easy and fast construction. In addition the BIH’s representation of the nodes is very compact since no complete AABBs need to be stored. Due to the growing parallelism in modern architectures, effectively utilizing many cores and vector units is crucial. In fact the BIH construction has many similarities with quicksort which is a well studied algorithm in the field of parallel computing. Modern architectures like the Cell processor or GPUs all have a memory access model (memory coherence model) where data from main memory must be explicitly loaded from or distributed to the threads. For most of these memory models bandwidth is high, but latency as well. This makes it especially important to place memory calls wisely. In the following section, we describe fast methods for completely rebuilding the acceleration data structure each frame. In section 3 we present intelligent local updates of the data structure. Note that both methods can be combined: The intelligent update method finds the subtree that has to be rebuilt after scene changes. This update can then be done by using the methods from the following section. After that we show the results of some benchmarks and evaluate both approaches. 2 Fast Complete Rebuild of Acceleration Data Structures The most important step in BIH construction is the sorting routine which sorts primitives to the right or left of the split plane. The Cell’s SPEs only have a limited amount of local storage. To do sorting on all primitives, which usually exceed the 256kB local store on the SPE, data has to be explicitly requested and written back to main memory. Peak performance for these DMA calls is reached for blocks of size 16kB, which is the maximum that can be obtained by one DMA call. Since all of these DMA operations are executed non-blocking (asynchronously), it is important to do as much work as possible in between consecutive calls. Common approaches for doing an in-place quicksort like in [17] usually distribute the array in blocks of fixed size to the threads. The threads then do a local in-place sort and write their split values to a common new allocated array. Then prefix sums over these values are calculated and the results are distributed back to the threads. By doing so, each thread knows the position where it can write its values back to. Using this algorithm on the Cell, each SPE would have to load one 16kB block, sort it and write the split values and the sorted array back to main memory. After that, the prefix sum operation on the split values of the 16kB blocks would need to be performed. Since this is a relatively cheap operation, parallelizing it using the SPEs leads to new overhead. Additionally, each SPE needs to read the 16kB blocks again to write them back to main memory in correct order. For this reason this method should be avoided. In our algorithm we only use one SPE to perform the sorting of one region. Each region refers to one interval of arbitrary size. To avoid any kind of recombination and reordering after one block is sorted, the values need to be written back to their correct location right away. If the size of the interval to be sorted is smaller than 16kB, sorting is easy. The SPE can load it, sort it in-place and write it back to where it was read from. Sorting intervals larger than 16kB is a bit more complicated. Values of a block smaller than the split plane value are written to the beginning of the interval to be sorted and the ones larger to the end. To make sure that no values are overwritten that were not already in the SPE’s local store, the values from the end of the interval Lecture Notes in Computer Science 3 Fig. 1: SPE in-place sorting. Elements larger than the pivot in red (dark grey) and elements smaller in green (light grey) also need to be loaded by the SPE. In order to maximize the computation during the MFC calls five buffers are needed. Figure 1 shows the use of the five buffers. In the beginning, four buffers are filled. Buffers A and C load values from the end of the triangle array, B and D load values from the beginning (Step 1). After that, values in D are sorted (Step 2). Then the pointer from buffer D and the OUT buffer are swapped and the OUT buffer is written back to main memory (Step 3). The elements in the OUT buffer smaller than the split plane are written to the beginning of the interval. The elements larger are written back to the end of the interval.After this the buffers need to be swapped again. This can be accomplished by swapping the pointers, i.e. buffer D becomes buffer C, buffer C becomes buffer B and buffer B becomes buffer A respectively (Step 4). Finally it is determined if more values that have not already been processed were read from the beginning or from the end. This information is then used to decide whether the next block needs to be loaded from the beginning or the end. This ensures that no values are overwritten that were not already loaded in the SPE’s local store. The newly loaded block is then again stored in buffer A which was the former and thus already processed buffer OUT. One possible extension that is used for blocks smaller than 16kB is job agglomeration. Since the peak performance is achieved for 16kB blocks, jobs can be agglomerated so that the interval of triangles to be sorted fits into one 16kB DMA call (see breadth first construction). Depth first construction One naive approach to perform BIH construction is depth first. In this approach only sorting is done on the SPE, the recursion and creation of nodes is still entirely done on the PPE. The algorithm starts on the PPE by choosing a global pivot, i.e. the split plane to divide the current BB. The split plane and the intervals are then submitted to the SPE. By signaling, the SPE now begins with the in-place sorting as described above. When the SPE has finished, it writes the split index and the right’s leftmost and left’s rightmost values back to main memory and signals this to the PPE. With these values the PPE is now able to create the new node and to make a recursive call on the new intervals. 4 Martin Weier, Thorsten Roth, André Hinkenjann Breadth first construction The breadth first construction of data structures on architectures like the Cell B. E. is advantageous for two main reasons: 1. The primitive array is traversed per tree level entirely from the beginning to the end ⇒ memory access is very efficient 2. Many small jobs with primitive counts smaller 16kB can be agglomerated ⇒ reduction of DMA calls We propose a method to do a breadth first construction on the Cell B. E. where the SPE can run construction of the BIH almost entirely independent from the PPE. The PPE is only responsible for allocating new memory since these operations cannot be performed from the SPEs. Efficient memory management is a particularly important aspect. The naive construction of the BIH allocates memory for each created node. These memory allocation calls are costly. Even though there are various publications like [18] and [19] stating that own implementations of memory management usually do not increase the application performance, memory management on today’s architectures still remains a bottleneck. One way of avoiding a large number of memory allocation calls is using an ObjectArena, as proposed in PBRT [20]. There, memory is allocated as multiple arrays of fixed size where each pointer to the beginning of an array is stored in a list. Freeing the allocated memory can easily be done by iterating over the list and freeing the referenced arrays. One advantage is, that many threads can be assigned to different lists. Thus, it is possible for them to compute the address where to write back to without the need of any synchronization between the different SPEs. This is particularly important for an implementation to do the BIH construction independently of the PPE. Being able to compute the new node’s address, build-up can run more independent of the PPE since the frequency of new allocations is reduced. Unfortunately, such an implementation also has disadvantages as memory can be wasted because the last arrays can only be partially filled. In addition, intelligent updates and the need of deleting single nodes leads to fragmentation of the arrays. To implement a construction in BFS manner, jobs, i.e. the intervals to be sorted on the next tree level, need to be managed by an additional job queue. At the beginning the PPE allocates two additional buffers of sizes as the primitives count. These buffers are denoted in the following as WorkAPPE and WorkBPPE. This can be seen as a definite upper bound of jobs that could be created while processing one tree level. The actual construction is divided into two phases. In each phase, the WorkAPPE and WorkBPPE switch roles, i.e. in the first phase jobs are read from WorkAPPE and newly created ones are stored in WorkBPPE, while this is done vice versa in the second phase. Figure 2 clarifies the two-phase execution. To do so, two additional buffers are allocated in the SPE’s local store as well. We denote these buffers as WorkA and WorkB. In WorkA the jobs from the current tree level, in the beginning that is the job for the root node, are stored. From each job in WorkA at most two new jobs are created which are then stored in WorkB. If there are no more jobs in WorkA and the current job is not the last one and if there are no more jobs in main memory, the current tree level is entirely processed. Should there be no more space left in WorkB, it is written back to the PPE. To keep track of how many blocks were written back in the last phase, an additional counter is used. This counter can also be used to determine if there are jobs left in main memory that have not already been processed. In addition, this approach can be further optimized. While processing the first tree levels, the jobs can be entirely stored on the SPE, so there is no need to write them back to main memory after the tree level was processed. Therefore, to avoid these unnecessary DMA calls, the buffers on the SPE can be swapped as well if there is no need to provide further storage for the newly created jobs. Our method also has some disadvantages. While requesting new jobs from main memory can be Lecture Notes in Computer Science 5 Fig. 2: Two-phase management of the job queue handled asynchronously this is not the case when WorkB needs to be written back to main memory. Therefore a double buffering scheme for WorkA and WorkB could be used. 3 Intelligent Local Update To accelerate object transformations, an algorithm for performing dynamic updates of the data structure is needed. The basic idea is to search the BIH region affected by given transformations and avoid the complete rebuild if possible. This is adapted from [6], where a similar approach for Bounding Volume Hierarchies based on [21] was presented. Results of our algorithm as well as information on potential future optimizations are also stated in sections 4.2 and 5. The algorithm uses the PPE (a traditional CPU core on the Cell chip), so it could be easily ported to other architectures. The traversal method used during the update procedure is shown in algorithm 1. Finding the affected interval of the underlying triangle array is of importance. To achieve this, states of geometry before and after transformation have to be considered, as transformations may obviously result in an object being moved to another region of the scene. Geometry is always represented as complete objects in the current implementation. For being able to search for affected regions, a bounding box is constructed which encloses the old and new geometry states. We call this the Enclosing Bounding Box (EBB). To avoid quality degradation or degeneration of the data structure, a complete rebuild is performed if the scene bounding box changes due to the transformation. Using the EBB, the algorithm is able to find the subtree which contains all modified geometry. Thus, it also encloses the according array interval when traversing down to the leftmost and rightmost leafs of this subtree, as this yields the starting and ending indices of the corresponding interval. As shown in the algorithm, traversal is stopped as soon as both intervals overlap the EBB. This is necessary because there is no information whether the geometry belongs to the lower or upper interval. An alternative approach is explained in section 5. Finding the primitives to be 6 Martin Weier, Thorsten Roth, André Hinkenjann overwritten in the triangle array is now done by iterating over the array and replacing all triangles with the corresponding geometry ID. This is stored in each triangle for material and texturing purposes. Subsequently, all nodes below the enclosing node are deleted recursively and the usual construction algorithm is performed for the subset of triangles given by the array interval. Note that it is necessary to keep track of the bounding box for the tree level reached, which is at this point used as if it was the actual scene bounding box. This is the case because we make use of the global heuristic in [15]. The resulting BIH can then simply be added to the existing tree. The new subtree may need adjustment of the pointer to its root, as the split axis may have changed. Results of this approach as well as some advantages and shortcomings are presented in section 4.2. Algorithm 1: BIH traversal algorithm used to find the smallest enclosing interval for EBB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 4 while traverse do if leaf reached ∨ both intervals overlap EBB then traverse ← false; end else if Only one interval overlaps EBB then if Interval does not contain EBB completely then if EBB overlaps splitting plane then Abort traversal; end else Set clipping plane for interval to corresponding value of EBB; end end Set traversal data to successive node; if Child is not a leaf then Set split axis; end end else traverse ← false; end end Results and Evaluation All measurements of this section were obtained by running the reported algorithms on a Sony Playstation 3. 4.1 Complete Rebuild Table 1 gives an overview of the different construction times and the resulting speedups. We tested four different versions. Two were PPE based, i.e. the naive and the approximate sorting approach Lecture Notes in Computer Science Model #Triangles ISS 17,633 Bunny 69,451 Fairy Forest 174,117 Dragon 871,414 Naive 106 ms 340 ms 1,478 ms 5,956 ms Model #Triangles Depth first Speedup ISS 17,633 47 ms 2.26 Bunny 69,451 133 ms 2.56 Fairy Forest 174,117 385 ms 3.84 Dragon 871,414 1,828 ms 3.26 7 PPE Approximate sorting Speedup 63 ms 1.68 201 ms 1.69 781 ms 1.89 Out of Memory SPE Breadth first Speedup 24 ms 4.42 68 ms 5.00 195 ms 7.58 897 ms 6.64 Table 1: Construction times for four different models in ms averaged from 10 runs proposed in [15]. The other two were SPE based, one in DFS and the other one in BFS manner. All speedups relate to the naive approach on the PPE. The naive approach was the fastest available algorithm on the PPE. This approach already includes vectorization. However, construction times even for the small models like the ISS or the Stanford Bunny are far from interactive. Even though the speedups from the approximate sorting are stated by [15] to be about 3-4, in our implementation speedups of about 1.8 can be achieved. This is due to the maximum primitives per node constraint in our ray tracing system. Nodes with more than 12 triangles need to be further subdivided. For this the naive approach is used. By looking at the two SPE based approaches it is apparent that the construction times for the breadth first approach are always better than the depth first approach. This is because the breadth first approach uses fewer synchronization calls between the SPE and the PPE. Table 1 shows that speedups of about 5 to 7 can be expected. Increasing speedups with different larger models cannot be concluded from the table, because they are not only dependent on the model’s size but also on layout of the overall scene. Since all of the benchmarks are made using only one SPE utilizing more SPEs should lead to another significant performance gain. This is the topic of future work. 4.2 Partial Rebuild When multiple, arbitrarily distributed objects are transformed, the result might be a very large EBB. This will in turn result in rebuilding large parts of the scene, gaining almost no performance boost (or none at all). A similar effect is caused by large triangles, as they induce an overlapping of intervals. As shown before, this will lead to an early termination of the traversal algorithm. Having large empty parts in a scene and otherwise small geometry, a huge performance boost can be expected. Though, performance may suffer due to counterintuitive reasons, e.g. moving an object through a huge empty region of the scene, but overlapping the root split axis. This would virtually result in a complete rebuild of the data structure. Nevertheless, good performance gains can be obtained, as is shown in the following. Figure 3 illustrates results for the scene rtPerf, which consists of spheres pseudorandomly distributed across the scene. Tests were run for triangle counts from 10k to 100k and moving a number of spheres in a predefined scheme. This has been performed several times and results were averaged. Figure 3a shows the correlation between the 8 Martin Weier, Thorsten Roth, André Hinkenjann Number of rebuilt triangles in scene "rtPerf" 20000 Rebuild and update times for scene "rtPerf" in ms transformed used for reconstruction delta 600 500 15000 400 Time triangles minimum time average time maximum time 10000 300 200 5000 100 0 0 10 20 30 40 50 60 70 80 90 number of triangles in thousands (a) 100 10 20 30 40 50 60 70 80 90 100 number of triangles in thousands (b) Fig. 3: Measurements for the scene “rtPerf” with triangle counts from 10k to 100k scene tris updates rebuilds ratio tmin tmax t trans rec ∆ trav carVis 22,106 383 112 3.42 0.48 154.09 69.71 468.47 7,549.25 7,080.78 2.22 cupVis 7,040 6 148 0.04 26.89 35.43 29.1 7,040 7,040 2,181.66 0 Table 2: Results for the scenes carVis and cupVis; from left to right: Number of triangles; number of updates; number of complete rebuilds; ratio updates/rebuilds; min, max and averaged update/rebuild times; average number of transformed triangles; average number of triangles used for rebuild; average difference of these values; average traversal depth total number of triangles and triangles used for rebuilds. While the number of transformed triangles stays constant, there is in most cases an increase in triangle numbers used for rebuilds. This does not simply increase by the number of triangles added to the scene, but just by a fraction of that (which may vary depending on geometry distribution). It can also be seen that only a fraction of data structure updates results in a complete rebuild, thus yielding a huge difference between average and maximum update time, as shown in figure 3b. Here, the correspondence between times needed for complete rebuilds and times achieved with the implemented update strategies is shown. While having an almost linear increase for both, the performance is vastly better than with just using brute force rebuild. Table 2 shows results for other testing scenes. Note that these scenes have fixed triangle numbers. carVis is a car model which can be decomposed into its parts, while cupVis is just a simple model of a glass without any surrounding geometry. A result from that is that in cupVis a complete rebuild is performed each time the geometry is moved. The small fraction of rebuilds in the table results from the object being marked for transformation, but not performing a transformation at all. Thus the scene bounding box as well as all the triangles remain the same and an update (without anything happening at all) can be performed. 5 Conclusion and Future Work In the current implementation of the ray tracer there was only the possibility to utilize one SPE for tree construction since the others were used for rendering. However, since rendering only takes Lecture Notes in Computer Science 9 place in between the data structure updates in principle all SPEs could be utilized. Even though an implementation for more SPEs has not been done yet, we want to introduce some ideas for a reasonable parallel construction. One way of utilizing more threads could be realized in the depth first based construction. Here, the recursive call could be delegated to different SPEs. However, this is not very efficient because on the upper tree levels many SPEs would be idle. The same problem arises when such a parallelization would be done using the breadth first construction and allowing all threads to have access to one common job queue. Additionally, this would make further synchronization between the SPEs and the PPEs necessary since the access to the job queue must be synchronized. In order to get the best performance out of the SPEs, synchronization must be reduced and the execution should be entirely independent from the PPE. One method that could be used in the context of BIH parallelization was proposed in [12] for BVH construction on GPUs. There a pre-processing step using Morton curves is performed to find regions of primitives that then could be built-up totally independent. The construction of the Morton curve can be efficiently parallelized since Morton code construction of a primitive can be done totally independent from the others. After the Morton code construction a radix sort needs to be performed. Radix sort is advantageous because it does not involve comparisons. Fortunately there are methods for Radix Sort on the Cell B. E., like [22] or [23]. By doing such a pre-computation step, better load balancing while using the SPEs in parallel can be achieved. Other possible improvements regard the job queue. One improvement could be made using a double buffering scheme for the two buffers on the SPE to avoid synchronous memory access. The necessary DMA calls might benefit from the usage of DMA-lists. Another possible way of dealing with the management of the job queue is the use of Cell’s software managed cache whereas some improvements have been made recently [24]. To achieve a better performance concerning the update strategy, e.g. an approach for further traversal of the data structure could be beneficial. This way, much less triangles could be involved in the update step by keeping subtrees which are not affected by transformations. Though, it has to be analyzed how deep the tree should be traversed, as keeping subtrees will often lead to index problems with the underlying triangle array. A metric for estimation of needed time for reorganization steps is needed to cope with that, as further traversal potentially leads to more reorganization overhead due tue memcopy operations. Also, more leafs’ indices might have to be adjusted accordingly. 6 Acknowledgements This work was sponsored by the German Federal Ministry of Education and Research (BMBF) under grant no 1762X07. References 1. Popov, S., Gunther, J., Seidel, H.P., Slusallek, P.: Stackless kd-Tree Traversal for High Performance GPU Ray Tracing. In: Computer Graphics Forum (Proc. Eurographics) 26. Volume 3. (2007) 415–424 2. Wald, I., Havran, V.: On Building Fast kd-trees for Ray Tracing, and on doing that in O(N logN ). In: In Proc. of IEEE Symp. on Interactive Ray Tracing. (2006) 61–69 3. Ize, T., Wald, I., Robertson, C., Parker, S.: An Evaluation of Parallel Grid Construction for Ray Tracing Dynamic Scenes. In: In Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing, Salt Lake City, Utah. (2006) 47–55 10 Martin Weier, Thorsten Roth, André Hinkenjann 4. Lauterbach, C., Yoon, S., Tuft, D., Manocha, D.: RT-DEFORM: Interactive Ray Tracing of Dynamic Scenes using BVHs. In: IEEE Symposium on Interactive Ray Tracing, Salt Lake City, Utah (2006) 5. Wald, I., Boulos, S., Shirley, P.: Ray Tracing Deformable Scenes using Dynamic Bounding Volume Hierarchies. In: ACM Transactions on Graphics 26. Volume 1. (2007) 6. Eisemann, M., Grosch, T., Magnor, M., Müller, S.: Automatic Creation of Object Hierarchies for Ray Tracing Dynamic Scenes. In Skala, V., ed.: WSCG Short Papers Post-Conference Proceedings, WSCG (2007) 7. Popov, S., Günther, J., Seidel, H.P., Slusallek, P.: Experiences with Streaming Construction of SAH kd-Trees. In: Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing, Utah. (2006) 89–94 8. Hunt, W., Mark, W.R., Stoll, G.: Fast kd-tree Construction with an Adaptive Error-Bounded Heuristic. In: 2006 IEEE Symposium on Interactive Ray Tracing. (2006) 9. Zhou, K., Hou, Q., Wang, R., Guo, B.: Real-time kd-tree Construction on Graphics Hardware. In: ACM Trans. Graph. Volume 27., New York, NY, USA, ACM (2008) 1–11 10. Wald, I., Ize, T., Kensler, A., Knoll, A., Parker, S.G.: Ray Tracing Animated Scenes using Coherent Grid Traversal. In: ACM Transactions on Graphics 25: Proceedings of ACM SIGGRAPH 2006, Boston, MA. Volume 3. (2006) 485–493 11. Ize, T., Wald, I., Parker, S.G.: Asynchronous BVH Construction for Ray Tracing Dynamic Scenes on Parallel Multi-Core Architectures. In Favre, J.M., dos Santos, L.P., Reiners, D., eds.: Eurographics Symposium on Parallel Graphics and Visualization. (2007) 12. Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., Manocha, D.: Fast BVH Construction on GPUs. In: Proc. Eurographics 2009. Volume 28., München, Germany (2009) 13. Zhou, K., Gong, M., Huang, X., Guo, B.: Highly Parallel Surface Reconstruction. Technical Report MSR-TR-2008-53, Microsoft Technical Report (2008) 14. Ajmera, P., Goradia, R., Chandran, S., Aluru, S.: Fast, Parallel, GPU-based Construction of Space Filling Curves and Octrees. In: In SI3D ’08: Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, Electronic Arts Campus, Redwood City, CA, New York, NY, USA, ACM (2008) 15. Wächter, C., Keller, E.: Instant Ray Tracing: The Bounding Interval Hierarchy. In: Rendering Techniques 2006: Proceedings of the 17th Eurographics Symposium on Rendering, Nicosia, Cyprus (2006) 139–149 16. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the Cell Multiprocessor. IBM J. Res. Dev. 49 (2005) 589–604 17. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing. Volume 2. Addison Wesley Pub Co Inc. (2003) 18. Johnstone, M.S., Wilson, P.R.: The Memory Fragmentation Problem: Solved? In: ISMM ’98: Proceedings of the 1st International Symposium on Memory Management, Vancouver, Canada, New York, NY, USA, ACM (1998) 26–36 19. Berger, E.D., Zorn, B.G., McKinley, K.S.: Composing High-Performance Memory Allocators. In: PLDI ’01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, Utah, New York, NY, USA, ACM (2001) 114–124 20. Pharr, M., Humphreys, G.: Physically Based Rendering: From Theory to Implementation (The Interactive 3d Technology Series). Morgan Kaufmann, San Fransisco (2004) 21. Goldsmith, J., Salmon, J.: Automatic Creation of Object Hierarchies for Ray Tracing. IEEE Comput. Graph. Appl. 7 (1987) 14–20 22. No author given: Parallel radix sort on cell (2007) http://sourceforge.net/projects/ sorting-on-cell/, last viewed: 28.02.09. 23. Ramprasad, N., Baruah, P.K.: Radix sort on the Cell Broadband Engine. Department of Mathematics and Computer Science, Sri Sathya Sai University, www.hipc.org/hipc2007/posters/radix-sort.pdf, last viewed: 10.02.09 (2007) 24. Guofu, F., Xiaoshe, D., Xuhao, W., Ying, C., Xingjun, Z.: An Efficient Software-Managed Cache Based on Cell Broadband Engine Architecture. Int. J. Distrib. Sen. Netw. 5 (2009) 16–16