WRH09_isvc09CellPaper - Institute of Visual Computing

Efficient Strategies for Acceleration Structure Updates in
Interactive Ray Tracing Applications on the Cell Processor
Martin Weier, Thorsten Roth, André Hinkenjann
Computer Graphics Lab
Bonn-Rhein-Sieg University of Applied Sciences
Sankt Augustin, Germany
[email protected]
Abstract. We present fast complete rebuild strategies, as well as adapted intelligent local
update strategies for acceleration data structures for interactive ray tracing environments.
Both approaches can be combined. Although the proposed strategies could be used with
other data structures and architectures as well, they are currently tailored to the Bounding
Interval Hierarchy on the Cell chip.
Recent hardware and software developments in the field of fast ray tracing allow for the use of
these renderers in interactive environments. While the focus of research of the last two decades was
mainly on efficient ray tracing of static scenes, current research focuses on dynamic, interactive
scenes. Recently, many approaches came up that deal with ray tracing of dynamic deformable
scenes. Current approaches use kd-trees [1,2], grids [3] or Bounding Volume Hierarchies (BVHs) on
commodity CPUs or GPUs [4,5,6]. These either do a complete rebuild of the scene’s acceleration
structure or provide methods to perform an intelligent dynamic update. However, there is always
a trade-off between the tree’s quality and the time that is needed to perform the rebuild or update
Recent publications [7,8] on using kd-trees for deformable scenes always propose trying to perform a complete rebuild from scratch. However, these approaches do not seem to scale well in
parallel [7]. Another approach is [9], a GPU based construction of kd-trees in breadth first manner.
In order to use the fine-grained parallelism of GPUs, a novel strategy for processing large nodes and
schemes for fast evaluation of the nodes’ split costs is introduced. A disadvantage of this approach
is that its memory overhead is very high. Grids, on the other hand can be constructed very fast
and in parallel [3]. Although grids are usually not as efficient as adaptive structures when it comes
to complex scenes, coherence can be exploited here as well [10].
In recent years, BVHs seem to have become the first choice for ray tracing of deformable scenes.
BVHs are well suited for dynamic updates. However, this often leads to increasing render times since
the trees degenerate over time. To decide whether it makes sense to perform a complete rebuild,
Lauterbach et al. [4] developed a metric to determine the tree’s quality. Another approach was
proposed by Wald et al. [11]. They use an asynchronous construction that runs in parallel during
vertex update and rendering. A novel approach is [12], performing a BVH construction on GPUs.
There, the upper tree levels are parallelized using Morton curves. This idea is based on [13] and
[14], where surface reconstruction on GPUs using space filling curves is performed. We have chosen
Martin Weier, Thorsten Roth, André Hinkenjann
the Bounding Interval Hierarchy (BIH) [15] for interactive ray tracing of dynamic scenes on the
Cell Broadband Engine Architecture [16]. The Bounding Interval Hierarchy has some advantages
compared to kd-trees and BVHs. A BIH node describes one axis aligned split plane like a kd-tree
node. However, the split planes in the BIH represent axis aligned bounding boxes (AABBs). Starting
from the scene’s BB the intervals are always fitted to the right’s leftmost and the left’s rightmost
values of the resulting partitions. Thus, the BIH can be seen as a hybrid of kd-trees and BVHs. One
advantage of BIHs compared to kd-trees and BVHs is their easy and fast construction. In addition
the BIH’s representation of the nodes is very compact since no complete AABBs need to be stored.
Due to the growing parallelism in modern architectures, effectively utilizing many cores and vector
units is crucial. In fact the BIH construction has many similarities with quicksort which is a well
studied algorithm in the field of parallel computing. Modern architectures like the Cell processor or
GPUs all have a memory access model (memory coherence model) where data from main memory
must be explicitly loaded from or distributed to the threads. For most of these memory models
bandwidth is high, but latency as well. This makes it especially important to place memory calls
wisely. In the following section, we describe fast methods for completely rebuilding the acceleration
data structure each frame. In section 3 we present intelligent local updates of the data structure.
Note that both methods can be combined: The intelligent update method finds the subtree that
has to be rebuilt after scene changes. This update can then be done by using the methods from the
following section. After that we show the results of some benchmarks and evaluate both approaches.
Fast Complete Rebuild of Acceleration Data Structures
The most important step in BIH construction is the sorting routine which sorts primitives to the
right or left of the split plane. The Cell’s SPEs only have a limited amount of local storage. To
do sorting on all primitives, which usually exceed the 256kB local store on the SPE, data has to
be explicitly requested and written back to main memory. Peak performance for these DMA calls
is reached for blocks of size 16kB, which is the maximum that can be obtained by one DMA call.
Since all of these DMA operations are executed non-blocking (asynchronously), it is important to
do as much work as possible in between consecutive calls. Common approaches for doing an in-place
quicksort like in [17] usually distribute the array in blocks of fixed size to the threads. The threads
then do a local in-place sort and write their split values to a common new allocated array. Then
prefix sums over these values are calculated and the results are distributed back to the threads. By
doing so, each thread knows the position where it can write its values back to. Using this algorithm
on the Cell, each SPE would have to load one 16kB block, sort it and write the split values and the
sorted array back to main memory. After that, the prefix sum operation on the split values of the
16kB blocks would need to be performed. Since this is a relatively cheap operation, parallelizing
it using the SPEs leads to new overhead. Additionally, each SPE needs to read the 16kB blocks
again to write them back to main memory in correct order. For this reason this method should be
avoided. In our algorithm we only use one SPE to perform the sorting of one region. Each region
refers to one interval of arbitrary size. To avoid any kind of recombination and reordering after one
block is sorted, the values need to be written back to their correct location right away.
If the size of the interval to be sorted is smaller than 16kB, sorting is easy. The SPE can load
it, sort it in-place and write it back to where it was read from. Sorting intervals larger than 16kB
is a bit more complicated. Values of a block smaller than the split plane value are written to the
beginning of the interval to be sorted and the ones larger to the end. To make sure that no values are
overwritten that were not already in the SPE’s local store, the values from the end of the interval
Lecture Notes in Computer Science
Fig. 1: SPE in-place sorting. Elements larger than the pivot in red (dark grey) and elements smaller in green
(light grey)
also need to be loaded by the SPE. In order to maximize the computation during the MFC calls
five buffers are needed. Figure 1 shows the use of the five buffers. In the beginning, four buffers are
filled. Buffers A and C load values from the end of the triangle array, B and D load values from
the beginning (Step 1). After that, values in D are sorted (Step 2). Then the pointer from buffer
D and the OUT buffer are swapped and the OUT buffer is written back to main memory (Step 3).
The elements in the OUT buffer smaller than the split plane are written to the beginning of the
interval. The elements larger are written back to the end of the interval.After this the buffers need
to be swapped again. This can be accomplished by swapping the pointers, i.e. buffer D becomes
buffer C, buffer C becomes buffer B and buffer B becomes buffer A respectively (Step 4). Finally it
is determined if more values that have not already been processed were read from the beginning or
from the end. This information is then used to decide whether the next block needs to be loaded
from the beginning or the end. This ensures that no values are overwritten that were not already
loaded in the SPE’s local store. The newly loaded block is then again stored in buffer A which was
the former and thus already processed buffer OUT. One possible extension that is used for blocks
smaller than 16kB is job agglomeration. Since the peak performance is achieved for 16kB blocks,
jobs can be agglomerated so that the interval of triangles to be sorted fits into one 16kB DMA call
(see breadth first construction).
Depth first construction One naive approach to perform BIH construction is depth first. In
this approach only sorting is done on the SPE, the recursion and creation of nodes is still entirely
done on the PPE. The algorithm starts on the PPE by choosing a global pivot, i.e. the split plane
to divide the current BB. The split plane and the intervals are then submitted to the SPE. By
signaling, the SPE now begins with the in-place sorting as described above. When the SPE has
finished, it writes the split index and the right’s leftmost and left’s rightmost values back to main
memory and signals this to the PPE. With these values the PPE is now able to create the new
node and to make a recursive call on the new intervals.
Martin Weier, Thorsten Roth, André Hinkenjann
Breadth first construction The breadth first construction of data structures on architectures
like the Cell B. E. is advantageous for two main reasons:
1. The primitive array is traversed per tree level entirely from the beginning to the end ⇒ memory
access is very efficient
2. Many small jobs with primitive counts smaller 16kB can be agglomerated ⇒ reduction of DMA
We propose a method to do a breadth first construction on the Cell B. E. where the SPE can run
construction of the BIH almost entirely independent from the PPE. The PPE is only responsible
for allocating new memory since these operations cannot be performed from the SPEs. Efficient
memory management is a particularly important aspect. The naive construction of the BIH allocates
memory for each created node. These memory allocation calls are costly. Even though there are
various publications like [18] and [19] stating that own implementations of memory management
usually do not increase the application performance, memory management on today’s architectures
still remains a bottleneck. One way of avoiding a large number of memory allocation calls is using
an ObjectArena, as proposed in PBRT [20]. There, memory is allocated as multiple arrays of fixed
size where each pointer to the beginning of an array is stored in a list. Freeing the allocated memory
can easily be done by iterating over the list and freeing the referenced arrays. One advantage is,
that many threads can be assigned to different lists. Thus, it is possible for them to compute the
address where to write back to without the need of any synchronization between the different SPEs.
This is particularly important for an implementation to do the BIH construction independently of
the PPE. Being able to compute the new node’s address, build-up can run more independent of the
PPE since the frequency of new allocations is reduced. Unfortunately, such an implementation also
has disadvantages as memory can be wasted because the last arrays can only be partially filled.
In addition, intelligent updates and the need of deleting single nodes leads to fragmentation of the
arrays. To implement a construction in BFS manner, jobs, i.e. the intervals to be sorted on the
next tree level, need to be managed by an additional job queue. At the beginning the PPE allocates
two additional buffers of sizes as the primitives count. These buffers are denoted in the following as
WorkAPPE and WorkBPPE. This can be seen as a definite upper bound of jobs that could be created
while processing one tree level. The actual construction is divided into two phases. In each phase,
the WorkAPPE and WorkBPPE switch roles, i.e. in the first phase jobs are read from WorkAPPE and
newly created ones are stored in WorkBPPE, while this is done vice versa in the second phase. Figure
2 clarifies the two-phase execution. To do so, two additional buffers are allocated in the SPE’s local
store as well. We denote these buffers as WorkA and WorkB. In WorkA the jobs from the current tree
level, in the beginning that is the job for the root node, are stored. From each job in WorkA at most
two new jobs are created which are then stored in WorkB. If there are no more jobs in WorkA and
the current job is not the last one and if there are no more jobs in main memory, the current tree
level is entirely processed.
Should there be no more space left in WorkB, it is written back to the PPE. To keep track of how
many blocks were written back in the last phase, an additional counter is used. This counter can also
be used to determine if there are jobs left in main memory that have not already been processed.
In addition, this approach can be further optimized. While processing the first tree levels, the jobs
can be entirely stored on the SPE, so there is no need to write them back to main memory after the
tree level was processed. Therefore, to avoid these unnecessary DMA calls, the buffers on the SPE
can be swapped as well if there is no need to provide further storage for the newly created jobs.
Our method also has some disadvantages. While requesting new jobs from main memory can be
Lecture Notes in Computer Science
Fig. 2: Two-phase management of the job queue
handled asynchronously this is not the case when WorkB needs to be written back to main memory.
Therefore a double buffering scheme for WorkA and WorkB could be used.
Intelligent Local Update
To accelerate object transformations, an algorithm for performing dynamic updates of the data
structure is needed. The basic idea is to search the BIH region affected by given transformations
and avoid the complete rebuild if possible. This is adapted from [6], where a similar approach for
Bounding Volume Hierarchies based on [21] was presented. Results of our algorithm as well as
information on potential future optimizations are also stated in sections 4.2 and 5. The algorithm
uses the PPE (a traditional CPU core on the Cell chip), so it could be easily ported to other
architectures. The traversal method used during the update procedure is shown in algorithm 1.
Finding the affected interval of the underlying triangle array is of importance. To achieve this,
states of geometry before and after transformation have to be considered, as transformations may
obviously result in an object being moved to another region of the scene. Geometry is always
represented as complete objects in the current implementation. For being able to search for affected
regions, a bounding box is constructed which encloses the old and new geometry states. We call
this the Enclosing Bounding Box (EBB). To avoid quality degradation or degeneration of the
data structure, a complete rebuild is performed if the scene bounding box changes due to the
transformation. Using the EBB, the algorithm is able to find the subtree which contains all modified
geometry. Thus, it also encloses the according array interval when traversing down to the leftmost
and rightmost leafs of this subtree, as this yields the starting and ending indices of the corresponding
interval. As shown in the algorithm, traversal is stopped as soon as both intervals overlap the EBB.
This is necessary because there is no information whether the geometry belongs to the lower or
upper interval. An alternative approach is explained in section 5. Finding the primitives to be
Martin Weier, Thorsten Roth, André Hinkenjann
overwritten in the triangle array is now done by iterating over the array and replacing all triangles
with the corresponding geometry ID. This is stored in each triangle for material and texturing
purposes. Subsequently, all nodes below the enclosing node are deleted recursively and the usual
construction algorithm is performed for the subset of triangles given by the array interval. Note
that it is necessary to keep track of the bounding box for the tree level reached, which is at this
point used as if it was the actual scene bounding box. This is the case because we make use of the
global heuristic in [15]. The resulting BIH can then simply be added to the existing tree. The new
subtree may need adjustment of the pointer to its root, as the split axis may have changed. Results
of this approach as well as some advantages and shortcomings are presented in section 4.2.
Algorithm 1: BIH traversal algorithm used to find the smallest enclosing interval for EBB
while traverse do
if leaf reached ∨ both intervals overlap EBB then
traverse ← false;
else if Only one interval overlaps EBB then
if Interval does not contain EBB completely then
if EBB overlaps splitting plane then
Abort traversal;
Set clipping plane for interval to corresponding value of EBB;
Set traversal data to successive node;
if Child is not a leaf then
Set split axis;
traverse ← false;
Results and Evaluation
All measurements of this section were obtained by running the reported algorithms on a Sony
Playstation 3.
Complete Rebuild
Table 1 gives an overview of the different construction times and the resulting speedups. We tested
four different versions. Two were PPE based, i.e. the naive and the approximate sorting approach
Lecture Notes in Computer Science
Fairy Forest
106 ms
340 ms
1,478 ms
5,956 ms
#Triangles Depth first Speedup
47 ms
133 ms
Fairy Forest
385 ms
871,414 1,828 ms
Approximate sorting Speedup
63 ms
201 ms
781 ms
Out of Memory
Breadth first Speedup
24 ms
68 ms
195 ms
897 ms
Table 1: Construction times for four different models in ms averaged from 10 runs
proposed in [15]. The other two were SPE based, one in DFS and the other one in BFS manner. All
speedups relate to the naive approach on the PPE. The naive approach was the fastest available
algorithm on the PPE. This approach already includes vectorization. However, construction times
even for the small models like the ISS or the Stanford Bunny are far from interactive. Even though
the speedups from the approximate sorting are stated by [15] to be about 3-4, in our implementation
speedups of about 1.8 can be achieved. This is due to the maximum primitives per node constraint
in our ray tracing system. Nodes with more than 12 triangles need to be further subdivided. For
this the naive approach is used.
By looking at the two SPE based approaches it is apparent that the construction times for the
breadth first approach are always better than the depth first approach. This is because the breadth
first approach uses fewer synchronization calls between the SPE and the PPE. Table 1 shows that
speedups of about 5 to 7 can be expected. Increasing speedups with different larger models cannot
be concluded from the table, because they are not only dependent on the model’s size but also on
layout of the overall scene. Since all of the benchmarks are made using only one SPE utilizing more
SPEs should lead to another significant performance gain. This is the topic of future work.
Partial Rebuild
When multiple, arbitrarily distributed objects are transformed, the result might be a very large
EBB. This will in turn result in rebuilding large parts of the scene, gaining almost no performance
boost (or none at all). A similar effect is caused by large triangles, as they induce an overlapping
of intervals. As shown before, this will lead to an early termination of the traversal algorithm.
Having large empty parts in a scene and otherwise small geometry, a huge performance boost
can be expected. Though, performance may suffer due to counterintuitive reasons, e.g. moving an
object through a huge empty region of the scene, but overlapping the root split axis. This would
virtually result in a complete rebuild of the data structure. Nevertheless, good performance gains
can be obtained, as is shown in the following. Figure 3 illustrates results for the scene rtPerf,
which consists of spheres pseudorandomly distributed across the scene. Tests were run for triangle
counts from 10k to 100k and moving a number of spheres in a predefined scheme. This has been
performed several times and results were averaged. Figure 3a shows the correlation between the
Martin Weier, Thorsten Roth, André Hinkenjann
Number of rebuilt triangles in scene "rtPerf"
Rebuild and update times for scene "rtPerf" in ms
used for reconstruction
minimum time
average time
maximum time
number of triangles in thousands
number of triangles in thousands
Fig. 3: Measurements for the scene “rtPerf” with triangle counts from 10k to 100k
scene tris
updates rebuilds ratio tmin tmax t
trans rec
carVis 22,106 383
3.42 0.48 154.09 69.71 468.47 7,549.25 7,080.78 2.22
cupVis 7,040 6
0.04 26.89 35.43 29.1 7,040 7,040
2,181.66 0
Table 2: Results for the scenes carVis and cupVis; from left to right: Number of triangles; number of updates;
number of complete rebuilds; ratio updates/rebuilds; min, max and averaged update/rebuild times; average
number of transformed triangles; average number of triangles used for rebuild; average difference of these
values; average traversal depth
total number of triangles and triangles used for rebuilds. While the number of transformed triangles
stays constant, there is in most cases an increase in triangle numbers used for rebuilds. This does
not simply increase by the number of triangles added to the scene, but just by a fraction of that
(which may vary depending on geometry distribution). It can also be seen that only a fraction
of data structure updates results in a complete rebuild, thus yielding a huge difference between
average and maximum update time, as shown in figure 3b. Here, the correspondence between times
needed for complete rebuilds and times achieved with the implemented update strategies is shown.
While having an almost linear increase for both, the performance is vastly better than with just
using brute force rebuild. Table 2 shows results for other testing scenes. Note that these scenes have
fixed triangle numbers. carVis is a car model which can be decomposed into its parts, while cupVis
is just a simple model of a glass without any surrounding geometry. A result from that is that in
cupVis a complete rebuild is performed each time the geometry is moved. The small fraction of
rebuilds in the table results from the object being marked for transformation, but not performing
a transformation at all. Thus the scene bounding box as well as all the triangles remain the same
and an update (without anything happening at all) can be performed.
Conclusion and Future Work
In the current implementation of the ray tracer there was only the possibility to utilize one SPE
for tree construction since the others were used for rendering. However, since rendering only takes
Lecture Notes in Computer Science
place in between the data structure updates in principle all SPEs could be utilized. Even though
an implementation for more SPEs has not been done yet, we want to introduce some ideas for a
reasonable parallel construction.
One way of utilizing more threads could be realized in the depth first based construction. Here,
the recursive call could be delegated to different SPEs. However, this is not very efficient because on
the upper tree levels many SPEs would be idle. The same problem arises when such a parallelization
would be done using the breadth first construction and allowing all threads to have access to one
common job queue. Additionally, this would make further synchronization between the SPEs and
the PPEs necessary since the access to the job queue must be synchronized. In order to get the
best performance out of the SPEs, synchronization must be reduced and the execution should be
entirely independent from the PPE.
One method that could be used in the context of BIH parallelization was proposed in [12] for
BVH construction on GPUs. There a pre-processing step using Morton curves is performed to
find regions of primitives that then could be built-up totally independent. The construction of the
Morton curve can be efficiently parallelized since Morton code construction of a primitive can be
done totally independent from the others. After the Morton code construction a radix sort needs to
be performed. Radix sort is advantageous because it does not involve comparisons. Fortunately there
are methods for Radix Sort on the Cell B. E., like [22] or [23]. By doing such a pre-computation step,
better load balancing while using the SPEs in parallel can be achieved. Other possible improvements
regard the job queue. One improvement could be made using a double buffering scheme for the two
buffers on the SPE to avoid synchronous memory access. The necessary DMA calls might benefit
from the usage of DMA-lists. Another possible way of dealing with the management of the job
queue is the use of Cell’s software managed cache whereas some improvements have been made
recently [24].
To achieve a better performance concerning the update strategy, e.g. an approach for further
traversal of the data structure could be beneficial. This way, much less triangles could be involved in
the update step by keeping subtrees which are not affected by transformations. Though, it has to be
analyzed how deep the tree should be traversed, as keeping subtrees will often lead to index problems
with the underlying triangle array. A metric for estimation of needed time for reorganization steps
is needed to cope with that, as further traversal potentially leads to more reorganization overhead
due tue memcopy operations. Also, more leafs’ indices might have to be adjusted accordingly.
This work was sponsored by the German Federal Ministry of Education and Research (BMBF)
under grant no 1762X07.
1. Popov, S., Gunther, J., Seidel, H.P., Slusallek, P.: Stackless kd-Tree Traversal for High Performance
GPU Ray Tracing. In: Computer Graphics Forum (Proc. Eurographics) 26. Volume 3. (2007) 415–424
2. Wald, I., Havran, V.: On Building Fast kd-trees for Ray Tracing, and on doing that in O(N logN ). In:
In Proc. of IEEE Symp. on Interactive Ray Tracing. (2006) 61–69
3. Ize, T., Wald, I., Robertson, C., Parker, S.: An Evaluation of Parallel Grid Construction for Ray
Tracing Dynamic Scenes. In: In Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing,
Salt Lake City, Utah. (2006) 47–55
Martin Weier, Thorsten Roth, André Hinkenjann
4. Lauterbach, C., Yoon, S., Tuft, D., Manocha, D.: RT-DEFORM: Interactive Ray Tracing of Dynamic
Scenes using BVHs. In: IEEE Symposium on Interactive Ray Tracing, Salt Lake City, Utah (2006)
5. Wald, I., Boulos, S., Shirley, P.: Ray Tracing Deformable Scenes using Dynamic Bounding Volume
Hierarchies. In: ACM Transactions on Graphics 26. Volume 1. (2007)
6. Eisemann, M., Grosch, T., Magnor, M., Müller, S.: Automatic Creation of Object Hierarchies for Ray
Tracing Dynamic Scenes. In Skala, V., ed.: WSCG Short Papers Post-Conference Proceedings, WSCG
7. Popov, S., Günther, J., Seidel, H.P., Slusallek, P.: Experiences with Streaming Construction of SAH
kd-Trees. In: Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing, Utah. (2006) 89–94
8. Hunt, W., Mark, W.R., Stoll, G.: Fast kd-tree Construction with an Adaptive Error-Bounded Heuristic.
In: 2006 IEEE Symposium on Interactive Ray Tracing. (2006)
9. Zhou, K., Hou, Q., Wang, R., Guo, B.: Real-time kd-tree Construction on Graphics Hardware. In:
ACM Trans. Graph. Volume 27., New York, NY, USA, ACM (2008) 1–11
10. Wald, I., Ize, T., Kensler, A., Knoll, A., Parker, S.G.: Ray Tracing Animated Scenes using Coherent
Grid Traversal. In: ACM Transactions on Graphics 25: Proceedings of ACM SIGGRAPH 2006, Boston,
MA. Volume 3. (2006) 485–493
11. Ize, T., Wald, I., Parker, S.G.: Asynchronous BVH Construction for Ray Tracing Dynamic Scenes on
Parallel Multi-Core Architectures. In Favre, J.M., dos Santos, L.P., Reiners, D., eds.: Eurographics
Symposium on Parallel Graphics and Visualization. (2007)
12. Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., Manocha, D.: Fast BVH Construction on
GPUs. In: Proc. Eurographics 2009. Volume 28., München, Germany (2009)
13. Zhou, K., Gong, M., Huang, X., Guo, B.: Highly Parallel Surface Reconstruction. Technical Report
MSR-TR-2008-53, Microsoft Technical Report (2008)
14. Ajmera, P., Goradia, R., Chandran, S., Aluru, S.: Fast, Parallel, GPU-based Construction of Space
Filling Curves and Octrees. In: In SI3D ’08: Proceedings of the 2008 Symposium on Interactive 3D
Graphics and Games, Electronic Arts Campus, Redwood City, CA, New York, NY, USA, ACM (2008)
15. Wächter, C., Keller, E.: Instant Ray Tracing: The Bounding Interval Hierarchy. In: Rendering Techniques 2006: Proceedings of the 17th Eurographics Symposium on Rendering, Nicosia, Cyprus (2006)
16. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the Cell
Multiprocessor. IBM J. Res. Dev. 49 (2005) 589–604
17. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing. Volume 2. Addison
Wesley Pub Co Inc. (2003)
18. Johnstone, M.S., Wilson, P.R.: The Memory Fragmentation Problem: Solved? In: ISMM ’98: Proceedings of the 1st International Symposium on Memory Management, Vancouver, Canada, New York, NY,
USA, ACM (1998) 26–36
19. Berger, E.D., Zorn, B.G., McKinley, K.S.: Composing High-Performance Memory Allocators. In:
PLDI ’01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and
implementation, Utah, New York, NY, USA, ACM (2001) 114–124
20. Pharr, M., Humphreys, G.: Physically Based Rendering: From Theory to Implementation (The Interactive 3d Technology Series). Morgan Kaufmann, San Fransisco (2004)
21. Goldsmith, J., Salmon, J.: Automatic Creation of Object Hierarchies for Ray Tracing. IEEE Comput.
Graph. Appl. 7 (1987) 14–20
22. No author given:
Parallel radix sort on cell (2007) http://sourceforge.net/projects/
sorting-on-cell/, last viewed: 28.02.09.
23. Ramprasad, N., Baruah, P.K.: Radix sort on the Cell Broadband Engine. Department of Mathematics
and Computer Science, Sri Sathya Sai University, www.hipc.org/hipc2007/posters/radix-sort.pdf,
last viewed: 10.02.09 (2007)
24. Guofu, F., Xiaoshe, D., Xuhao, W., Ying, C., Xingjun, Z.: An Efficient Software-Managed Cache Based
on Cell Broadband Engine Architecture. Int. J. Distrib. Sen. Netw. 5 (2009) 16–16