Novel Algorithms for Index & Vertex Data Compression and Decompression main memory. The way that the transfer rate is increased is not by increasing the size of the transfer buffer, but by decreasing the size of the data being sent through the buffer. If the algorithms that the group create generate a compression ratio of C, the overall throughput ratio will be change from “one data per transfer” to “C data per transfer”. Alex Berliner, Brian Estes, Samuel Lerner School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, Florida, 32816-2450 Abstract — Modern graphics cards are performing tremendous amounts of work to maintain the expected quality of current-generation graphical applications. A bottleneck exists where the speed of the bus connecting the graphics card and a computer’s processor is not sufficient to quickly transmit draw data, known as index and vertex data. The goal of this project was to research, implement, and identify lossless algorithms that would be able to compress this data efficiently. Over the course of the project, the group evaluated many algorithms including run-length encoding, LZO compression, floating point masks, and Golomb encoding. Index Terms — Index, vertex, graphics, compression, decompression, lossless. I. INTRODUCTION Modern graphics cards are perform tremendous amounts of work to maintain the visual fidelity expected of current-generation graphical applications. Although it is important to improve the hardware that supports these cards, it is equally important to optimize software to make as much use of the existing hardware as possible. The compression of data before transfer through a hardware bus is a shining example of this method of optimization. Efficient compression algorithms can be used to deliver an overall faster system than what can be accomplished with hardware optimizations alone. The goal of this project was to implement efficient lossless compression and decompression algorithms for use on the modern graphics pipeline. The algorithms will compress the data that go into both the vertex and index buffers. This reduction in size of information being transferred through the buffer will allow a higher effective throughput of previously uncompressed objects. After data is fetched from a buffer it will be quickly decompressed on the graphics processing unit (GPU) and then used normally. The implementation of these algorithms will increase the speed and efficiency that a graphics card can operate by allowing the card to not have to wait as long for new information to transfer into the buffer from the computer’s II. DATA TYPES The first of the data types to be compressed is vertex data. This data can potentially contain both decimal and integer values and depending on the situation can contain numerous different structures of data. This data contains information that describes a single point of the object. By connecting these points and using the contained data the 3d object is drawn. The second type of data to be compressed is index data which consists of only integer values. Each integer is a single data point that is used similarly to how indices are used in an array of variables in programming. The number points to a section of vertex data that all describe the single vertex of the object. By using index information the object only has to define the vertex once and can then reuse that point when drawing the object. Fig 1 displays how two triangles are drawn using an index and vertex buffer. The arrows between the buffers shows how the index points to a section in the vertex buffer. The vertex buffer is formatted as Position(x,y,z) and Color(r,g,b). It can be seen that within the index buffer the vertices 1 and 3 are reused to draw both triangles. Fig. 1. Interaction between index and vertex information. III. REQUIREMENTS The group recognized that it was going to the most influential requirement for the project was that the algorithm must be lossless. Many algorithms modify or truncate the data that is being compressed to attempt to save space with little loss in meaningful information. This method works in some situations, such as audio and video compression, but when the data is vertex information, small changes may lead to large discrepancies in rendered appearance. This means that the information that is compressed cannot be changed in any form for transmission. The second requirement for this project was the algorithm should be able to perform decompression quickly while still maintaining an acceptable level of compression. Since the decompression sequence is supposed to be performed in real time, while the compression sequence can be performed in advance, low decompression time was favored over low compression time and high compression rate. Although an ideal compression algorithm would be one highly efficient in terms of compression ratio, compression time, and decompression time, a real solution can only be so good in one area before acting to the detriment to one or more of the others. IV. SPECIFICATIONS A. Compression Requirements Having the compression algorithms that can execute quickly however is not altogether useless. If the compression algorithm that is used happens to be fast, it can be put to use by having the CPU compress the assets before it is sent through the graphics pipeline. A situation like this might occur in a game that was not built with these optimizations in mind. If the assets in the project were not compressed when they were built, they would be able to still gain benefit from the compression / decompression system with on-the-fly compression. B. Decompression The compression algorithms that are created must be made in such a way that they support some amount of random access capability. The contents of a buffer being sent to a GPU contains many objects, and the GPU may not want to access these objects in the order that they are presented. If the compression algorithm is written in such a way that the block it creates must all be decompressed at the same time or in sequence, then significant overhead will be incurred when trying to access a chunk that is in the middle. worked to create a testing environment for their algorithms to run in that provided standardized tools and output procedures, such as standardized file reading, buffer management, data formatting, checksum validation, and time & space metric keeping. The testing environment was written in C and was used when comparing different implementations and optimizations of the group’s algorithms with previous attempts. By developing the environment in C, the group hoped to avoid the complications of using a shader program which would have introduced more complexity than needed to test the algorithms. VI. INDEX COMPRESSION A. Delta Delta Encoding is an encoding and decoding method that when run on a list of integers generates a list of deltas, or differences of a value in the list and the previous value. The first value in the list is kept in its original form, which is named the anchor point. This list is used as a way of encoding the original list of integers into potentially smaller numbers by storing the differences, that when saved will result in less space used. When decoding the list these deltas are then used to calculate the original number by adding them to the previous item in the list. The decoder will run through the list one by one the resulting list will be identical to the original. This process can be seen in Fig 2 which shows an example of Delta Encoding. The encoded data contains the anchor point 5 and the delta from that anchor point and the next value 4 is then stored in the encoded list as -1. The decoded list is generated similarly by taking the anchor point and adding it to the next value, in this case 5 + (-1) which then will result in the original value of 4. Due to the nature of how Delta Encoding works, integer data that does not vary much from one unit to the next offers the highest potential compression. Due to this index data is a prime candidate for encoding as when an object is created the indices will tend to vary little from its neighbors in the buffer. V. TESTING ENVIRONMENT The group recognized that it was going to be difficult to maintain a standardized method of code generation, testing, and data output without taking coordination precautions. To prevent these code fragmentation issues, the group first Fig. 2. Example of Delta Encoding. B. Run Length Run length encoding is a simple compression algorithm that turns consecutive appearances of a single character, a “run”, into a pairing of the number of times that the character appears followed by the character being compressed. As can be seen in Fig 3, a run of 5 a’s in a row would take up 5 individual characters when uncompressed. The compression algorithm will turn this into “5a”, which takes up a mere 2 characters. The algorithm must also recognize when not to use this technique in situations where doing so will increase the file size. As with the last character being encoded, “z”, compressing it into “1z” would double its size, and so it is left alone. Decompressing a run length encoded file is simply the process of re-inflating the elements from their counted forms by inserting the specified element the denoted number of times. Fig. 3. Example of Delta Encoding. C. Golomb-Rice Golomb-Rice coding is an algorithm designed by Solomon Golomb and iterated upon by Robert Rice. It takes in an integer and translates it into a binary sequence. It is based on integer division, with a divisor that is decided upon before runtime. It works by dividing the integer being compressed by the chosen divisor and writing the quotient and remainder as a single sequence. The quotient from the result of this division is written in unary notation. Unary is essentially a base 1 number system. Each integer in unary is written as a series of one number repeated to match the quantity the integer represents. For example the integer three is written as 111 followed by a space. We cannot accurately express the space in a binary sequence so it is instead represented by a 0 in our program. The remainder from the result of the division operation is simply written in binary. A unary sequence requires a lot more digits to represent an integer than a binary sequence. Because of this, choosing a large divisor when using Golomb-Rice Compression is encouraged. VII. VERTEX COMPRESSION There are numerous research papers that describe attempts to create effective vertex compression algorithms. Some of these algorithms work at the time of vertex data creation when creating the actual 3D object instead of at the time of data transfer. There are also some algorithms proposed for vertex compression that are lossy; used with the assumption that the programs drawing the 3D objects do not need the precision that the 32-bit vertex float data would offer. These however were unacceptable for this project as any changes in the data would result in graphical errors or loss of important data in scientific or engineering simulations. [1] A. Burtscher-Ratanaworabhan Burtscher-Ratanaworabhan, also labeled BR encoding, is a predictive, hash-based vertex compression algorithm. The hope with this algorithm is to save space by using previously recorded data entries over the course of compressing a file to predict what the next value is going to be, and then only storing the difference in information of that value and the actual value. It works by sequentially predicting each value in a data file, performing a XOR operation on the actual value and the predicted value to increase data uniformity, and then finally performing leading zero compression on the result of the XOR operation. The hash tables that are used for value prediction are called the DFCM and FCM hash tables, which stands for (Differential) Finite Context Method. An FCM uses a twolevel prediction table to predict the next value that will appear in a sequence. The first level stores the history of recently viewed values, known as a context, and has an individual history for each location of the program counter of the program it is running in. The second level stores the value that is most likely to proceed the current one, using each context as a hash index. After a value is predicted from the table, the table is updated to reflect the real result of the context. DFCM prediction works in a similar fashion; instead of storing each actual value encountered as in a normal FCM, only the difference between each value is stored. [2] B. Lempel-Ziv-Oberhumer Lempel-Ziv-Oberhumer, or LZO, describes a family of compression algorithms based on the LZ77 compressor, which is also behind other popular algorithms such as those that compress GIF and PNG files. LZO algorithms focus on decompression while still achieving acceptable levels of compression. LZO compresses a block of data into “matches” using a sliding window. This is done using a small memory allocation to store a “window” ranging in size 4 to 64 kilobytes. This window holds a section of data which it then slides across the data to see if it matches the current block. When a match is found it is replaced by a reference to the original block’s location. Blocks that do not match the current “window” of data are stored as is which creates runs of non-matching literals in between those that matched. For this project LZO1-1 was implemented as this version focused more on decompression speed instead of compression rate. Fig 5 displays a comparison between BR and LZO compression rates from out tests. Unlike the results of the index compression algorithms this has a very clear better algorithm. LZO has a consistently higher compression ratio and almost a double the compression rate when compared to BR. VIII. RESULTS All tests were done using our testing environment. Each test run was done on a computer that contained a Intel Core i7-4785T @ 2.20GHz, 8 GB RAM and was run on Windows 8.1 Pro. All test data was run 10 times for each file then averaged. The Delta RLE integer compression algorithm is able to compress 46.25% of the index buffer on average, when run on sequential data. It has an average compression time of 0.83 milliseconds, and an average decompression time of 0.76 milliseconds. It is able to compress data averaging at 400MB/second and decompress data at 250 MB/second. The Golomb Rice integer compression algorithm is able to compress 42.01% of the index buffer on average. It has an average compression time of 14 milliseconds, and an average decompression time of 14 milliseconds. It is able to compress data at 28 MB/second, and decompress data at 16MB/second on average. The LZO1-1 float compression algorithm is able to compress 32.58% of the vertex buffer on average. It has an average compression time of 5.1 milliseconds, and an average decompression time of 2.9 milliseconds. It is able to compress data at 500MB/second and decompress data at 600MB/second. The BRK float compression algorithm is able to compress 14.03% of the vertex buffer on average. It has an average compression time of 9 milliseconds, and an average decompression time of 7.6 milliseconds. It is able to compress data at a rate of 200MB/second and decompress data at 200MB/second. Fig 4 displays a comparison between Delta-RLE and Golomb-Rice compression rates from out tests. It is important to note that the Compression rates of both algorithms remain relatively comparable throughout the tests. However due to Golomb’s slow speeds with decompression it was deemed the less fit algorithm for our project. Fig. 4. Comparison between Golomb-Rice compression rates. Fig. 5. Delta-RLE encoding and Comparison between LZO and BR compression rates. IX. CONCLUSION The four algorithms described in this paper were implemented and tested and the resulting data shows that for index buffer compression, Golomb-Rice is significantly less efficient than Delta-RLE. Delta-RLE can compress and decompress 16 times faster than Golomb-Rice. Using Golomb-Rice is only advantageous when run on random data, as it maintains a consistent compression ratio. Conversely, the Delta-RLE algorithm has far lower compression results when run on random data. For vertex buffer compression algorithms, LZO was the better fitting algorithm for this project as it provides 18.55% more compression than BR encoding, able to compress around twice the normalized rate (MB/s) than BR and is also able to decompress at three times the normalized rate that BR decompresses at. This shows that LZO and other LZ77 based algorithms are what is recommended moving forward when attempting to compress index and vertex data for graphics cards. ACKNOWLEDGEMENT The authors wish to acknowledge the assistance and support of Todd Martin and Mangesh Nijasure from Advanced Micro Systems as well as Dr. Sumanta Pattanaik and Dr. Mark Heinrich from the University of Central Florida. REFERENCES [1] P. H. Chou and T. H. Meng. Vertex data compression through vector quantization. IEEE Transactions on Visualization and Computer Graphics, 8(4):373–382, 2002. [2] M. Burtscher and P. Ratanawarobhan, “FPC: A high-speed compressor for double-precision floating-point data,” IEEE Trans. Comput., vol. 58, no. 1, pp. 18–31, Jan. 2009.