Novel Algorithms for Index & Vertex Data Compression and Decompression Authors: Alex Berliner Brian Estes Samuel Lerner Contributors: Todd Martin Mangesh Nijasure Dr. Sumanta Pattanaik Sponsors: Table of Contents Executive Summary ............................................................................. 0 Project Overview.................................................................................. 4 2.1 Identification of Project ................................................................................ 4 2.2 Motivation for Project................................................................................... 8 2.2.1 Alex ...................................................................................................... 9 2.2.2 Brian ..................................................................................................... 9 2.2.3 Sam .................................................................................................... 10 2.3 Goals and Objectives ................................................................................ 11 2.3.2 Testing Environment Objectives ......................................................... 13 2.3.3 Algorithm Development Objectives .................................................... 14 2.4 Specifications ............................................................................................ 15 2.4.1 Index Compression Specifications ..................................................... 15 2.4.2 Compression Specifications ............................................................... 15 2.4.3 Decompression Specifications ........................................................... 17 2.5 Space Efficiency ........................................................................................ 19 2.6 Requirements ............................................................................................ 19 2.6.1 Overall requirements .......................................................................... 19 2.6.2 Compression Requirements ............................................................... 22 2.6.3 Decompression Requirements ........................................................... 23 Research ........................................................................................... 25 3.1 Data types ................................................................................................. 25 3.2 Graphics .................................................................................................... 26 3.2.1 General Graphics pipeline .................................................................. 26 3.2.2 Index buffer ........................................................................................ 29 3.2.3 Vertex buffer ....................................................................................... 29 3.3 Index Compression Research ................................................................... 31 3.3.1 Delta Encoding ................................................................................... 31 3.3.2 Run Length Encoding ......................................................................... 33 3.3.3 Huffman Coding ................................................................................. 34 3.3.4 Golomb-Rice ...................................................................................... 36 3.4 Vertex Compression Research.................................................................. 37 3.4.1 Statistical Float Masking ..................................................................... 38 I 3.4.2 BR Compression ................................................................................ 39 3.4.3 LZO Compression .............................................................................. 42 3.5 Additional Research .................................................................................. 43 3.5.1 Testing Environment Language: C vs. C++ ........................................ 43 3.5.2 AMP code ........................................................................................... 43 Design Details ................................................................................... 44 4.1 Initial Design .............................................................................................. 44 4.1.1 Offline Compression ........................................................................... 44 4.1.2 Online Compression ........................................................................... 46 4.2 Testing Environment ................................................................................. 47 4.2.1 Initial Environment Design .................................................................. 48 4.2.2 Data Recording .................................................................................. 48 4.2.3 Scoring Method .................................................................................. 49 4.2.4 Dataset Concerns ............................................................................... 50 4.3 Index Compression ................................................................................... 51 4.3.1 Delta Encoding ................................................................................... 51 4.3.2 Other Considered Algorithms ............................................................. 52 4.3.3 Delta Optimization .............................................................................. 53 4.3.4 Golomb-Rice ...................................................................................... 55 4.4 Vertex Compression .................................................................................. 56 Build, Testing and Evaluation Plan .................................................... 58 5.1 Version Control.......................................................................................... 58 5.1.1 What is Version Control ...................................................................... 59 5.1.2 Choosing a Version Control System ................................................... 60 5.2 Test Runs .................................................................................................. 63 5.3 Index algorithm development .................................................................... 64 5.3.1 Delta Encoding ................................................................................... 64 5.3.2 Golomb-Rice Encoding....................................................................... 68 5.3.3 Index Compression comparison ......................................................... 71 5.4 Vertex algorithm development ................................................................... 72 5.4.1 Test Data ............................................................................................ 72 5.4.2 Vertex Algorithm Implementation ....................................................... 73 II 5.4.3 Vertex Compression comparison ....................................................... 78 5.5 Test Environment ...................................................................................... 79 5.5.1 File Reader ......................................................................................... 79 5.5.2 Compression and Decompression algorithms .................................... 80 5.5.3 Testing code ....................................................................................... 81 5.5.4 Data Printer ........................................................................................ 82 Administrative Content....................................................................... 83 6.1 Consultants ............................................................................................... 83 6.1.1 AMD ................................................................................................... 83 6.1.2 Dr. Sumanta N. Pattanaik ................................................................... 83 6.1.3 Dr. Mark Heinrich ............................................................................... 84 6.1.4 Dr. Shaojie Zhang .............................................................................. 84 6.2 Budget ....................................................................................................... 84 6.2.1 A Graphics Processing Unit. ............................................................... 84 6.2.2 Version control ................................................................................... 85 6.2.3 Algorithm Licenses ............................................................................. 86 6.2.4 Document Expenses .......................................................................... 86 6.2.5 Estimated Expenditures...................................................................... 87 6.2.6 Actual Expenditures............................................................................ 88 6.3 Project Milestones ..................................................................................... 88 6.3.1 First Semester .................................................................................... 88 6.3.2 Second Semester ............................................................................... 90 Summary/Conclusion......................................................................... 93 7.1 Design Summary ....................................................................................... 93 7.2 Successes ................................................................................................. 93 7.2.1 Initial Research ................................................................................... 93 7.2.2 Testing Environment........................................................................... 94 7.2.3 Index Compression............................................................................. 94 7.2.4 Vertex Compression ........................................................................... 95 7.3 Difficulties .................................................................................................. 95 7.3.1 Vertex Compression Difficulties .......................................................... 97 7.3.2 Future Improvements ......................................................................... 98 III Appendices ...................................................................................... 100 8.1 Copyright ................................................................................................. 100 8.2 Datasheets .............................................................................................. 100 8.3 Software/Other ........................................................................................ 102 Bibliography ..................................................................................... 103 IV List of Figures Figure 1.1: Figure 1.1 PCI-E Speeds: A table detailing the speeds of the various versions of the PCI-E bus. .................................................................................... 1 Figure 2.1: Providing an index number 3 to an array to retrieve the corresponding value, ‘d’ ............................................................................................................... 4 Figure 2.2: Vertices Form Triangle: An illustration of three vertices coming together to form a triangle. ................................................................................... 5 Figure 2.3: Vertex Data, before and After Indexing: A demonstration of how much space can be saved with indexing. ....................................................................... 6 Figure 2.4: Graphical Object: An example of a graphical object, specifically a square, formed by two triangles. Reprinted with permission. ............................... 7 Figure 2.5: Vertex Buffer: A sample vertex buffer shown with the corresponding vertices it is describing. Reprinted with permission. ............................................. 7 Figure 2.6: Index Buffer: A sample index buffer generated using a vertex buffer. Reprinted with permission. ................................................................................... 8 Figure 2.7: : How performance is expected to be optimized ............................... 12 Figure 2.8: Compressed Objects: Three compressed objects in the space of one uncompressed object. ........................................................................................ 13 Figure 2.9: Delta Compression on Floats: This demonstrates why float values cannot be compressed using delta compression. ............................................... 18 Figure 2.10: The process of hook code being injected into a program being performed. .......................................................................................................... 20 Figure 2.11: Graphical Errors: Severe graphical errors caused by incorrectly drawn vertices. ................................................................................................... 20 Figure 2.12: Offline Compression ....................................................................... 21 Figure 2.13: Online Compression ....................................................................... 22 Figure 3.1: Floating Point Format: The number 0.15625 is represented in the 32bit floating point format. ...................................................................................... 26 Figure 3.2: Example of Vertex Buffer being used and reloaded 3 times. ............ 27 Figure 3.3: The Graphics Pipeline: Illustration of where vertex data fits into the graphics pipeline................................................................................................. 28 Figure 3.4: Indices A and B: Index A is shown pointing to Vertex A, and Index B is shown pointing to Vertex B. ............................................................................ 29 Figure 3.5: Index and Vertex Interaction: Diagram detailing the interaction between index and vertex buffers. ...................................................................... 31 Figure 3.6: Delta Encoding: Demonstration of the compression and decompression process associated with Delta encoding. .................................. 32 Figure 3.7: Run Length Encoding: Quick transformation of a sequence into a compressed form using run-length encoding. ..................................................... 34 Figure 3.8: Making Change: A greedy algorithm, this algorithm tries to use the fewest number of coins possible when making change. ..................................... 35 V Figure 3.9: Huffman Coding: An example of the kind of tree used in Huffman encoding, accompanied by sample data being compressed. ............................. 36 Figure 3.10: XOR Operator: A quick equation to demonstrate the XOR Operator ........................................................................................................................... 39 Figure 3.11: XOR Operator: A quick equation to demonstrate the XOR Operator ........................................................................................................................... 40 Figure 3.12: Leading Zero Compression: The zeroes at the beginning of a binary number are replaced with a single binary number counting the zeroes. ............. 40 Figure 3.13: FCM generation and prediction ...................................................... 41 Figure 3.14: DFCM generation and prediction .................................................... 42 Figure 4.1: Header Data: An example of how header would be applied for dynamically applying the compression algorithms. ............................................. 45 Figure 4.2: Graphics Pipeline with Compression: Two possible configurations of the graphics pipeline after our compression and decompression algorithms have been added......................................................................................................... 47 Figure 4.3: Checksum Functions: A checksum function will return a vastly different value even with similar input data. ........................................................ 49 Figure 4.4: Checksum Usefulness: Demonstration of how a checksum alerts the program that data has been changed. ................................................................ 49 Figure 4.5: Score Equations for testing environment. ......................................... 50 Figure 4.6: Example of different data not working at same efficiency on same algorithm. ............................................................................................................ 51 Figure 4.7: Run Length + Delta: Example of running Run Length encoding on top of Delta encoding................................................................................................ 53 Figure 4.8: Example Showing Benefit of Dynamic Anchor Points with Escape Codes ................................................................................................................. 54 Figure 4.9: Example Showing Benefit of Dynamic Anchor Points with No Escape Codes ................................................................................................................. 55 Figure 5.1: Version Control: A file being changed and merged in a generic form of version control. ............................................................................................... 60 Figure 5.2: Version control pros / cons: The different pros and cons of each kind of version control. ............................................................................................... 61 Figure 5.3: Index Buffer Delta Compression: Example of compressing index buffer data using Delta Encoding........................................................................ 65 Figure 5.4: Delta RLE file size change ............................................................... 66 Figure 5.5: Delta RLE Compression and Decompression Time ......................... 66 Figure 5.6: Delta RLE Normalized Compression Speeds ................................... 67 Figure 5.7: Delta RLE Compression rates of different test files .......................... 67 Figure 5.8: Delta RLE Test Run Histogram ........................................................ 68 Figure 5.9: Golomb-Rice file size change ........................................................... 69 Figure 5.10: Golomb-Rice Compression and Decompression Time ................... 69 Figure 5.11: Golomb-Rice Normalized Compression Speeds ............................ 70 Figure 5.12: Golomb-Rice Compression rates of different test files.................... 70 VI Figure 5.13: Golomb-Rice Test Run Histogram .................................................. 71 Figure 5.14: Comparison between Delta-RLE and Golomb-Rice Compression Rates .................................................................................................................. 72 Figure 5.15: LZO File size changes .................................................................... 74 Figure 5.16: LZO Compression and Decompression times ................................ 74 Figure 5.17: LZO normalized compression speeds ............................................ 74 Figure 5.18: LZO Compression rates of different test files ................................. 75 Figure 5.19: LZO test run histogram ................................................................... 75 Figure 5.20: BR size changes............................................................................. 76 Figure 5.21: BR Compression and Decompression times .................................. 76 Figure 5.22: BR normalized compression rate, measured in MB/S .................... 77 Figure 5.23: BR Compression rates of different test files ................................... 77 Figure 5.24: BR test run histogram ..................................................................... 78 Figure 5.25: Comparison between Delta-RLE and Golomb-Rice Compression Rates .................................................................................................................. 79 Figure 5.26: Example Testing Environment Output: Example output produced by our testing environment, including the performance measures. ......................... 82 Figure 5.27: Additional Testing Environment Output: Full performance metrics used for determining algorithm statistics. ........................................................... 82 Figure 6.1: AMD R9 Graphics Cards: A side-by-side price and performance comparison. More information on this series of graphics cards is provided in the appendices. Reprinted with permission. ............................................................. 85 Figure 6.2: GitHub Personal Plans: The potential cost of a subscription to a GitHub personal account. ................................................................................... 85 Figure 6.3: GitHub Organization Plans: The potential cost of a subscription to a GitHub organization account. ............................................................................. 86 Figure 6.4: The Spot Pricing: Quote detailing the cost to print a document. ....... 87 Figure 6.5: Estimated Expenditures Pie Chart. ................................................... 87 Figure 6.6: Actual Expenditures Pie Chart .......................................................... 88 Figure 6.7: First Semester Milestones: Milestone Timeline of the First Semester of the Project. ..................................................................................................... 90 Figure 6.8: Second Semester Milestones: Milestone Timeline of the second Semester of the Project. ..................................................................................... 92 Figure 8.1: Specifications for the R9 series of Graphics Cards [2] Reprinted with permission. ....................................................................................................... 100 Figure 8.2: Sample Index Data ......................................................................... 102 Figure 8.3: Sample Vertex Data ....................................................................... 102 VII Executive Summary Modern graphics cards are constantly performing a tremendous amount of work to maintain the frame rate and visual fidelity expected of current-generation games and other graphical applications. Graphics cards have become powerhouses of computational ability, with modern cards boasting thousands of cores and an amount of onboard random access memory (RAM) comparable to the host system itself. It would not be unreasonable to posit that modern computers are really two computational systems in one, with the main processor and graphics processor rapidly communicating with each other to provide the visual experience that the group has come to expect. Some obstacles however can negatively impact communication with the GPU. Since the design of modern computers is one that ultimately prefers modularity and a degree of user freedom over brute efficiency, the role of the graphics card has been relegated to an optional peripheral that exists on an external bus relatively far away from other critical system resources. This configuration complicates the process of transferring data between the computer and graphics card, necessitating a transfer bus that is extremely fast and efficient, with an enormous throughput. The bus used today for this purpose is known as the Peripheral Component Interconnect Express (PCI-E) and it provides the amount of data throughput that graphics card needs to function. The version of this bus that current graphics cards run on, PCI-E v3.0, is capable of transferring almost 16 GB of data every second, with version 4.0 supporting twice that amount. PCI Express Version Bandwidth (16-lane) Bit Rate (16-lane) 1.0 4GB/s 40GT/s 2.0 8GB/s 80GT/s 3.0 ~16GB/s 128GT/s 4.0 ~32GB/s 256GT/s Figure 1.1: Figure 1.1 PCI-E Speeds: A table detailing the speeds of the various versions of the PCI-E bus. Being transferred over the bus, among other things, are the data that the graphics card requires of all objects that are to be drawn on the screen, known as the index and vertex data. Even with the extreme speed of the bus these 1 graphics cards use, a bottleneck exists where the speed provided to transfer this amount of data is not sufficient. The impetus of this project was the desire to determine whether any kind of advantage could be gained from compressing the contents of the index and vertex data on the CPU side before sending it to through the buffers and onto the GPU, where it would then be decompressed using GPU resources. The compression algorithm must achieve a high ratio of compression and be made in such a way that it is able to be decompressed quickly. The decompression algorithms that accompany these compression algorithms are required to be able to rapidly decode the two buffers so that they may be passed on to the rest of the graphics pipeline with minimal delay. It also was hoped that these algorithms will be implemented on current graphics cards to increase the amount of data that these cards are able to receive in a given period. Although the aim of this project was not to physically increase the speed of the bus that the GPU runs on, it is hoped that the effective increase in transfer speed of the compressed data to the uncompressed data outweighs the performance hit that the constant decompression of resources will take. The overall goal of this project was to implement lossless, efficient algorithms designed to compress the data in the index and vertex buffers of the graphics pipeline. Our first objective was to conduct research to establish and solidify any background knowledge the group needs to complete the project. The group started by researching the graphics pipeline to gain a better understanding of the data the group is working with. Next, the group moved to researching existing lossless compression algorithms, to identify a first round of algorithms that work well. Once the group had finished the research for the project, the group moved on to coding the testing environment. The group began by setting up a way for the program to receive the input. In this case the group used a file reader because the group will be given sample testing data in a text file. Then the group needed to design functions which will collect performance metrics, to demonstrate the effectiveness and efficiency of our algorithms. Finally, the group had to implement a checksum to ensure that the data that the group decompresses is the same as the data the group originally compressed. Once the testing environment was completed, the group will begin work on testing and writing compression algorithms. The group began with algorithms to compress the index data, because it is consistent in what it is describing, and has uniform formatting, making it easier to manage. The group then moved on to the algorithm which compresses the vertex data. Because its format varies and it 2 describes a set of attributes rather than just one, the group decided to attempt it later in the course of the project. Over the course of the project, the group developed and tested many algorithms for compressing both the index and vertex buffers. The algorithms that they tested to compress the index buffers were a pass of delta encoding followed by run-length encoding for compression, as well as Golomb Coding. Huffman coding was researched but the group decided not to implement it. Early progress in the project focused on index compression. As a result less research had been done on compression for the vertex data than has been done for the index data. Many methods of compressing the vertex data were researched such as the Burrows-Wheeler Transform that were deemed to not have enough potential to implement and test. Additionally, other methods of optimizing the vertex data for storage were researched, such as methods for converting vertex information like color data into tables representing them more efficiently. 3 Project Overview 2.1 Identification of Project When a graphics card displays a 3D image to a computer screen, a large amount of data is being transferred from the system’s memory into the graphics card’s memory. This information includes data describing every vertex in the object, texture information to display, and index information. An index for vertex data works the same as an index for an array in computer programming. Instead of storing numbers as individual named values in a computer, a chunk of memory is reserved whose size is equal to a multiple of the size of the data that is being stored. To access an element that is stored in an array, the index number is used to go that many elements down the list of elements in an array and pull out that numbers. For example, in Figure 2.1, the index number 3 is being requested from Array. Elements 0-2 are skipped in the list and the element located at 3 is returned to the user. It represents stored data as a single number, and that number corresponds to an address somewhere in memory. This makes referencing the data easier and takes up less space in the long run. Figure 2.1: Providing an index number 3 to an array to retrieve the corresponding value, ‘d’ 4 A vertex in computer graphics is very similar to the commonplace geometrical term. It is a single point in a graphical environment that, when combined with other points, makes up a shape. Typically one connects three vertices to form a triangle, because triangles can be combined to form any complex geometrical shape, as shown in Figure 2.2. A graphics card will read in three vertices at a time so that it can form other shapes using these kinds of triangles. Once the card has formed the triangle, it chops it up into tiny pieces in order to transfer it through the graphics pipeline. It then reforms the pieces after textures and shaders have been applied and fits it into a larger graphical object. Figure 2.2: Vertices Form Triangle: An illustration of three vertices coming together to form a triangle. 5 Figure 2.3: Vertex Data, before and After Indexing: A demonstration of how much space can be saved with indexing. The size of the data describing these objects are being drawn are only getting bigger and more complex. As computer graphics continue to attempt to mirror reality more closely, an increasing amount of data has to be sent through the graphics pipeline for processing. Objects have to be created using an exponentially growing number of polygons in order to increase their fidelity. Textures for the objects have to be larger, so that when they are wrapped on an object and inspected at a high resolution they don’t show any tearing or unrealistic patterns. Because of how visually complex the world around us can be, graphics developers are constantly attempting to go to new and astounding lengths in order to display even the tiniest details correctly. The faster that the GPU can get through information, the faster it can display it to the screen and the better it will run. Therefore it is decided that a compression algorithm is required for 2 portions of the graphics pipeline which help to describe graphical objects. Graphical objects are typically formed using triangles of various differing forms and sizes. An example of an object that is comprised of triangles in this way is the square shown in Figure 2.3. These triangles are made up of three different vertices. 6 Figure 2.4: Graphical Object: An example of a graphical object, specifically a square, formed by two triangles. Reprinted with permission. The first item which requires compression is the vertex buffer. The vertex buffer contains many different types of information which all work together to describe a single vertex of a graphical object. Figure 2.4 demonstrates the vertex buffer storing the position data of the vertex in the form of a set of Cartesian (x,y) coordinates. Figure 2.5: Vertex Buffer: A sample vertex buffer shown with the corresponding vertices it is describing. Reprinted with permission. The second item which requires compression is the index buffer. The index buffer is itself a form of compression which maps several values in the vertex buffer to an index. Rather than fill the vertex buffer with repeated information about the same vertex, the graphics pipeline simply reads which vertex it has to render next in the index buffer and searches for the corresponding information stored in an address within the vertex buffer. Figure 2.5 shows an example of both the index and vertex buffers, side by side. 7 Figure 2.6: Index Buffer: A sample index buffer generated using a vertex buffer. Reprinted with permission. 2.2 Motivation for Project This project’s main motivation is the ever growing need for efficiency and speed in the world of graphics and simulation. With 3D graphics becoming more and more advanced the objects being drawn to screen are only getting more and more complex. This makes the data that describes these objects larger and as a result more data must go through the GPU at one time to draw the object to the scene. With PCI-E transfer speeds not increasing at the same rate of graphics there is a need to figure out a way to transfer more data at one time over the same data lanes that are currently being used. This is the main drive for our project; to compress the data that is transferred into the buffers and as a result allow more data to transfer into the buffers at one time. This will allow for more complex objects to be created and more objects at one time to be loaded into the buffer and as a result will increase the performance of the graphics card. This project offered the group a huge opportunity to influence a very interesting and active field. We all play videogames and AMD is a huge name in the video game world, providing many processors and video cards for computers and even most video cards for consoles. In addition to the ability to work with AMD this project also gives us a great opportunity to learn about the graphics pipeline and how index and vertex data for 3D objects are formatted and used to draw what they see on the screen. Another motivation for the group is our interest in the compression of data and how it works. People all use programs like 7zip and WinZip in order to compress files but this project lets us gain a basic understanding of how compression works and how it lowers the file size while still keeping the data that was originally there. 8 2.2.1 Alex The evolution of graphics cards and graphics drivers has interested me for a long time. As far as general use software goes, no applications are more complicated than those that utilize both the graphics card and the processor of a computer. A platform’s support for graphics cards is a major factor in it becoming widely accepted on the desktop, which I am greatly interested in changing. I believe that the development of better cross-platform tools for graphics cards, such as OpenCL and AMD’s own mantle, will lead to a wider rate of adoption of the Linux platform for every-day computing. As a programmer and computing enthusiast in general, I think that having a viable open-source alternative to Microsoft Windows and Apple’s Mac for desktop is a very important source of competition. I would like to begin learning about how graphics cards function so that I can, among other things, contribute to this vision. In addition to wanting to expand the horizons of Linux on the desktop, I’ve also always wanted to understand the inner workings of a graphics card. In school, I’ve learned the rudimentary ideas associated with how the CPU functions, but I’ve always wanted the chance to learn how the GPU functions as well. To an outlier, the way that a computer can even draw 3D objects on a screen with such ease seems like magic, and finding an opportunity to work with people from AMD who can share their insight on how these systems work is very invaluable to me. Finally, as someone who often plays video games on PC, the opportunity to contribute to the video games industry is a novel opportunity for me. The concepts of video game graphics can also be used in many different fields. A field that I take some interest in is the emerging virtual reality craze. Virtual reality requires very powerful GPU’s, and VR can be used for many things in addition to just playing video games for leisure; it can also be used as a tool for therapy or for training, such as those undergoing physical therapy, those with disorders such as agoraphobia, and those who in the military practicing dangerous or complex tasks. 2.2.2 Brian I decided to take Senior Design in order to prepare myself for the professional world of Computer Science. I wanted some intellectual background in the field I would be starting my career in. I also wanted an experience I could point to when future employers asked what prepared me to work at their company. 9 So, when I was offered a chance to work with a high profile graphics hardware company, I happily accepted. The things I could learn while working with AMD go far beyond just learning about compression algorithms and GPU’s. I could learn industry standards, the software development life cycle, and what it’s like to work in an office with professionals in my field. In many ways I would be getting a full tour through the future of my career. That is not to say, however that my interest in computer graphics is nonexistent. I have been curious for a long time about what was involved in the way a graphics card functions. Rendering three-dimensional objects takes a lot of processing power in just a static environment, but rendering them in real-time must be expensive, in the sense of both memory and finances, considering the cost of some graphics cards. Before college, I would simply shrug it off as part of the costs of owning a cutting edge personal computer. Now, however, as I near the end of my degree I find myself questioning how hardware, and really anything related to computers, works beneath the price tags and specifications. So, I have made it my mission to broaden my horizons before graduation, and researching the graphics pipeline will serve as one more milestone. 2.2.3 Sam I have always been very interested in computer graphics. With 3D simulations and video games being the main reasons for my interest. This is because the more realistic 3D simulations and representations of data always fascinating me and also due to playing video games being my main pastime, both of which computer graphics are very important. The main motivation for me to do this project was to gain more knowledge on how computer graphics are generated and how 3D object data is used to create the things we see every day. When I first entered into college I was an electrical engineering major intending to go into the field of graphics processing hardware research and development and eventually work at a company involved in the field, ideally either AMD or NVidia; who are the two big names in graphics card R&D. Early on in my classes I realized I enjoyed programming more than circuit design and switched to Computer Science. However I still wanted to get involved within the fields of graphics, video games, or simulation in some way. This project 10 greatly peaked my interest as it would involve working directly with graphics cards, the graphics pipeline, and how they operate at a software level which as mentioned before are all very interesting and interesting to me. I have also taken some classes and done some projects involving 3D graphics and programming which I am eager to apply towards something outside of just a hobby project or an assignment required by a class. This project lets me apply my existing knowledge and gain much more of an understanding of index and vertex data that is used to build 3D objects. 2.3 Goals and Objectives When data is being sent through the graphics pipeline, the PCI-E bus acts as a major bottleneck between the CPU and GPU. With the immense amount of data that is being sent through this bus every second, it is of great importance that the data sent through is optimized in any way possible. The aim of this project in part is to alleviate the problems associated with dealing with this bus without directly designing a more efficient version of PCI-E. Although continuing to improve the hardware that computers run on is always of great importance, optimizations must be made to make systems faster during the interim. Concentrating only on developing new versions of PCI-E with a higher throughput ignores the performance that can be gained by carefully considering what is being sent through that bus. The compression of data before transfer is a shining example of this method of optimization. Efficient compression algorithms will always be able to work with the newest and fastest versions of PCI-E to deliver overall a faster system than what can be accomplished with hardware optimizations alone. The algorithms that are written today will be just as useful in the future as they are now, if not only to pave the way for further improvement and even higher optimization. The main goal of this project is to implement efficient lossless compression and decompression algorithms into the graphics pipeline. The algorithms will compress the data that goes into both the vertex and index buffers. This reduces the size of information being transferred into the buffers and thus allow for more information to be transferred at one time. When the data is being fetched from the buffers it is then quickly decompressed and used normally. The implementation of these algorithms will increase the speed and efficiency that a graphics card can operate by allowing the card to not have to wait as long for new information to transfer into the buffer from the computer’s main memory. 11 This transfer rate of compressed data can be quantized using the formula located in Figure 2.7. Using this formula the impact of the developed algorithms can be seen based upon the increase in the compressed transfer rate value. Figure 2.7: : How performance is expected to be optimized In terms of throughput, if we consider the current amount of data that is able to be sent through the index and vertex buffers at once as one object and the time it takes to send the object as one object transfer unit, uncompressed object will be sent at the rate of “one object per transfer”. Although an increase in the amount of physical bytes that are sent through the pipeline in a given period is not possible that does not mean that it is impossible to increase the “object per transfer” ratio. The way that the transfer rate is increased is not by increasing the size of the transfer buffer, but by decreasing the size of the data being sent through the buffer. If the algorithms generate a compression ratio of C, the overall throughput ratio will be change from “one data per transfer” to “C data per transfer”, as is demonstrated in Figure 2.8. 12 Figure 2.8: Compressed Objects: Three compressed objects in the space of one uncompressed object. The objectives for the project are first to research the basics of the graphics pipeline and how it is used to draw objects to the screen. Then research existing lossless compression algorithms that can be used as a base or as an improvement to other algorithms. These objectives were ongoing from the beginning of the project to the end when the final implemented algorithms were implemented. 2.3.2 Testing Environment Objectives The first coding objective is developing a testing environment to be used for quick prototyping of our algorithm and allow for generation of useful test data. This is important to get set up first to allow quick implementation of test algorithms and see if the new test algorithm are an improvement or not to the previous iteration. Within this objective there are many sub-objectives that can be separated into parallel tasks among group members. These include the development of the different modules of the testing environment which was done in parallel. These modules can be summed up into the reader of data, the writer of data, the 13 algorithms themselves, and the tests to be run. Other sub-objectives includes the development of the before mentioned tests to be run on the algorithm to gather consistent valuable data when testing the algorithms. 2.3.3 Algorithm Development Objectives Next comes the development of the lossless compression algorithms that work on vertex and index information. Due to the fact that both types of data are very different in format and size it was necessary to develop two separate compression algorithms, one for each type of information. Once developed the main objective was to improve upon these base algorithms to make the final product as efficient as possible. This can be thought of as two separate objectives: one for the development of the index algorithm the other for the vertex’s development. 2.3.3.1 Index Compression Objectives Within the objective of completing the index compression and decompression algorithm there are several objectives that once achieved result in a fully developed and implemented algorithm. These objectives include the before mentioned research into compression algorithms, and the design of a prototype algorithm that will create a baseline to start off from. Next is the implementation of optimizations that will compress the integer based index data even further. With this type of data the optimization of the algorithm has huge potential and the group’s objective is to achieve a much higher compression with this data when compared to vertex data. 2.3.3.2 Vertex Compression Objectives The development of the vertex compression and decompression algorithm has many sub objectives as well. This type of data has many more formats of data that has to be accounted for and as a result one of these is a reader and parser that can process vertex data and convert it to a consistent usable form. The next sub objective is the development of a prototype algorithm that can handle all possible types of data that can be seen in vertex data. This includes handling float information which is much more complex compared to integer compression. 14 2.4 Specifications 2.4.1 Index Compression Specifications The algorithms developed must compress the vertex and index information a notable amount and do so without costing a large amount of resources to decompress. A compression ratio of at least 1.25:1 is acceptable as the information that can be in the vertex buffer can vary greatly. For the index information a compression ratio much higher is achievable due to it being a set size and only integer values. Compression can be achieved either by using the CPU or GPU’s resources. If done by the CPU the compression will be done in advance, most likely at the time the data is actually created and written to storage. Decompression has to be done directly on the GPU when data is fetched from the buffer either by software implementation with the shader programs or through specific hardware on the graphics chip. Due to the potential requirement of designing specific hardware to run the decompression, running the decompression code on a physical graphics card was out of scope for the project. 2.4.2 Compression Specifications The compression algorithm system that was developed over the course of the project needed to have the ability to compress the data as efficiently as possible. Two different approaches are taken if data is compressed online or offline. Offline data compression is performed in the following manner. First, a program will be used to scan through the data with the intent of trying to determine the most efficient method of compressing the data. After the compression sequence has been performed on the relevant assets, they will be stored to the disk for use later. In this situation, since the data is being compressed in advance of when it is being run, the graphics pipeline does not process the data yet. It will continue when the graphical application that is trying to use the assets is loaded on the computer system. 15 Advantages of this system come from the extra time that is available for the compression process. Because the program is allowed to determine in advance which compression algorithm is being used, it can avoid situations where an ineffective algorithm is used on data. Disadvantages of this system are related to the time that compression takes relative to how fast the computer system is. If the workstation that the graphical application is being run on is not as powerful as required, it may be inconvenient to compile the compressed assets for use in the pipeline. The time penalties from doing offline compression can become more apparent when compressing all the assets to use for an application. Methods may exist to circumvent this such as only creating assets when needed. However, developers will not want to use the system correctly if doing so causes them to be idle for significantly longer than previously. If the data is compressed online, then a different compression system and method of pipeline integration will be used. Instead of using the most powerful and efficient compression algorithms that are available overall, the online method must prefer speed and runtime efficiency. This is because the online method does all compression as the assets are being loaded from the disk, which means that it must be done in real time. When graphics operations are done in real time, there is a possibility that it can stall the graphics pipeline if it takes too long. The process of online compression starts even before it is known that assets from the disk will be compressed. A graphical application that is supporting these optimizations will be presumably be running code that can modify the way that a game receives the assets that it uses, another possible instance of a hook can be used to achieve this. Once the application starts, it will begin to request assets to be loaded from the disk. When this happens, code will be inserted between the time of the loading and the time that the assets are sent to the GPU. This code will run a quick version of a compression algorithm that is expected to yield a compression ratio that is less than that of an offline compression algorithm, before sending it to the GPU. Overall it is hoped that this approach will be faster than transferring the uncompressed data despite the time that it takes for the data to be decompressed. 16 2.4.3 Decompression Specifications The decompression algorithms that were developed over the course of this project all have to be very fast and efficient to work effectively. This can be achieved by employing a range of different optimization techniques on many different algorithms. The decompression must be done when the data is fetched from the respective buffer, and will have to run on the graphics card. Thus it needs to be very fast to not hold up the pipeline. It is important to note that due to the different types of vertex data being much more dynamic than index data it will less likely be as compressed as much as index data as these are always a single integer value. As opposed to some of the methods used for compression, all methods used for decompression must be done on the graphics pipeline. Decompression is being performed so that the graphics card may be sent data in a more efficient manner than before, so any data that is to be decompressed must be done after the transfer; there is no point to decompressing the data before that point. Because index information only contains integer values, it allows the implementation of many existing compression methods such as delta encoding and Huffman coding. This data can be compressed to a much higher degree than vertex data. Vertex information however can contain both integer and float (decimal) numbers and each object’s vertices can contain different information to describe them. This makes compression more difficult and complicates the process of creating compression algorithms that work with index information to also work with all vertex information. Delta encoding specifically will not work well on vertex buffer data for two reasons. The first is that vertex buffers are primarily comprised of float data. This means that running delta compression on the buffers will not reduce the number of bits in each value. For example suppose you have two numbers in your index buffer: 99 and 100; the difference between these numbers is 1. Knowing this, the group can keep one of these values as an anchor point, and replace the other value with this difference. If the group needs to recover the second value, one simply adds the 17 difference to the anchor point. Now our buffer has 99 and 1, and while the 99 hasn’t changed form, the 100 has now become essentially the same bit-length as a char. With float data, the difference will still need all of its bits in order to properly represent the decimal number that results from the subtraction. Therefore delta encoding has no effect. Figure 2.9: Delta Compression on Floats: This demonstrates why float values cannot be compressed using delta compression. The second reason is that vertex buffers contain several different types of data. Subtracting color data from position data results in very odd values. Grouping the vertex data by type would solve the problems for position data, since most of the triangles are positioned close together to form a graphical object. This would not have the same effect, however, on color data, since the color of one graphical object can differ vastly from the rest. 18 2.5 Space Efficiency Space efficiency was also a concern when identifying algorithms. Space efficiency is the concept of using as little space in memory as possible to perform the actions required for an operation or function. If an algorithm requires a large amount of memory just to run the decompression would potentially counter any benefits of running the algorithm. 2.6 Requirements 2.6.1 Overall requirements The main goal of this project was to create a sort of “hook” into the graphics card pipeline so that graphical data can be compressed in advance on the CPU or at compile time, before being sent to the GPU for use. In computer terminology, a “hook” is code that is code that is used to allow further functionality in a module to run from external sources before the main program continues to run. As demonstrated in Figure 2.10, they work by intercepting the original call of some function and then inserting their own code into the pipeline before the original code can continue working. Although some true hooks can be malicious in nature, the term is only being used to describe the process where the group can insert code compression / decompression into the graphics pipeline to try to get performance benefits even though it was not originally intended to do so. 19 Figure 2.10: The process of hook code being injected into a program being performed. Graphical data must not be altered in the compression / decompression process. When compression algorithms that alter the contents of the data it is compressing are used, the visual quality of the object being rendered may be greatly reduced or altered; this is usually manifested in graphics as objects that appear to be “glitched”, as can be seen in figure 2.11. For this reason, all the algorithms that the group developed had to have been developed in a way such that values are not altered even slightly during compression / decompression. Such algorithms are known as lossless algorithms. Figure 2.11: Graphical Errors: Severe graphical errors caused by incorrectly drawn vertices. In addition to being lossless, both the compression and decompression algorithms must be compatible with the constraints of the existing graphics pipeline. For example, it was outside of the scope of the group’s responsibility to design a hardware module for the decompression algorithms to be run on. 20 Instead, they were testing to see if a software implementation would yield performance benefits. Although a fast algorithm that is useful for both compression and decompression is ideal, it is not necessarily a strict requirement. It is also within the scope of this project to find algorithms that are quick only during decompression, that are suitable for parts of the graphics pipeline that are not on-the-fly. For instance, such “offline” compression algorithms can be used to create pre-compressed objects during compile time that during real time are only meant to be decompressed by the GPU. Figure 2.12: Offline Compression 21 Figure 2.13: Online Compression 2.6.2 Compression Requirements The group decided to focus on writing an algorithm that can achieve a high level of compression and whose data can be decompressed quickly, while worrying less about the time that it takes to compress the data. Although an ideal compression algorithm would be highly efficient in terms of compression ratio, compression time, and decompression time, a real solution can only be so good in one area before acting to the detriment to one or more of the others. For example, if an algorithm is able to provide an extreme level of compression but does so in such a way that decompression is very difficult, the algorithm would not be desirable. Having the compression algorithms that can execute quickly is not altogether useless. If the compression algorithm that is used happens to be fast, it can be put to use by having the CPU compress the assets before it is sent through the graphics pipeline. A situation like this might occur in a game that was not built with these optimizations in mind. If the assets in the project were not compressed when they were built, they would be able to still gain benefit from the compression / decompression system with on-the-fly compression. This was considered to be a tertiary goal for the project. 22 The compression system had to be written in such a way that it was simple for developers to use for their own projects. Within the scope of the project this means that the group had to design their code in a modular fashion. Doing so would make it easy for AMD to implement in their own systems where they see fit. The last stipulation for the compression system was that the overall compression / decompression system had to be written in a way such that developers could choose not to use them if they did not want to. In situations where the algorithms were causing problems, the developer might want to turn off the compression / decompression system until the problems are resolved. The graphics cards AMD manufactures in the future may be fitted with a module or section of its card dedicated to compressing and decompressing the index and vertex buffers with our algorithm. Keeping that in mind, programs written now will not be written with this new compression module in mind. In order to preserve backwards compatibility, it is essential that the project includes the option to turn off the compression module. Backwards compatibility is designing new hardware with the ability to run code written for an older generation of hardware. The compression algorithms must be made in such a way that they support some amount of random access capability. The contents of a buffer being sent to a GPU contains many objects, and the GPU may not want to access these objects in the order that they are presented. If the compression algorithm is written in such a way that the block it creates must all be decompressed at the same time or in sequence, then significant overhead will be incurred when trying to access a chunk that is in the middle. 2.6.3 Decompression Requirements The decompression algorithms also had to adhere to their own set of guidelines and requirements. The first and most important stipulation was that the algorithms must be very fast. Unlike the compression algorithms, the decompression algorithms will always be run on the GPU online and in real time. If the introduction of the group’s optimization system overall cause the GPU to run slower than it had been going previously, it may cause a hiccup in the graphics pipeline which can lead to a lower frame rate, among other undesirable effects. 23 The decompression algorithms must also be space efficient. As with all highperformance software, the size of the memory footprint is of critical importance. Any optimizations that the group can make to cause the decompression sequence to take up less memory means that the memory can be used elsewhere in the GPU. Aside from dimensional requirements, the decompression code must be made to run on a graphics card. This is in contrast to most of the code that programmers write, which is made to be written on a CPU. For testing purposes, the code for this project was developed in C. Finally, the decompression algorithms that the group wrote must be written such that they take advantage of the block structure that the compression algorithms provide. The final ideal compression algorithms were made so that the data could be decompressed in chunks that are not dependent on the surrounding blocks. This allows the decompress or to potentially save some computation time by only decompressing the segments that it needs during a given operation, instead of decompressing the entire buffer at once. 24 Research 3.1 Data types When the group began work on this project, they needed to make sure that their foundation of how computers store numbers was completely sound. The concept of how data types function in computers was especially important. A data type is a specific number of bits that are stored consecutively with an accompanying algorithm that is used to parse the bits. The containers that are used to store numbers in computers are not all the same size, nor are they supposed to all be parsed the same way, so different algorithms must be created to parse different data types. Data is typically stored in computers by using a data structure such as a symbol table to keep track of the type of the data that is being worked with. In computers, no matter what the type of the data, all containers can be reduced to binary. This property can be exploited to implement the type of compression techniques that are presented in this paper. The type of data that the group members are trying to compress in this project are all integer and float data at the core. Due to the fact that computers store integer and floating point data differently, the process of creating efficient compression algorithms for storing both at the same time is made further complicated. Integer data is the fundamental unit of storage. In 32-bit C, a single integer is also 32-bits, which means that it can store values from 0 to 4294967295 or 2^321. Since binary is a traditional number system, storing integer data causes the bits in an integer to fill the lower order bytes before the higher ones. Compression techniques can take advantage of this to reduce the number of bytes that is needed to store data. Floating point data is not stored as a traditional number would be stored; converting a float to binary is not a simple base conversion as with an integer. Instead, a standard exists called the IEEE Standard for Floating-Point Arithmetic (IEEE 754). It is basically a computerized representation of the scientific notation. It is a specialized system that is used to store decimal data in a floating point 25 computer storable number, essentially comprised of the sign (positive or negative) of the number, followed by the exponent of the number, and finally the fraction of the number that is being stored. An example of a number being represented in IEEE 754 floating point can be seen in Figure X.X. Figure 3.1: Floating Point Format: The number 0.15625 is represented in the 32-bit floating point format. 3.2 Graphics Before working on this project only one member of the group had previous experience with graphics programming. Because of this, the first step to our project was to get everyone up to speed with the basics of 3D graphics, how 3D objects are designed and created, and to achieve an in depth understanding of the data the group is tasked with compressing for the project. As a result a large portion of the initial research consisted of learning how graphic programs work and how vertex and index information are used to draw objects to the screen. The rest of our research was involved with current lossless compression algorithms that can possibly be used with the data the group will be compressing. 3.2.1 General Graphics pipeline The first step towards understanding the graphics pipeline is understanding how objects are generated. Objects consist of many vertices which contain a position, and can contain other values such as color or normal vector information. This information is all stored in a vertex buffer. Next another type of information the group needed to research was index information. This information is stored in an index buffer and points towards a specific vertex stored in the vertex buffer. The use of index data is a widely used way to lower the size of vertex information that is needed to build an object as it allows vertices to be reused without being redefined in the vertex buffer. These are the main areas of where our researched 26 focused as these two data types are what our algorithms will be compressing and decompressing. This process of populating the buffer and then using the buffer to supply the graphics pipeline with data is shown in Figure 3.2. This figure displays 3 different iterations of an example vertex buffer loaded with data from the system’s memory; the data being fetched and sent into the graphics pipeline, and then the buffer reload which repeats this process. In the first run through, labeled Object 1, the vertex buffer is populated during with data read in from the system memory, once the buffers are loaded with information the graphics pipeline will then “fetch” or retrieve the data one chunk at a time. When the pipeline has exhausted the current data inside the buffer it will then clear this data and load in new data. Figure 3.2: Example of Vertex Buffer being used and reloaded 3 times. Inside the graphics pipeline, once data is read into the index and vertex buffer, the GPU then reads in the indices and vertices one at a time into the assembler. A diagram showing the operation path that takes place inside the graphics pipeline that takes vertex and index information and turns it into the final image is 27 displayed in Figure 2.2. In the figure the assembler is comprised of the vertex shader and the triangle assembly. The triangle assembly builds the shapes described by the vertices in the form of many triangles next to each other (hence the name). These triangles are all build one after another and placed in the correct 3D position in order to build the full 3D object. This object is then transformed and altered along with other objects that have been constructed and placed in the 3D scene being drawn. Data that will not be displayed is then “clipped” out during the Viewport clipping stage. This viewport is designated by a virtual camera that indicates what will be seen by the scene drawn to the screen. Once the scene is drawn and clipped it is then sent through the Rasterizer which simply cuts the image seen by the camera into many small fragments. These fragments are sent to a Fragment Shader where things like textures are applied and the fragment data is processed into what is known as pixel data. Once this pixel data is processed it is sent to the frame buffer where it will reside until displayed as the final image. It is important to note that only the vertex shader and fragment shader are directly alterable by the programmer, the rest of the graphics pipeline is all done “behind the scenes”. Figure 3.3: The Graphics Pipeline: Illustration of where vertex data fits into the graphics pipeline. The interaction between both index and vertex data to build triangles can be seen in Figure 3.4, which shows a small index and vertex buffer and how the values of the index buffer “point” to a chunk of data in the vertex buffer. By connecting the position values of the vertices the two triangles shown will be drawn and displayed to the screen assuming no other transforms or modification to the scene takes place in the rest of the graphics pipeline. In the figure vertex 1 and vertex 3 are used in both triangles while both are only defined once in the vertex buffer. 28 3.2.2 Index buffer The index buffer holds the index data which is used to point to specific vertices in the vertex buffer. Index data consists of non-negative integer values. An example of an index buffer can be seen in Figure 2.1. Each value is a single unit of data that points to a specific vertex shown by the arrows going between the example buffers. Figure 3.4: Indices A and B: Index A is shown pointing to Vertex A, and Index B is shown pointing to Vertex B. Due to the nature of 3D objects, most of the time when drawing a line from vertex a to vertex b, vertex a will be positioned relatively close to vertex b in the vertex buffer, as a result the index information does not tend to vary much from one value to the next value in the buffer. The reason for index data’s use is to allow the reuse of vertices without having to redefine the whole vertex over and over and store a new vertex every time it is used. Instead the vertex will be defined once and its location within the list of vertices is stored as an integer value known as the index. 3.2.3 Vertex buffer Vertex buffers contain all of the vertex information for graphical objects, and multiple values within it are mapped to a single vertex. This is what the index in 29 the index buffer will point to. Vertex information is much more dynamic and varying than index information. It can contain numerous different fields of information that describe each individual vertex. One attribute that can be described is the position of the vertex in the graphical environment. This position is mapped out using a three-dimensional Cartesian coordinate system (x,y,z). The position is described by three values, these 3 values correspond to the x, y and z positions on the respective axes. These values can be float or integer values depending on the precision needed or how the object was designed and scaled when created. Another attribute is the color data of the vertex. Color data is described using three or four float values. The color data is mapped out as R, G, B, and sometimes A values where A standing for the alpha channel. The RGB color scale is a measure of the saturation of the three colors found in a color display: red, green, and blue values from 0 to 255. Using a unique mixture of saturation levels, any color on the color spectrum can be displayed. If all of the vertices of the graphical object share the same color values, the object formed by them will appear to the person viewing them as that solid color, assuming no texturing is later placed on top. If two vertices do not share the exact same color data, a gradient will form filling the spectrum between the two colors. One more example of the attributes stored in the vertex buffer is the normal vector of the vertex. This data describes a vector that is perpendicular to the vertex. Normal vectors are used in many calculations in graphics including lighting calculations, allowing each vertex to reflect light in the proper direction. Figure 3.5 displays an example with each vertex consisting of multiple fields and how the index buffer is used to generate two triangles from 4 vertices. The fields that are available to the programmer are ultimately up to the designer of the 3D object or by the program that is being used to create said object and some can be left out in order to save space on the final “mesh” of the object. Because not all information available may be needed for the specific program being developed, the fields that are used are dictated by the programmers of the vertex shader program, which is one of the programs the programmer uses to communicate with and use the graphics card. In the example shown the single vertex contains fields for position (x, y, z integer value) and color (R, G, B float). Another common field that a 3D object can have is a normal vector that is used in many calculations including how lighting is shown on the object and may have been included in the file that contains the information to build the triangles shown, however it was not needed for this example, and thus not read into the buffers even if it was available in the 3D object’s file. 30 Figure 3.5: Index and Vertex Interaction: Diagram detailing the interaction between index and vertex buffers. 3.3 Index Compression Research 3.3.1 Delta Encoding Delta Encoding is an encoding and decoding method that when run on a list of integers generates a list of the deltas, or differences of a value in the list and the previous value. This list is used as a way of encoding the list of integers into potentially smaller numbers that when saved will result in less space used. These deltas are then used to decode the list from the first value, which is named the anchor point. Then one by one the list is decoded and if done correctly the resulting list is identical to the original. Due to the nature of how Delta Encoding works, integer data that does not vary much from one unit to the next offers the highest potential compression. A complete example of delta encoding is displayed in Figure 3.6. The process of compressing the data follows the simple formula shown below: 31 Where buff is a list of integers, or in our case a buffer of index data and n starts at 1 (0 being the first element). With a buffer of size m, compression (encoding) will take O(m) to complete. For delta decompression to work however buff[0] is stored as is and is called the anchor point of the compressed list. In the figure the compressed data is shown as the middle list, all values except for the first (the 5) are changed to their deltas that resulted from this formula. The reason this works well as a compression method as if you have 9999 followed by 10000, the compressed list would only contain a 1 instead of 10000. This allows the use of less space to store values that when decoded equates to a much larger number. Decompression (decoding) follows the formula: Where n starts again at 1 and increases with each iteration by one until it reaches the size of the compressed list indicating the whole list has been decoded. For the basic implementation of delta compression to access a value further down in a list or in our case the buffer the algorithm requires the buffer to be decompressed from the beginning, which causes the decompression of that value to have a runtime efficiency of O(n) where n is the size of the buffer which you are retrieving. Figure 3.6: Delta Encoding: Demonstration of the compression and decompression process associated with Delta encoding. 32 3.3.2 Run Length Encoding Run length encoding is a simple compression algorithm that turns consecutive appearances of a single character, a “run”, into a pairing of the number of times that the character appears followed by the character being compressed. As can be seen in Figure 3.7, a run of 5 a’s in a row would take up 5 individual characters when uncompressed. The compression algorithm will turn this into “5a”, which takes up a mere 2 characters. The algorithm must also recognize when not to use this technique in situations where doing so will increase the file size. As with the last character being encoded, “z”, compressing it into “1z” would double its size, and so it is left alone. Decompressing a run length encoded file is also simple. Much like compressing the file, the decompression sequence works by reading through the contents of the file, looking for a number followed by a letter. Each number-letter pair is then returned to its original form of a run of the number’s value of the letter in question. An advantage of using run length encoding is that every run is compressed independently of any other run; the data does not depend on the surrounding data to be compressed. In practice, when a program requires only a segment of a file, it will not have to start at the beginning. And in a situation where the data is being streamed and decompressed in real time, the program can start decompressing as the single pairs are received. The efficacy of this type of algorithm can vary heavily based on the type of contents being encoded. If the data is prone to repeatedly storing the same element of some alphabet (having long runs of the same character), the resultant file size will be much smaller than the original. A good example of data that benefits from run-length encoding are data-blocked pictures with large swaths of uniform coloring. However if the data being stored is not uniform, such as random binary, many short runs may be generating. These kinds of short runs do not greatly improve the compression ratio. 33 Figure 3.7: Run Length Encoding: Quick transformation of a sequence into a compressed form using run-length encoding. 3.3.3 Huffman Coding Huffman coding is a frequency based compression algorithm. This means that the way that the data is encoded depends on the amount of times the character appears within the file. It also assumes that there is a large gap between the character with the lowest frequency and the character with the highest frequency. It also helps to have a large amount of variance in between the two extremes. It is also a greedy algorithm designed to look for the character with the lowest frequency first. A greedy algorithm is defined as always choosing an option which has the most benefit at the current decision juncture. An example of a greedy algorithm can be seen in Figure 3.8. The hope is that taking all of the most immediately efficient choice will result in the most efficient overall path possible. A good example of how greedy algorithms can be effective is the following problem: “How can you make change using the fewest coins possible?” The answer is to always take the current remaining change value and issue the largest denomination of coin, subtracting its value from the total as you go. 34 Figure 3.8: Making Change: A greedy algorithm, this algorithm tries to use the fewest number of coins possible when making change. It adds the characters to a binary tree, as demonstrated in Figure 3.9, with each left branch representing a 0, and each right branch representing a 1. Each left branch will contain a lower value than the right branch. Then it converts each character in the file into a binary sequence which matches the tree. The logic behind it is that the characters with the lowest frequency will be at the bottom of the tree, with the longest sequence when encoded. The characters with the highest frequency will have a short sequence such as “01” or “110”. The decompression works by reading in the encoded sequence and tracing the tree until the desired character is reached. It is guaranteed that none of the codes are the prefix for another code, thanks to the way the tree is laid out. A potential advantage of using this algorithm on index data is that there could be certain indices that appear repeatedly in the buffer. This could be due to certain graphical objects being of more importance than other graphical objects, and therefore their vertices would appear in the buffer most often. This would allow the Huffman coding to compress it with the most efficient compression ratio. A problem with using this algorithm is that not all environments will have at least one object of superior importance to the rest of the environment. If an environment were to have graphical objects which had an approximately equal distribution of importance assigned to them, this would cause each vertex to occur roughly the same number of times inside of the index buffer. This would mean that the compression would assign half of the indices small binary 35 sequences and half of the values large binary sequences. This would result in the long sequences cancelling out the small sequences in terms of saving space . Figure 3.9: Huffman Coding: An example of the kind of tree used in Huffman encoding, accompanied by sample data being compressed. 3.3.4 Golomb-Rice Golomb-Rice coding is an algorithm similar to Huffman coding. It takes in an integer and translates it into a binary sequence. It is based on integer division, with a divisor that is decided upon before runtime. It works by dividing the integer being compressed by the chosen divisor and writing the quotient and remainder as a single sequence. The quotient from the result of this division is written in unary notation. Unary is essentially a base 1 number system. Each integer in unary is written as a series of one number repeated to match the quantity the integer represents. For example the integer three is written as 111 followed by a space. We cannot accurately express the space in a binary sequence so it is instead represented by a 0 in our program. The remainder from the result of the division operation is simply written in binary. A unary sequence requires a lot more digits to represent an integer than a binary sequence. Because of this, choosing a large divisor when using Golomb-Rice Compression is encouraged. 36 Huffman coding and Golomb-Rice encoding were so similar in nature, that we decided to only implement one. Many factors were considered by the group when we decided to implement Golomb-Rice over Huffman coding. The first of these factors was space efficiency. Huffman requires a binary tree which stores each number we are encoding as a node in the tree. This tree would have to be transmitted through the buffer along with the encoded sequences in order to be decompressed by the GPU. Overall this would limit the maximum amount of compression we could hope to achieve. Golomb-Rice, on the other hand, needed only to transfer a single integer (the divisor) along with the compressed data. The second factor was compression time. In order to decompress a binary sequence generated by Huffman coding, each individual bit would have to be checked, and then used to trace a path down the binary tree. With Golomb-Rice encoding, the quotient portion of the sequence is simply a run of 1’s. This format is easily analyzed with a simple while loop, and does not require additional operations to be performed. On average, half of the binary sequence generated by Golomb consists of the quotient portion. In essence, a Golomb sequence could be decoded in half the time it would take to decode a Huffman sequence of the same length. The final factor which contributed to the implementation of Golomb coding over Huffman coding was that Huffman was a frequency-based algorithm. This implies that certain indexes would have to show up a large magnitude more frequently than other indexes in order for compression to be effective. Since Golomb did not have this restriction hindering its effectiveness, it was considered to be a safer alternative. 3.4 Vertex Compression Research There are numerous research papers that describe attempts to create effective vertex compression algorithms. Some of these algorithms work at the time of vertex data creation when creating the actual 3D object instead of at the time of data transfer. There are also some algorithms proposed for vertex compression that are even lossy; used with the assumption that the programs drawing the 3D objects do not need the precision that the 32-bit vertex float data would offer [1]. In addition to research papers proposing compression algorithms that are run directly on the data, there also exist methods of compression that act upon what type of data type the information is loaded into. With the assumption that the data that is being saved has more data than is needed, space saving optimizations may be made. These often are up to the programmer whether or not to do them and require some assumptions when it comes to the data being saved. For 37 example there is a structure called VertexPositionNormalTexture included in the XNA video game development library that contains both a normal vector and a 3D position. This structure is 32 bytes in size, storing the position as a vector3 (12 bytes) and the normal vector as a vector2 (8 bytes). In addition to this struct there are special data types such as NormalizedShort2 which by using this instead of the vector data type can save 8 bytes without worrying about losing too much precision when used to store normal vector data [3]. This is more of an optimization than a compression step and it is up to the programmers of the shader program to decide when a smaller data type will suffice for their application and the data they will be placing into smaller data types, instead of losslessly compressing the existing data which is the goal of the project 3.4.1 Statistical Float Masking Statistical float masking is an algorithm meant to prime data for compression by other algorithms. It is not a compression algorithm that can be applied by itself; it is simply an optimization that can be applied before an algorithm is applied to increase compression ratio. The reasons that the group wanted to implement this type of algorithm was to try to make a way that data to be read into the buffer can be primed in advance for buffer transfer. The algorithm works by creating mask values derived from the most common bit values occurring in each bit-column of a block of data. For each block of data, the algorithm will count whether in each bit-column more zeroes or more ones occur. It then repeats this process for every column. Recording all of these results creates a mask that when XORed with the dataset increases a deterministic way to increase dataset uniformity. An example of this process is outlined in Figure 3.10. Although when the group started work on this algorithm they thought that it was a general-purpose optimization that could be applied to all algorithms, they found later on that some problems existed with the application process that made it more undesirable than previously thought. The first problem with this method that the group found that this compression method would only have yielded significant performance increases for unoptimized algorithms. Since efficient modern day algorithms attempt to optimize as much as possible, this predictive method trying to compress already compressed files did not yield any compression benefit. In some cases, attempting to store small enough blocks even increased the original filesize with the additional overhead. 38 Figure 3.10: XOR Operator: A quick equation to demonstrate the XOR Operator 3.4.2 BR Compression One particular kind of compression algorithm that looked promising for compressing the float values commonly found in the vertex buffer is an algorithm found in a research paper entitled “FPC: A High-Speed Compressor for DoublePrecision Floating-Point Data”. It is named by their authors as simply FPC for “floating point compression,” although the group has decided to call it “BR compression” after its authors, Martin Burtscher & Paruj Ratanaworabhan, because it is easier to identify. It works by sequentially predicting each value, performing a XOR operation on the actual value and the predicted value, and then finally performing leading zero compression on the result of the XOR operation. The algorithm uses two separate prediction methods, called FCM and DFCM. The prediction functions involve the use of specialized two-level hash tables to make predictions on what float mask will be most effective. It compares each prediction to the original value to see which is more accurate. The logic behind the compression performed is that the one that is more accurate will produce more zeroes after a XOR operation, which leads to space savings through leading zero compression (LZC). The XOR operation returns a 0 in the place of each identical bit in its two operands. A quick equation using XOR is shown in Figure 3.11. Therefore, it can be assumed that the closer the actual and the predicted values are, the more 39 leading zeros will be present in the result, and the better the compression ratio will be. Figure 3.11: XOR Operator: A quick equation to demonstrate the XOR Operator Leading zero compression is demonstrated in Figure 3.12. First it counts the number of leading zeros in the value. It stores that count within a three-bit integer. Then it removes all of the leading zeros and replaces them with that three byte integer. A primitive example is show in the figure. Figure 3.12: Leading Zero Compression: The zeroes at the beginning of a binary number are replaced with a single binary number counting the zeroes. The disadvantage of using this type of value is that the algorithm uses the same methods to decompress its values that it does to compress them. This means that the time it takes for the algorithm to compress the vertex data will be the same amount of time that the algorithm takes to decompress the data. This nullifies the purpose of offline compression as the decompression algorithm will stall the pipeline as much as the compression algorithm would. The FCM and DFCM prediction algorithms use the previously mentioned masking techniques for their compression, but use different methods to generate 40 the XOR values. An FCM uses a two-level prediction table to predict the next value that will appear in a sequence. The first level stores the history of recently viewed values, known as a context, and has an individual history for each location of the program counter of the program it is running in. The second level stores the value that is most likely to proceed the current one, using each context as a hash index. After a value is predicted from the table, the table is updated to reflect the real result of the context. DFCM prediction works in a similar fashion; instead of storing each actual value encountered as in a normal FCM, only the difference between each value is stored. This version uses the program counter to determine the last value output from that instruction, in conjunction with the entire history at that point. Additionally, instead of the hash table storing the absolute values of all the numbers returned in the history, only the difference in the values are stored, much like delta encoding. A DFCM will return the stride pattern if it determines that the value is indeed part of the stride, otherwise it will return the last outputted value. In the group’s use of this technique, the FCM and DFCM are both used as complementary functions, where the prediction for the XOR value with the higher number of leading zeros is used as the result. Figures 3.13 & 3.14 show how value prediction and table updating for FCM and DFCM work. Figure 3.13: FCM generation and prediction 41 Figure 3.14: DFCM generation and prediction Decompressing values generated with FCM and DFCM XORs is simple at this point. All that must be done to reverse the process is to inflate the numbers from their compressed leading zero form, and then to XOR the resulting value with the correct predictor hash, as is noted by a bit set in every compressed value. 3.4.3 LZO Compression Lempel-Ziv-Oberhumer had the best reported balances between compression rate and decompression speed of the algorithms researched. The LZO algorithms are a family of compression algorithms based on the LZ77 compressor distributed under the GNU General Public License. These algorithms focus on decompression time. This made LZO ideal for this project as it still achieves a high level of compression while having a low decompression time. The LZ77 is also behind other popular algorithms such as those that compress GIF and PNG files. LZO is also used in real world applications such as video games published by Electronic Arts. LZ77 compresses a block of data into “matches” using a sliding window. Compression is done using a small memory allocation to store a “window” ranging in size 4 to 64 kilobytes. This window holds a section of data which it then slides across the data to see if it matches the current block. When a match is found it is replaced by a reference to the original block’s location. Blocks that do not match the current “window” of data are stored as is creating runs of nonmatching literals in between the matches. LZO Oberhumer runs an optimization on top of this to greatly increase decompression speeds. 42 3.5 Additional Research 3.5.1 Testing Environment Language: C vs. C++ For this project, a testing environment was developed in order to aid in quick prototyping of the algorithm. Both C and C++ were proposed to be the language that the environment was developed in. This is because both C and C++ are used widely in graphics programming and both are very similar to shader coding languages. In the end the majority of our environment was programmed in C. This is due to the prior knowledge of the language and the ability for C to be implemented into C++ code with little to no modification in case the need for C++ became apparent later in the project. 3.5.2 AMP code C++ AMP (C++ Accelerated Massive Parallelism) is a programming model developed in C++ that allows the coder to easily develop a program that runs on parallel processors such as GPUs. Initial research showed that this model had potential to be a good way to test the group’s final algorithms on GPUs without implementing them in hardware or shader code. Implementing code previously run on a CPU may affect the algorithm’s performance when being run on a GPU and implementing our algorithms using the AMP libraries allow the algorithm to be simulated on the GPU to see how it affects the performance if at all when compared to previous tests. The AMP libraries may also help in the parallelization of the algorithms. This however was not implemented due to time constraints and importance of other requirements. 43 Design Details 4.1 Initial Design The group had enumerated two possible implementations for their compression techniques: at the time the 3D object is compiled into the vertex or index data (“offline”), and at the time when the data is read from the hard drive during runtime (“online”). Figure 4.1 shows a simplified version of the graphics pipeline with both the offline compression version of our algorithms running at compiletime shown on the left and the online method being done at runtime displayed on the right. Where in the graphics pipeline the algorithms would be implemented are displayed as the blue portions in the figure. 4.1.1 Offline Compression The main differences between offline and online compression are what constraints are put on the algorithm in terms of what resources are available and how much time the algorithm has to run compress the data. As can be seen on left side of Figure 4.2, the offline compression implementation is performed after the vertex and index data are created and then saved to the system’s main memory. By using the offline implementation, the group gains the freedom to work without worrying about resource or time constraints; if the program takes a large amount of time to compress the data, it is being done without the graphical application running and as a result avoids potentially stalling the graphics pipeline. Additionally, resources that will not be available when the program is running could be usable by the algorithm in the offline method. Another potential benefit to running the compression offline would be the possibility of making a “smart” algorithm designed to choose which compression method works best on the data being compressed. This smart algorithm will give a score to multiple different compression methods based on their performance on the particular set of data. The algorithm with the highest score will then be the one run on the data to ensure the best compression is achieved for that specific dataset. Then at time of decompression there will be a way to be tell which compression method was run on that specific dataset and run the corresponding decompression algorithm to correctly decompress the data. This can be conveyed either through a header section included in the data or through a 44 separate lookup table that is created when the buffer is populated with the data. The reason this would most likely be run offline is this scoring method potentially would have to run all possible compression algorithms to see what method would be the best; this would likely take up too much time and potentially resources if this test was run in parallel to run online during runtime. If the program was to dynamically apply the compression algorithm at compile time, a header would be needed to define which algorithm the program was going to use. The program would map each algorithm to a string of bits such that each algorithm is given a unique number, which are written in binary. So for example if the one has four algorithms, such as delta, Huffman, run-length, and one for vertex data, the header would only need two bits to represent each of them. It is likely the header would use four bits instead in case the group wanted to incorporate more algorithms, or alternative versions of the ones the group already have as displayed in Figure 4. Alternative versions of delta compression for example could be a slow and fast version. The slow version could use more predictive techniques and more anchor points, while the fast version uses fewer anchor points but is better for compression during run time. Figure 4.1: Header Data: An example of how header would be applied for dynamically applying the compression algorithms. Index buffers and Vertex buffers benefit from offline compression differently. For example, the compression algorithm for vertex data is likely to be more complex and therefore will take a longer time to complete. This means that the vertex compression algorithm will benefit more from offline compression because it is not rushed to finish in order to avoid stalling the pipeline. 45 4.1.2 Online Compression In the online compression implementation, compression is run on the data when it is read in from the system’s memory (the hard drive in this case), and then compressed directly before being loaded into the buffers on the GPU. The right side of Figure 4.2 displays the online implementation of this system. This implementation will require a high speed compression algorithm in order to avoid halting the graphics pipeline as it waits for the buffer to be populated. In addition to the speed requirement there is also a chance that when running the program, fewer resources will be available to compress the data due to the program also utilizing the GPU and CPU. This could potentially slow down the compression and introduce variability in our algorithm’s runtime. The algorithms developed in this project are designed to follow the offline compression method as this would allow the group to focus on higher compression ratios while avoiding the constraints on resources and speed that the online method would introduce. If the final compression algorithms are fast enough to be run at runtime without potentially stalling the rest of the graphics pipeline then the algorithms will be converted, however this is not imperative for fulfilling the requirements of the project. 46 Figure 4.2: Graphics Pipeline with Compression: Two possible configurations of the graphics pipeline after our compression and decompression algorithms have been added. 4.2 Testing Environment When the group began work, they anticipated using a multitude of different algorithms in their testing, all of these tests would generate important data that the group would need to gather and organize into a common format to be compare. The group decided that there must be a standardized testing environment that would be able to track all of the algorithms that the group would work on over the course of their project. The features that the group hoped to implement in their environment to answer their concerns were the ability to keep track of the data being generated from the many different algorithms, the ability to test each algorithm on multiple kinds of data sets in a short period, the ability to provide a standard and modularized system for many different algorithms to be tested and compared in a short period, and to ability to verify the correctness of their implementations. The testing environment is a framework designed to test our algorithms and measure their performance. It was written in the C programming language, and was worked on by all of the group’s members. It is modular, meaning each of its functions can be added, modified or removed without impacting the environment’s ability to run properly. The testing environment takes in index and vertex data from a text file and stores each in two separate arrays. When prompted, it runs the current version of the compression and decompression algorithms. It can measure the compression algorithms run time and compression ratio. It has a checksum function to ensure that the data being decompressed matches the data that was originally compressed. Two main areas of optimization must be taken into consideration when comparing the efficiency of compression and decompression algorithms: their time and space complexity. Both are important measures of how well the compression and decompression algorithms are able to work through the data being sent through the vertex and index buffers, and so the group put a high priority on the collection of statistics on this and other data over the course of the project. 47 4.2.1 Initial Environment Design The testing was designed to facilitate the generation of useful information when testing the group’s changes to their algorithms. This information is used when comparing different implementations and optimizations of the group’s algorithms with previous attempts. By developing the environment in C, the group hoped to avoid the complications of using a shader program which would have introduced more complexity than needed to be to test the algorithms. Because C code is easily ported to C++ code, by starting out using C, the group can easily transfer existing code to C++ if features of C++ were needed. 4.2.2 Data Recording The group had concerns regarding their ability to keep track of all the data that their project was going to generate; the data for an algorithm must be labeled appropriately along with the statistics it generates. Each revision of each algorithm they viewed as an entirely different entity because changes in the code could yield unique performance benefits that could be lost when working with an algorithm that has been forked into two different variations of the same type. When attempting to determine which optimizations of that algorithm work best, the group wanted to be able to generate meaningful comparisons while maintaining a distinction between algorithms that started similarly but then took on different optimizations to be labeled as the same, at the expense of generating a large amount of data. The group has designed a system for keeping track of the data they collect. When an algorithm is run, the statistics for that algorithm will be inserted into a database. If the algorithm yielded better performance than the previous iteration, then they would keep the code and continue working on improving it. In addition to the concerns related to their ability to keep track of all their data, the group realized that simply executing the algorithm code and recording how it performed was insufficient. The group also viewed the validation of the integrity of their algorithms to be very important. The group was afraid that their iterations and optimizations to the algorithms could cause the algorithm to produce incorrect compression or decompression sequences and lose the required lossless quality without them noticing at the time. These errors, if gone unnoticed, would compound on top of the other changes and possible errors that they had since added to the project after the initial error. The group decided the most efficient way to test if the algorithms were working as intended at all times 48 was to compare a checksum of the original data that was intended to be compressed, and the data that is outputted after having been decompressed. This check did not verify that the algorithm was being performed in any particular way, it simply verified that the compression and decompression sequence yielded the same data as what was originally entered into it. Another basic sanity check to make sure that the size of the compressed data was smaller than that of the original data was implemented. Figure 4.3: Checksum Functions: A checksum function will return a vastly different value even with similar input data. Figure 4.4: Checksum Usefulness: Demonstration of how a checksum alerts the program that data has been changed. 4.2.3 Scoring Method 49 In order to compare the different algorithms the group developed and tested in the environment, a scoring method was employed in order to give a quick way to compare algorithms against each other. The three developed scores, shown in Figure 4.5, show the efficiency the compression section, and the decompression. To do this the scoring method takes into account the time the algorithm takes to both compress and decompress the data and the compression ratio achieved by the compression section of the algorithm. When creating the decompression score the program uses the resources required to run the decompression section multiplied by the time taken to decompress the data. The resulting score is used to see if the decompression section of the algorithm is more efficient than previous attempts in terms of the two important sections of decompression: resource requirements and speed of decompression. By creating three different scores the group was able to choose the most efficient algorithm possible by first seeing if the whole algorithm was more efficient than previous attempts. Second the two other scores are used to see if the algorithm’s sections can be combined with other attempts’ opposite section to get a better result. By making different combinations of compression and decompression sections the group hopes to be able to further increase compression efficiency without having to add in a completely new and untested optimization. Score Compression Ratio Equation 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝐷𝑎𝑡𝑎 𝑆𝑖𝑧𝑒 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐷𝑎𝑡𝑎 𝑆𝑖𝑧𝑒 Compression Score Compression Ratio*Compression Time Decompression Score GPU Resources Used*Decompression Time Figure 4.5: Score Equations for testing environment. 4.2.4 Dataset Concerns The group was also concerned how they were going to give themselves the ability to quickly test all their algorithms over multiple types of data with theoretically different compression ratios. A potential problem could arise if the index and vertex data that the group are testing their algorithms on are not diverse enough. Although some algorithms may excel at compressing a particular type of data, it may falter in other areas that impact its overall performance. An example of this is shown in Figure 4.6. This figure shows two 50 different datasets being run through a delta encoding algorithm. The left dataset has a much higher level of compression compared to the one on the right. Another example would be, if the group only tested all of their algorithms on a file with long binary runs, run length encoding would appear much more viable than if testing were performed on a more inconsistently distributed file. As such they worked to design their testing environment in a way that facilitated the testing of algorithms on multiple datasets at the same time. Figure 4.6: Example of different data not working at same efficiency on same algorithm. 4.3 Index Compression Due to the uniformity of index data (no index data will ever have a decimal value), the index compression algorithm is much easier to develop and is capable of achieving a much higher compression ratio than that of vertex data. Because of this the group started with this algorithm and implemented a solid prototype before moving onto the more complex vertex algorithm 4.3.1 Delta Encoding Of all the algorithms researched, it seemed delta encoding would be the best to start with as a baseline for index data compression. This was chosen as delta encoding has been proven to work very well with integer data that does not vary much from one unit to the next. This is generally the case with index information 51 due to the fact that when drawing an object, it is uncommon to point towards a vertex at one spot, then point to one very far from it as this would draw a very weird looking shape. Often these two vertices will be close to each other in the buffer thanks to the way 3D objects are transformed into vertex information when created. Initial test results were very promising with the implementation of delta compression showing a large amount of compression with very little time penalty at both the time of data compression and more importantly decompression. Due to these initial results delta compression was proven to be a good baseline to build upon. 4.3.2 Other Considered Algorithms Other algorithms that were tested are Huffman encoding and run length encoding. Initial research on these algorithms deemed them less effective on average index data and as a result the group does not consider the testing of these alone on uncompressed data a large priority. With run length encoding however when implemented on top of delta encoding there was an increase in the efficiency of the algorithm being developed even further as the new compressed data was potentially more compatible with it and yielded better compression without a large increase on decompression time. An example of this is shown in Figure 4.7. In the figure a sample of an index buffer is shown. The data in this buffer is first run through a delta encoder as the first step of compression. In the next step the delta encoded data is then run through a run length encoder. Because of how run length encoding works numbers have to be encoded into letter representation and in the example seen in the figure the run length algorithm encodes 500 to the letter a, 1 to the letter b and -1 to the letter c. It can be observed that by running the delta encoded data through this second algorithm the sample data is converted from three values to two. This is where running one encoding on top of another will produce higher compression ratios, and decompression is not greatly slowed by the compounding of these encodings. 52 Figure 4.7: Run Length + Delta: Example of running Run Length encoding on top of Delta encoding 4.3.3 Delta Optimization In order to increase the speed and efficiency of delta decompression on our data, the group developed dynamic anchor points. These anchor points split the data into separate sections or blocks, allowing the delta compression algorithm to start at the nearest anchor point instead of at the beginning of the data. This will help the algorithm by allowing the GPU to access indices at random locations in the buffer without having to decode from the beginning of the buffer, but instead the closest anchor point to the point being fetched. There are a few methods of implementing these anchor points. The first anchor point implementation is shown in Figure 4.8. This implementation involves the use of escape codes that exist in some index data. These escape codes are pieces of data that do not represent actual indices but instead are a flag of sorts to indicate the end of a triangle strip, a special type of optimization that allows the creation of a strip of triangles, each connected by 2 vertices to the previous triangle. This optimization allows the reuse of two indices of the previous triangle to draw the next triangle. By using these codes the algorithm will place the new anchor point directly after the escape code, thus the deltas between the following indices should be very small as all the triangles in a triangle strip are connected. In the figure the escape codes are represented as the value -1 in the original index buffer. These are turned into two consecutive -1’s in the encoded buffers as shown by the arrows going from the original data to the left dynamic anchor point buffer. This is done to prevent the deltas that equal -1 from triggering an escape code, and in turn these deltas are represented in the encoded buffers as a -1 followed by a 0. In the diagram a command to fetch the 7th value in the index buffer is run on both encoded buffers. The anchor point used it represented by the first blue box. The following dots down the line represent decoding steps that they had to be 53 run before the desired index was reached, indicated by the ending blue square with the desired value in it. It can be seen that the normal anchor point method required eight decoding steps, assuming the loading of the anchor point was the first step, whereas the dynamic anchor point implementation allowed the decoder to reach the desired value in only 3 decoding steps. Figure 4.8: Example Showing Benefit of Dynamic Anchor Points with Escape Codes Another method for dynamic anchor points could be taking the size of the buffer of data and splitting it up into equal parts. This method will require a smart algorithm to split the buffer up to an optimal amount to avoid too many anchor points but still have enough to allow the buffer to quickly get the information it needs without decoding too many values. As shown in Figure 4.9, splitting the buffer up by different factors of three indices would work well, since 3 indices defines a triangle, which is the base shape for drawing an object. 54 Figure 4.9: Example Showing Benefit of Dynamic Anchor Points with No Escape Codes 4.3.4 Golomb-Rice The divisor used in encoding a sequence with the Golomb-Rice algorithm will determine the effectiveness of the algorithm. The method used to choose the divisor is therefore of the utmost importance. When the size of the divisor is decreased, the size of the quotient will increase, while the size of the remainder decreases. If the divisor is set too low, there is the chance that the quotient could become too large to fit within the largest datatype available in C (64 bits). When the divisor is increased, the size of the quotient decreases, while the size of the remainder increases. If the divisor is set too high, the numbers will simply be converted to their binary representations, not yielding any compression. The implementation of Golomb-Rice used for this project calculates the divisor based on the maximum value found within the input file. It ensures that the encoded sequence required for that maximum value is 32 bits at most. This way, all numbers smaller than this max value will be less than 32 bits when encoded, ensuring the algorithm always yields some amount of compression. This method works best on input files with a wide range between the highest and the lowest value. If the values are skewed such that a majority come close to the highest value, this method will yield next to no compression. Blocking Implementation The binary sequences of varying lengths produced by Golomb Rice are not easily stored or decompressed by the C language’s native libraries. As a result, a blocking structure was added when implementing the algorithm. A block is simply a small portion of the input data being compressed in a similar fashion. The 55 minimum required bytes to store the largest sequence in the block is the number of bytes used to store each value within that block. If a particular block ends up with sequences larger than a native integer, the hope is that the other smaller blocks in the compressed data will compensate. The compressed buffer as a whole is stored in char array, with the first char in each block used to store the number of bytes required for each value in the block. In the current implementation each block shares the same divisor. It is possible, however, to give each block its own divisor. This could possibly yield better compression as each block could compress its own values to the smallest sequence possible. A caveats to this technique is that it leads to more overhead when compressing the data. It also leads to more header bytes required to store the divisor for each block. These header bytes could become so large that they themselves would add a complete integer to the compressed buffer. Golomb’s Strengths Golomb’s main strength is its ability to work on any kind of data formatting. RunLength Encoding works optimally on data which is sequential, where the delta values are repeated frequently and consecutively. Golomb will compress data effectively regardless of the range between values or where they are in relation to each other. Golomb does not require additional data to be stored alongside the compressed buffer, aside from the divisor that was used to compress the data. For this reason, Golomb also lends itself well to parallel implementations. Any thread can decompress a value in parallel as long as it is given the divisor. Priming with Delta An optimization added to the Golomb-Rice algorithm was first compressing the data with Delta compression. Delta compression makes large values smaller by only storing the differences between two adjacent values. If the data is formatted in this way, the divisor chosen will prove to be more effective at compressing the data overall. This method comes with several tradeoffs. It adds to the overhead processes required to run the compression and decompression algorithms. It also eliminates Golomb’s ability to be parallelized, since the buffer cannot be decompressed without an anchor point. 4.4 Vertex Compression Vertex Compression is much more complex and as a result required much more research and work to implement prototypes. The group researched many different algorithms. These algorithms included prediction based and non56 prediction based algorithms. Of these algorithms the group chose to focus on implement BR compression and LZO compression. BR compression was chosen to attempt to use an open algorithm that the group could understand completely. BR was considered to be a good candidate because it uses a one-pass predictive algorithm, meaning that it would hopefully provide sufficient compression and decompression speeds. BR was the first vertex compression algorithm that the group attempted to implement. LZO was chosen over other LZ77 algorithms such as DEFLATE due to its focus on decompression time. This focus made LZO ideal for this project as it still achieves a high level of compression while having a low decompression time. There are numerous different LZO algorithms which all generate different levels of compression, however since all LZO1 algorithms use the same decompressor decompression speeds are all comparable to each other in terms of MB/S. LZO11 was the version of LZO that was implemented into our testing environment. 57 Build, Testing and Evaluation Plan Our original build plan is outlined in our objective section. First it was planned to start the creation of the testing environment. Then was the implementation of basic compression and decompression algorithms. Then using the testing environment the group would test potential improvements to our algorithm and accept or throw out the modification depending on if it improved the algorithm or not. In order to test the performance of the algorithms the group used the testing environment to run the algorithm on a set of sample data. Once run our environment gave us valuable test data such as how long it took to compress the data, how long to decompress, and what was the compression ratio from normal data to compressed data. The group then plans to have this information save to a file for future analysis to see if the algorithm was an improvement or not. In order to evaluate if an algorithm is an improvement the group compared a few different fields of test data against other tests. The first and most important is to make sure it maintains being lossless, if any data is altered or lost it is a failed algorithm and must be thrown out. Next in importance is the compression ratio, closely followed by decompression time. These two fields will be scrutinized the most in terms of power of the compression versus speed of decompression. The reason these two are so important is the group is looking for an algorithm that has the greatest amount of compression for our data types, but at the same time maintaining a fast enough decompression time to allow it to be at runtime when the data is fetched from the buffer. 5.1 Version Control When the group started this project, they knew that they would be working on the same files at the same time. Regardless of whether the group is working on different files in the project or they are even working on different parts of the same file at the same time, it is necessary to keep the project in a single form that all members contribute to. If the members attempt to work on their parts of the project with complete independence with the intention of merging it all at a 58 later date, they may run into large compatibility problems when merging. A more manageable solution is that the members keep a central repository for the code to be stored in. That is why the group decided to implement a version control solution for use during their project. 5.1.1 What is Version Control Version control is a system that one or many people may use to manage the changes that are made to a project. Although implementations for this concept differ, the idea is that many people may work on the same project simultaneously by creating a copy of the master project that all members contribute to, having each person make their own changes to the parts of the project that they are assigned to, and then finally updating the master version with their changes, known as a “commit”. Conflicts in version control may arise when multiple people attempt to edit the same file and then both try to commit their changes back to the master version. A robust version control system will alert the user that is committing over other peoples’ changes that their commit may result in the loss of other peoples’ work. A comparison highlighting the differences between the two copies of the same file, or a “diff” may be provided, which the user can use to update their version of the file with the contents of the other users. An example of a file resolution can be seen in Figure 5.1. The usage of this conflict resolution system is imperfect but much preferable to the alternative: users not being able to work on the same file at the same time, having to manually make sure that it is safe to commit the files they are working with. As they could see, using a version control solution was imperative for managing the group’s project. It would allow the group to work on different parts of their compression and decompression algorithms or testing environment at the same time, and merge their code when they were done. Since the group members did not only work on the project when they were meeting together, it was important to have a centralized location that they would be able to store their source code that was not dependent on transferring over some physical medium; storing their code online alleviated that problem. Additionally, version control would allow the group to revert commits that introduced bugs. Some bugs are introduced through errors in code that are difficult to pin down, and if the commit also modified a large area of code, it may not be time efficient to fix. Instead, the changes could 59 be redone using a different solution or while the programmer is being more conscious of the errors that might occur when writing the code. Figure 5.1: Version Control: A file being changed and merged in a generic form of version control. 5.1.2 Choosing a Version Control System In the beginning stages of the project the group had not yet established whether a public account was safe to store data that could potentially compromise their NDA if open to the public. The owners of the company that run GitHub, for example, are a strong proponent of the concept of free information on the internet. Therefore if one signs up for a public account, their code is freely available to be viewed by the public and is considered open source. Knowing how important the role of version control would be in their project, the group wanted to make sure that the solution they were choosing was the right one for their task. The group considered many different factors when considering which version control solution to use. They were aware that at some point they 60 could be working on different parts of code that existed in the same file, so they knew that their version control solution must be good at alerting the users that files needed to be merged before committing. Not doing so would lead to situations where the code they were writing became fractured in ways that may not be simple to fix. Since this project did not require internet access for any of its functionality, the group wanted to be able to have a version control solution that also did not require internet. This allows them the extra flexibility to work on the project for long stretches of time where internet isn’t available, such as when traveling. Another aspect of version control that the group was interested in was the ease of its use. Using an overly complicated version control solution is just as undesirable as using one that isn’t robust enough for all of the group’s needs. A system that takes as much time to learn and use as it does save time is not a very useful system in the end. Finally, the group wanted to use a system that secure means of code storage. Because the project was being done for AMD, who had placed a nondisclosure agreement on parts of the project, it was important for the group to be able to control who was able to access the code. Git Svn Dropbox Google Drive multiple users ✔ ✔ X ✔ offline ✔ X X X simple X ✔ X ✔ secure ✔ ✔ X ✔ cross platform ✔ ✔ ✔ ✔ Figure 5.2: Version control pros / cons: The different pros and cons of each kind of version control. 61 The first version control solution the group looked at using was Subversion. Subversion has many strengths; it is a very robust solution with powerful tools. Its automated tools are useful for keeping track of entire projects at the same time. It allows users to commit only the files that they are working on back to the master version, simplifying the merging process. Subversion is also a secure solution as it uses a login system to keep track of can checkout and merge changes into the repository. Subversion is cross platform; it has a command line utility for Linux and also has many powerful clients for Windows such as TortoiseSVN. Many companies use Subversion in the workplace for their products, another indication of its degree of usefulness. Subversion is also relatively simple to use; since all users are committing to the same online database, of the committing files is relatively simple. Git is another tool widely used for enterprise level code versioning. Like Subversion, it has many powerful tools that allow many developers to work on the same code simultaneously. Git operates by having each user create an entire clone of the repository that they are working on to edit. This allows developers to more easily work with the project’s revision history and will allow them to have full control over the project while not connected to the internet. Additionally, since Git repositories are distributed, loss of the master server will not hinder the project members as the loss would for a group using Subversion. Like Subversion, it has robust member management features, allowing restricted access to projects being hosted online. Git is typically considered harder to use than Subversion. Since the entire code repository is being committed when users make changes, more complex commands must be used than are used in Subversion. Dropbox is another candidate for use in software versioning. Dropbox is a service which syncs folders to a cloud-based storage system. Accounts come with a small amount of space in their databanks without any kind of subscription or monetary commitment. Since it was designed to be a general use file versioning program and not designed to be a code versioning program, it implements certain functionality that programmers would find useful. For instance, if two developers are working on a file simultaneously, Dropbox will not alert them that the file’s code is diverging. Instead it will simply add the other file to the directory alongside the original; a solution that is far from ideal. Dropbox is typically used for smaller projects where the developers do not think that using a large scale versioning program is necessary. Most do not recommend attempting to store their projects on Dropbox. 62 Google drive is also used for versioning but can be used in a different capacity. In addition to Google Drive’s file syncing and backup functionalities, Google drive is also able to host the group-editing of documents. Multiple people are able to edit the same document simultaneously, greatly simplifying the document creation process. The group originally used Dropbox for file versioning. At the time, it was sufficient for their needs to make sure that their code was backed up in some location, and also simplified file sharing. With Dropbox’s sharing system, the group was also not worried about problems regarding the security of the code. Since the group could not verify either site’s security, they initially posted our completed code to a shared folder in the Dropbox. Only those with user accounts who had been sent an invitation to the folder were allowed to view or edit its contents. Later on the group transferred their files onto the private GitHub they had obtained through a student subscription. The group saw using Dropbox as a temporary solution until they were able to decide on which dedicated file versioning program they would use. The group eventually decided that they would use Git as would their version control system. They chose it because of its robust feature set. Specifically, Git provided the ability for the group to work offline with a full and distributed backup system for their project which was essential for time-sensitive work. The group used Google Drive to store their documentation. They also used the drive to store a schedule of the things they had left to do on the project. This schedule was formatted in a way that defined exactly who would work on what portion of the project. They also included a document containing the minutes from each of our meeting, so as to preserve the events for future use, and provide an overall picture of the pace the group was keeping in the progression of the project. The group also kept track of the meeting minutes simply for record keeping purposes. The group saw the ability to all contribute to the same document simultaneously as a unique and very useful ability. Due to the complex nature of word document storage, a traditional versioning system like Git or Subversion would not suffice. 5.2 Test Runs All Tests were done using our testing environment. Each test run was done on a computer that contained an Intel Core i7-4785T @2.20GHz, 8 GB Ram and was run on Windows 8.1 Pro. All test data run 10 times for each file then averaged. 63 5.3 Index algorithm development 5.3.1 Delta Encoding The group decided to use delta encoding as the baseline compression rate for our algorithm to compress index buffer data. It was chosen over the other possible encodings as it is an easily implemented and very effective algorithm when run on index data. Index buffers consist largely of sequential integer values. This makes logical sense because a graphical object is more likely to be defined by a series of vertices which are close together, rather than by vertices on opposite ends of the graphical environment. When run on the sample data and then running Run Length Encoding on top of the encoded data, the group found delta compression had around a 2:1 compression ratio as seen in Figure 5.3. The original data is stored in a text file indexBuffer.txt as seen in the table. This file contains a large amount of example index values. Delta encoding is run using our test environment on the original data and as seen in the table the compressed data, saved to indexBufferCOM.txt, is almost half the size of the original data. Inside this test there were escape codes that had to stay in a separate format from the rest of the data. In this test run the escape code was the unsigned integer value equating to the signed integer -1.This escape code is used when drawing triangle strips to indicate the end of a strip and the beginning of a new one. The Delta encoder must contain a handler that consists of a check to see if the value is an escape code or if the delta value is a -1. In order to do this this the group encoded their own escape codes to be written to the compressed list to ensure that when the encoder hit this escape code two numbers are added to the compressed list, in this case a -1 and then a 1 are pushed to the list. If the delta between values equates to a -1, (the value of the escape code) the code would then push -1 and 0 to the compressed list, indicating that it was a -1 instead of the escape code. The addition to these escape codes will cause some increase in size but even with this method the sizes are still a huge improvement. By examining Figure 5.3 it can be observed that the file size remains the same when comparing the original data file and the decompressed file and the lossless quality is ensured by running the original and the resulting de-compressed file through a checksum that checks the sum of all the values of one file with the sum of the other. 64 File Data Status File Size indexBuffer.txt Original Data 48 KB indexBufferCOM.txt Compressed Data 26 KB indexBufferDEC.txt Decompressed Data 48 KB Figure 5.3: Index Buffer Delta Compression: Example of compressing index buffer data using Delta Encoding. The delta compression algorithm implemented can be considered a “dumb” implementation as it begins at the start of the array, with index zero set as the anchor point and the only value unchanged from the original array. This algorithm would use a large amount of time decompressing portions of the data which are not required at that time. The group plans to redesign the algorithm to be “smart" and include dynamic anchor points. This will allow the algorithm to run much faster when accessing different parts of the buffer and allow for much faster decompression times. By using delta encoding first there was potential to use other encodings and algorithms to further compress the data. In uncompressed index data there is rarely a pattern of the same index repeated over and over again in a consecutive run due to a single vertex not able to connect with itself to form a graphical object. This makes run-length encoding an inadequate algorithm for the uncompressed index data. However after delta encoding is run on the index data there is a chance for many of the deltas between indices to be the same value (if the vertices all come after each other in the buffer). This allows us to run length encoding on the delta encoded buffer and potentially greatly compress the already compressed data even further. The results of Delta encoding compounded with run length encoding are shown below. As seen in Figure 5.7 the average compression rate for Delta combined with run length encoding is 46.25%. The speed of the algorithm’s test runs can be seen in Figure 5.5 and the average compression / decompression times are .83 and .76 milliseconds respectively. 65 Figure 5.4: Delta RLE file size change Figure 5.5: Delta RLE Compression and Decompression Time 66 Figure 5.6: Delta RLE Normalized Compression Speeds Figure 5.7: Delta RLE Compression rates of different test files 67 Figure 5.8: Delta RLE Test Run Histogram 5.3.2 Golomb-Rice Encoding The Golomb Rice integer compression algorithm is able to compress 42.01% of the index buffer on average which can be seen in Figure 5.12, when run on the same data as our Delta RLE Algorithm. As seen in Figure 5.10 Golomb-Rice has an average compression time of 14 milliseconds, and an average decompression time of 14 milliseconds. Figure 5.11 displays the compression rate which averaged at 4 MB/second for compression, and decompression averaged at 6MB/second. 68 Figure 5.9: Golomb-Rice file size change Figure 5.10: Golomb-Rice Compression and Decompression Time 69 Figure 5.11: Golomb-Rice Normalized Compression Speeds Figure 5.12: Golomb-Rice Compression rates of different test files 70 Figure 5.13: Golomb-Rice Test Run Histogram 5.3.3 Index Compression comparison Figure 5.14 displays a comparison between Delta-RLE and Golomb-Rice compression rates from out tests. It is important to note that the Compression rates of both algorithms remain relatively comparable throughout the tests. However due to Golomb’s slow speeds with decompression it was deemed the less fit algorithm for our project however it is still a valuable algorithm as it has similar performance when run on random data, which is not true at all with DeltaRLE. 71 Figure 5.14: Comparison between Delta-RLE and Golomb-Rice Compression Rates 5.4 Vertex algorithm development 5.4.1 Test Data Our test data was provided by AMD, and was produced by the output of a PERFstudio program designed the dump the contents of actual index and vertex buffers of graphical objects the company performs tests with. These values were then written into a text file, with each vertex receiving its own line in the file when the group was given the vertex buffer data, some of the values had been printed to the file in exponential form. This meant that the value had been printed out as a decimal number similar to the actual data, but raised to a negative power in order to keep the numbers within an arbitrary range. An example of this is using 0.113e^-3 to describe the float value 0.000113. Before the group could begin work on the compression algorithm, they had to ensure that all of the data was uniformly described by exclusively numbers. The characters used to follow proper exponential formal forced us to read in all of the data from the text files as string data and then converted to the proper number data type. This required a parser to be used to translate the string data into float 72 data, and interpret the exponential-formatted data as it was encountered. Further information regarding this process will be covered in the section concerning our testing environment’s File Reader (Section 5.5.1) 5.4.2 Vertex Algorithm Implementation Because vertex data can contain both integer and float values a suggested path to take for compressing the data is splitting up these two data types. This would allow us to potentially run integer based compression algorithms on one section of data, while running the more complex float compression algorithms on the other section which would then only contain float data. Another potential way to compress this data is by representing the float values as strings. Through this the algorithm then can use a method such as the Burrows-Wheeler Transformations to organize the data and then compress it using an encoding such as run length. The group implemented two algorithms for vertex compression. These two are LZO and BR compression. Our tests have shown LZO to be a better candidate for vertex compression as it yields better compression results and has faster decompression speeds. LZO also is a LZ77 based algorithm, giving insight on the value this compressor holds and the possibility of other algorithms that use this compressor giving better results for future research. 5.4.2.1 LZO In our tests LZO achieved a compression rate averaging 32.58% as seen in Figure 5.18 and compression/decompression speeds averaging 5.1 and 2.9 milliseconds respectively. It is important to note that as seen in Figure 5.16 the decompression speeds were always well below that of the compression. 73 Figure 5.15: LZO File size changes Figure 5.16: LZO Compression and Decompression times Figure 5.17: LZO normalized compression speeds 74 Figure 5.18: LZO Compression rates of different test files Figure 5.19: LZO test run histogram 5.4.2.2 BR 75 In our tests BR achieved a compression rate averaging 14% as seen in Figure 5.20 and compression/decompression speeds averaging 9 and 7.6 milliseconds respectively, as seen in Figure 5.21. Figure 5.20: BR size changes Figure 5.21: BR Compression and Decompression times 76 Figure 5.22: BR normalized compression rate, measured in MB/S Figure 5.23: BR Compression rates of different test files 77 Figure 5.24: BR test run histogram 5.4.3 Vertex Compression comparison Figure 5.25 displays a comparison between BR and LZO compression rates from out tests. Unlike the results of the index compression algorithms this has a very clear better algorithm. LZO has a consistently higher compression ratio and almost a double the compression rate when compared to BR. BR was valuable to the project still as LZO’s algorithm was relatively unknown outside of being based upon the LZ77 compressor, whereas BR was described in full. 78 Figure 5.25: Comparison between Delta-RLE and Golomb-Rice Compression Rates 5.5 Test Environment Our testing environment can be separated into 4 basic sections. These sections are: the file reader, the tests that are run, the section that outputs the data into our testing database and the actual compression and decompression algorithms that will be implemented and tested. These four sections were developed to be separate modules. This allows one section to be modified while keeping the other constant and allowing the modification without the risk of damaging the other sections. This is especially important when implementing test algorithms. 5.5.1 File Reader AMD provided a large amount of sample data from index and vertex buffers they had worked with. It was acquired from PERF studio through a function that dumped the contents of the buffers into folders and then compressed them into a .zip file. The data itself was stored in the form of text files, and would therefore have to be read into our test environment through a file reader. 79 The File Reader is separated into two separate functions, one designed to read in index data and one designed to read in vertex data. Both functions have only one parameter, which is the address of an integer value. This address is used in order to allow the function call to pass-by-reference the size of an array. This information will be needed later because the functions which perform compression and decompression on the data need to know the size of the array storing it. The first function which reads in data from the index buffer files is designed to return an array of integers after scanning the text files. The second function is designed to return an array of strings for the vertex buffer because of formatting complications which will be explained later. The simple “fscanf” function from the C standard input/output (stdio) library was used to read in the data from the text files in both functions. The function starts by scanning in each value in the file without saving it to an array. While the function passes over each value it keeps a tally of how many values are inside the file. Once it has completed an entire scan of the document and knows how many values are inside the file, the function dynamically allocates space for an array that can hold the correct number of values. The data from the file is then read in using the fscanf function once more. This time the data from the index buffer files stored in an array of integers, while the data from the vertex buffer files is stored in an array of strings. The reason the group had to read in the vertex buffer data as strings is because some of the values the group was given from the vertex buffers was formatted in exponential form, whereas other values were expressed as floats. We had to design a parser which would read through each string in the array and detect when it was in the exponential form. It then converted the data into the more recognizable decimal numbers commonly found in a vertex buffer. Any values that were not in the exponential form were simply transformed into floats using the C library’s “strtof” function. These values were then stored into a float array. 5.5.2 Compression and Decompression algorithms The most important part of our test environment is the ability for prototype compression and decompression algorithms to be implemented easily and quickly into the environment. By writing the algorithm as a function that takes in the buffer object an algorithm can be plugged into the rest of the environment without modifying the rest of the environment. The way the environment was written, the group is able to design these algorithms in separate C files, and then 80 call these files in the main function of the testing environment which ties every part of it together. By writing in this modular fashion the group is able to focus on the actual algorithm instead of worrying about introducing errors into the rest of the environment. This modular approach also allows each group member to develop separate optimizations in parallel and test them all in the same environment by easily plugging their code into the algorithms section of the environment. 5.5.3 Testing code Our actual testing code consists of 3 different tests: a time test, a compression ratio test, and a lossless integrity test. The first of these tests, the time test works by first recording a start time before the compression or decompression algorithm is run but after the test data is read into a buffer array. It then will run through the compression algorithm being tested. Once finished it will record an end time indicating the length of time compression took. The environment will then record yet another time, this will indicate when the decompression started. Once decompression completes it will record this time as well, allowing the calculation of the time it took to decompress the data. Finally it subtracts the difference of these start times to their end times to find the total time the respective section of the algorithm has taken. Currently, the group is recording the times by calling the C library’s “time()” function. This function returns the system clock’s time by a measurement of milliseconds. The second test the compression ratio test. Currently the group is manually checking the output file sizes to test the differences in sizes between the original data and the compressed data. We wish to have more precise data however and will be implementing a method to calculate this in code, potentially using a C++ data structure named vectors and some math to calculate the size of the resulting data. By taking the compressed data and dividing it by the uncompressed data the group gets a compression ratio to be used in comparing just how effective the algorithm is. Finally the test for if the algorithm is indeed lossless is run every time using a checksum. This checksum will tell if the original data is exactly like the decompressed data. The chances of two different lists of data having the same checksum are extremely low and because of this the group is confident that this 81 is a sufficient test to see if the two lists are identical or not. If the decompressed data is not the same a warning message is displayed to the console, an example of this is shown in Figure 5.6. Once the code is run and finishes the time test data collected from the test run is displayed to the console which is seen at the bottom of the figure. Note the test data used in the example displayed in the figure is very small and as a result compression and decompression took less than a measurable amount of time to complete. The collected data will also be written to a file or database for further comparison and analysis. Figure 5.26: Example Testing Environment Output: Example output produced by our testing environment, including the performance measures. 5.5.4 Data Printer The final part of the testing environment being developed is the data writer. This code is designed to format the test results of the program. Our test results output both to the screen during debug mode and to a database file during testing mode, which will allow us to construct a set format for the data to be organized in and allow us to compare all the different tests with each other and organize the data in graphs and figure that will be more efficient in displaying the findings. An example of data being output can be seen in Figure 5.7. Figure 5.27: Additional Testing Environment Output: Full performance metrics used for determining algorithm statistics. 82 Administrative Content 6.1 Consultants 6.1.1 AMD The project was originally proposed and is sponsored by Advanced Micro Devices (AMD). They are one of the two main Graphics Card research and development companies for mainstream computing and gaming. Their graphics cards are also featured in most gaming consoles and they have a wide variety of personal use GPUs and workstation GPUs that would benefit greatly from our project. AMD has been the group’s main consultant when it comes to how graphics cards work and how the data the group will be compressing is used within the graphics pipeline. Specifically the group’s main contacts at AMD are Todd Martin and Mangesh Nijasure. They also are the main consultants when it comes to what our project needs to do in terms of requirement and specifications. Additionally they have provided us with all of our initial test data and programs to generate additional test data if needed. 6.1.2 Dr. Sumanta N. Pattanaik Dr. Sumanta Pattanaik is an associate professor at UCF that teaches Computer Graphics. He has provided the group with a crash course encompassing the basic background knowledge of how Computer Graphics are computed and programmed. He has also been helpful in understanding both vertex and index information and how we can potentially compress it with our algorithms and gave us some good ideas on where to start researching lossless compression algorithms and what algorithms did not seem to be helpful towards the development of successful compression algorithms. 83 6.1.3 Dr. Mark Heinrich Dr. Heinrich is an associate Professor at UCF who conducts research focused on computer architecture. He is also in charge of the Computer Science senior design class. He has been very helpful with keeping out project on track and making sure the group does not fall too far behind and run out of time. He also has been very helpful in contacting other professors to ask for assistance with our project. 6.1.4 Dr. Shaojie Zhang Dr. Shaojie Zhang is an Associate Professor of Computer Science at UCF. He conducts research with DNA simulation and analysis. He provided some direction in terms of compression algorithms to use with both index and vertex data. 6.2 Budget Our client for this project, AMD is also our major sponsor. They have contributed a fund of $2000 in order to ensure that the project was completed without costs of required equipment and software getting in the way. The group has several possible expenditures which these funds will go towards covering. 6.2.1 A Graphics Processing Unit. A graphics card manufactured by AMD containing the most recent iteration of their Graphics Processing Unit in order to test our algorithms with the most fidelity. Unit Average Price Graphics Memory R9 295X2 $999.99 Up to 8GB GDDR5 R9 290x $349.99 Up to 8GB GDDR5 R9 290 $259.99 Up to 4GB GDDR5 84 R9 285 $244.99 Up to 4GB GDDR5 R9 280X $229.99 Up to 3GB GDDR5 R9 280 $174.99 Up to 3GB GDDR5 R9 270X $159.99 Up to 4GB GDDR5 R9 270 $139.99 Up to 2GB GDDR5 Figure 6.1: AMD R9 Graphics Cards: A side-by-side price and performance comparison. More information on this series of graphics cards is provided in the appendices. Reprinted with permission. 6.2.2 Version control Version control in the form of GitHub or some other sites used as private repositories could be required. The group may come into contact with sensitive material that AMD is working on as the group progress in our project. In order to abide by our Non-disclosure Agreement and preserve this sensitive data, the group would need a private repository from one of these sites. Despite the fact that public repositories are free to use, private repositories require a subscription with a monthly fee. The entire project will last from August 18, 2014 until May 2, 2014. This will require a minimum of 10 months of subscription time. Plan Name Private Repositories Subscription Fee Overall Cost (10 months) Free 0 $0 / month $0 Micro 5 $7 / month $70 Small 10 $12 / month $120 Medium 20 $22 / month $220 Large 50 $50 / month $500 Figure 6.2: GitHub Personal Plans: The potential cost of a subscription to a GitHub personal account. Plan Name Private Repositories Subscription Fee Overall Cost Free 0 $0 / month $0 85 Bronze 5 $25 / month $250 Silver 10 $50 / month $500 Gold 20 $100 / month $1000 Platinum 50 $200 / month $2000 Figure 6.3: GitHub Organization Plans: The potential cost of a subscription to a GitHub organization account. 6.2.3 Algorithm Licenses In order to use certain patented algorithms, purchasable licenses could be necessary since AMD is planning to use our work in their commercial product. As of yet the group has not been required to purchase a patent in order to makes use of the algorithms in our project. Huffman Encoding, Run-length Encoding, and Delta Encoding are all not patented algorithms and are therefore free to use in this context. However the group may yet find an algorithm that is patented which they will be required to pay to use. This will likely be found in our research concerning compression of the vertex buffer, if it is indeed found at all. 6.2.4 Document Expenses For this class this paper was required to be printed and bound professionally, this will have to be done by an outside party. Luckily, on the UCF campus is a professional graphical design firm entitled “the Spot.” They are most well-known by the student body as a place to print papers and, relevantly, get papers bound. We contacted the Spot for a quote on how much it would cost to get our paper printed bound. We used the final design document created by the previous year’s senior design students as an example of what the group would be printing. Their paper had fifty-four pages without color, and forty-three pages with color. The spot quoted us the costs of printing each page. The rate they charge to print a 86 document without color is ten cents per page. The rate they charge to print a document with color is forty-nine cents a page. The total is calculated in the table below: Item Rate Cost Black and White Impression $0.10 / page $5.40 Color Impression $0.49 / page $21.07 1 small spiral bind $4.50 flat fee $4.50 Total Cost $30.97 Figure 6.4: The Spot Pricing: Quote detailing the cost to print a document. 6.2.5 Estimated Expenditures Figure 6.5 shows a pie chart detailing what percentage of the budget the group expected to spend on each necessary item: Figure 6.5: Estimated Expenditures Pie Chart. As shown the group expected to go with the Medium Personal level subscription with GitHub for version control. We also planned to buy the R9 950x Graphics Card from AMD. 87 6.2.6 Actual Expenditures Figure 6.6: Actual Expenditures Pie Chart Git Hub offers free private group-repositories for university students. This guarantees that we can protect our NDA while still being cost-effective with our project expenses. As the project progressed it was decided that a graphics card was not required to gather the sample data we needed to test our algorithms. This was thanks to the test data that was provided to us by our sponsors, AMD and existing data that was found on the web. The poster used for Senior Design Day was professionally done and of a proper size to easily convey our results. This caused the poster to eat the largest amount of our budget, costing $140.00. Even with this poster and the cost of getting the final document professionally printed this project still came in well under-budget. 6.3 Project Milestones 6.3.1 First Semester This project has been split up into two semesters of work. The first semester is comprised of primarily research and design of the initial algorithms and test environment. This semester’s milestones are displayed in Figure 6.6 which is a timeline running from the beginning of the semester to the end which is marked 88 with the completion of the initial design documentation making it the final milestone for this semester. 6.3.1.1 Research The first milestones involved the completion of basic research into graphics and compression algorithms. This process took around 3 weeks to get a good enough understanding to quantify it as a milestone even though it technically continues throughout the whole project’s development. Research first focused on gaining knowledge of what vertex and index data is comprised of, as well as how these two data types are used when drawing graphics to screen. Researching vertex and index data also required the group to research and learn the basics of the rest of the graphics pipeline which was accomplished through both online research and a crash course given by Dr. Sumanta Pattanaik. Once the group gained a good foundation on graphics and the data the group was tasked with compressing research turned towards learning about lossless compression algorithms. This involved first learning that basics of how encoding data and compression works to reduce the size while not damaging the data. Then the group focused on different algorithms, first focusing on ones that will work on integer based data as this would be the easiest to prototype with index data. Then turning focus on float compression which turned out to be much more difficult. In the end however enough research was done and enough knowledge gained to mark it as a completed milestone in the project. 6.3.1.2 Testing Environment With initial research completed the group then focused on the development of the testing environment which took just shy of 3 weeks to get set up and running. The design of the environment mainly comprised of the basic ideas the group wanted to implement into the environment. The actual development and coding of the environment was modularized and split among group members to increase the speed at which it was completed. Once all group members completed and debugged their sections they were integrated with the others and debugged again as a whole. 6.3.1.3 Index Algorithm Prototype 89 With the environment set up the group’s next milestone was accomplished in just around a week and a half and is marked with the completion of the initial prototype for index data compression and decompression algorithm. As mentioned before this was accomplished using the modified delta encoding algorithm. The implementation of the prototype of this algorithm in code went smoothly and only took around a week with the coding and testing to iron out any errors finishing on November 13th. 6.3.1.4 Vertex Algorithm Prototype Attempt With the prototype for the index algorithm implemented and tested the group then turned towards the design of the vertex algorithm. The group spent almost two weeks attempting to design an algorithm that would work well with vertex data however the group ran into some problems and had to go back to researching different compression methods to create an algorithm that will run efficiently on vertex data. This whole process took around a month of the first semester’s time and will not be completed until the first few weeks of the second semester. The group decided in order to get the final design documentation finished before the end of the semester focus would have to shift towards the completion of the document instead of further work on the vertex algorithm. Figure 6.7: First Semester Milestones: Milestone Timeline of the First Semester of the Project. 6.3.2 Second Semester This semester was comprised of the actual development and optimization of the project’s algorithms and concludes with the presentation of the finalized project at the end of the semester. The milestones for the second semester are displayed 90 in Figure 6.7 which like Figure 6.6 displays a timeline from the start to the end of the semester. 6.3.2.1 Vertex Algorithm Prototype Due to previously mentioned difficulties with the design of the vertex compression algorithm during the first semester the beginning of the second semester focused on getting a prototype vertex compression and decompression algorithm designed and implemented in code. As seen in the timeline there were 3 weeks of semester time allocated towards finishing research and designing an algorithm to be used on vertex information. Some research on compressing vertex data also occurred before the semester started and is not displayed on the timeline. The week after design was focused on implementing the algorithms into code. 6.3.2.2 Optimization of algorithms With both algorithms’ prototypes implemented in code the focus of the group turned to optimizing the algorithms to run faster and more efficiently. This was planned to take the largest amount of time and around 4 to 5 weeks were allocated towards this task. Because of the algorithms implemented this took longer than expected and the time allocated into implementing them onto the GPU got re-allocated into further optimization and research. 6.3.2.3 Implementation on GPU Due to converting the decompression algorithms to use the GPU’s resources being a stretch goal it’s time was re-allocated into further research and development of optimizations for the implemented algorithms. As mentioned before this would have most likely been done using the C++ AMP model to test the algorithm without implementing in shader code. Again because this step is not vital towards the completion of the projects goals the two weeks that have been allocated towards finishing and testing the speeds and efficiency on the GPU were taken for more optimization of the algorithms is needed. 6.3.2.4 Completion of project and project documentation 91 The remaining time will be spent on finalizing the algorithms as well as finishing up the final design documentation in order to prepare to present the finished project to AMD and the chosen UCF faculty that will judge the project’s outcome. This presentation marks the closing of the project. Figure 6.8: Second Semester Milestones: Milestone Timeline of the second Semester of the Project. 92 Summary/Conclusion 7.1 Design Summary The goal of this project was to identify lossless compression algorithms that compress vertex and index buffer information. The way the algorithms are being developed, data is first compressed offline and saved the system’s main memory. The compressed data is then loaded into the respective buffers as if it were normal data. When the information is fetched from the buffer it is then decompressed using our decompression algorithm and used normally by the rest of the graphics pipeline. As mentioned before the compression algorithms are designed to be done offline, at compile time of the 3D or program using the 3D object in order to avoid time and resource constraints and achieve a better compression ratio. The decompression is designed to be done at runtime on the graphics card when data is fetched from either the index or vertex buffer. The group are using a modified delta encoding algorithm and an implementation of Golomb-Rice to compress and decompress the index data and an implementation of BR and LZO1-1 on vertex data. The outcome of the project is an increase in efficiency and speed of graphics cards without heavily modifying existing standards. 7.2 Successes 7.2.1 Initial Research Throughout the project work went very smoothly. Initial research on graphics was greatly sped up with the aid of Professor Sumanta Pattanaik, Todd Martin, and Mangesh Nijasure. Because only one group member had previous experience with graphics, professor Pattanaik gave a very helpful crash course in the basics of computer graphics programming as well as vertex and index data formatting and use when drawing graphics to the screen. Through his aid and with the help of AMD’s Todd Martin and Mangesh Nijasure the group gained a solid understanding of the basics of graphics. 93 Professor Sumanta Pattanaik was a large help when the group were trying to understand the basics of computer graphics. He explained the basics of how the graphics pipeline functioned, how index and vertex data is used within the pipeline, and how this data is usually structured and formatted. This information was vital to our understanding of how our project fit into the computer graphics environment. He also gave the group some ideas on what algorithms to start looking at. In addition to research on graphics. The group also had to do a lot of research on data compression, specifically lossless compression. Again with the help of Todd Martin and Mangesh Nijasure the group were able to research specific algorithms to implement on these types of data and gain a better understanding of what to look for when searching for compression algorithm to use on vertex and index data. Professor Pattanaik also aided in giving us good ideas about which algorithms to focus more on and which wouldn't be as helpful in the project. 7.2.2 Testing Environment Another milestone that the group completed was the development of the testing environment. The group came together and quickly got it up and running within a few days of its design. Testing with said environment also proved to work very well and made the creation of uniform test data much easier. In addition to making test data easier to gather the way it was designed allows the group to plug in new tests and test data very easily. The group was able to quickly complete most of the milestones that they were aiming for during the first semester. They were able to quickly create the tools that they needed to obtain data for use in testing their algorithms such as the integrity checking tool, the file scanner, and basic performance data analyzer. 7.2.3 Index Compression One of the main success for this project was the development of the index compression and decompression algorithms. By using a modified delta compression algorithm and an implementation of Golomb-Rice the group was able to create algorithms that run very fast and compress data to around half its size. 94 7.2.4 Vertex Compression Even though some of LZO’s algorithm is unknown to us at this time, its performance was very good and gave the best results of the implemented algorithms. BR was implemented to attempt to have a completely open algorithm available to create optimizations on top of to get better results out of. Although overall BR compression is a generally decent compression method, it has some flaws that lead us to believe that LZO compression may be a more suitable algorithm in essentially every category. The first and most major concern is that in all statistical categories, LZO simply outperforms BR encoding. In terms of both compression rate and speed and also decompression rate and speed, LZO provides more satisfactory results. For example, LZO compressed the vertex data an average of 32%, while BR compression only yielded an average of 14% compression. Additionally, in terms of our most valued metric, decompression time, LZO far outperforms BR compression. LZO yields a staggering 606 MB/S on our test dataset, compared to BR’s 210 MB/s decompression rate. Aside from performance metrics, LZO is able to perform decompression without much overhead. In comparison, BR compression requires a hash table to be included in the header for each compression block that is sent through the pipeline. 7.3 Difficulties The first issue the group ran into was the incompatibility of code with some group member’s computers and the testing environment. This was due to some group members using IDEs and different operating systems. One group member was coding the test environment in Code Blocks, while another was coding in Visual Studio, both of these IDEs have compilers which allow for certain syntax rules of the C language to be ignored in favor of easier usability. As a result, when our final group member attempted to compile the environment in the Linux gcc compiler it would not work. The program would fail to compile, despite showing no errors in the IDEs. This issue has since been resolved as all of the syntax violations have been corrected. One of the group’s major difficulties was that the group all had very little experience in working with the graphics pipeline prior to the project’s inception. One group member had taken a course concerning computer graphics, but one 95 single-semester course does not provide a full working knowledge of its subject matter. This caused several misconceptions to arise over the course of working on the project. Index buffers were fairly straightforward, however the sample data the group was given contained something which confused us. The index data was formatted in such a way that between certain sets of values was an unsigned value that equated to negative one. This acted as a reset value to tell the graphics card that this was where one graphical object stopped and another began. This caused problems for our initial delta compression algorithm, as negative one is a value that is commonplace in most outputs. We had to format the index data in such a way that when it is being compressed it converted these reset values to an escape character rather than just a regular integer. Vertex buffers proved difficult to understand from the beginning. The group was unsure of whether the different data-types would be consistently present throughout the whole input. The position data is guaranteed to be there, but the group was informed that the color and normal vector data was not always present. The group was not sure if that meant that some inputs would lack these values, or that certain vertices in a single input would have it and some of them would not. The group discovered it was the latter which makes our input data very inconsistent and therefore more difficult to compress. The input itself contained a whole other issue to tackle. Some of the values were formatted into exponential form. This required us to read in the values as strings, and then create a parser to change the values from strings into floats. The group also had difficulties with our version control system on GitHub. None of us had used Git before this semester began. This meant the group had to appropriately acquaint ourselves with the user interface and the commit system. The main issues came from transitioning to a new system, since the group was previously using Dropbox as our version control. Dropbox synced automatically and was fairly easy to navigate. When the group first committed changes it seemed like the interface would display all of the shared folder’s contents, similar to how Dropbox presents the shared data as a folder. In reality, the system would simply display what items had been changed after each commit. The actual contents of the folder were located in the Git workspace located on each person’s hard drive. 96 Most of the other difficulties were small misconceptions the group had based on certain aspects of the projects. At one point, a group member had the idea that compressing the index and vertex buffers at run time was a non-negotiable requirement of the project. The group decided this was a stretch goal but not vital, since the group wanted the highest compression ratio from the algorithm rather than the fastest compression time. Another misconception was that the index and vertex Buffers were filled simply by what the user displayed on their screen. In reality, what goes in the buffers is handled by shader code, which dictates the behaviors of a camera-like entity pointed at the graphical object. The group was also unsure whether our algorithm would be designed around parallelism. Parallel algorithms have to be designed from the ground up in a certain way, so knowing the answer to this question early was vital to progressing with the project. In the end our sponsor decided it was better just to deliver the base algorithm, so that they could possibly expand upon it later. Most, if not all, of these types of problems were handled by simply contacting the project’s sponsor, AMD. Even when they did not have an immediate solution to our problem, the group was easily able to work things out through discussion and compromise. 7.3.1 Vertex Compression Difficulties The largest difficulty the group had during this semester was the research and design of a vertex data compression and decompression algorithm. Many existing algorithms are not lossless and are not easily implemented which caused the group to have to rethink the design many times. The group chose to commit to two different algorithms, BR being a predictive algorithm and LZO being based on the popular LZ77 compressor. The main issue with vertex compression is the non-uniformity of the data. Every time a potential algorithm was developed a test case would be found that would render it useless and as a result unfit for the project’s goals. 97 7.3.2 Future Improvements These are ideas, concepts, and tweaks for the project that the group was not able to implement in the time that they had. Although time constrained them from implementing these things, these possibilities may be beneficial to explore. Other methods of optimizing the vertex data for storage were researched, such as methods for converting vertex information like color data into tables representing them more efficiently. The group originally considered using a C++ parallelization library named C++ Accelerated Massive Parallelism (C++ AMP) which allows us to quickly write code that will run on the GPU without actually writing shader code or implementing hardware on the graphics card itself. It is not easily apparent when evaluating a compression algorithm whether it is able to be parallelized easily; much analysis of most algorithms must be done. Although the group did not have time to evaluate their algorithms for parallelizability, it is an important aspect for an algorithm to have. Another system that the group believes will have a performance benefit is the ability to switch between the use of different algorithms based on how the data was encoded. The idea is that any number of compression algorithms can be used in tandem to best encode the data when running offline; not just one has to be used. However the compression program will not initially be aware of the qualities of the data is trying to analyze, so the program must familiarize itself with the patterns that exist within the data first. It will attempt to scan through the data and produce a score of what it thinks the best algorithm may be for storage. Many different methods can be devised to perform this functionality, although the group has not implemented this into their project yet. One possible and simple method is to take a small section of the data it is looking at and attempting to compress it using all of the different algorithms it has available. It can then use the algorithm that compressed the sample the most efficiently to compress the entire file. Alternatively, a non-heuristic method can be used where the program will simply compress the file using all of the available compression algorithms, and use the one that yields the best results. The group is hesitant to use this implementation however as there many significant performance penalties from using it. With the introduction of using multiple different algorithms to compress data all being sent through the same channel, a problem will exist when the compressed data gets to the decompression step. Since it is no longer previously known what algorithm that was used to compress the data was, a way to differentiate between the different types of compressed contents must be included in the file contents. The way that the group chose to implement this was to include a header before every single graphical object that is passed through the buffer. A 98 possible further optimization to this system is to rearrange the contents of the index buffer during transport. All of the objects that have been compressed with the same algorithm could be grouped in the same location with a header existing that only describes the contents of that group as a whole. In addition to dynamic anchor points the group plans to test another optimization technique in which a variable is introduced that will hold on to the current value that has been decompressed. This will allow the decompression algorithm to “remember” where it was in the list, and instead of having to decode the entire list from the closest anchor point it instead will just continue where it left off if the requested data is further down in the list. 99 Appendices 8.1 Copyright From: Nijasure, Mangesh <Mangesh.Nijasure@amd.com> Date: Mon, Dec 1, 2014 at 3:04 PM Subject: RE: Diagram Copyright Permission To: Brian Estes <bestes258@gmail.com>, "Martin, Todd" <Todd.Martin@amd.com> Cc: Alex Berliner <alexberliner@gmail.com>, Samuel Lerner <simolias@gmail.com> You can use any of the diagrams I presented from the slides shown in class, just include the citations (always good practice) I had citations to MSFT in the slides you can just use those. Any information from the AMD website can also be used along with the appropriate citation as well. Mangesh Nijasure From: Brian Estes [mailto:bestes258@gmail.com] Sent: Sunday, November 30, 2014 6:58 PM To: Martin, Todd; Nijasure, Mangesh Cc: Alex Berliner; Samuel Lerner Subject: Diagram Copyright Permission 8.2 Datasheets Figure 8.1: Specifications for the R9 series of Graphics Cards [2] Reprinted with permission. R9 295X2 R9 290X R9 290 R9 285 R9 280X R9 280 R9 270X R9 270 GPU ARCHITECTURE 28nm 28nm 28nm 28nm 28nm 28nm 28nm 28nm API SUPPORT11 DirectX® 12, Mantle, OpenGL DirectX® 12, Mantle, OpenGL DirectX® 12, Mantle, OpenGL DirectX® 12, Mantle, OpenGL DirectX® 12, Mantle, OpenGL DirectX® 12, Mantle, OpenGL DirectX® 12, Mantle, OpenGL DirectX® 12, Mantle, OpenGL 100 4.3, OpenCL 4.3, OpenCL 4.3, OpenCL 4.3, OpenCL 4.3, OpenCL 4.3, OpenCL 4.3, OpenCL 4.3, OpenCL PCI EXPRESS® VERSION 3 3 3 3 3 3 3 3 GPU CLOCK SPEED Up to 1018 MHz Up to 1000 MHz Up to 947 MHz Up to 918MHz Up to 1000 MHz Up to 933 MHz Up to 1050 MHz Up to 925 MHz MEMORY BANDWIDTH Up to 640 GB/s Up to 352 GB/s Up to 320 GB/s Up to 176 GB/s Up to 288 GB/s Up to 240 GB/s Up to 179.2 GBP/s Up to 179.2 GBP/s MEMORY AMOUNT Up to 8GB GDDR5 Up to 8GB GDDR5 Up to 4GB GDDR5 Up to 4GB GDDR5 Up to 3GB GDDR5 Up to 3GB GDDR5 Up to 4GB GDDR5 Up to 2GB GDDR5 STREAM PROCESSING UNITS Up to 5632 Up to 2816 Up to 2560 Up to 1792 Up to 2048 Up to 1792 Up to 1280 Up to 1280 1 x 6-pin + 1 x 8pin 1 x 6-pin + 1 x 8pin 2 x 6-pin 1 x 6-pin + 1 x 8pin 1 x 6-pin + 1 x 8pin 2 x 6-pin 1 x 6-pin REQUIRED 2 x 8-pin POWER SUPPLY CONNECTORS 101 Figure 8.2: Sample Index Data Figure 8.3: Sample Vertex Data 8.3 Software/Other In the development of our testing environment we all tried to use many different IDEs to code in including Microsoft Visual Studios and Code Blocks. Because we wanted the testing environment to work across all of our computers (some of us use Linux based and others use windows based systems) we ended up using just text editors such as Sublime Text 2 and 3, Notepad++ and compiling our project in the command line using GCC to ensure code compatibility. In order to make sure our project stays up to date between our computers we are using Git for version control of our code. For our documents we are keeping them in Google Drive. This allows us to write the required documents at the same time, while keeping a unified minutes log and TODO list. In order to gain test data we plan on using a program called AMD Perf-Studio. This program works exclusively on AMD GPUs and as a result in the future we may need to procure one as none of us use them in our systems. By using the program we can pause a video game or 3D program and get a printout of the buffers on the GPU at that time. 102 Bibliography [1] P. H. Chou and T. H. Meng. Vertex data compression through vector quantization. IEEE Transactions on Visualization and Computer Graphics, 8(4):373–382, 2002. [2] http://www.amd.com/en-us/products/graphics/desktop/r9 - "AMD Radeon™ R9 Series Graphics." AMD Radeon™ R9 Series Graphics. N.p., n.d. Web. 22 Nov. 2014. [3] http://blogs.msdn.com/b/shawnhar/archive/2010/11/19/compressed-vertexdata.aspx - Compressed vertex data [4]http://www.adobe.com/devnet/flashplayer/articles/vertex-fragmentshaders.html - Vertex and Fragment Shaders [5] http://computer.howstuffworks.com/c10.htm - The Basics of C Programming [6] http://www.directron.com/expressguide.html - What is PCI Express? A Layman's guide to high speed PCI-E technology [7] https://www.opengl.org/sdk/docs/tutorials/ClockworkCoders/attributes.php Vertex Attributes [8] https://msdn.microsoft.com/enus/library/windows/desktop/bb147325%28v=vs.85%29.aspx - Rendering from Vertex and Index Buffers [9] http://introcs.cs.princeton.edu/java/44st/ - Symbol Tables [10] http://steve.hollasch.net/cgindex/coding/ieeefloat.html - IEEE Standard 754 [11] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.296.6055&rank=3 Compression in the Graphics Pipeline [12] "Build Software Better, Together." GitHub. N.p., n.d. Web. 22 Nov. 2014. https://github.com/pricing [13] http://www.mcs.anl.gov/papers/P5009-0813_1.pdf - Float Masks [14] Run-length encodings - S. W. Golomb (1966); IEEE Trans Info Theory 12(3):399 [15] http://rosettacode.org/wiki/Run-length_encoding - Run-length encoding [16] http://www.dspguide.com/ch27/4.htm - Delta Encoding [17] http://rosettacode.org/wiki/Huffman_coding - Huffman coding [18] https://msdn.microsoft.com/en-us/library/hh265137.aspx - AMP C++ [19] http://www.mcs.anl.gov/papers/P5009-0813_1.pdf - Improving Floating Point Compression through Binary Masks [20]http://www.oberhumer.com/opensource/ - LZO 103