Design Document - Department of Electrical Engineering and

advertisement
Novel Algorithms for Index & Vertex
Data Compression and
Decompression
Authors:
Alex Berliner
Brian Estes
Samuel Lerner
Contributors:
Todd Martin
Mangesh Nijasure
Dr. Sumanta Pattanaik
Sponsors:
Table of Contents
Executive Summary ............................................................................. 0
Project Overview.................................................................................. 4
2.1 Identification of Project ................................................................................ 4
2.2 Motivation for Project................................................................................... 8
2.2.1 Alex ...................................................................................................... 9
2.2.2 Brian ..................................................................................................... 9
2.2.3 Sam .................................................................................................... 10
2.3 Goals and Objectives ................................................................................ 11
2.3.2 Testing Environment Objectives ......................................................... 13
2.3.3 Algorithm Development Objectives .................................................... 14
2.4 Specifications ............................................................................................ 15
2.4.1 Index Compression Specifications ..................................................... 15
2.4.2 Compression Specifications ............................................................... 15
2.4.3 Decompression Specifications ........................................................... 17
2.5 Space Efficiency ........................................................................................ 19
2.6 Requirements ............................................................................................ 19
2.6.1 Overall requirements .......................................................................... 19
2.6.2 Compression Requirements ............................................................... 22
2.6.3 Decompression Requirements ........................................................... 23
Research ........................................................................................... 25
3.1 Data types ................................................................................................. 25
3.2 Graphics .................................................................................................... 26
3.2.1 General Graphics pipeline .................................................................. 26
3.2.2 Index buffer ........................................................................................ 29
3.2.3 Vertex buffer ....................................................................................... 29
3.3 Index Compression Research ................................................................... 31
3.3.1 Delta Encoding ................................................................................... 31
3.3.2 Run Length Encoding ......................................................................... 33
3.3.3 Huffman Coding ................................................................................. 34
3.3.4 Golomb-Rice ...................................................................................... 36
3.4 Vertex Compression Research.................................................................. 37
3.4.1 Statistical Float Masking ..................................................................... 38
I
3.4.2 BR Compression ................................................................................ 39
3.4.3 LZO Compression .............................................................................. 42
3.5 Additional Research .................................................................................. 43
3.5.1 Testing Environment Language: C vs. C++ ........................................ 43
3.5.2 AMP code ........................................................................................... 43
Design Details ................................................................................... 44
4.1 Initial Design .............................................................................................. 44
4.1.1 Offline Compression ........................................................................... 44
4.1.2 Online Compression ........................................................................... 46
4.2 Testing Environment ................................................................................. 47
4.2.1 Initial Environment Design .................................................................. 48
4.2.2 Data Recording .................................................................................. 48
4.2.3 Scoring Method .................................................................................. 49
4.2.4 Dataset Concerns ............................................................................... 50
4.3 Index Compression ................................................................................... 51
4.3.1 Delta Encoding ................................................................................... 51
4.3.2 Other Considered Algorithms ............................................................. 52
4.3.3 Delta Optimization .............................................................................. 53
4.3.4 Golomb-Rice ...................................................................................... 55
4.4 Vertex Compression .................................................................................. 56
Build, Testing and Evaluation Plan .................................................... 58
5.1 Version Control.......................................................................................... 58
5.1.1 What is Version Control ...................................................................... 59
5.1.2 Choosing a Version Control System ................................................... 60
5.2 Test Runs .................................................................................................. 63
5.3 Index algorithm development .................................................................... 64
5.3.1 Delta Encoding ................................................................................... 64
5.3.2 Golomb-Rice Encoding....................................................................... 68
5.3.3 Index Compression comparison ......................................................... 71
5.4 Vertex algorithm development ................................................................... 72
5.4.1 Test Data ............................................................................................ 72
5.4.2 Vertex Algorithm Implementation ....................................................... 73
II
5.4.3 Vertex Compression comparison ....................................................... 78
5.5 Test Environment ...................................................................................... 79
5.5.1 File Reader ......................................................................................... 79
5.5.2 Compression and Decompression algorithms .................................... 80
5.5.3 Testing code ....................................................................................... 81
5.5.4 Data Printer ........................................................................................ 82
Administrative Content....................................................................... 83
6.1 Consultants ............................................................................................... 83
6.1.1 AMD ................................................................................................... 83
6.1.2 Dr. Sumanta N. Pattanaik ................................................................... 83
6.1.3 Dr. Mark Heinrich ............................................................................... 84
6.1.4 Dr. Shaojie Zhang .............................................................................. 84
6.2 Budget ....................................................................................................... 84
6.2.1 A Graphics Processing Unit. ............................................................... 84
6.2.2 Version control ................................................................................... 85
6.2.3 Algorithm Licenses ............................................................................. 86
6.2.4 Document Expenses .......................................................................... 86
6.2.5 Estimated Expenditures...................................................................... 87
6.2.6 Actual Expenditures............................................................................ 88
6.3 Project Milestones ..................................................................................... 88
6.3.1 First Semester .................................................................................... 88
6.3.2 Second Semester ............................................................................... 90
Summary/Conclusion......................................................................... 93
7.1 Design Summary ....................................................................................... 93
7.2 Successes ................................................................................................. 93
7.2.1 Initial Research ................................................................................... 93
7.2.2 Testing Environment........................................................................... 94
7.2.3 Index Compression............................................................................. 94
7.2.4 Vertex Compression ........................................................................... 95
7.3 Difficulties .................................................................................................. 95
7.3.1 Vertex Compression Difficulties .......................................................... 97
7.3.2 Future Improvements ......................................................................... 98
III
Appendices ...................................................................................... 100
8.1 Copyright ................................................................................................. 100
8.2 Datasheets .............................................................................................. 100
8.3 Software/Other ........................................................................................ 102
Bibliography ..................................................................................... 103
IV
List of Figures
Figure 1.1: Figure 1.1 PCI-E Speeds: A table detailing the speeds of the various
versions of the PCI-E bus. .................................................................................... 1
Figure 2.1: Providing an index number 3 to an array to retrieve the corresponding
value, ‘d’ ............................................................................................................... 4
Figure 2.2: Vertices Form Triangle: An illustration of three vertices coming
together to form a triangle. ................................................................................... 5
Figure 2.3: Vertex Data, before and After Indexing: A demonstration of how much
space can be saved with indexing. ....................................................................... 6
Figure 2.4: Graphical Object: An example of a graphical object, specifically a
square, formed by two triangles. Reprinted with permission. ............................... 7
Figure 2.5: Vertex Buffer: A sample vertex buffer shown with the corresponding
vertices it is describing. Reprinted with permission. ............................................. 7
Figure 2.6: Index Buffer: A sample index buffer generated using a vertex buffer.
Reprinted with permission. ................................................................................... 8
Figure 2.7: : How performance is expected to be optimized ............................... 12
Figure 2.8: Compressed Objects: Three compressed objects in the space of one
uncompressed object. ........................................................................................ 13
Figure 2.9: Delta Compression on Floats: This demonstrates why float values
cannot be compressed using delta compression. ............................................... 18
Figure 2.10: The process of hook code being injected into a program being
performed. .......................................................................................................... 20
Figure 2.11: Graphical Errors: Severe graphical errors caused by incorrectly
drawn vertices. ................................................................................................... 20
Figure 2.12: Offline Compression ....................................................................... 21
Figure 2.13: Online Compression ....................................................................... 22
Figure 3.1: Floating Point Format: The number 0.15625 is represented in the 32bit floating point format. ...................................................................................... 26
Figure 3.2: Example of Vertex Buffer being used and reloaded 3 times. ............ 27
Figure 3.3: The Graphics Pipeline: Illustration of where vertex data fits into the
graphics pipeline................................................................................................. 28
Figure 3.4: Indices A and B: Index A is shown pointing to Vertex A, and Index B
is shown pointing to Vertex B. ............................................................................ 29
Figure 3.5: Index and Vertex Interaction: Diagram detailing the interaction
between index and vertex buffers. ...................................................................... 31
Figure 3.6: Delta Encoding: Demonstration of the compression and
decompression process associated with Delta encoding. .................................. 32
Figure 3.7: Run Length Encoding: Quick transformation of a sequence into a
compressed form using run-length encoding. ..................................................... 34
Figure 3.8: Making Change: A greedy algorithm, this algorithm tries to use the
fewest number of coins possible when making change. ..................................... 35
V
Figure 3.9: Huffman Coding: An example of the kind of tree used in Huffman
encoding, accompanied by sample data being compressed. ............................. 36
Figure 3.10: XOR Operator: A quick equation to demonstrate the XOR Operator
........................................................................................................................... 39
Figure 3.11: XOR Operator: A quick equation to demonstrate the XOR Operator
........................................................................................................................... 40
Figure 3.12: Leading Zero Compression: The zeroes at the beginning of a binary
number are replaced with a single binary number counting the zeroes. ............. 40
Figure 3.13: FCM generation and prediction ...................................................... 41
Figure 3.14: DFCM generation and prediction .................................................... 42
Figure 4.1: Header Data: An example of how header would be applied for
dynamically applying the compression algorithms. ............................................. 45
Figure 4.2: Graphics Pipeline with Compression: Two possible configurations of
the graphics pipeline after our compression and decompression algorithms have
been added......................................................................................................... 47
Figure 4.3: Checksum Functions: A checksum function will return a vastly
different value even with similar input data. ........................................................ 49
Figure 4.4: Checksum Usefulness: Demonstration of how a checksum alerts the
program that data has been changed. ................................................................ 49
Figure 4.5: Score Equations for testing environment. ......................................... 50
Figure 4.6: Example of different data not working at same efficiency on same
algorithm. ............................................................................................................ 51
Figure 4.7: Run Length + Delta: Example of running Run Length encoding on top
of Delta encoding................................................................................................ 53
Figure 4.8: Example Showing Benefit of Dynamic Anchor Points with Escape
Codes ................................................................................................................. 54
Figure 4.9: Example Showing Benefit of Dynamic Anchor Points with No Escape
Codes ................................................................................................................. 55
Figure 5.1: Version Control: A file being changed and merged in a generic form
of version control. ............................................................................................... 60
Figure 5.2: Version control pros / cons: The different pros and cons of each kind
of version control. ............................................................................................... 61
Figure 5.3: Index Buffer Delta Compression: Example of compressing index
buffer data using Delta Encoding........................................................................ 65
Figure 5.4: Delta RLE file size change ............................................................... 66
Figure 5.5: Delta RLE Compression and Decompression Time ......................... 66
Figure 5.6: Delta RLE Normalized Compression Speeds ................................... 67
Figure 5.7: Delta RLE Compression rates of different test files .......................... 67
Figure 5.8: Delta RLE Test Run Histogram ........................................................ 68
Figure 5.9: Golomb-Rice file size change ........................................................... 69
Figure 5.10: Golomb-Rice Compression and Decompression Time ................... 69
Figure 5.11: Golomb-Rice Normalized Compression Speeds ............................ 70
Figure 5.12: Golomb-Rice Compression rates of different test files.................... 70
VI
Figure 5.13: Golomb-Rice Test Run Histogram .................................................. 71
Figure 5.14: Comparison between Delta-RLE and Golomb-Rice Compression
Rates .................................................................................................................. 72
Figure 5.15: LZO File size changes .................................................................... 74
Figure 5.16: LZO Compression and Decompression times ................................ 74
Figure 5.17: LZO normalized compression speeds ............................................ 74
Figure 5.18: LZO Compression rates of different test files ................................. 75
Figure 5.19: LZO test run histogram ................................................................... 75
Figure 5.20: BR size changes............................................................................. 76
Figure 5.21: BR Compression and Decompression times .................................. 76
Figure 5.22: BR normalized compression rate, measured in MB/S .................... 77
Figure 5.23: BR Compression rates of different test files ................................... 77
Figure 5.24: BR test run histogram ..................................................................... 78
Figure 5.25: Comparison between Delta-RLE and Golomb-Rice Compression
Rates .................................................................................................................. 79
Figure 5.26: Example Testing Environment Output: Example output produced by
our testing environment, including the performance measures. ......................... 82
Figure 5.27: Additional Testing Environment Output: Full performance metrics
used for determining algorithm statistics. ........................................................... 82
Figure 6.1: AMD R9 Graphics Cards: A side-by-side price and performance
comparison. More information on this series of graphics cards is provided in the
appendices. Reprinted with permission. ............................................................. 85
Figure 6.2: GitHub Personal Plans: The potential cost of a subscription to a
GitHub personal account. ................................................................................... 85
Figure 6.3: GitHub Organization Plans: The potential cost of a subscription to a
GitHub organization account. ............................................................................. 86
Figure 6.4: The Spot Pricing: Quote detailing the cost to print a document. ....... 87
Figure 6.5: Estimated Expenditures Pie Chart. ................................................... 87
Figure 6.6: Actual Expenditures Pie Chart .......................................................... 88
Figure 6.7: First Semester Milestones: Milestone Timeline of the First Semester
of the Project. ..................................................................................................... 90
Figure 6.8: Second Semester Milestones: Milestone Timeline of the second
Semester of the Project. ..................................................................................... 92
Figure 8.1: Specifications for the R9 series of Graphics Cards [2] Reprinted with
permission. ....................................................................................................... 100
Figure 8.2: Sample Index Data ......................................................................... 102
Figure 8.3: Sample Vertex Data ....................................................................... 102
VII
Executive Summary
Modern graphics cards are constantly performing a tremendous amount of work
to maintain the frame rate and visual fidelity expected of current-generation
games and other graphical applications. Graphics cards have become
powerhouses of computational ability, with modern cards boasting thousands of
cores and an amount of onboard random access memory (RAM) comparable to
the host system itself. It would not be unreasonable to posit that modern
computers are really two computational systems in one, with the main processor
and graphics processor rapidly communicating with each other to provide the
visual experience that the group has come to expect.
Some obstacles however can negatively impact communication with the GPU.
Since the design of modern computers is one that ultimately prefers modularity
and a degree of user freedom over brute efficiency, the role of the graphics card
has been relegated to an optional peripheral that exists on an external bus
relatively far away from other critical system resources. This configuration
complicates the process of transferring data between the computer and graphics
card, necessitating a transfer bus that is extremely fast and efficient, with an
enormous throughput. The bus used today for this purpose is known as the
Peripheral Component Interconnect Express (PCI-E) and it provides the amount
of data throughput that graphics card needs to function. The version of this bus
that current graphics cards run on, PCI-E v3.0, is capable of transferring almost
16 GB of data every second, with version 4.0 supporting twice that amount.
PCI Express Version
Bandwidth (16-lane)
Bit Rate (16-lane)
1.0
4GB/s
40GT/s
2.0
8GB/s
80GT/s
3.0
~16GB/s
128GT/s
4.0
~32GB/s
256GT/s
Figure 1.1: Figure 1.1 PCI-E Speeds: A table detailing the speeds of the
various versions of the PCI-E bus.
Being transferred over the bus, among other things, are the data that the
graphics card requires of all objects that are to be drawn on the screen, known
as the index and vertex data. Even with the extreme speed of the bus these
1
graphics cards use, a bottleneck exists where the speed provided to transfer this
amount of data is not sufficient. The impetus of this project was the desire to
determine whether any kind of advantage could be gained from compressing the
contents of the index and vertex data on the CPU side before sending it to
through the buffers and onto the GPU, where it would then be decompressed
using GPU resources. The compression algorithm must achieve a high ratio of
compression and be made in such a way that it is able to be decompressed
quickly. The decompression algorithms that accompany these compression
algorithms are required to be able to rapidly decode the two buffers so that they
may be passed on to the rest of the graphics pipeline with minimal delay. It also
was hoped that these algorithms will be implemented on current graphics cards
to increase the amount of data that these cards are able to receive in a given
period. Although the aim of this project was not to physically increase the speed
of the bus that the GPU runs on, it is hoped that the effective increase in transfer
speed of the compressed data to the uncompressed data outweighs the
performance hit that the constant decompression of resources will take.
The overall goal of this project was to implement lossless, efficient algorithms
designed to compress the data in the index and vertex buffers of the graphics
pipeline. Our first objective was to conduct research to establish and solidify any
background knowledge the group needs to complete the project. The group
started by researching the graphics pipeline to gain a better understanding of the
data the group is working with. Next, the group moved to researching existing
lossless compression algorithms, to identify a first round of algorithms that work
well.
Once the group had finished the research for the project, the group moved on to
coding the testing environment. The group began by setting up a way for the
program to receive the input. In this case the group used a file reader because
the group will be given sample testing data in a text file. Then the group needed
to design functions which will collect performance metrics, to demonstrate the
effectiveness and efficiency of our algorithms. Finally, the group had to
implement a checksum to ensure that the data that the group decompresses is
the same as the data the group originally compressed.
Once the testing environment was completed, the group will begin work on
testing and writing compression algorithms. The group began with algorithms to
compress the index data, because it is consistent in what it is describing, and has
uniform formatting, making it easier to manage. The group then moved on to the
algorithm which compresses the vertex data. Because its format varies and it
2
describes a set of attributes rather than just one, the group decided to attempt it
later in the course of the project.
Over the course of the project, the group developed and tested many algorithms
for compressing both the index and vertex buffers. The algorithms that they
tested to compress the index buffers were a pass of delta encoding followed by
run-length encoding for compression, as well as Golomb Coding. Huffman coding
was researched but the group decided not to implement it.
Early progress in the project focused on index compression. As a result less
research had been done on compression for the vertex data than has been done
for the index data. Many methods of compressing the vertex data were
researched such as the Burrows-Wheeler Transform that were deemed to not
have enough potential to implement and test. Additionally, other methods of
optimizing the vertex data for storage were researched, such as methods for
converting vertex information like color data into tables representing them more
efficiently.
3
Project Overview
2.1 Identification of Project
When a graphics card displays a 3D image to a computer screen, a large amount
of data is being transferred from the system’s memory into the graphics card’s
memory. This information includes data describing every vertex in the object,
texture information to display, and index information.
An index for vertex data works the same as an index for an array in computer
programming. Instead of storing numbers as individual named values in a
computer, a chunk of memory is reserved whose size is equal to a multiple of the
size of the data that is being stored. To access an element that is stored in an
array, the index number is used to go that many elements down the list of
elements in an array and pull out that numbers. For example, in Figure 2.1, the
index number 3 is being requested from Array. Elements 0-2 are skipped in the
list and the element located at 3 is returned to the user. It represents stored data
as a single number, and that number corresponds to an address somewhere in
memory. This makes referencing the data easier and takes up less space in the
long run.
Figure 2.1: Providing an index number 3 to an array to retrieve the
corresponding value, ‘d’
4
A vertex in computer graphics is very similar to the commonplace geometrical
term. It is a single point in a graphical environment that, when combined with
other points, makes up a shape. Typically one connects three vertices to form a
triangle, because triangles can be combined to form any complex geometrical
shape, as shown in Figure 2.2. A graphics card will read in three vertices at a
time so that it can form other shapes using these kinds of triangles. Once the
card has formed the triangle, it chops it up into tiny pieces in order to transfer it
through the graphics pipeline. It then reforms the pieces after textures and
shaders have been applied and fits it into a larger graphical object.
Figure 2.2: Vertices Form Triangle: An illustration of three vertices coming
together to form a triangle.
5
Figure 2.3: Vertex Data, before and After Indexing: A demonstration of how
much space can be saved with indexing.
The size of the data describing these objects are being drawn are only getting
bigger and more complex. As computer graphics continue to attempt to mirror
reality more closely, an increasing amount of data has to be sent through the
graphics pipeline for processing. Objects have to be created using an
exponentially growing number of polygons in order to increase their fidelity.
Textures for the objects have to be larger, so that when they are wrapped on an
object and inspected at a high resolution they don’t show any tearing or
unrealistic patterns. Because of how visually complex the world around us can
be, graphics developers are constantly attempting to go to new and astounding
lengths in order to display even the tiniest details correctly.
The faster that the GPU can get through information, the faster it can display it to
the screen and the better it will run. Therefore it is decided that a compression
algorithm is required for 2 portions of the graphics pipeline which help to describe
graphical objects. Graphical objects are typically formed using triangles of
various differing forms and sizes. An example of an object that is comprised of
triangles in this way is the square shown in Figure 2.3. These triangles are made
up of three different vertices.
6
Figure 2.4: Graphical Object: An example of a graphical object, specifically
a square, formed by two triangles. Reprinted with permission.
The first item which requires compression is the vertex buffer. The vertex buffer
contains many different types of information which all work together to describe a
single vertex of a graphical object. Figure 2.4 demonstrates the vertex buffer
storing the position data of the vertex in the form of a set of Cartesian (x,y)
coordinates.
Figure 2.5: Vertex Buffer: A sample vertex buffer shown with the
corresponding vertices it is describing. Reprinted with permission.
The second item which requires compression is the index buffer. The index
buffer is itself a form of compression which maps several values in the vertex
buffer to an index. Rather than fill the vertex buffer with repeated information
about the same vertex, the graphics pipeline simply reads which vertex it has to
render next in the index buffer and searches for the corresponding information
stored in an address within the vertex buffer. Figure 2.5 shows an example of
both the index and vertex buffers, side by side.
7
Figure 2.6: Index Buffer: A sample index buffer generated using a vertex
buffer. Reprinted with permission.
2.2 Motivation for Project
This project’s main motivation is the ever growing need for efficiency and speed
in the world of graphics and simulation. With 3D graphics becoming more and
more advanced the objects being drawn to screen are only getting more and
more complex. This makes the data that describes these objects larger and as a
result more data must go through the GPU at one time to draw the object to the
scene. With PCI-E transfer speeds not increasing at the same rate of graphics
there is a need to figure out a way to transfer more data at one time over the
same data lanes that are currently being used. This is the main drive for our
project; to compress the data that is transferred into the buffers and as a result
allow more data to transfer into the buffers at one time. This will allow for more
complex objects to be created and more objects at one time to be loaded into the
buffer and as a result will increase the performance of the graphics card.
This project offered the group a huge opportunity to influence a very interesting
and active field. We all play videogames and AMD is a huge name in the video
game world, providing many processors and video cards for computers and even
most video cards for consoles. In addition to the ability to work with AMD this
project also gives us a great opportunity to learn about the graphics pipeline and
how index and vertex data for 3D objects are formatted and used to draw what
they see on the screen. Another motivation for the group is our interest in the
compression of data and how it works. People all use programs like 7zip and
WinZip in order to compress files but this project lets us gain a basic
understanding of how compression works and how it lowers the file size while still
keeping the data that was originally there.
8
2.2.1 Alex
The evolution of graphics cards and graphics drivers has interested me for a long
time. As far as general use software goes, no applications are more complicated
than those that utilize both the graphics card and the processor of a computer. A
platform’s support for graphics cards is a major factor in it becoming widely
accepted on the desktop, which I am greatly interested in changing. I believe that
the development of better cross-platform tools for graphics cards, such as
OpenCL and AMD’s own mantle, will lead to a wider rate of adoption of the Linux
platform for every-day computing. As a programmer and computing enthusiast in
general, I think that having a viable open-source alternative to Microsoft Windows
and Apple’s Mac for desktop is a very important source of competition. I would
like to begin learning about how graphics cards function so that I can, among
other things, contribute to this vision.
In addition to wanting to expand the horizons of Linux on the desktop, I’ve also
always wanted to understand the inner workings of a graphics card. In school,
I’ve learned the rudimentary ideas associated with how the CPU functions, but
I’ve always wanted the chance to learn how the GPU functions as well. To an
outlier, the way that a computer can even draw 3D objects on a screen with such
ease seems like magic, and finding an opportunity to work with people from AMD
who can share their insight on how these systems work is very invaluable to me.
Finally, as someone who often plays video games on PC, the opportunity to
contribute to the video games industry is a novel opportunity for me. The
concepts of video game graphics can also be used in many different fields. A
field that I take some interest in is the emerging virtual reality craze. Virtual reality
requires very powerful GPU’s, and VR can be used for many things in addition to
just playing video games for leisure; it can also be used as a tool for therapy or
for training, such as those undergoing physical therapy, those with disorders
such as agoraphobia, and those who in the military practicing dangerous or
complex tasks.
2.2.2 Brian
I decided to take Senior Design in order to prepare myself for the professional
world of Computer Science. I wanted some intellectual background in the field I
would be starting my career in. I also wanted an experience I could point to when
future employers asked what prepared me to work at their company.
9
So, when I was offered a chance to work with a high profile graphics hardware
company, I happily accepted. The things I could learn while working with AMD go
far beyond just learning about compression algorithms and GPU’s. I could learn
industry standards, the software development life cycle, and what it’s like to work
in an office with professionals in my field. In many ways I would be getting a full
tour through the future of my career.
That is not to say, however that my interest in computer graphics is nonexistent. I
have been curious for a long time about what was involved in the way a graphics
card functions. Rendering three-dimensional objects takes a lot of processing
power in just a static environment, but rendering them in real-time must be
expensive, in the sense of both memory and finances, considering the cost of
some graphics cards.
Before college, I would simply shrug it off as part of the costs of owning a cutting
edge personal computer. Now, however, as I near the end of my degree I find
myself questioning how hardware, and really anything related to computers,
works beneath the price tags and specifications. So, I have made it my mission
to broaden my horizons before graduation, and researching the graphics pipeline
will serve as one more milestone.
2.2.3 Sam
I have always been very interested in computer graphics. With 3D simulations
and video games being the main reasons for my interest. This is because the
more realistic 3D simulations and representations of data always fascinating me
and also due to playing video games being my main pastime, both of which
computer graphics are very important. The main motivation for me to do this
project was to gain more knowledge on how computer graphics are generated
and how 3D object data is used to create the things we see every day. When I
first entered into college I was an electrical engineering major intending to go into
the field of graphics processing hardware research and development and
eventually work at a company involved in the field, ideally either AMD or NVidia;
who are the two big names in graphics card R&D.
Early on in my classes I realized I enjoyed programming more than circuit design
and switched to Computer Science. However I still wanted to get involved within
the fields of graphics, video games, or simulation in some way. This project
10
greatly peaked my interest as it would involve working directly with graphics
cards, the graphics pipeline, and how they operate at a software level which as
mentioned before are all very interesting and interesting to me. I have also taken
some classes and done some projects involving 3D graphics and programming
which I am eager to apply towards something outside of just a hobby project or
an assignment required by a class. This project lets me apply my existing
knowledge and gain much more of an understanding of index and vertex data
that is used to build 3D objects.
2.3 Goals and Objectives
When data is being sent through the graphics pipeline, the PCI-E bus acts as a
major bottleneck between the CPU and GPU. With the immense amount of data
that is being sent through this bus every second, it is of great importance that the
data sent through is optimized in any way possible. The aim of this project in part
is to alleviate the problems associated with dealing with this bus without directly
designing a more efficient version of PCI-E.
Although continuing to improve the hardware that computers run on is always of
great importance, optimizations must be made to make systems faster during the
interim. Concentrating only on developing new versions of PCI-E with a higher
throughput ignores the performance that can be gained by carefully considering
what is being sent through that bus. The compression of data before transfer is a
shining example of this method of optimization. Efficient compression algorithms
will always be able to work with the newest and fastest versions of PCI-E to
deliver overall a faster system than what can be accomplished with hardware
optimizations alone. The algorithms that are written today will be just as useful in
the future as they are now, if not only to pave the way for further improvement
and even higher optimization.
The main goal of this project is to implement efficient lossless compression and
decompression algorithms into the graphics pipeline. The algorithms will
compress the data that goes into both the vertex and index buffers. This reduces
the size of information being transferred into the buffers and thus allow for more
information to be transferred at one time. When the data is being fetched from
the buffers it is then quickly decompressed and used normally. The
implementation of these algorithms will increase the speed and efficiency that a
graphics card can operate by allowing the card to not have to wait as long for
new information to transfer into the buffer from the computer’s main memory.
11
This transfer rate of compressed data can be quantized using the formula located
in Figure 2.7. Using this formula the impact of the developed algorithms can be
seen based upon the increase in the compressed transfer rate value.
Figure 2.7: : How performance is expected to be optimized
In terms of throughput, if we consider the current amount of data that is able to
be sent through the index and vertex buffers at once as one object and the time it
takes to send the object as one object transfer unit, uncompressed object will be
sent at the rate of “one object per transfer”. Although an increase in the amount
of physical bytes that are sent through the pipeline in a given period is not
possible that does not mean that it is impossible to increase the “object per
transfer” ratio. The way that the transfer rate is increased is not by increasing the
size of the transfer buffer, but by decreasing the size of the data being sent
through the buffer. If the algorithms generate a compression ratio of C, the
overall throughput ratio will be change from “one data per transfer” to “C data per
transfer”, as is demonstrated in Figure 2.8.
12
Figure 2.8: Compressed Objects: Three compressed objects in the space of
one uncompressed object.
The objectives for the project are first to research the basics of the graphics
pipeline and how it is used to draw objects to the screen. Then research existing
lossless compression algorithms that can be used as a base or as an
improvement to other algorithms. These objectives were ongoing from the
beginning of the project to the end when the final implemented algorithms were
implemented.
2.3.2 Testing Environment Objectives
The first coding objective is developing a testing environment to be used for
quick prototyping of our algorithm and allow for generation of useful test data.
This is important to get set up first to allow quick implementation of test
algorithms and see if the new test algorithm are an improvement or not to the
previous iteration.
Within this objective there are many sub-objectives that can be separated into
parallel tasks among group members. These include the development of the
different modules of the testing environment which was done in parallel. These
modules can be summed up into the reader of data, the writer of data, the
13
algorithms themselves, and the tests to be run. Other sub-objectives includes the
development of the before mentioned tests to be run on the algorithm to gather
consistent valuable data when testing the algorithms.
2.3.3 Algorithm Development Objectives
Next comes the development of the lossless compression algorithms that work
on vertex and index information. Due to the fact that both types of data are very
different in format and size it was necessary to develop two separate
compression algorithms, one for each type of information. Once developed the
main objective was to improve upon these base algorithms to make the final
product as efficient as possible. This can be thought of as two separate
objectives: one for the development of the index algorithm the other for the
vertex’s development.
2.3.3.1 Index Compression Objectives
Within the objective of completing the index compression and decompression
algorithm there are several objectives that once achieved result in a fully
developed and implemented algorithm. These objectives include the before
mentioned research into compression algorithms, and the design of a prototype
algorithm that will create a baseline to start off from. Next is the implementation
of optimizations that will compress the integer based index data even further.
With this type of data the optimization of the algorithm has huge potential and the
group’s objective is to achieve a much higher compression with this data when
compared to vertex data.
2.3.3.2 Vertex Compression Objectives
The development of the vertex compression and decompression algorithm has
many sub objectives as well. This type of data has many more formats of data
that has to be accounted for and as a result one of these is a reader and parser
that can process vertex data and convert it to a consistent usable form. The next
sub objective is the development of a prototype algorithm that can handle all
possible types of data that can be seen in vertex data. This includes handling
float information which is much more complex compared to integer compression.
14
2.4 Specifications
2.4.1 Index Compression Specifications
The algorithms developed must compress the vertex and index information a
notable amount and do so without costing a large amount of resources to
decompress. A compression ratio of at least 1.25:1 is acceptable as the
information that can be in the vertex buffer can vary greatly. For the index
information a compression ratio much higher is achievable due to it being a set
size and only integer values.
Compression can be achieved either by using the CPU or GPU’s resources. If
done by the CPU the compression will be done in advance, most likely at the
time the data is actually created and written to storage. Decompression has to be
done directly on the GPU when data is fetched from the buffer either by software
implementation with the shader programs or through specific hardware on the
graphics chip. Due to the potential requirement of designing specific hardware to
run the decompression, running the decompression code on a physical graphics
card was out of scope for the project.
2.4.2 Compression Specifications
The compression algorithm system that was developed over the course of the
project needed to have the ability to compress the data as efficiently as possible.
Two different approaches are taken if data is compressed online or offline.
Offline data compression is performed in the following manner. First, a program
will be used to scan through the data with the intent of trying to determine the
most efficient method of compressing the data. After the compression sequence
has been performed on the relevant assets, they will be stored to the disk for use
later. In this situation, since the data is being compressed in advance of when it
is being run, the graphics pipeline does not process the data yet. It will continue
when the graphical application that is trying to use the assets is loaded on the
computer system.
15
Advantages of this system come from the extra time that is available for the
compression process. Because the program is allowed to determine in advance
which compression algorithm is being used, it can avoid situations where an
ineffective algorithm is used on data.
Disadvantages of this system are related to the time that compression takes
relative to how fast the computer system is. If the workstation that the graphical
application is being run on is not as powerful as required, it may be inconvenient
to compile the compressed assets for use in the pipeline. The time penalties from
doing offline compression can become more apparent when compressing all the
assets to use for an application. Methods may exist to circumvent this such as
only creating assets when needed. However, developers will not want to use the
system correctly if doing so causes them to be idle for significantly longer than
previously.
If the data is compressed online, then a different compression system and
method of pipeline integration will be used. Instead of using the most powerful
and efficient compression algorithms that are available overall, the online method
must prefer speed and runtime efficiency. This is because the online method
does all compression as the assets are being loaded from the disk, which means
that it must be done in real time. When graphics operations are done in real time,
there is a possibility that it can stall the graphics pipeline if it takes too long.
The process of online compression starts even before it is known that assets
from the disk will be compressed. A graphical application that is supporting these
optimizations will be presumably be running code that can modify the way that a
game receives the assets that it uses, another possible instance of a hook can
be used to achieve this.
Once the application starts, it will begin to request assets to be loaded from the
disk. When this happens, code will be inserted between the time of the loading
and the time that the assets are sent to the GPU. This code will run a quick
version of a compression algorithm that is expected to yield a compression ratio
that is less than that of an offline compression algorithm, before sending it to the
GPU. Overall it is hoped that this approach will be faster than transferring the
uncompressed data despite the time that it takes for the data to be
decompressed.
16
2.4.3 Decompression Specifications
The decompression algorithms that were developed over the course of this
project all have to be very fast and efficient to work effectively. This can be
achieved by employing a range of different optimization techniques on many
different algorithms.
The decompression must be done when the data is fetched from the respective
buffer, and will have to run on the graphics card. Thus it needs to be very fast to
not hold up the pipeline. It is important to note that due to the different types of
vertex data being much more dynamic than index data it will less likely be as
compressed as much as index data as these are always a single integer value.
As opposed to some of the methods used for compression, all methods used for
decompression must be done on the graphics pipeline. Decompression is being
performed so that the graphics card may be sent data in a more efficient manner
than before, so any data that is to be decompressed must be done after the
transfer; there is no point to decompressing the data before that point.
Because index information only contains integer values, it allows the
implementation of many existing compression methods such as delta encoding
and Huffman coding. This data can be compressed to a much higher degree than
vertex data. Vertex information however can contain both integer and float
(decimal) numbers and each object’s vertices can contain different information to
describe them. This makes compression more difficult and complicates the
process of creating compression algorithms that work with index information to
also work with all vertex information.
Delta encoding specifically will not work well on vertex buffer data for two
reasons. The first is that vertex buffers are primarily comprised of float data. This
means that running delta compression on the buffers will not reduce the number
of bits in each value.
For example suppose you have two numbers in your index buffer: 99 and 100;
the difference between these numbers is 1. Knowing this, the group can keep
one of these values as an anchor point, and replace the other value with this
difference. If the group needs to recover the second value, one simply adds the
17
difference to the anchor point. Now our buffer has 99 and 1, and while the 99
hasn’t changed form, the 100 has now become essentially the same bit-length as
a char. With float data, the difference will still need all of its bits in order to
properly represent the decimal number that results from the subtraction.
Therefore delta encoding has no effect.
Figure 2.9: Delta Compression on Floats: This demonstrates why float
values cannot be compressed using delta compression.
The second reason is that vertex buffers contain several different types of data.
Subtracting color data from position data results in very odd values. Grouping the
vertex data by type would solve the problems for position data, since most of the
triangles are positioned close together to form a graphical object. This would not
have the same effect, however, on color data, since the color of one graphical
object can differ vastly from the rest.
18
2.5 Space Efficiency
Space efficiency was also a concern when identifying algorithms. Space
efficiency is the concept of using as little space in memory as possible to perform
the actions required for an operation or function. If an algorithm requires a large
amount of memory just to run the decompression would potentially counter any
benefits of running the algorithm.
2.6 Requirements
2.6.1 Overall requirements
The main goal of this project was to create a sort of “hook” into the graphics card
pipeline so that graphical data can be compressed in advance on the CPU or at
compile time, before being sent to the GPU for use.
In computer terminology, a “hook” is code that is code that is used to allow
further functionality in a module to run from external sources before the main
program continues to run. As demonstrated in Figure 2.10, they work by
intercepting the original call of some function and then inserting their own code
into the pipeline before the original code can continue working. Although some
true hooks can be malicious in nature, the term is only being used to describe the
process where the group can insert code compression / decompression into the
graphics pipeline to try to get performance benefits even though it was not
originally intended to do so.
19
Figure 2.10: The process of hook code being injected into a program being
performed.
Graphical data must not be altered in the compression / decompression process.
When compression algorithms that alter the contents of the data it is
compressing are used, the visual quality of the object being rendered may be
greatly reduced or altered; this is usually manifested in graphics as objects that
appear to be “glitched”, as can be seen in figure 2.11. For this reason, all the
algorithms that the group developed had to have been developed in a way such
that values are not altered even slightly during compression / decompression.
Such algorithms are known as lossless algorithms.
Figure 2.11: Graphical Errors: Severe graphical errors caused by
incorrectly drawn vertices.
In addition to being lossless, both the compression and decompression
algorithms must be compatible with the constraints of the existing graphics
pipeline. For example, it was outside of the scope of the group’s responsibility to
design a hardware module for the decompression algorithms to be run on.
20
Instead, they were testing to see if a software implementation would yield
performance benefits.
Although a fast algorithm that is useful for both compression and decompression
is ideal, it is not necessarily a strict requirement. It is also within the scope of this
project to find algorithms that are quick only during decompression, that are
suitable for parts of the graphics pipeline that are not on-the-fly. For instance,
such “offline” compression algorithms can be used to create pre-compressed
objects during compile time that during real time are only meant to be
decompressed by the GPU.
Figure 2.12: Offline Compression
21
Figure 2.13: Online Compression
2.6.2 Compression Requirements
The group decided to focus on writing an algorithm that can achieve a high level
of compression and whose data can be decompressed quickly, while worrying
less about the time that it takes to compress the data. Although an ideal
compression algorithm would be highly efficient in terms of compression ratio,
compression time, and decompression time, a real solution can only be so good
in one area before acting to the detriment to one or more of the others. For
example, if an algorithm is able to provide an extreme level of compression but
does so in such a way that decompression is very difficult, the algorithm would
not be desirable.
Having the compression algorithms that can execute quickly is not altogether
useless. If the compression algorithm that is used happens to be fast, it can be
put to use by having the CPU compress the assets before it is sent through the
graphics pipeline. A situation like this might occur in a game that was not built
with these optimizations in mind. If the assets in the project were not compressed
when they were built, they would be able to still gain benefit from the
compression / decompression system with on-the-fly compression. This was
considered to be a tertiary goal for the project.
22
The compression system had to be written in such a way that it was simple for
developers to use for their own projects. Within the scope of the project this
means that the group had to design their code in a modular fashion. Doing so
would make it easy for AMD to implement in their own systems where they see
fit.
The last stipulation for the compression system was that the overall compression
/ decompression system had to be written in a way such that developers could
choose not to use them if they did not want to. In situations where the algorithms
were causing problems, the developer might want to turn off the compression /
decompression system until the problems are resolved.
The graphics cards AMD manufactures in the future may be fitted with a module
or section of its card dedicated to compressing and decompressing the index and
vertex buffers with our algorithm. Keeping that in mind, programs written now will
not be written with this new compression module in mind. In order to preserve
backwards compatibility, it is essential that the project includes the option to turn
off the compression module. Backwards compatibility is designing new hardware
with the ability to run code written for an older generation of hardware.
The compression algorithms must be made in such a way that they support some
amount of random access capability. The contents of a buffer being sent to a
GPU contains many objects, and the GPU may not want to access these objects
in the order that they are presented. If the compression algorithm is written in
such a way that the block it creates must all be decompressed at the same time
or in sequence, then significant overhead will be incurred when trying to access a
chunk that is in the middle.
2.6.3 Decompression Requirements
The decompression algorithms also had to adhere to their own set of guidelines
and requirements. The first and most important stipulation was that the
algorithms must be very fast. Unlike the compression algorithms, the
decompression algorithms will always be run on the GPU online and in real time.
If the introduction of the group’s optimization system overall cause the GPU to
run slower than it had been going previously, it may cause a hiccup in the
graphics pipeline which can lead to a lower frame rate, among other undesirable
effects.
23
The decompression algorithms must also be space efficient. As with all highperformance software, the size of the memory footprint is of critical importance.
Any optimizations that the group can make to cause the decompression
sequence to take up less memory means that the memory can be used
elsewhere in the GPU.
Aside from dimensional requirements, the decompression code must be made to
run on a graphics card. This is in contrast to most of the code that programmers
write, which is made to be written on a CPU. For testing purposes, the code for
this project was developed in C.
Finally, the decompression algorithms that the group wrote must be written such
that they take advantage of the block structure that the compression algorithms
provide. The final ideal compression algorithms were made so that the data could
be decompressed in chunks that are not dependent on the surrounding blocks.
This allows the decompress or to potentially save some computation time by only
decompressing the segments that it needs during a given operation, instead of
decompressing the entire buffer at once.
24
Research
3.1 Data types
When the group began work on this project, they needed to make sure that their
foundation of how computers store numbers was completely sound. The concept
of how data types function in computers was especially important. A data type is
a specific number of bits that are stored consecutively with an accompanying
algorithm that is used to parse the bits. The containers that are used to store
numbers in computers are not all the same size, nor are they supposed to all be
parsed the same way, so different algorithms must be created to parse different
data types.
Data is typically stored in computers by using a data structure such as a symbol
table to keep track of the type of the data that is being worked with. In computers,
no matter what the type of the data, all containers can be reduced to binary. This
property can be exploited to implement the type of compression techniques that
are presented in this paper. The type of data that the group members are trying
to compress in this project are all integer and float data at the core. Due to the
fact that computers store integer and floating point data differently, the process of
creating efficient compression algorithms for storing both at the same time is
made further complicated.
Integer data is the fundamental unit of storage. In 32-bit C, a single integer is
also 32-bits, which means that it can store values from 0 to 4294967295 or 2^321. Since binary is a traditional number system, storing integer data causes the
bits in an integer to fill the lower order bytes before the higher ones.
Compression techniques can take advantage of this to reduce the number of
bytes that is needed to store data.
Floating point data is not stored as a traditional number would be stored;
converting a float to binary is not a simple base conversion as with an integer.
Instead, a standard exists called the IEEE Standard for Floating-Point Arithmetic
(IEEE 754). It is basically a computerized representation of the scientific notation.
It is a specialized system that is used to store decimal data in a floating point
25
computer storable number, essentially comprised of the sign (positive or
negative) of the number, followed by the exponent of the number, and finally the
fraction of the number that is being stored. An example of a number being
represented in IEEE 754 floating point can be seen in Figure X.X.
Figure 3.1: Floating Point Format: The number 0.15625 is represented in the
32-bit floating point format.
3.2 Graphics
Before working on this project only one member of the group had previous
experience with graphics programming. Because of this, the first step to our
project was to get everyone up to speed with the basics of 3D graphics, how 3D
objects are designed and created, and to achieve an in depth understanding of
the data the group is tasked with compressing for the project. As a result a large
portion of the initial research consisted of learning how graphic programs work
and how vertex and index information are used to draw objects to the screen.
The rest of our research was involved with current lossless compression
algorithms that can possibly be used with the data the group will be compressing.
3.2.1 General Graphics pipeline
The first step towards understanding the graphics pipeline is understanding how
objects are generated. Objects consist of many vertices which contain a position,
and can contain other values such as color or normal vector information. This
information is all stored in a vertex buffer. Next another type of information the
group needed to research was index information. This information is stored in an
index buffer and points towards a specific vertex stored in the vertex buffer. The
use of index data is a widely used way to lower the size of vertex information that
is needed to build an object as it allows vertices to be reused without being
redefined in the vertex buffer. These are the main areas of where our researched
26
focused as these two data types are what our algorithms will be compressing and
decompressing.
This process of populating the buffer and then using the buffer to supply the
graphics pipeline with data is shown in Figure 3.2. This figure displays 3 different
iterations of an example vertex buffer loaded with data from the system’s
memory; the data being fetched and sent into the graphics pipeline, and then the
buffer reload which repeats this process. In the first run through, labeled Object
1, the vertex buffer is populated during with data read in from the system
memory, once the buffers are loaded with information the graphics pipeline will
then “fetch” or retrieve the data one chunk at a time. When the pipeline has
exhausted the current data inside the buffer it will then clear this data and load in
new data.
Figure 3.2: Example of Vertex Buffer being used and reloaded 3 times.
Inside the graphics pipeline, once data is read into the index and vertex buffer,
the GPU then reads in the indices and vertices one at a time into the assembler.
A diagram showing the operation path that takes place inside the graphics
pipeline that takes vertex and index information and turns it into the final image is
27
displayed in Figure 2.2. In the figure the assembler is comprised of the vertex
shader and the triangle assembly. The triangle assembly builds the shapes
described by the vertices in the form of many triangles next to each other (hence
the name). These triangles are all build one after another and placed in the
correct 3D position in order to build the full 3D object. This object is then
transformed and altered along with other objects that have been constructed and
placed in the 3D scene being drawn. Data that will not be displayed is then
“clipped” out during the Viewport clipping stage. This viewport is designated by a
virtual camera that indicates what will be seen by the scene drawn to the screen.
Once the scene is drawn and clipped it is then sent through the Rasterizer which
simply cuts the image seen by the camera into many small fragments. These
fragments are sent to a Fragment Shader where things like textures are applied
and the fragment data is processed into what is known as pixel data. Once this
pixel data is processed it is sent to the frame buffer where it will reside until
displayed as the final image. It is important to note that only the vertex shader
and fragment shader are directly alterable by the programmer, the rest of the
graphics pipeline is all done “behind the scenes”.
Figure 3.3: The Graphics Pipeline: Illustration of where vertex data fits into
the graphics pipeline.
The interaction between both index and vertex data to build triangles can be
seen in Figure 3.4, which shows a small index and vertex buffer and how the
values of the index buffer “point” to a chunk of data in the vertex buffer. By
connecting the position values of the vertices the two triangles shown will be
drawn and displayed to the screen assuming no other transforms or modification
to the scene takes place in the rest of the graphics pipeline. In the figure vertex 1
and vertex 3 are used in both triangles while both are only defined once in the
vertex buffer.
28
3.2.2 Index buffer
The index buffer holds the index data which is used to point to specific vertices in
the vertex buffer. Index data consists of non-negative integer values. An example
of an index buffer can be seen in Figure 2.1. Each value is a single unit of data
that points to a specific vertex shown by the arrows going between the example
buffers.
Figure 3.4: Indices A and B: Index A is shown pointing to Vertex A, and
Index B is shown pointing to Vertex B.
Due to the nature of 3D objects, most of the time when drawing a line from vertex
a to vertex b, vertex a will be positioned relatively close to vertex b in the vertex
buffer, as a result the index information does not tend to vary much from one
value to the next value in the buffer. The reason for index data’s use is to allow
the reuse of vertices without having to redefine the whole vertex over and over
and store a new vertex every time it is used. Instead the vertex will be defined
once and its location within the list of vertices is stored as an integer value known
as the index.
3.2.3 Vertex buffer
Vertex buffers contain all of the vertex information for graphical objects, and
multiple values within it are mapped to a single vertex. This is what the index in
29
the index buffer will point to. Vertex information is much more dynamic and
varying than index information. It can contain numerous different fields of
information that describe each individual vertex. One attribute that can be
described is the position of the vertex in the graphical environment. This position
is mapped out using a three-dimensional Cartesian coordinate system (x,y,z).
The position is described by three values, these 3 values correspond to the x, y
and z positions on the respective axes. These values can be float or integer
values depending on the precision needed or how the object was designed and
scaled when created.
Another attribute is the color data of the vertex. Color data is described using
three or four float values. The color data is mapped out as R, G, B, and
sometimes A values where A standing for the alpha channel. The RGB color
scale is a measure of the saturation of the three colors found in a color display:
red, green, and blue values from 0 to 255. Using a unique mixture of saturation
levels, any color on the color spectrum can be displayed. If all of the vertices of
the graphical object share the same color values, the object formed by them will
appear to the person viewing them as that solid color, assuming no texturing is
later placed on top. If two vertices do not share the exact same color data, a
gradient will form filling the spectrum between the two colors.
One more example of the attributes stored in the vertex buffer is the normal
vector of the vertex. This data describes a vector that is perpendicular to the
vertex. Normal vectors are used in many calculations in graphics including
lighting calculations, allowing each vertex to reflect light in the proper direction.
Figure 3.5 displays an example with each vertex consisting of multiple fields and
how the index buffer is used to generate two triangles from 4 vertices. The fields
that are available to the programmer are ultimately up to the designer of the 3D
object or by the program that is being used to create said object and some can
be left out in order to save space on the final “mesh” of the object. Because not
all information available may be needed for the specific program being
developed, the fields that are used are dictated by the programmers of the vertex
shader program, which is one of the programs the programmer uses to
communicate with and use the graphics card. In the example shown the single
vertex contains fields for position (x, y, z integer value) and color (R, G, B float).
Another common field that a 3D object can have is a normal vector that is used in
many calculations including how lighting is shown on the object and may have
been included in the file that contains the information to build the triangles
shown, however it was not needed for this example, and thus not read into the
buffers even if it was available in the 3D object’s file.
30
Figure 3.5: Index and Vertex Interaction: Diagram detailing the interaction
between index and vertex buffers.
3.3 Index Compression Research
3.3.1 Delta Encoding
Delta Encoding is an encoding and decoding method that when run on a list of
integers generates a list of the deltas, or differences of a value in the list and the
previous value. This list is used as a way of encoding the list of integers into
potentially smaller numbers that when saved will result in less space used. These
deltas are then used to decode the list from the first value, which is named the
anchor point. Then one by one the list is decoded and if done correctly the
resulting list is identical to the original. Due to the nature of how Delta Encoding
works, integer data that does not vary much from one unit to the next offers the
highest potential compression.
A complete example of delta encoding is displayed in Figure 3.6. The process of
compressing the data follows the simple formula shown below:
31
Where buff is a list of integers, or in our case a buffer of index data and n starts
at 1 (0 being the first element). With a buffer of size m, compression (encoding)
will take O(m) to complete. For delta decompression to work however buff[0] is
stored as is and is called the anchor point of the compressed list. In the figure the
compressed data is shown as the middle list, all values except for the first (the 5)
are changed to their deltas that resulted from this formula. The reason this works
well as a compression method as if you have 9999 followed by 10000, the
compressed list would only contain a 1 instead of 10000. This allows the use of
less space to store values that when decoded equates to a much larger number.
Decompression (decoding) follows the formula:
Where n starts again at 1 and increases with each iteration by one until it
reaches the size of the compressed list indicating the whole list has been
decoded. For the basic implementation of delta compression to access a value
further down in a list or in our case the buffer the algorithm requires the buffer to
be decompressed from the beginning, which causes the decompression of that
value to have a runtime efficiency of O(n) where n is the size of the buffer which
you are retrieving.
Figure 3.6: Delta Encoding: Demonstration of the compression and
decompression process associated with Delta encoding.
32
3.3.2 Run Length Encoding
Run length encoding is a simple compression algorithm that turns consecutive
appearances of a single character, a “run”, into a pairing of the number of times
that the character appears followed by the character being compressed.
As can be seen in Figure 3.7, a run of 5 a’s in a row would take up 5 individual
characters when uncompressed. The compression algorithm will turn this into
“5a”, which takes up a mere 2 characters. The algorithm must also recognize
when not to use this technique in situations where doing so will increase the file
size. As with the last character being encoded, “z”, compressing it into “1z” would
double its size, and so it is left alone.
Decompressing a run length encoded file is also simple. Much like compressing
the file, the decompression sequence works by reading through the contents of
the file, looking for a number followed by a letter. Each number-letter pair is then
returned to its original form of a run of the number’s value of the letter in
question.
An advantage of using run length encoding is that every run is compressed
independently of any other run; the data does not depend on the surrounding
data to be compressed. In practice, when a program requires only a segment of
a file, it will not have to start at the beginning. And in a situation where the data is
being streamed and decompressed in real time, the program can start
decompressing as the single pairs are received.
The efficacy of this type of algorithm can vary heavily based on the type of
contents being encoded. If the data is prone to repeatedly storing the same
element of some alphabet (having long runs of the same character), the resultant
file size will be much smaller than the original. A good example of data that
benefits from run-length encoding are data-blocked pictures with large swaths of
uniform coloring. However if the data being stored is not uniform, such as
random binary, many short runs may be generating. These kinds of short runs do
not greatly improve the compression ratio.
33
Figure 3.7: Run Length Encoding: Quick transformation of a sequence into
a compressed form using run-length encoding.
3.3.3 Huffman Coding
Huffman coding is a frequency based compression algorithm. This means that
the way that the data is encoded depends on the amount of times the character
appears within the file. It also assumes that there is a large gap between the
character with the lowest frequency and the character with the highest frequency.
It also helps to have a large amount of variance in between the two extremes.
It is also a greedy algorithm designed to look for the character with the lowest
frequency first. A greedy algorithm is defined as always choosing an option which
has the most benefit at the current decision juncture. An example of a greedy
algorithm can be seen in Figure 3.8. The hope is that taking all of the most
immediately efficient choice will result in the most efficient overall path possible.
A good example of how greedy algorithms can be effective is the following
problem: “How can you make change using the fewest coins possible?” The
answer is to always take the current remaining change value and issue the
largest denomination of coin, subtracting its value from the total as you go.
34
Figure 3.8: Making Change: A greedy algorithm, this algorithm tries to use
the fewest number of coins possible when making change.
It adds the characters to a binary tree, as demonstrated in Figure 3.9, with each
left branch representing a 0, and each right branch representing a 1. Each left
branch will contain a lower value than the right branch. Then it converts each
character in the file into a binary sequence which matches the tree. The logic
behind it is that the characters with the lowest frequency will be at the bottom of
the tree, with the longest sequence when encoded. The characters with the
highest frequency will have a short sequence such as “01” or “110”.
The decompression works by reading in the encoded sequence and tracing the
tree until the desired character is reached. It is guaranteed that none of the
codes are the prefix for another code, thanks to the way the tree is laid out.
A potential advantage of using this algorithm on index data is that there could be
certain indices that appear repeatedly in the buffer. This could be due to certain
graphical objects being of more importance than other graphical objects, and
therefore their vertices would appear in the buffer most often. This would allow
the Huffman coding to compress it with the most efficient compression ratio.
A problem with using this algorithm is that not all environments will have at least
one object of superior importance to the rest of the environment. If an
environment were to have graphical objects which had an approximately equal
distribution of importance assigned to them, this would cause each vertex to
occur roughly the same number of times inside of the index buffer. This would
mean that the compression would assign half of the indices small binary
35
sequences and half of the values large binary sequences. This would result in
the long sequences cancelling out the small sequences in terms of saving space
.
Figure 3.9: Huffman Coding: An example of the kind of tree used in
Huffman encoding, accompanied by sample data being
compressed.
3.3.4 Golomb-Rice
Golomb-Rice coding is an algorithm similar to Huffman coding. It takes in an
integer and translates it into a binary sequence. It is based on integer division,
with a divisor that is decided upon before runtime. It works by dividing the integer
being compressed by the chosen divisor and writing the quotient and remainder
as a single sequence.
The quotient from the result of this division is written in unary notation. Unary is
essentially a base 1 number system. Each integer in unary is written as a series
of one number repeated to match the quantity the integer represents. For
example the integer three is written as 111 followed by a space. We cannot
accurately express the space in a binary sequence so it is instead represented
by a 0 in our program.
The remainder from the result of the division operation is simply written in binary.
A unary sequence requires a lot more digits to represent an integer than a binary
sequence. Because of this, choosing a large divisor when using Golomb-Rice
Compression is encouraged.
36
Huffman coding and Golomb-Rice encoding were so similar in nature, that we
decided to only implement one. Many factors were considered by the group when
we decided to implement Golomb-Rice over Huffman coding. The first of these
factors was space efficiency. Huffman requires a binary tree which stores each
number we are encoding as a node in the tree. This tree would have to be
transmitted through the buffer along with the encoded sequences in order to be
decompressed by the GPU. Overall this would limit the maximum amount of
compression we could hope to achieve. Golomb-Rice, on the other hand, needed
only to transfer a single integer (the divisor) along with the compressed data. The
second factor was compression time. In order to decompress a binary sequence
generated by Huffman coding, each individual bit would have to be checked, and
then used to trace a path down the binary tree. With Golomb-Rice encoding, the
quotient portion of the sequence is simply a run of 1’s. This format is easily
analyzed with a simple while loop, and does not require additional operations to
be performed. On average, half of the binary sequence generated by Golomb
consists of the quotient portion. In essence, a Golomb sequence could be
decoded in half the time it would take to decode a Huffman sequence of the
same length. The final factor which contributed to the implementation of Golomb
coding over Huffman coding was that Huffman was a frequency-based algorithm.
This implies that certain indexes would have to show up a large magnitude more
frequently than other indexes in order for compression to be effective. Since
Golomb did not have this restriction hindering its effectiveness, it was considered
to be a safer alternative.
3.4 Vertex Compression Research
There are numerous research papers that describe attempts to create effective
vertex compression algorithms. Some of these algorithms work at the time of
vertex data creation when creating the actual 3D object instead of at the time of
data transfer. There are also some algorithms proposed for vertex compression
that are even lossy; used with the assumption that the programs drawing the 3D
objects do not need the precision that the 32-bit vertex float data would offer [1].
In addition to research papers proposing compression algorithms that are run
directly on the data, there also exist methods of compression that act upon what
type of data type the information is loaded into. With the assumption that the data
that is being saved has more data than is needed, space saving optimizations
may be made. These often are up to the programmer whether or not to do them
and require some assumptions when it comes to the data being saved. For
37
example there is a structure called VertexPositionNormalTexture included in the
XNA video game development library that contains both a normal vector and a
3D position. This structure is 32 bytes in size, storing the position as a vector3
(12 bytes) and the normal vector as a vector2 (8 bytes). In addition to this struct
there are special data types such as NormalizedShort2 which by using this
instead of the vector data type can save 8 bytes without worrying about losing
too much precision when used to store normal vector data [3]. This is more of an
optimization than a compression step and it is up to the programmers of the
shader program to decide when a smaller data type will suffice for their
application and the data they will be placing into smaller data types, instead of
losslessly compressing the existing data which is the goal of the project
3.4.1 Statistical Float Masking
Statistical float masking is an algorithm meant to prime data for compression by
other algorithms. It is not a compression algorithm that can be applied by itself; it
is simply an optimization that can be applied before an algorithm is applied to
increase compression ratio. The reasons that the group wanted to implement this
type of algorithm was to try to make a way that data to be read into the buffer can
be primed in advance for buffer transfer.
The algorithm works by creating mask values derived from the most common bit
values occurring in each bit-column of a block of data. For each block of data, the
algorithm will count whether in each bit-column more zeroes or more ones occur.
It then repeats this process for every column. Recording all of these results
creates a mask that when XORed with the dataset increases a deterministic way
to increase dataset uniformity. An example of this process is outlined in Figure
3.10.
Although when the group started work on this algorithm they thought that it was a
general-purpose optimization that could be applied to all algorithms, they found
later on that some problems existed with the application process that made it
more undesirable than previously thought.
The first problem with this method that the group found that this compression
method would only have yielded significant performance increases for unoptimized algorithms. Since efficient modern day algorithms attempt to optimize
as much as possible, this predictive method trying to compress already
compressed files did not yield any compression benefit. In some cases,
attempting to store small enough blocks even increased the original filesize with
the additional overhead.
38
Figure 3.10: XOR Operator: A quick equation to demonstrate the XOR
Operator
3.4.2 BR Compression
One particular kind of compression algorithm that looked promising for
compressing the float values commonly found in the vertex buffer is an algorithm
found in a research paper entitled “FPC: A High-Speed Compressor for DoublePrecision Floating-Point Data”. It is named by their authors as simply FPC for
“floating point compression,” although the group has decided to call it “BR
compression” after its authors, Martin Burtscher & Paruj Ratanaworabhan,
because it is easier to identify. It works by sequentially predicting each value,
performing a XOR operation on the actual value and the predicted value, and
then finally performing leading zero compression on the result of the XOR
operation.
The algorithm uses two separate prediction methods, called FCM and DFCM.
The prediction functions involve the use of specialized two-level hash tables to
make predictions on what float mask will be most effective. It compares each
prediction to the original value to see which is more accurate. The logic behind
the compression performed is that the one that is more accurate will produce
more zeroes after a XOR operation, which leads to space savings through
leading zero compression (LZC).
The XOR operation returns a 0 in the place of each identical bit in its two
operands. A quick equation using XOR is shown in Figure 3.11. Therefore, it can
be assumed that the closer the actual and the predicted values are, the more
39
leading zeros will be present in the result, and the better the compression ratio
will be.
Figure 3.11: XOR Operator: A quick equation to demonstrate the XOR
Operator
Leading zero compression is demonstrated in Figure 3.12. First it counts the
number of leading zeros in the value. It stores that count within a three-bit
integer. Then it removes all of the leading zeros and replaces them with that
three byte integer. A primitive example is show in the figure.
Figure 3.12: Leading Zero Compression: The zeroes at the beginning of a
binary number are replaced with a single binary number counting
the zeroes.
The disadvantage of using this type of value is that the algorithm uses the same
methods to decompress its values that it does to compress them. This means
that the time it takes for the algorithm to compress the vertex data will be the
same amount of time that the algorithm takes to decompress the data. This
nullifies the purpose of offline compression as the decompression algorithm will
stall the pipeline as much as the compression algorithm would.
The FCM and DFCM prediction algorithms use the previously mentioned
masking techniques for their compression, but use different methods to generate
40
the XOR values. An FCM uses a two-level prediction table to predict the next
value that will appear in a sequence. The first level stores the history of recently
viewed values, known as a context, and has an individual history for each
location of the program counter of the program it is running in. The second level
stores the value that is most likely to proceed the current one, using each context
as a hash index. After a value is predicted from the table, the table is updated to
reflect the real result of the context.
DFCM prediction works in a similar fashion; instead of storing each actual value
encountered as in a normal FCM, only the difference between each value is
stored. This version uses the program counter to determine the last value output
from that instruction, in conjunction with the entire history at that point.
Additionally, instead of the hash table storing the absolute values of all the
numbers returned in the history, only the difference in the values are stored,
much like delta encoding. A DFCM will return the stride pattern if it determines
that the value is indeed part of the stride, otherwise it will return the last outputted
value. In the group’s use of this technique, the FCM and DFCM are both used as
complementary functions, where the prediction for the XOR value with the higher
number of leading zeros is used as the result. Figures 3.13 & 3.14 show how
value prediction and table updating for FCM and DFCM work.
Figure 3.13: FCM generation and prediction
41
Figure 3.14: DFCM generation and prediction
Decompressing values generated with FCM and DFCM XORs is simple at this
point. All that must be done to reverse the process is to inflate the numbers from
their compressed leading zero form, and then to XOR the resulting value with the
correct predictor hash, as is noted by a bit set in every compressed value.
3.4.3 LZO Compression
Lempel-Ziv-Oberhumer had the best reported balances between compression
rate and decompression speed of the algorithms researched. The LZO
algorithms are a family of compression algorithms based on the LZ77
compressor distributed under the GNU General Public License. These
algorithms focus on decompression time. This made LZO ideal for this project as
it still achieves a high level of compression while having a low decompression
time.
The LZ77 is also behind other popular algorithms such as those that compress
GIF and PNG files. LZO is also used in real world applications such as video
games published by Electronic Arts.
LZ77 compresses a block of data into “matches” using a sliding window.
Compression is done using a small memory allocation to store a “window”
ranging in size 4 to 64 kilobytes. This window holds a section of data which it
then slides across the data to see if it matches the current block. When a match
is found it is replaced by a reference to the original block’s location. Blocks that
do not match the current “window” of data are stored as is creating runs of nonmatching literals in between the matches. LZO Oberhumer runs an optimization
on top of this to greatly increase decompression speeds.
42
3.5 Additional Research
3.5.1 Testing Environment Language: C vs. C++
For this project, a testing environment was developed in order to aid in quick
prototyping of the algorithm. Both C and C++ were proposed to be the language
that the environment was developed in. This is because both C and C++ are
used widely in graphics programming and both are very similar to shader coding
languages. In the end the majority of our environment was programmed in C.
This is due to the prior knowledge of the language and the ability for C to be
implemented into C++ code with little to no modification in case the need for C++
became apparent later in the project.
3.5.2 AMP code
C++ AMP (C++ Accelerated Massive Parallelism) is a programming model
developed in C++ that allows the coder to easily develop a program that runs on
parallel processors such as GPUs. Initial research showed that this model had
potential to be a good way to test the group’s final algorithms on GPUs without
implementing them in hardware or shader code.
Implementing code previously run on a CPU may affect the algorithm’s
performance when being run on a GPU and implementing our algorithms using
the AMP libraries allow the algorithm to be simulated on the GPU to see how it
affects the performance if at all when compared to previous tests. The AMP
libraries may also help in the parallelization of the algorithms. This however was
not implemented due to time constraints and importance of other requirements.
43
Design Details
4.1 Initial Design
The group had enumerated two possible implementations for their compression
techniques: at the time the 3D object is compiled into the vertex or index data
(“offline”), and at the time when the data is read from the hard drive during
runtime (“online”). Figure 4.1 shows a simplified version of the graphics pipeline
with both the offline compression version of our algorithms running at compiletime shown on the left and the online method being done at runtime displayed on
the right. Where in the graphics pipeline the algorithms would be implemented
are displayed as the blue portions in the figure.
4.1.1 Offline Compression
The main differences between offline and online compression are what
constraints are put on the algorithm in terms of what resources are available and
how much time the algorithm has to run compress the data. As can be seen on
left side of Figure 4.2, the offline compression implementation is performed after
the vertex and index data are created and then saved to the system’s main
memory. By using the offline implementation, the group gains the freedom to
work without worrying about resource or time constraints; if the program takes a
large amount of time to compress the data, it is being done without the graphical
application running and as a result avoids potentially stalling the graphics
pipeline. Additionally, resources that will not be available when the program is
running could be usable by the algorithm in the offline method.
Another potential benefit to running the compression offline would be the
possibility of making a “smart” algorithm designed to choose which compression
method works best on the data being compressed. This smart algorithm will give
a score to multiple different compression methods based on their performance on
the particular set of data. The algorithm with the highest score will then be the
one run on the data to ensure the best compression is achieved for that specific
dataset. Then at time of decompression there will be a way to be tell which
compression method was run on that specific dataset and run the corresponding
decompression algorithm to correctly decompress the data. This can be
conveyed either through a header section included in the data or through a
44
separate lookup table that is created when the buffer is populated with the data.
The reason this would most likely be run offline is this scoring method potentially
would have to run all possible compression algorithms to see what method would
be the best; this would likely take up too much time and potentially resources if
this test was run in parallel to run online during runtime.
If the program was to dynamically apply the compression algorithm at compile
time, a header would be needed to define which algorithm the program was
going to use. The program would map each algorithm to a string of bits such that
each algorithm is given a unique number, which are written in binary. So for
example if the one has four algorithms, such as delta, Huffman, run-length, and
one for vertex data, the header would only need two bits to represent each of
them. It is likely the header would use four bits instead in case the group wanted
to incorporate more algorithms, or alternative versions of the ones the group
already have as displayed in Figure 4. Alternative versions of delta compression
for example could be a slow and fast version. The slow version could use more
predictive techniques and more anchor points, while the fast version uses fewer
anchor points but is better for compression during run time.
Figure 4.1: Header Data: An example of how header would be applied for
dynamically applying the compression algorithms.
Index buffers and Vertex buffers benefit from offline compression differently. For
example, the compression algorithm for vertex data is likely to be more complex
and therefore will take a longer time to complete. This means that the vertex
compression algorithm will benefit more from offline compression because it is
not rushed to finish in order to avoid stalling the pipeline.
45
4.1.2 Online Compression
In the online compression implementation, compression is run on the data when
it is read in from the system’s memory (the hard drive in this case), and then
compressed directly before being loaded into the buffers on the GPU. The right
side of Figure 4.2 displays the online implementation of this system. This
implementation will require a high speed compression algorithm in order to avoid
halting the graphics pipeline as it waits for the buffer to be populated. In addition
to the speed requirement there is also a chance that when running the program,
fewer resources will be available to compress the data due to the program also
utilizing the GPU and CPU. This could potentially slow down the compression
and introduce variability in our algorithm’s runtime.
The algorithms developed in this project are designed to follow the offline
compression method as this would allow the group to focus on higher
compression ratios while avoiding the constraints on resources and speed that
the online method would introduce. If the final compression algorithms are fast
enough to be run at runtime without potentially stalling the rest of the graphics
pipeline then the algorithms will be converted, however this is not imperative for
fulfilling the requirements of the project.
46
Figure 4.2: Graphics Pipeline with Compression: Two possible
configurations of the graphics pipeline after our compression and
decompression algorithms have been added.
4.2 Testing Environment
When the group began work, they anticipated using a multitude of different
algorithms in their testing, all of these tests would generate important data that
the group would need to gather and organize into a common format to be
compare. The group decided that there must be a standardized testing
environment that would be able to track all of the algorithms that the group would
work on over the course of their project. The features that the group hoped to
implement in their environment to answer their concerns were the ability to keep
track of the data being generated from the many different algorithms, the ability
to test each algorithm on multiple kinds of data sets in a short period, the ability
to provide a standard and modularized system for many different algorithms to be
tested and compared in a short period, and to ability to verify the correctness of
their implementations.
The testing environment is a framework designed to test our algorithms and
measure their performance. It was written in the C programming language, and
was worked on by all of the group’s members. It is modular, meaning each of its
functions can be added, modified or removed without impacting the
environment’s ability to run properly.
The testing environment takes in index and vertex data from a text file and stores
each in two separate arrays. When prompted, it runs the current version of the
compression and decompression algorithms. It can measure the compression
algorithms run time and compression ratio. It has a checksum function to ensure
that the data being decompressed matches the data that was originally
compressed.
Two main areas of optimization must be taken into consideration when
comparing the efficiency of compression and decompression algorithms: their
time and space complexity. Both are important measures of how well the
compression and decompression algorithms are able to work through the data
being sent through the vertex and index buffers, and so the group put a high
priority on the collection of statistics on this and other data over the course of the
project.
47
4.2.1 Initial Environment Design
The testing was designed to facilitate the generation of useful information when
testing the group’s changes to their algorithms. This information is used when
comparing different implementations and optimizations of the group’s algorithms
with previous attempts. By developing the environment in C, the group hoped to
avoid the complications of using a shader program which would have introduced
more complexity than needed to be to test the algorithms. Because C code is
easily ported to C++ code, by starting out using C, the group can easily transfer
existing code to C++ if features of C++ were needed.
4.2.2 Data Recording
The group had concerns regarding their ability to keep track of all the data that
their project was going to generate; the data for an algorithm must be labeled
appropriately along with the statistics it generates. Each revision of each
algorithm they viewed as an entirely different entity because changes in the code
could yield unique performance benefits that could be lost when working with an
algorithm that has been forked into two different variations of the same type.
When attempting to determine which optimizations of that algorithm work best,
the group wanted to be able to generate meaningful comparisons while
maintaining a distinction between algorithms that started similarly but then took
on different optimizations to be labeled as the same, at the expense of
generating a large amount of data.
The group has designed a system for keeping track of the data they collect.
When an algorithm is run, the statistics for that algorithm will be inserted into a
database. If the algorithm yielded better performance than the previous iteration,
then they would keep the code and continue working on improving it.
In addition to the concerns related to their ability to keep track of all their data,
the group realized that simply executing the algorithm code and recording how it
performed was insufficient. The group also viewed the validation of the integrity
of their algorithms to be very important. The group was afraid that their iterations
and optimizations to the algorithms could cause the algorithm to produce
incorrect compression or decompression sequences and lose the
required lossless quality without them noticing at the time. These errors, if gone
unnoticed, would compound on top of the other changes and possible errors that
they had since added to the project after the initial error. The group decided the
most efficient way to test if the algorithms were working as intended at all times
48
was to compare a checksum of the original data that was intended to be
compressed, and the data that is outputted after having been decompressed.
This check did not verify that the algorithm was being performed in any particular
way, it simply verified that the compression and decompression sequence
yielded the same data as what was originally entered into it. Another basic sanity
check to make sure that the size of the compressed data was smaller than that of
the original data was implemented.
Figure 4.3: Checksum Functions: A checksum function will return a vastly
different value even with similar input data.
Figure 4.4: Checksum Usefulness: Demonstration of how a checksum
alerts the program that data has been changed.
4.2.3 Scoring Method
49
In order to compare the different algorithms the group developed and tested in
the environment, a scoring method was employed in order to give a quick way to
compare algorithms against each other. The three developed scores, shown in
Figure 4.5, show the efficiency the compression section, and the
decompression. To do this the scoring method takes into account the time the
algorithm takes to both compress and decompress the data and the compression
ratio achieved by the compression section of the algorithm.
When creating the decompression score the program uses the resources
required to run the decompression section multiplied by the time taken to
decompress the data. The resulting score is used to see if the decompression
section of the algorithm is more efficient than previous attempts in terms of the
two important sections of decompression: resource requirements and speed of
decompression.
By creating three different scores the group was able to choose the most efficient
algorithm possible by first seeing if the whole algorithm was more efficient than
previous attempts. Second the two other scores are used to see if the algorithm’s
sections can be combined with other attempts’ opposite section to get a better
result. By making different combinations of compression and decompression
sections the group hopes to be able to further increase compression efficiency
without having to add in a completely new and untested optimization.
Score
Compression Ratio
Equation
𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝐷𝑎𝑡𝑎 𝑆𝑖𝑧𝑒
𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐷𝑎𝑡𝑎 𝑆𝑖𝑧𝑒
Compression Score
Compression Ratio*Compression Time
Decompression Score
GPU Resources Used*Decompression Time
Figure 4.5: Score Equations for testing environment.
4.2.4 Dataset Concerns
The group was also concerned how they were going to give themselves the
ability to quickly test all their algorithms over multiple types of data with
theoretically different compression ratios. A potential problem could arise if the
index and vertex data that the group are testing their algorithms on are not
diverse enough. Although some algorithms may excel at compressing a
particular type of data, it may falter in other areas that impact its overall
performance. An example of this is shown in Figure 4.6. This figure shows two
50
different datasets being run through a delta encoding algorithm. The left dataset
has a much higher level of compression compared to the one on the right.
Another example would be, if the group only tested all of their algorithms on a file
with long binary runs, run length encoding would appear much more viable than if
testing were performed on a more inconsistently distributed file. As such they
worked to design their testing environment in a way that facilitated the testing of
algorithms on multiple datasets at the same time.
Figure 4.6: Example of different data not working at same efficiency on
same algorithm.
4.3 Index Compression
Due to the uniformity of index data (no index data will ever have a decimal
value), the index compression algorithm is much easier to develop and is
capable of achieving a much higher compression ratio than that of vertex data.
Because of this the group started with this algorithm and implemented a solid
prototype before moving onto the more complex vertex algorithm
4.3.1 Delta Encoding
Of all the algorithms researched, it seemed delta encoding would be the best to
start with as a baseline for index data compression. This was chosen as delta
encoding has been proven to work very well with integer data that does not vary
much from one unit to the next. This is generally the case with index information
51
due to the fact that when drawing an object, it is uncommon to point towards a
vertex at one spot, then point to one very far from it as this would draw a very
weird looking shape. Often these two vertices will be close to each other in the
buffer thanks to the way 3D objects are transformed into vertex information when
created.
Initial test results were very promising with the implementation of delta
compression showing a large amount of compression with very little time penalty
at both the time of data compression and more importantly decompression. Due
to these initial results delta compression was proven to be a good baseline to
build upon.
4.3.2 Other Considered Algorithms
Other algorithms that were tested are Huffman encoding and run length
encoding. Initial research on these algorithms deemed them less effective on
average index data and as a result the group does not consider the testing of
these alone on uncompressed data a large priority. With run length encoding
however when implemented on top of delta encoding there was an increase in
the efficiency of the algorithm being developed even further as the new
compressed data was potentially more compatible with it and yielded better
compression without a large increase on decompression time. An example of this
is shown in Figure 4.7. In the figure a sample of an index buffer is shown. The
data in this buffer is first run through a delta encoder as the first step of
compression. In the next step the delta encoded data is then run through a run
length encoder. Because of how run length encoding works numbers have to be
encoded into letter representation and in the example seen in the figure the run
length algorithm encodes 500 to the letter a, 1 to the letter b and -1 to the letter
c. It can be observed that by running the delta encoded data through this second
algorithm the sample data is converted from three values to two. This is where
running one encoding on top of another will produce higher compression ratios,
and decompression is not greatly slowed by the compounding of these
encodings.
52
Figure 4.7: Run Length + Delta: Example of running Run Length encoding
on top of Delta encoding
4.3.3 Delta Optimization
In order to increase the speed and efficiency of delta decompression on our data,
the group developed dynamic anchor points. These anchor points split the data
into separate sections or blocks, allowing the delta compression algorithm to start
at the nearest anchor point instead of at the beginning of the data. This will help
the algorithm by allowing the GPU to access indices at random locations in the
buffer without having to decode from the beginning of the buffer, but instead the
closest anchor point to the point being fetched. There are a few methods of
implementing these anchor points.
The first anchor point implementation is shown in Figure 4.8. This
implementation involves the use of escape codes that exist in some index data.
These escape codes are pieces of data that do not represent actual indices but
instead are a flag of sorts to indicate the end of a triangle strip, a special type of
optimization that allows the creation of a strip of triangles, each connected by 2
vertices to the previous triangle. This optimization allows the reuse of two indices
of the previous triangle to draw the next triangle.
By using these codes the algorithm will place the new anchor point directly after
the escape code, thus the deltas between the following indices should be very
small as all the triangles in a triangle strip are connected. In the figure the
escape codes are represented as the value -1 in the original index buffer. These
are turned into two consecutive -1’s in the encoded buffers as shown by the
arrows going from the original data to the left dynamic anchor point buffer. This is
done to prevent the deltas that equal -1 from triggering an escape code, and in
turn these deltas are represented in the encoded buffers as a -1 followed by a 0.
In the diagram a command to fetch the 7th value in the index buffer is run on
both encoded buffers. The anchor point used it represented by the first blue box.
The following dots down the line represent decoding steps that they had to be
53
run before the desired index was reached, indicated by the ending blue square
with the desired value in it. It can be seen that the normal anchor point method
required eight decoding steps, assuming the loading of the anchor point was the
first step, whereas the dynamic anchor point implementation allowed the decoder
to reach the desired value in only 3 decoding steps.
Figure 4.8: Example Showing Benefit of Dynamic Anchor Points with
Escape Codes
Another method for dynamic anchor points could be taking the size of the buffer
of data and splitting it up into equal parts. This method will require a smart
algorithm to split the buffer up to an optimal amount to avoid too many anchor
points but still have enough to allow the buffer to quickly get the information it
needs without decoding too many values. As shown in Figure 4.9, splitting the
buffer up by different factors of three indices would work well, since 3 indices
defines a triangle, which is the base shape for drawing an object.
54
Figure 4.9: Example Showing Benefit of Dynamic Anchor Points with No
Escape Codes
4.3.4 Golomb-Rice
The divisor used in encoding a sequence with the Golomb-Rice algorithm will
determine the effectiveness of the algorithm. The method used to choose the
divisor is therefore of the utmost importance. When the size of the divisor is
decreased, the size of the quotient will increase, while the size of the remainder
decreases. If the divisor is set too low, there is the chance that the quotient could
become too large to fit within the largest datatype available in C (64 bits). When
the divisor is increased, the size of the quotient decreases, while the size of the
remainder increases. If the divisor is set too high, the numbers will simply be
converted to their binary representations, not yielding any compression.
The implementation of Golomb-Rice used for this project calculates the divisor
based on the maximum value found within the input file. It ensures that the
encoded sequence required for that maximum value is 32 bits at most. This way,
all numbers smaller than this max value will be less than 32 bits when encoded,
ensuring the algorithm always yields some amount of compression. This method
works best on input files with a wide range between the highest and the lowest
value. If the values are skewed such that a majority come close to the highest
value, this method will yield next to no compression.
Blocking Implementation
The binary sequences of varying lengths produced by Golomb Rice are not
easily stored or decompressed by the C language’s native libraries. As a result, a
blocking structure was added when implementing the algorithm. A block is simply
a small portion of the input data being compressed in a similar fashion. The
55
minimum required bytes to store the largest sequence in the block is the number
of bytes used to store each value within that block. If a particular block ends up
with sequences larger than a native integer, the hope is that the other smaller
blocks in the compressed data will compensate. The compressed buffer as a
whole is stored in char array, with the first char in each block used to store the
number of bytes required for each value in the block.
In the current implementation each block shares the same divisor. It is possible,
however, to give each block its own divisor. This could possibly yield better
compression as each block could compress its own values to the smallest
sequence possible. A caveats to this technique is that it leads to more overhead
when compressing the data. It also leads to more header bytes required to store
the divisor for each block. These header bytes could become so large that they
themselves would add a complete integer to the compressed buffer.
Golomb’s Strengths
Golomb’s main strength is its ability to work on any kind of data formatting. RunLength Encoding works optimally on data which is sequential, where the delta
values are repeated frequently and consecutively. Golomb will compress data
effectively regardless of the range between values or where they are in relation
to each other. Golomb does not require additional data to be stored alongside the
compressed buffer, aside from the divisor that was used to compress the data.
For this reason, Golomb also lends itself well to parallel implementations. Any
thread can decompress a value in parallel as long as it is given the divisor.
Priming with Delta
An optimization added to the Golomb-Rice algorithm was first compressing the
data with Delta compression. Delta compression makes large values smaller by
only storing the differences between two adjacent values. If the data is formatted
in this way, the divisor chosen will prove to be more effective at compressing the
data overall. This method comes with several tradeoffs. It adds to the overhead
processes required to run the compression and decompression algorithms. It
also eliminates Golomb’s ability to be parallelized, since the buffer cannot be
decompressed without an anchor point.
4.4 Vertex Compression
Vertex Compression is much more complex and as a result required much more
research and work to implement prototypes. The group researched many
different algorithms. These algorithms included prediction based and non56
prediction based algorithms. Of these algorithms the group chose to focus on
implement BR compression and LZO compression.
BR compression was chosen to attempt to use an open algorithm that the group
could understand completely. BR was considered to be a good candidate
because it uses a one-pass predictive algorithm, meaning that it would hopefully
provide sufficient compression and decompression speeds. BR was the first
vertex compression algorithm that the group attempted to implement.
LZO was chosen over other LZ77 algorithms such as DEFLATE due to its focus
on decompression time. This focus made LZO ideal for this project as it still
achieves a high level of compression while having a low decompression time.
There are numerous different LZO algorithms which all generate different levels
of compression, however since all LZO1 algorithms use the same decompressor
decompression speeds are all comparable to each other in terms of MB/S. LZO11 was the version of LZO that was implemented into our testing environment.
57
Build, Testing and Evaluation
Plan
Our original build plan is outlined in our objective section. First it was planned to
start the creation of the testing environment. Then was the implementation of
basic compression and decompression algorithms. Then using the testing
environment the group would test potential improvements to our algorithm and
accept or throw out the modification depending on if it improved the algorithm or
not.
In order to test the performance of the algorithms the group used the testing
environment to run the algorithm on a set of sample data. Once run our
environment gave us valuable test data such as how long it took to compress the
data, how long to decompress, and what was the compression ratio from normal
data to compressed data. The group then plans to have this information save to a
file for future analysis to see if the algorithm was an improvement or not.
In order to evaluate if an algorithm is an improvement the group compared a few
different fields of test data against other tests. The first and most important is to
make sure it maintains being lossless, if any data is altered or lost it is a failed
algorithm and must be thrown out. Next in importance is the compression ratio,
closely followed by decompression time. These two fields will be scrutinized the
most in terms of power of the compression versus speed of decompression. The
reason these two are so important is the group is looking for an algorithm that
has the greatest amount of compression for our data types, but at the same time
maintaining a fast enough decompression time to allow it to be at runtime when
the data is fetched from the buffer.
5.1 Version Control
When the group started this project, they knew that they would be working on the
same files at the same time. Regardless of whether the group is working on
different files in the project or they are even working on different parts of the
same file at the same time, it is necessary to keep the project in a single form
that all members contribute to. If the members attempt to work on their parts of
the project with complete independence with the intention of merging it all at a
58
later date, they may run into large compatibility problems when merging. A more
manageable solution is that the members keep a central repository for the code
to be stored in. That is why the group decided to implement a version control
solution for use during their project.
5.1.1 What is Version Control
Version control is a system that one or many people may use to manage the
changes that are made to a project. Although implementations for this concept
differ, the idea is that many people may work on the same project simultaneously
by creating a copy of the master project that all members contribute to, having
each person make their own changes to the parts of the project that they are
assigned to, and then finally updating the master version with their changes,
known as a “commit”.
Conflicts in version control may arise when multiple people attempt to edit the
same file and then both try to commit their changes back to the master version. A
robust version control system will alert the user that is committing over other
peoples’ changes that their commit may result in the loss of other peoples’ work.
A comparison highlighting the differences between the two copies of the same
file, or a “diff” may be provided, which the user can use to update their version of
the file with the contents of the other users. An example of a file resolution can
be seen in Figure 5.1. The usage of this conflict resolution system is imperfect
but much preferable to the alternative: users not being able to work on the same
file at the same time, having to manually make sure that it is safe to commit the
files they are working with.
As they could see, using a version control solution was imperative for managing
the group’s project. It would allow the group to work on different parts of their
compression and decompression algorithms or testing environment at the same
time, and merge their code when they were done. Since the group members did
not only work on the project when they were meeting together, it was important to
have a centralized location that they would be able to store their source code that
was not dependent on transferring over some physical medium; storing their
code online alleviated that problem. Additionally, version control would allow the
group to revert commits that introduced bugs. Some bugs are introduced through
errors in code that are difficult to pin down, and if the commit also modified a
large area of code, it may not be time efficient to fix. Instead, the changes could
59
be redone using a different solution or while the programmer is being more
conscious of the errors that might occur when writing the code.
Figure 5.1: Version Control: A file being changed and merged in a generic
form of version control.
5.1.2 Choosing a Version Control System
In the beginning stages of the project the group had not yet established whether
a public account was safe to store data that could potentially compromise their
NDA if open to the public. The owners of the company that run GitHub, for
example, are a strong proponent of the concept of free information on the
internet. Therefore if one signs up for a public account, their code is freely
available to be viewed by the public and is considered open source.
Knowing how important the role of version control would be in their project, the
group wanted to make sure that the solution they were choosing was the right
one for their task. The group considered many different factors when considering
which version control solution to use. They were aware that at some point they
60
could be working on different parts of code that existed in the same file, so they
knew that their version control solution must be good at alerting the users that
files needed to be merged before committing. Not doing so would lead to
situations where the code they were writing became fractured in ways that may
not be simple to fix. Since this project did not require internet access for any of its
functionality, the group wanted to be able to have a version control solution that
also did not require internet. This allows them the extra flexibility to work on the
project for long stretches of time where internet isn’t available, such as when
traveling.
Another aspect of version control that the group was interested in was the ease
of its use. Using an overly complicated version control solution is just as
undesirable as using one that isn’t robust enough for all of the group’s needs. A
system that takes as much time to learn and use as it does save time is not a
very useful system in the end.
Finally, the group wanted to use a system that secure means of code storage.
Because the project was being done for AMD, who had placed a nondisclosure
agreement on parts of the project, it was important for the group to be able to
control who was able to access the code.
Git
Svn
Dropbox
Google Drive
multiple users
✔
✔
X
✔
offline
✔
X
X
X
simple
X
✔
X
✔
secure
✔
✔
X
✔
cross platform
✔
✔
✔
✔
Figure 5.2: Version control pros / cons: The different pros and cons of each
kind of version control.
61
The first version control solution the group looked at using was Subversion.
Subversion has many strengths; it is a very robust solution with powerful tools. Its
automated tools are useful for keeping track of entire projects at the same time. It
allows users to commit only the files that they are working on back to the master
version, simplifying the merging process. Subversion is also a secure solution as
it uses a login system to keep track of can checkout and merge changes into the
repository. Subversion is cross platform; it has a command line utility for Linux
and also has many powerful clients for Windows such as TortoiseSVN. Many
companies use Subversion in the workplace for their products, another indication
of its degree of usefulness. Subversion is also relatively simple to use; since all
users are committing to the same online database, of the committing files is
relatively simple.
Git is another tool widely used for enterprise level code versioning. Like
Subversion, it has many powerful tools that allow many developers to work on
the same code simultaneously. Git operates by having each user create an entire
clone of the repository that they are working on to edit. This allows developers to
more easily work with the project’s revision history and will allow them to have full
control over the project while not connected to the internet. Additionally, since Git
repositories are distributed, loss of the master server will not hinder the project
members as the loss would for a group using Subversion. Like Subversion, it has
robust member management features, allowing restricted access to projects
being hosted online. Git is typically considered harder to use than Subversion.
Since the entire code repository is being committed when users make changes,
more complex commands must be used than are used in Subversion.
Dropbox is another candidate for use in software versioning. Dropbox is a service
which syncs folders to a cloud-based storage system. Accounts come with a
small amount of space in their databanks without any kind of subscription or
monetary commitment. Since it was designed to be a general use file versioning
program and not designed to be a code versioning program, it implements
certain functionality that programmers would find useful. For instance, if two
developers are working on a file simultaneously, Dropbox will not alert them that
the file’s code is diverging. Instead it will simply add the other file to the directory
alongside the original; a solution that is far from ideal. Dropbox is typically used
for smaller projects where the developers do not think that using a large scale
versioning program is necessary. Most do not recommend attempting to store
their projects on Dropbox.
62
Google drive is also used for versioning but can be used in a different capacity.
In addition to Google Drive’s file syncing and backup functionalities, Google drive
is also able to host the group-editing of documents. Multiple people are able to
edit the same document simultaneously, greatly simplifying the document
creation process.
The group originally used Dropbox for file versioning. At the time, it was sufficient
for their needs to make sure that their code was backed up in some location, and
also simplified file sharing. With Dropbox’s sharing system, the group was also
not worried about problems regarding the security of the code. Since the group
could not verify either site’s security, they initially posted our completed code to a
shared folder in the Dropbox. Only those with user accounts who had been sent
an invitation to the folder were allowed to view or edit its contents. Later on the
group transferred their files onto the private GitHub they had obtained through a
student subscription. The group saw using Dropbox as a temporary solution until
they were able to decide on which dedicated file versioning program they would
use.
The group eventually decided that they would use Git as would their version
control system. They chose it because of its robust feature set. Specifically, Git
provided the ability for the group to work offline with a full and distributed backup
system for their project which was essential for time-sensitive work. The group
used Google Drive to store their documentation. They also used the drive to
store a schedule of the things they had left to do on the project. This schedule
was formatted in a way that defined exactly who would work on what portion of
the project. They also included a document containing the minutes from each of
our meeting, so as to preserve the events for future use, and provide an overall
picture of the pace the group was keeping in the progression of the project. The
group also kept track of the meeting minutes simply for record keeping purposes.
The group saw the ability to all contribute to the same document simultaneously
as a unique and very useful ability. Due to the complex nature of word document
storage, a traditional versioning system like Git or Subversion would not suffice.
5.2 Test Runs
All Tests were done using our testing environment. Each test run was done on a
computer that contained an Intel Core i7-4785T @2.20GHz, 8 GB Ram and was
run on Windows 8.1 Pro. All test data run 10 times for each file then averaged.
63
5.3 Index algorithm development
5.3.1 Delta Encoding
The group decided to use delta encoding as the baseline compression rate for
our algorithm to compress index buffer data. It was chosen over the other
possible encodings as it is an easily implemented and very effective algorithm
when run on index data. Index buffers consist largely of sequential integer
values. This makes logical sense because a graphical object is more likely to be
defined by a series of vertices which are close together, rather than by vertices
on opposite ends of the graphical environment.
When run on the sample data and then running Run Length Encoding on top of
the encoded data, the group found delta compression had around a 2:1
compression ratio as seen in Figure 5.3. The original data is stored in a text file
indexBuffer.txt as seen in the table. This file contains a large amount of example
index values. Delta encoding is run using our test environment on the original
data and as seen in the table the compressed data, saved to
indexBufferCOM.txt, is almost half the size of the original data.
Inside this test there were escape codes that had to stay in a separate format
from the rest of the data. In this test run the escape code was the unsigned
integer value equating to the signed integer -1.This escape code is used when
drawing triangle strips to indicate the end of a strip and the beginning of a new
one. The Delta encoder must contain a handler that consists of a check to see if
the value is an escape code or if the delta value is a -1. In order to do this this
the group encoded their own escape codes to be written to the compressed list to
ensure that when the encoder hit this escape code two numbers are added to the
compressed list, in this case a -1 and then a 1 are pushed to the list. If the delta
between values equates to a -1, (the value of the escape code) the code would
then push -1 and 0 to the compressed list, indicating that it was a -1 instead of
the escape code. The addition to these escape codes will cause some increase
in size but even with this method the sizes are still a huge improvement. By
examining Figure 5.3 it can be observed that the file size remains the same
when comparing the original data file and the decompressed file and the lossless
quality is ensured by running the original and the resulting de-compressed file
through a checksum that checks the sum of all the values of one file with the sum
of the other.
64
File
Data Status
File Size
indexBuffer.txt
Original Data
48 KB
indexBufferCOM.txt
Compressed Data
26 KB
indexBufferDEC.txt Decompressed Data
48 KB
Figure 5.3: Index Buffer Delta Compression: Example of compressing index
buffer data using Delta Encoding.
The delta compression algorithm implemented can be considered a “dumb”
implementation as it begins at the start of the array, with index zero set as the
anchor point and the only value unchanged from the original array. This algorithm
would use a large amount of time decompressing portions of the data which are
not required at that time. The group plans to redesign the algorithm to be “smart"
and include dynamic anchor points. This will allow the algorithm to run much
faster when accessing different parts of the buffer and allow for much faster
decompression times.
By using delta encoding first there was potential to use other encodings and
algorithms to further compress the data. In uncompressed index data there is
rarely a pattern of the same index repeated over and over again in a consecutive
run due to a single vertex not able to connect with itself to form a graphical
object. This makes run-length encoding an inadequate algorithm for the
uncompressed index data. However after delta encoding is run on the index data
there is a chance for many of the deltas between indices to be the same value (if
the vertices all come after each other in the buffer). This allows us to run length
encoding on the delta encoded buffer and potentially greatly compress the
already compressed data even further.
The results of Delta encoding compounded with run length encoding are shown
below. As seen in Figure 5.7 the average compression rate for Delta combined
with run length encoding is 46.25%. The speed of the algorithm’s test runs can
be seen in Figure 5.5 and the average compression / decompression times are
.83 and .76 milliseconds respectively.
65
Figure 5.4: Delta RLE file size change
Figure 5.5: Delta RLE Compression and Decompression Time
66
Figure 5.6: Delta RLE Normalized Compression Speeds
Figure 5.7: Delta RLE Compression rates of different test files
67
Figure 5.8: Delta RLE Test Run Histogram
5.3.2 Golomb-Rice Encoding
The Golomb Rice integer compression algorithm is able to compress 42.01% of
the index buffer on average which can be seen in Figure 5.12, when run on the
same data as our Delta RLE Algorithm. As seen in Figure 5.10 Golomb-Rice has
an average compression time of 14 milliseconds, and an average decompression
time of 14 milliseconds. Figure 5.11 displays the compression rate which
averaged at 4 MB/second for compression, and decompression averaged at
6MB/second.
68
Figure 5.9: Golomb-Rice file size change
Figure 5.10: Golomb-Rice Compression and Decompression Time
69
Figure 5.11: Golomb-Rice Normalized Compression Speeds
Figure 5.12: Golomb-Rice Compression rates of different test files
70
Figure 5.13: Golomb-Rice Test Run Histogram
5.3.3 Index Compression comparison
Figure 5.14 displays a comparison between Delta-RLE and Golomb-Rice
compression rates from out tests. It is important to note that the Compression
rates of both algorithms remain relatively comparable throughout the tests.
However due to Golomb’s slow speeds with decompression it was deemed the
less fit algorithm for our project however it is still a valuable algorithm as it has
similar performance when run on random data, which is not true at all with DeltaRLE.
71
Figure 5.14: Comparison between Delta-RLE and Golomb-Rice
Compression Rates
5.4 Vertex algorithm development
5.4.1 Test Data
Our test data was provided by AMD, and was produced by the output of a PERFstudio program designed the dump the contents of actual index and vertex
buffers of graphical objects the company performs tests with. These values were
then written into a text file, with each vertex receiving its own line in the file when
the group was given the vertex buffer data, some of the values had been printed
to the file in exponential form. This meant that the value had been printed out as
a decimal number similar to the actual data, but raised to a negative power in
order to keep the numbers within an arbitrary range. An example of this is using
0.113e^-3 to describe the float value 0.000113.
Before the group could begin work on the compression algorithm, they had to
ensure that all of the data was uniformly described by exclusively numbers. The
characters used to follow proper exponential formal forced us to read in all of the
data from the text files as string data and then converted to the proper number
data type. This required a parser to be used to translate the string data into float
72
data, and interpret the exponential-formatted data as it was encountered. Further
information regarding this process will be covered in the section concerning our
testing environment’s File Reader (Section 5.5.1)
5.4.2 Vertex Algorithm Implementation
Because vertex data can contain both integer and float values a suggested path
to take for compressing the data is splitting up these two data types. This would
allow us to potentially run integer based compression algorithms on one section
of data, while running the more complex float compression algorithms on the
other section which would then only contain float data. Another potential way to
compress this data is by representing the float values as strings. Through this the
algorithm then can use a method such as the Burrows-Wheeler Transformations
to organize the data and then compress it using an encoding such as run length.
The group implemented two algorithms for vertex compression. These two are
LZO and BR compression. Our tests have shown LZO to be a better candidate
for vertex compression as it yields better compression results and has faster
decompression speeds. LZO also is a LZ77 based algorithm, giving insight on
the value this compressor holds and the possibility of other algorithms that use
this compressor giving better results for future research.
5.4.2.1 LZO
In our tests LZO achieved a compression rate averaging 32.58% as seen in
Figure 5.18 and compression/decompression speeds averaging 5.1 and 2.9
milliseconds respectively. It is important to note that as seen in Figure 5.16 the
decompression speeds were always well below that of the compression.
73
Figure 5.15: LZO File size changes
Figure 5.16: LZO Compression and Decompression times
Figure 5.17: LZO normalized compression speeds
74
Figure 5.18: LZO Compression rates of different test files
Figure 5.19: LZO test run histogram
5.4.2.2 BR
75
In our tests BR achieved a compression rate averaging 14% as seen in Figure
5.20 and compression/decompression speeds averaging 9 and 7.6 milliseconds
respectively, as seen in Figure 5.21.
Figure 5.20: BR size changes
Figure 5.21: BR Compression and Decompression times
76
Figure 5.22: BR normalized compression rate, measured in MB/S
Figure 5.23: BR Compression rates of different test files
77
Figure 5.24: BR test run histogram
5.4.3 Vertex Compression comparison
Figure 5.25 displays a comparison between BR and LZO compression rates
from out tests. Unlike the results of the index compression algorithms this has a
very clear better algorithm. LZO has a consistently higher compression ratio and
almost a double the compression rate when compared to BR. BR was valuable to
the project still as LZO’s algorithm was relatively unknown outside of being based
upon the LZ77 compressor, whereas BR was described in full.
78
Figure 5.25: Comparison between Delta-RLE and Golomb-Rice
Compression Rates
5.5 Test Environment
Our testing environment can be separated into 4 basic sections. These sections
are: the file reader, the tests that are run, the section that outputs the data into
our testing database and the actual compression and decompression algorithms
that will be implemented and tested. These four sections were developed to be
separate modules. This allows one section to be modified while keeping the other
constant and allowing the modification without the risk of damaging the other
sections. This is especially important when implementing test algorithms.
5.5.1 File Reader
AMD provided a large amount of sample data from index and vertex buffers they
had worked with. It was acquired from PERF studio through a function that
dumped the contents of the buffers into folders and then compressed them into a
.zip file. The data itself was stored in the form of text files, and would therefore
have to be read into our test environment through a file reader.
79
The File Reader is separated into two separate functions, one designed to read
in index data and one designed to read in vertex data. Both functions have only
one parameter, which is the address of an integer value. This address is used in
order to allow the function call to pass-by-reference the size of an array. This
information will be needed later because the functions which perform
compression and decompression on the data need to know the size of the array
storing it. The first function which reads in data from the index buffer files is
designed to return an array of integers after scanning the text files. The second
function is designed to return an array of strings for the vertex buffer because of
formatting complications which will be explained later.
The simple “fscanf” function from the C standard input/output (stdio) library was
used to read in the data from the text files in both functions. The function starts
by scanning in each value in the file without saving it to an array. While the
function passes over each value it keeps a tally of how many values are inside
the file. Once it has completed an entire scan of the document and knows how
many values are inside the file, the function dynamically allocates space for an
array that can hold the correct number of values. The data from the file is then
read in using the fscanf function once more. This time the data from the index
buffer files stored in an array of integers, while the data from the vertex buffer
files is stored in an array of strings.
The reason the group had to read in the vertex buffer data as strings is because
some of the values the group was given from the vertex buffers was formatted in
exponential form, whereas other values were expressed as floats. We had to
design a parser which would read through each string in the array and detect
when it was in the exponential form. It then converted the data into the more
recognizable decimal numbers commonly found in a vertex buffer. Any values
that were not in the exponential form were simply transformed into floats using
the C library’s “strtof” function. These values were then stored into a float array.
5.5.2 Compression and Decompression algorithms
The most important part of our test environment is the ability for prototype
compression and decompression algorithms to be implemented easily and
quickly into the environment. By writing the algorithm as a function that takes in
the buffer object an algorithm can be plugged into the rest of the environment
without modifying the rest of the environment. The way the environment was
written, the group is able to design these algorithms in separate C files, and then
80
call these files in the main function of the testing environment which ties every
part of it together.
By writing in this modular fashion the group is able to focus on the actual
algorithm instead of worrying about introducing errors into the rest of the
environment. This modular approach also allows each group member to develop
separate optimizations in parallel and test them all in the same environment by
easily plugging their code into the algorithms section of the environment.
5.5.3 Testing code
Our actual testing code consists of 3 different tests: a time test, a compression
ratio test, and a lossless integrity test. The first of these tests, the time test works
by first recording a start time before the compression or decompression algorithm
is run but after the test data is read into a buffer array. It then will run through the
compression algorithm being tested. Once finished it will record an end time
indicating the length of time compression took. The environment will then record
yet another time, this will indicate when the decompression started. Once
decompression completes it will record this time as well, allowing the calculation
of the time it took to decompress the data. Finally it subtracts the difference of
these start times to their end times to find the total time the respective section of
the algorithm has taken. Currently, the group is recording the times by calling the
C library’s “time()” function. This function returns the system clock’s time by a
measurement of milliseconds.
The second test the compression ratio test. Currently the group is manually
checking the output file sizes to test the differences in sizes between the original
data and the compressed data. We wish to have more precise data however and
will be implementing a method to calculate this in code, potentially using a C++
data structure named vectors and some math to calculate the size of the
resulting data. By taking the compressed data and dividing it by the
uncompressed data the group gets a compression ratio to be used in comparing
just how effective the algorithm is.
Finally the test for if the algorithm is indeed lossless is run every time using a
checksum. This checksum will tell if the original data is exactly like the
decompressed data. The chances of two different lists of data having the same
checksum are extremely low and because of this the group is confident that this
81
is a sufficient test to see if the two lists are identical or not. If the decompressed
data is not the same a warning message is displayed to the console, an example
of this is shown in Figure 5.6. Once the code is run and finishes the time test
data collected from the test run is displayed to the console which is seen at the
bottom of the figure. Note the test data used in the example displayed in the
figure is very small and as a result compression and decompression took less
than a measurable amount of time to complete. The collected data will also be
written to a file or database for further comparison and analysis.
Figure 5.26: Example Testing Environment Output: Example output
produced by our testing environment, including the performance
measures.
5.5.4 Data Printer
The final part of the testing environment being developed is the data writer. This
code is designed to format the test results of the program. Our test results output
both to the screen during debug mode and to a database file during testing
mode, which will allow us to construct a set format for the data to be organized in
and allow us to compare all the different tests with each other and organize the
data in graphs and figure that will be more efficient in displaying the findings. An
example of data being output can be seen in Figure 5.7.
Figure 5.27: Additional Testing Environment Output: Full performance
metrics used for determining algorithm statistics.
82
Administrative Content
6.1 Consultants
6.1.1 AMD
The project was originally proposed and is sponsored by Advanced Micro
Devices (AMD). They are one of the two main Graphics Card research and
development companies for mainstream computing and gaming. Their graphics
cards are also featured in most gaming consoles and they have a wide variety of
personal use GPUs and workstation GPUs that would benefit greatly from our
project.
AMD has been the group’s main consultant when it comes to how graphics cards
work and how the data the group will be compressing is used within the graphics
pipeline. Specifically the group’s main contacts at AMD are Todd Martin and
Mangesh Nijasure. They also are the main consultants when it comes to what
our project needs to do in terms of requirement and specifications. Additionally
they have provided us with all of our initial test data and programs to generate
additional test data if needed.
6.1.2 Dr. Sumanta N. Pattanaik
Dr. Sumanta Pattanaik is an associate professor at UCF that teaches Computer
Graphics. He has provided the group with a crash course encompassing the
basic background knowledge of how Computer Graphics are computed and
programmed. He has also been helpful in understanding both vertex and index
information and how we can potentially compress it with our algorithms and gave
us some good ideas on where to start researching lossless compression
algorithms and what algorithms did not seem to be helpful towards the
development of successful compression algorithms.
83
6.1.3 Dr. Mark Heinrich
Dr. Heinrich is an associate Professor at UCF who conducts research focused on
computer architecture. He is also in charge of the Computer Science senior
design class. He has been very helpful with keeping out project on track and
making sure the group does not fall too far behind and run out of time. He also
has been very helpful in contacting other professors to ask for assistance with
our project.
6.1.4 Dr. Shaojie Zhang
Dr. Shaojie Zhang is an Associate Professor of Computer Science at UCF. He
conducts research with DNA simulation and analysis. He provided some direction
in terms of compression algorithms to use with both index and vertex data.
6.2 Budget
Our client for this project, AMD is also our major sponsor. They have contributed
a fund of $2000 in order to ensure that the project was completed without costs
of required equipment and software getting in the way. The group has several
possible expenditures which these funds will go towards covering.
6.2.1 A Graphics Processing Unit.
A graphics card manufactured by AMD containing the most recent iteration of
their Graphics Processing Unit in order to test our algorithms with the most
fidelity.
Unit
Average Price
Graphics Memory
R9 295X2
$999.99
Up to 8GB GDDR5
R9 290x
$349.99
Up to 8GB GDDR5
R9 290
$259.99
Up to 4GB GDDR5
84
R9 285
$244.99
Up to 4GB GDDR5
R9 280X
$229.99
Up to 3GB GDDR5
R9 280
$174.99
Up to 3GB GDDR5
R9 270X
$159.99
Up to 4GB GDDR5
R9 270
$139.99
Up to 2GB GDDR5
Figure 6.1: AMD R9 Graphics Cards: A side-by-side price and performance
comparison. More information on this series of graphics cards is
provided in the appendices. Reprinted with permission.
6.2.2 Version control
Version control in the form of GitHub or some other sites used as private
repositories could be required. The group may come into contact with sensitive
material that AMD is working on as the group progress in our project. In order to
abide by our Non-disclosure Agreement and preserve this sensitive data, the
group would need a private repository from one of these sites. Despite the fact
that public repositories are free to use, private repositories require a subscription
with a monthly fee. The entire project will last from August 18, 2014 until May 2,
2014. This will require a minimum of 10 months of subscription time.
Plan Name
Private Repositories
Subscription Fee
Overall Cost
(10 months)
Free
0
$0 / month
$0
Micro
5
$7 / month
$70
Small
10
$12 / month
$120
Medium
20
$22 / month
$220
Large
50
$50 / month
$500
Figure 6.2: GitHub Personal Plans: The potential cost of a subscription to a
GitHub personal account.
Plan Name
Private Repositories
Subscription Fee
Overall Cost
Free
0
$0 / month
$0
85
Bronze
5
$25 / month
$250
Silver
10
$50 / month
$500
Gold
20
$100 / month
$1000
Platinum
50
$200 / month
$2000
Figure 6.3: GitHub Organization Plans: The potential cost of a subscription
to a GitHub organization account.
6.2.3 Algorithm Licenses
In order to use certain patented algorithms, purchasable licenses could be
necessary since AMD is planning to use our work in their commercial product.
As of yet the group has not been required to purchase a patent in order to makes
use of the algorithms in our project. Huffman Encoding, Run-length Encoding,
and Delta Encoding are all not patented algorithms and are therefore free to use
in this context.
However the group may yet find an algorithm that is patented which they will be
required to pay to use. This will likely be found in our research concerning
compression of the vertex buffer, if it is indeed found at all.
6.2.4 Document Expenses
For this class this paper was required to be printed and bound professionally, this
will have to be done by an outside party. Luckily, on the UCF campus is a
professional graphical design firm entitled “the Spot.” They are most well-known
by the student body as a place to print papers and, relevantly, get papers bound.
We contacted the Spot for a quote on how much it would cost to get our paper
printed bound. We used the final design document created by the previous year’s
senior design students as an example of what the group would be printing. Their
paper had fifty-four pages without color, and forty-three pages with color. The
spot quoted us the costs of printing each page. The rate they charge to print a
86
document without color is ten cents per page. The rate they charge to print a
document with color is forty-nine cents a page. The total is calculated in the table
below:
Item
Rate
Cost
Black and White Impression
$0.10 / page
$5.40
Color Impression
$0.49 / page
$21.07
1 small spiral bind
$4.50 flat fee
$4.50
Total Cost
$30.97
Figure 6.4: The Spot Pricing: Quote detailing the cost to print a document.
6.2.5 Estimated Expenditures
Figure 6.5 shows a pie chart detailing what percentage of the budget the group
expected to spend on each necessary item:
Figure 6.5: Estimated Expenditures Pie Chart.
As shown the group expected to go with the Medium Personal level subscription
with GitHub for version control. We also planned to buy the R9 950x Graphics
Card from AMD.
87
6.2.6 Actual Expenditures
Figure 6.6: Actual Expenditures Pie Chart
Git Hub offers free private group-repositories for university students. This
guarantees that we can protect our NDA while still being cost-effective with our
project expenses.
As the project progressed it was decided that a graphics card was not required to
gather the sample data we needed to test our algorithms. This was thanks to the
test data that was provided to us by our sponsors, AMD and existing data that
was found on the web.
The poster used for Senior Design Day was professionally done and of a proper
size to easily convey our results. This caused the poster to eat the largest
amount of our budget, costing $140.00. Even with this poster and the cost of
getting the final document professionally printed this project still came in well
under-budget.
6.3 Project Milestones
6.3.1 First Semester
This project has been split up into two semesters of work. The first semester is
comprised of primarily research and design of the initial algorithms and test
environment. This semester’s milestones are displayed in Figure 6.6 which is a
timeline running from the beginning of the semester to the end which is marked
88
with the completion of the initial design documentation making it the final
milestone for this semester.
6.3.1.1 Research
The first milestones involved the completion of basic research into graphics and
compression algorithms. This process took around 3 weeks to get a good
enough understanding to quantify it as a milestone even though it technically
continues throughout the whole project’s development. Research first focused on
gaining knowledge of what vertex and index data is comprised of, as well as how
these two data types are used when drawing graphics to screen. Researching
vertex and index data also required the group to research and learn the basics of
the rest of the graphics pipeline which was accomplished through both online
research and a crash course given by Dr. Sumanta Pattanaik.
Once the group gained a good foundation on graphics and the data the group
was tasked with compressing research turned towards learning about lossless
compression algorithms. This involved first learning that basics of how encoding
data and compression works to reduce the size while not damaging the data.
Then the group focused on different algorithms, first focusing on ones that will
work on integer based data as this would be the easiest to prototype with index
data. Then turning focus on float compression which turned out to be much more
difficult. In the end however enough research was done and enough knowledge
gained to mark it as a completed milestone in the project.
6.3.1.2 Testing Environment
With initial research completed the group then focused on the development of the
testing environment which took just shy of 3 weeks to get set up and running.
The design of the environment mainly comprised of the basic ideas the group
wanted to implement into the environment. The actual development and coding
of the environment was modularized and split among group members to increase
the speed at which it was completed. Once all group members completed and
debugged their sections they were integrated with the others and debugged
again as a whole.
6.3.1.3 Index Algorithm Prototype
89
With the environment set up the group’s next milestone was accomplished in just
around a week and a half and is marked with the completion of the initial
prototype for index data compression and decompression algorithm. As
mentioned before this was accomplished using the modified delta encoding
algorithm. The implementation of the prototype of this algorithm in code went
smoothly and only took around a week with the coding and testing to iron out any
errors finishing on November 13th.
6.3.1.4 Vertex Algorithm Prototype Attempt
With the prototype for the index algorithm implemented and tested the group then
turned towards the design of the vertex algorithm. The group spent almost two
weeks attempting to design an algorithm that would work well with vertex data
however the group ran into some problems and had to go back to researching
different compression methods to create an algorithm that will run efficiently on
vertex data. This whole process took around a month of the first semester’s time
and will not be completed until the first few weeks of the second semester. The
group decided in order to get the final design documentation finished before the
end of the semester focus would have to shift towards the completion of the
document instead of further work on the vertex algorithm.
Figure 6.7: First Semester Milestones: Milestone Timeline of the First
Semester of the Project.
6.3.2 Second Semester
This semester was comprised of the actual development and optimization of the
project’s algorithms and concludes with the presentation of the finalized project at
the end of the semester. The milestones for the second semester are displayed
90
in Figure 6.7 which like Figure 6.6 displays a timeline from the start to the end of
the semester.
6.3.2.1 Vertex Algorithm Prototype
Due to previously mentioned difficulties with the design of the vertex
compression algorithm during the first semester the beginning of the second
semester focused on getting a prototype vertex compression and decompression
algorithm designed and implemented in code. As seen in the timeline there were
3 weeks of semester time allocated towards finishing research and designing an
algorithm to be used on vertex information. Some research on compressing
vertex data also occurred before the semester started and is not displayed on the
timeline. The week after design was focused on implementing the algorithms into
code.
6.3.2.2 Optimization of algorithms
With both algorithms’ prototypes implemented in code the focus of the group
turned to optimizing the algorithms to run faster and more efficiently. This was
planned to take the largest amount of time and around 4 to 5 weeks were
allocated towards this task. Because of the algorithms implemented this took
longer than expected and the time allocated into implementing them onto the
GPU got re-allocated into further optimization and research.
6.3.2.3 Implementation on GPU
Due to converting the decompression algorithms to use the GPU’s resources
being a stretch goal it’s time was re-allocated into further research and
development of optimizations for the implemented algorithms. As mentioned
before this would have most likely been done using the C++ AMP model to test
the algorithm without implementing in shader code. Again because this step is
not vital towards the completion of the projects goals the two weeks that have
been allocated towards finishing and testing the speeds and efficiency on the
GPU were taken for more optimization of the algorithms is needed.
6.3.2.4 Completion of project and project documentation
91
The remaining time will be spent on finalizing the algorithms as well as finishing
up the final design documentation in order to prepare to present the finished
project to AMD and the chosen UCF faculty that will judge the project’s outcome.
This presentation marks the closing of the project.
Figure 6.8: Second Semester Milestones: Milestone Timeline of the second
Semester of the Project.
92
Summary/Conclusion
7.1 Design Summary
The goal of this project was to identify lossless compression algorithms that
compress vertex and index buffer information. The way the algorithms are being
developed, data is first compressed offline and saved the system’s main
memory. The compressed data is then loaded into the respective buffers as if it
were normal data. When the information is fetched from the buffer it is then
decompressed using our decompression algorithm and used normally by the rest
of the graphics pipeline.
As mentioned before the compression algorithms are designed to be done
offline, at compile time of the 3D or program using the 3D object in order to avoid
time and resource constraints and achieve a better compression ratio. The
decompression is designed to be done at runtime on the graphics card when
data is fetched from either the index or vertex buffer. The group are using a
modified delta encoding algorithm and an implementation of Golomb-Rice to
compress and decompress the index data and an implementation of BR and
LZO1-1 on vertex data. The outcome of the project is an increase in efficiency
and speed of graphics cards without heavily modifying existing standards.
7.2 Successes
7.2.1 Initial Research
Throughout the project work went very smoothly. Initial research on graphics was
greatly sped up with the aid of Professor Sumanta Pattanaik, Todd Martin, and
Mangesh Nijasure. Because only one group member had previous experience
with graphics, professor Pattanaik gave a very helpful crash course in the basics
of computer graphics programming as well as vertex and index data formatting
and use when drawing graphics to the screen. Through his aid and with the help
of AMD’s Todd Martin and Mangesh Nijasure the group gained a solid
understanding of the basics of graphics.
93
Professor Sumanta Pattanaik was a large help when the group were trying to
understand the basics of computer graphics. He explained the basics of how the
graphics pipeline functioned, how index and vertex data is used within the
pipeline, and how this data is usually structured and formatted. This information
was vital to our understanding of how our project fit into the computer graphics
environment. He also gave the group some ideas on what algorithms to start
looking at.
In addition to research on graphics. The group also had to do a lot of research on
data compression, specifically lossless compression. Again with the help of Todd
Martin and Mangesh Nijasure the group were able to research specific algorithms
to implement on these types of data and gain a better understanding of what to
look for when searching for compression algorithm to use on vertex and index
data. Professor Pattanaik also aided in giving us good ideas about which
algorithms to focus more on and which wouldn't be as helpful in the project.
7.2.2 Testing Environment
Another milestone that the group completed was the development of the testing
environment. The group came together and quickly got it up and running within a
few days of its design. Testing with said environment also proved to work very
well and made the creation of uniform test data much easier. In addition to
making test data easier to gather the way it was designed allows the group to
plug in new tests and test data very easily.
The group was able to quickly complete most of the milestones that they were
aiming for during the first semester. They were able to quickly create the tools
that they needed to obtain data for use in testing their algorithms such as the
integrity checking tool, the file scanner, and basic performance data analyzer.
7.2.3 Index Compression
One of the main success for this project was the development of the index
compression and decompression algorithms. By using a modified delta
compression algorithm and an implementation of Golomb-Rice the group was
able to create algorithms that run very fast and compress data to around half its
size.
94
7.2.4 Vertex Compression
Even though some of LZO’s algorithm is unknown to us at this time, its
performance was very good and gave the best results of the implemented
algorithms. BR was implemented to attempt to have a completely open algorithm
available to create optimizations on top of to get better results out of. Although
overall BR compression is a generally decent compression method, it has some
flaws that lead us to believe that LZO compression may be a more suitable
algorithm in essentially every category.
The first and most major concern is that in all statistical categories, LZO simply
outperforms BR encoding. In terms of both compression rate and speed and also
decompression rate and speed, LZO provides more satisfactory results. For
example, LZO compressed the vertex data an average of 32%, while BR
compression only yielded an average of 14% compression. Additionally, in terms
of our most valued metric, decompression time, LZO far outperforms BR
compression. LZO yields a staggering 606 MB/S on our test dataset, compared
to BR’s 210 MB/s decompression rate.
Aside from performance metrics, LZO is able to perform decompression without
much overhead. In comparison, BR compression requires a hash table to be
included in the header for each compression block that is sent through the
pipeline.
7.3 Difficulties
The first issue the group ran into was the incompatibility of code with some group
member’s computers and the testing environment. This was due to some group
members using IDEs and different operating systems. One group member was
coding the test environment in Code Blocks, while another was coding in Visual
Studio, both of these IDEs have compilers which allow for certain syntax rules of
the C language to be ignored in favor of easier usability. As a result, when our
final group member attempted to compile the environment in the Linux gcc
compiler it would not work. The program would fail to compile, despite showing
no errors in the IDEs. This issue has since been resolved as all of the syntax
violations have been corrected.
One of the group’s major difficulties was that the group all had very little
experience in working with the graphics pipeline prior to the project’s inception.
One group member had taken a course concerning computer graphics, but one
95
single-semester course does not provide a full working knowledge of its subject
matter. This caused several misconceptions to arise over the course of working
on the project.
Index buffers were fairly straightforward, however the sample data the group was
given contained something which confused us. The index data was formatted in
such a way that between certain sets of values was an unsigned value that
equated to negative one. This acted as a reset value to tell the graphics card that
this was where one graphical object stopped and another began. This caused
problems for our initial delta compression algorithm, as negative one is a value
that is commonplace in most outputs. We had to format the index data in such a
way that when it is being compressed it converted these reset values to an
escape character rather than just a regular integer.
Vertex buffers proved difficult to understand from the beginning. The group was
unsure of whether the different data-types would be consistently present
throughout the whole input. The position data is guaranteed to be there, but the
group was informed that the color and normal vector data was not always
present. The group was not sure if that meant that some inputs would lack these
values, or that certain vertices in a single input would have it and some of them
would not. The group discovered it was the latter which makes our input data
very inconsistent and therefore more difficult to compress. The input itself
contained a whole other issue to tackle. Some of the values were formatted into
exponential form. This required us to read in the values as strings, and then
create a parser to change the values from strings into floats.
The group also had difficulties with our version control system on GitHub. None
of us had used Git before this semester began. This meant the group had to
appropriately acquaint ourselves with the user interface and the commit
system. The main issues came from transitioning to a new system, since the
group was previously using Dropbox as our version control. Dropbox synced
automatically and was fairly easy to navigate. When the group first committed
changes it seemed like the interface would display all of the shared folder’s
contents, similar to how Dropbox presents the shared data as a folder. In reality,
the system would simply display what items had been changed after each
commit. The actual contents of the folder were located in the Git workspace
located on each person’s hard drive.
96
Most of the other difficulties were small misconceptions the group had based on
certain aspects of the projects. At one point, a group member had the idea that
compressing the index and vertex buffers at run time was a non-negotiable
requirement of the project. The group decided this was a stretch goal but not
vital, since the group wanted the highest compression ratio from the algorithm
rather than the fastest compression time. Another misconception was that the
index and vertex Buffers were filled simply by what the user displayed on their
screen. In reality, what goes in the buffers is handled by shader code, which
dictates the behaviors of a camera-like entity pointed at the graphical object.
The group was also unsure whether our algorithm would be designed around
parallelism. Parallel algorithms have to be designed from the ground up in a
certain way, so knowing the answer to this question early was vital to progressing
with the project. In the end our sponsor decided it was better just to deliver the
base algorithm, so that they could possibly expand upon it later.
Most, if not all, of these types of problems were handled by simply contacting the
project’s sponsor, AMD. Even when they did not have an immediate solution to
our problem, the group was easily able to work things out through discussion and
compromise.
7.3.1 Vertex Compression Difficulties
The largest difficulty the group had during this semester was the research and
design of a vertex data compression and decompression algorithm. Many
existing algorithms are not lossless and are not easily implemented which
caused the group to have to rethink the design many times. The group chose to
commit to two different algorithms, BR being a predictive algorithm and LZO
being based on the popular LZ77 compressor.
The main issue with vertex compression is the non-uniformity of the data. Every
time a potential algorithm was developed a test case would be found that would
render it useless and as a result unfit for the project’s goals.
97
7.3.2 Future Improvements
These are ideas, concepts, and tweaks for the project that the group was not
able to implement in the time that they had. Although time constrained them from
implementing these things, these possibilities may be beneficial to explore.
Other methods of optimizing the vertex data for storage were researched, such
as methods for converting vertex information like color data into tables
representing them more efficiently.
The group originally considered using a C++ parallelization library named C++
Accelerated Massive Parallelism (C++ AMP) which allows us to quickly write
code that will run on the GPU without actually writing shader code or
implementing hardware on the graphics card itself. It is not easily apparent when
evaluating a compression algorithm whether it is able to be parallelized easily;
much analysis of most algorithms must be done. Although the group did not have
time to evaluate their algorithms for parallelizability, it is an important aspect for
an algorithm to have.
Another system that the group believes will have a performance benefit is the
ability to switch between the use of different algorithms based on how the data
was encoded. The idea is that any number of compression algorithms can be
used in tandem to best encode the data when running offline; not just one has to
be used. However the compression program will not initially be aware of the
qualities of the data is trying to analyze, so the program must familiarize itself
with the patterns that exist within the data first. It will attempt to scan through the
data and produce a score of what it thinks the best algorithm may be for storage.
Many different methods can be devised to perform this functionality, although the
group has not implemented this into their project yet. One possible and simple
method is to take a small section of the data it is looking at and attempting to
compress it using all of the different algorithms it has available. It can then use
the algorithm that compressed the sample the most efficiently to compress the
entire file. Alternatively, a non-heuristic method can be used where the program
will simply compress the file using all of the available compression algorithms,
and use the one that yields the best results. The group is hesitant to use this
implementation however as there many significant performance penalties from
using it.
With the introduction of using multiple different algorithms to compress data all
being sent through the same channel, a problem will exist when the compressed
data gets to the decompression step. Since it is no longer previously known what
algorithm that was used to compress the data was, a way to differentiate
between the different types of compressed contents must be included in the file
contents. The way that the group chose to implement this was to include a
header before every single graphical object that is passed through the buffer. A
98
possible further optimization to this system is to rearrange the contents of the
index buffer during transport. All of the objects that have been compressed with
the same algorithm could be grouped in the same location with a header existing
that only describes the contents of that group as a whole.
In addition to dynamic anchor points the group plans to test another optimization
technique in which a variable is introduced that will hold on to the current value
that has been decompressed. This will allow the decompression algorithm to
“remember” where it was in the list, and instead of having to decode the entire list
from the closest anchor point it instead will just continue where it left off if the
requested data is further down in the list.
99
Appendices
8.1 Copyright
From: Nijasure, Mangesh <Mangesh.Nijasure@amd.com>
Date: Mon, Dec 1, 2014 at 3:04 PM
Subject: RE: Diagram Copyright Permission
To: Brian Estes <bestes258@gmail.com>, "Martin, Todd"
<Todd.Martin@amd.com>
Cc: Alex Berliner <alexberliner@gmail.com>, Samuel Lerner
<simolias@gmail.com>
You can use any of the diagrams I presented from the slides shown in class, just
include the citations (always good practice) I had citations to MSFT in the slides
you can just use those.
Any information from the AMD website can also be used along with the
appropriate citation as well.
Mangesh Nijasure
From: Brian Estes [mailto:bestes258@gmail.com]
Sent: Sunday, November 30, 2014 6:58 PM
To: Martin, Todd; Nijasure, Mangesh
Cc: Alex Berliner; Samuel Lerner
Subject: Diagram Copyright Permission
8.2 Datasheets
Figure 8.1: Specifications for the R9 series of Graphics Cards [2] Reprinted
with permission.
R9
295X2
R9 290X
R9 290
R9 285
R9 280X
R9 280
R9 270X
R9 270
GPU
ARCHITECTURE
28nm
28nm
28nm
28nm
28nm
28nm
28nm
28nm
API SUPPORT11
DirectX®
12,
Mantle,
OpenGL
DirectX®
12,
Mantle,
OpenGL
DirectX®
12,
Mantle,
OpenGL
DirectX®
12,
Mantle,
OpenGL
DirectX®
12,
Mantle,
OpenGL
DirectX®
12,
Mantle,
OpenGL
DirectX®
12,
Mantle,
OpenGL
DirectX®
12,
Mantle,
OpenGL
100
4.3,
OpenCL
4.3,
OpenCL
4.3,
OpenCL
4.3,
OpenCL
4.3,
OpenCL
4.3,
OpenCL
4.3,
OpenCL
4.3,
OpenCL
PCI EXPRESS®
VERSION
3
3
3
3
3
3
3
3
GPU CLOCK
SPEED
Up to
1018
MHz
Up to
1000
MHz
Up to
947 MHz
Up to
918MHz
Up to
1000
MHz
Up to
933 MHz
Up to
1050
MHz
Up to
925 MHz
MEMORY
BANDWIDTH
Up to
640 GB/s
Up to
352 GB/s
Up to
320 GB/s
Up to
176 GB/s
Up to
288 GB/s
Up to
240 GB/s
Up to
179.2
GBP/s
Up to
179.2
GBP/s
MEMORY
AMOUNT
Up to
8GB
GDDR5
Up to
8GB
GDDR5
Up to
4GB
GDDR5
Up to
4GB
GDDR5
Up to
3GB
GDDR5
Up to
3GB
GDDR5
Up to
4GB
GDDR5
Up to
2GB
GDDR5
STREAM
PROCESSING
UNITS
Up to
5632
Up to
2816
Up to
2560
Up to
1792
Up to
2048
Up to
1792
Up to
1280
Up to
1280
1 x 6-pin
+ 1 x 8pin
1 x 6-pin
+ 1 x 8pin
2 x 6-pin
1 x 6-pin
+ 1 x 8pin
1 x 6-pin
+ 1 x 8pin
2 x 6-pin
1 x 6-pin
REQUIRED
2 x 8-pin
POWER SUPPLY
CONNECTORS
101
Figure 8.2: Sample Index Data
Figure 8.3: Sample Vertex Data
8.3 Software/Other
In the development of our testing environment we all tried to use many different
IDEs to code in including Microsoft Visual Studios and Code Blocks. Because we
wanted the testing environment to work across all of our computers (some of us
use Linux based and others use windows based systems) we ended up using
just text editors such as Sublime Text 2 and 3, Notepad++ and compiling our
project in the command line using GCC to ensure code compatibility.
In order to make sure our project stays up to date between our computers we are
using Git for version control of our code. For our documents we are keeping them
in Google Drive. This allows us to write the required documents at the same time,
while keeping a unified minutes log and TODO list.
In order to gain test data we plan on using a program called AMD Perf-Studio.
This program works exclusively on AMD GPUs and as a result in the future we
may need to procure one as none of us use them in our systems. By using the
program we can pause a video game or 3D program and get a printout of the
buffers on the GPU at that time.
102
Bibliography
[1] P. H. Chou and T. H. Meng. Vertex data compression through vector
quantization. IEEE Transactions on Visualization and Computer Graphics,
8(4):373–382, 2002.
[2] http://www.amd.com/en-us/products/graphics/desktop/r9 - "AMD Radeon™
R9 Series Graphics." AMD Radeon™ R9 Series Graphics. N.p., n.d. Web. 22
Nov. 2014.
[3] http://blogs.msdn.com/b/shawnhar/archive/2010/11/19/compressed-vertexdata.aspx - Compressed vertex data
[4]http://www.adobe.com/devnet/flashplayer/articles/vertex-fragmentshaders.html - Vertex and Fragment Shaders
[5] http://computer.howstuffworks.com/c10.htm - The Basics of C Programming
[6] http://www.directron.com/expressguide.html - What is PCI Express? A
Layman's guide to high speed PCI-E technology
[7] https://www.opengl.org/sdk/docs/tutorials/ClockworkCoders/attributes.php Vertex Attributes
[8] https://msdn.microsoft.com/enus/library/windows/desktop/bb147325%28v=vs.85%29.aspx - Rendering from
Vertex and Index Buffers
[9] http://introcs.cs.princeton.edu/java/44st/ - Symbol Tables
[10] http://steve.hollasch.net/cgindex/coding/ieeefloat.html - IEEE Standard 754
[11] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.296.6055&rank=3 Compression in the Graphics Pipeline
[12] "Build Software Better, Together." GitHub. N.p., n.d. Web. 22 Nov. 2014.
https://github.com/pricing
[13] http://www.mcs.anl.gov/papers/P5009-0813_1.pdf - Float Masks
[14] Run-length encodings - S. W. Golomb (1966); IEEE Trans Info Theory
12(3):399
[15] http://rosettacode.org/wiki/Run-length_encoding - Run-length encoding
[16] http://www.dspguide.com/ch27/4.htm - Delta Encoding
[17] http://rosettacode.org/wiki/Huffman_coding - Huffman coding
[18] https://msdn.microsoft.com/en-us/library/hh265137.aspx - AMP C++
[19] http://www.mcs.anl.gov/papers/P5009-0813_1.pdf - Improving Floating
Point Compression through Binary Masks
[20]http://www.oberhumer.com/opensource/ - LZO
103
Download