The Method of LZW Compression

advertisement
Table of Contents
 History of LZW
- LZ77
- LZSS
- LZ78
- LZW
 The Method of LZW Compression
 Example of the LZW Method
 Methods of Optimization
- Dictionary Freezing
- Dictionary Pruning
- Dictionary Deconstruction
 Emperical Results of Research
 Conclusions
 Appendix
- Compression of war.txt
- Compression of randwar.txt
 Bibliography
HISTORY
LZ77
The basis for what is now called LZW compression was founded in 1977 by Jardo Ziv
and Abraham Lempel. They proposed a "string learning" algorithm that would essentially "get
smarter" by being able to recognize more and more patterns in a file. This initial algorithm was
called LZ77.
The LZ77 algorithm is the model of future models. It is a universal algorithm, in that it
assumes no information about the file before it begins compression. This method uses a "sliding
window" method while compressing. The data that is within the range of the window is what is
active in the algorithm.
This data range essentially is two buffers of predetermined length, b and d respectively.
Thus, the sliding window has b and d elements in its range. The buffer b contains elements to be
encoded, and the buffer d contains the elements that have already been encoded. Each element in
b is coded in terms of those in d, and a data token is created to represent the encoding.
Each token holds three pieces of information. First the token holds the spot in the buffer
d. The second element held is the length of the match in d to the element in b. And finally, the
token holds the next element in b. This method did compress files, and was innovative, in that it
could be used universally on different types of files. However, it does contain several problems.
First, a problem was the number of comparisons needed to match an element from the
buffer b to the dictionary d. The algorithm required d*b comparisons. What results is an
algorithm of O(N2). An algorithm of this order causes excessive and unreasonable time
consumption for large files. Likewise, the larger the dictionary is made the larger the gains in
compression. However, clearly by increasing d, the number of comparisons is increased, further
complicating the run time dilemma.
LZSS
The next progression of this code was the LZSS code. This was implemented by Storer
and Szymanski. It addresses the problem of the inefficient use of space in the token, as well as
how comparisons are done.
The new token presented introduces a flag bit. The leading bit of the binary code of each
token would flag the decompresser as to whether the next string of data was plain text, or,
instead, a symbolic reference. If the flag was set to "1", the plain text mode, then the data would
be read as is. However, if the flag was "0", then the next string of data was the dictionary
location, followed by the length of the string from dictionary. This saved space by limiting
elements in these tokens that do not need as much space.
Another simple addition to the algorithm was the implementation of a binary tree in the
buffer window. This allows for O(log N) searches and compares, rather than O(N2), like the
previous form of the algorithm. This coupled with the new token scheme help make the
algorithm faster and more space efficient.
LZ78
The LZ78 method drops all of the things that made the previous methods slow and
cumbersome. Mainly, instead of having file buffers, which require several file pointers inside
the file, LZ78 simply breaks input into phases to be processed. In this way it just moves
sequentially through the file.
The dictionary is initialized as a null character, and continually builds as the file is
processed. This turns out to be the most efficient method of matching strings in the file.
However, if left unchecked, the dictionary size would grow indefinitely until the system
resources were consumed. The best method for dealing with this problem is still debated, and we
will touch on that later.
It turns out that this method works very effectively for large files. However, for smaller
files it, essentially just creates and stores the dictionary, with few repetitions of strings within the
file. This results in little to no gain in size reduction.
LZW
Terry Welch, in 1984, proposed a simple addition to the LZ78 algorithm, which helped
deal with both efficiency, and the problem with smaller files. The addition is an initialization of
the dictionary to the standard 256 ASCII characters, represented by the codes 0 - 255. By having
this pre-established starting point, even the smaller files were able to be compressed, as more of
the code was able to be represented by the dictionary entries. Also, larger files would have slight
gain in the efficiency of their encoding.
The Method of LZW Compression
While a fairly detailed method by nature, LZW is easily understood, and easily
explained. The following will give an overview of the methods used for LZW encoding,
and will provide an example there of.
The LZW method, essentially, reads the character strings character by character,
comparing with the dictionary, outputting a key related to the string, and then updating
the dictionary. In more detail, the method works like this. Initially, the dictionary,
consisting of the 256 ASCII characters, is input into a data structure, perhaps a map, but
for optimal efficiency a hash table. The file to be encoded is then opened and the first
character is read. This character is, then, compared with the entries in the dictionary. If
the character is found, the next character is concatenated to the end of the character string,
forming a string of the form Ix, where I is the previous string, and x is the character
concatenated to the end. Again the string is checked against the dictionary. This string of
two characters will not be found. At this point, the ASCII code for the character, I, is
outputted, the string I is updated to be equivalent to x, and the string Ix is concatenated to
the dictionary at the next available address, in this case 256. This process continues until
the entire file is encoded.
Example of the LZW method:
Input String: “abracadabra”
I=a
I = ab
I=b
I = br
I=r
I = ra
I=a
I = ac
I=c
I = ca
I=a
I = ad
I=d
I = da
I=a
I = ab
I = abr
I=r
I = ra
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
Result: Found; Read Next Character
Result: Not Found; Output: 97; I = b; Dictionary[256] = “ab”
Result: Found; Read Next Character
Result: Not Found; Output: 98; I = r; Dictionary[257] = “br”
Result: Found; Read Next Character
Result: Not Found; Output: 114; I = a; Dictionary[258] = “ ra”
Result: Found; Read Next Character
Result: Not Found; Output: 97; I = c; Dictionary[259] = “ac”
Result: Found; Read Next Character
Result: Not Found; Output: 99; I = a; Dictionary[260] = “ca”
Result: Found; Read Next Character
Result: Not Found; Output: 97; I = d; Dictionary[261] = “ad”
Result: Found; Read Next Character
Result: Not Found; Output: 100; I = a; Dictionary[262] = “da”
Result: Found; Read Next Character
Result: Found; Read Next Character
Result: Not Found; Output: 256; I = r; Dictionary[263] = “abr”
Result: Found; Read Next Character
Result: Found; End of file; Output: 258
Output string: 97 98 114 97 99 97 100 256 258
METHODS of OPTIMIZATION
Dictionary Freezing
This was probably the first method implemented in order to deal with the ever expanding
dictionary. Simply put, dictionary freezing just picks a size, and does not permit the dictionary to
grow beyond that size. Instead, it encodes the rest of the file according to the frozen dictionary.
This method, in effect, defeats the purpose of the dynamic nature of LZW compression. It is,
however, a simple, and easy, fix which will still result in file compression.
Dictionary Pruning
Dictionary pruning is a modification of the dictionary freezing method. It, also, utilizes
restricted space requirements by letting the dictionary grow to a predetermined size. The storage
of data as it is being processed is a key part of this method. By using trees whose roots only have
descendants if their pattern has been matched, the finding of infrequently used patterns is
facilitated. Thus, once the dictionary becomes full, and an additional entry is needed, a search
for the first node with no leaves will be done. This node will be replaced with the new dictionary
entry, and the decoding will continue.
Deconstruction of the Dictionary
A third method is to simply discard the dictionary, and then to restart once the dictionary
has been filled. Although at first this seems as though it is a drastic measure that would result in
the loss of data, it is, actually, quite an effective way of compressing data. By continually
refreshing the dictionary it helps to flush out unused strings. The effectiveness of this, also
comes into play when dealing with nonstandard text files. Thus, by restarting the dictionary it
keeps itself updated with the current file types. The problem this method faces is at what size
should the dictionary be capped. Or in other words, at what interval should the dictionary be
reset. This is the problem we chose to demonstrate as part of our research, and our results are as
follows.
Emperical Results of Research
Our test implementation resets the dictionary after k characters are read. The user
specifies a range k for which to compress the input file. We had the program write both k and
the size of the compressed file to the standard output. We tested two files, war.txt, which was
given to us in a previous project, and a file of random characters. Both of these files are about 3
megabytes in length. We had tested k on the range of 10 kilobytes to 1000 kilobytes, piped the
output into data files, and plotted them using gnuplot. We hoped to find through this graph a low
point at which compression would be optimal.
We did not find an optimal value of k. Instead, the file size asymptotically approaches a
lower bound as k increases. This lower bound varies between files. The plots for both files are
included in the Appendix. However, several interesting observations can be made.
First, the plot for the file of random characters is much smoother than that of the normal
text. In the plot for war.txt, the peaks and valleys in the file size have a difference of as much as
10 kilobytes. We believe this is because the compression algorithm becomes “lucky.” After
each reset, the dictionary has to rebuild. By chance, sometimes a better dictionary will be
created. For some values of k, better dictionaries are consistently made and therefore the
compression will be better. This however, seems to be a random event, and cannot be predicted.
The plot on for the file of random characters supports this conclusion. The LZW
algorithm is meant to be used where there will be many repeated strings. However, a file of
random characters will not have this property. A string of characters appearing together is
determined by probability, unlike words like “the” which appear very often in text. When
resetting the dictionary, it is unlikely that the new dictionary will be any luckier than the
previous. This accounts for the much smaller deviation in the plot for the file of random
characters.
A second observation is that the compression on the text file was much better than that
on the file of random characters. The file war.txt is actually larger than the file of random
characters, but its lower bound of compression was around 1.54 megabytes. However, the lower
bound for the file of random characters was around 2.08 megabytes. As discussed briefly before,
this is because the dictionary will not adapt well to the non-repetitive nature of the file of random
characters. Text, on the other hand, will have many common words that will quickly make their
way into the dictionary and will significantly affect the compression of the file.
Conclusions
Even though we did not find a value of k at which the compression is optimized, we did
learn more about the behavior of the algorithm. We were surprised at how quickly the lower
bound is approached. The graphs are at their lower bound at 300 kilobytes, and no significant
compression gains were achieved using larger values of k. This is most likely due to the finite
size of the hash table that stores the dictionary.
Another very possible way of making the compression better is to make the size of the
hash table bigger. Intuitively, a larger dictionary will mean more compression. However, this
comes at a cost. Right now the output for the compressed file is comprised of 12-bit unsigned
integers. These 12 bits can represent the codes for a dictionary of size 4096. This is exactly the
size of our hash table. Therefore, if we are to make the hash table larger, we would have to
output more bits. The greater number of bits would offset the gains in compression from having
the larger hash table.
Not surprisingly we did not determine anything that computer scientists have not known
for years, however, we did achieve a better understanding of the problems that the industry faces.
The balance of power between space and speed is an elegant battle that will be fought until
technology eliminates one or both of the combatants. As we approach that future, we are proud
to say that we were part of the journey.
Appendix 1 Compression of War.txt
Appendix 2 Compression of randwar.txt
Bibliography
Held, Gilbert. Data and Image Compression, John Wiley and Sonts, Ltd.
West Sussex, England 1996. p. 280-283
Lelewer, D and Hirschberg, D. “Data Compression.” ACM Computing Surveys,
Vol. 19, Number 3, 1987, p. 261-296
Sahni, Sartaj. Data Structures, Algorithms, and Applications in C++.
New York, 1998. p. 357-370
Salomon, David. Data Compression. Springer-Verlag, New York, 1998
p. 123-131.
Download