Table of Contents History of LZW - LZ77 - LZSS - LZ78 - LZW The Method of LZW Compression Example of the LZW Method Methods of Optimization - Dictionary Freezing - Dictionary Pruning - Dictionary Deconstruction Emperical Results of Research Conclusions Appendix - Compression of war.txt - Compression of randwar.txt Bibliography HISTORY LZ77 The basis for what is now called LZW compression was founded in 1977 by Jardo Ziv and Abraham Lempel. They proposed a "string learning" algorithm that would essentially "get smarter" by being able to recognize more and more patterns in a file. This initial algorithm was called LZ77. The LZ77 algorithm is the model of future models. It is a universal algorithm, in that it assumes no information about the file before it begins compression. This method uses a "sliding window" method while compressing. The data that is within the range of the window is what is active in the algorithm. This data range essentially is two buffers of predetermined length, b and d respectively. Thus, the sliding window has b and d elements in its range. The buffer b contains elements to be encoded, and the buffer d contains the elements that have already been encoded. Each element in b is coded in terms of those in d, and a data token is created to represent the encoding. Each token holds three pieces of information. First the token holds the spot in the buffer d. The second element held is the length of the match in d to the element in b. And finally, the token holds the next element in b. This method did compress files, and was innovative, in that it could be used universally on different types of files. However, it does contain several problems. First, a problem was the number of comparisons needed to match an element from the buffer b to the dictionary d. The algorithm required d*b comparisons. What results is an algorithm of O(N2). An algorithm of this order causes excessive and unreasonable time consumption for large files. Likewise, the larger the dictionary is made the larger the gains in compression. However, clearly by increasing d, the number of comparisons is increased, further complicating the run time dilemma. LZSS The next progression of this code was the LZSS code. This was implemented by Storer and Szymanski. It addresses the problem of the inefficient use of space in the token, as well as how comparisons are done. The new token presented introduces a flag bit. The leading bit of the binary code of each token would flag the decompresser as to whether the next string of data was plain text, or, instead, a symbolic reference. If the flag was set to "1", the plain text mode, then the data would be read as is. However, if the flag was "0", then the next string of data was the dictionary location, followed by the length of the string from dictionary. This saved space by limiting elements in these tokens that do not need as much space. Another simple addition to the algorithm was the implementation of a binary tree in the buffer window. This allows for O(log N) searches and compares, rather than O(N2), like the previous form of the algorithm. This coupled with the new token scheme help make the algorithm faster and more space efficient. LZ78 The LZ78 method drops all of the things that made the previous methods slow and cumbersome. Mainly, instead of having file buffers, which require several file pointers inside the file, LZ78 simply breaks input into phases to be processed. In this way it just moves sequentially through the file. The dictionary is initialized as a null character, and continually builds as the file is processed. This turns out to be the most efficient method of matching strings in the file. However, if left unchecked, the dictionary size would grow indefinitely until the system resources were consumed. The best method for dealing with this problem is still debated, and we will touch on that later. It turns out that this method works very effectively for large files. However, for smaller files it, essentially just creates and stores the dictionary, with few repetitions of strings within the file. This results in little to no gain in size reduction. LZW Terry Welch, in 1984, proposed a simple addition to the LZ78 algorithm, which helped deal with both efficiency, and the problem with smaller files. The addition is an initialization of the dictionary to the standard 256 ASCII characters, represented by the codes 0 - 255. By having this pre-established starting point, even the smaller files were able to be compressed, as more of the code was able to be represented by the dictionary entries. Also, larger files would have slight gain in the efficiency of their encoding. The Method of LZW Compression While a fairly detailed method by nature, LZW is easily understood, and easily explained. The following will give an overview of the methods used for LZW encoding, and will provide an example there of. The LZW method, essentially, reads the character strings character by character, comparing with the dictionary, outputting a key related to the string, and then updating the dictionary. In more detail, the method works like this. Initially, the dictionary, consisting of the 256 ASCII characters, is input into a data structure, perhaps a map, but for optimal efficiency a hash table. The file to be encoded is then opened and the first character is read. This character is, then, compared with the entries in the dictionary. If the character is found, the next character is concatenated to the end of the character string, forming a string of the form Ix, where I is the previous string, and x is the character concatenated to the end. Again the string is checked against the dictionary. This string of two characters will not be found. At this point, the ASCII code for the character, I, is outputted, the string I is updated to be equivalent to x, and the string Ix is concatenated to the dictionary at the next available address, in this case 256. This process continues until the entire file is encoded. Example of the LZW method: Input String: “abracadabra” I=a I = ab I=b I = br I=r I = ra I=a I = ac I=c I = ca I=a I = ad I=d I = da I=a I = ab I = abr I=r I = ra => => => => => => => => => => => => => => => => => => => Result: Found; Read Next Character Result: Not Found; Output: 97; I = b; Dictionary[256] = “ab” Result: Found; Read Next Character Result: Not Found; Output: 98; I = r; Dictionary[257] = “br” Result: Found; Read Next Character Result: Not Found; Output: 114; I = a; Dictionary[258] = “ ra” Result: Found; Read Next Character Result: Not Found; Output: 97; I = c; Dictionary[259] = “ac” Result: Found; Read Next Character Result: Not Found; Output: 99; I = a; Dictionary[260] = “ca” Result: Found; Read Next Character Result: Not Found; Output: 97; I = d; Dictionary[261] = “ad” Result: Found; Read Next Character Result: Not Found; Output: 100; I = a; Dictionary[262] = “da” Result: Found; Read Next Character Result: Found; Read Next Character Result: Not Found; Output: 256; I = r; Dictionary[263] = “abr” Result: Found; Read Next Character Result: Found; End of file; Output: 258 Output string: 97 98 114 97 99 97 100 256 258 METHODS of OPTIMIZATION Dictionary Freezing This was probably the first method implemented in order to deal with the ever expanding dictionary. Simply put, dictionary freezing just picks a size, and does not permit the dictionary to grow beyond that size. Instead, it encodes the rest of the file according to the frozen dictionary. This method, in effect, defeats the purpose of the dynamic nature of LZW compression. It is, however, a simple, and easy, fix which will still result in file compression. Dictionary Pruning Dictionary pruning is a modification of the dictionary freezing method. It, also, utilizes restricted space requirements by letting the dictionary grow to a predetermined size. The storage of data as it is being processed is a key part of this method. By using trees whose roots only have descendants if their pattern has been matched, the finding of infrequently used patterns is facilitated. Thus, once the dictionary becomes full, and an additional entry is needed, a search for the first node with no leaves will be done. This node will be replaced with the new dictionary entry, and the decoding will continue. Deconstruction of the Dictionary A third method is to simply discard the dictionary, and then to restart once the dictionary has been filled. Although at first this seems as though it is a drastic measure that would result in the loss of data, it is, actually, quite an effective way of compressing data. By continually refreshing the dictionary it helps to flush out unused strings. The effectiveness of this, also comes into play when dealing with nonstandard text files. Thus, by restarting the dictionary it keeps itself updated with the current file types. The problem this method faces is at what size should the dictionary be capped. Or in other words, at what interval should the dictionary be reset. This is the problem we chose to demonstrate as part of our research, and our results are as follows. Emperical Results of Research Our test implementation resets the dictionary after k characters are read. The user specifies a range k for which to compress the input file. We had the program write both k and the size of the compressed file to the standard output. We tested two files, war.txt, which was given to us in a previous project, and a file of random characters. Both of these files are about 3 megabytes in length. We had tested k on the range of 10 kilobytes to 1000 kilobytes, piped the output into data files, and plotted them using gnuplot. We hoped to find through this graph a low point at which compression would be optimal. We did not find an optimal value of k. Instead, the file size asymptotically approaches a lower bound as k increases. This lower bound varies between files. The plots for both files are included in the Appendix. However, several interesting observations can be made. First, the plot for the file of random characters is much smoother than that of the normal text. In the plot for war.txt, the peaks and valleys in the file size have a difference of as much as 10 kilobytes. We believe this is because the compression algorithm becomes “lucky.” After each reset, the dictionary has to rebuild. By chance, sometimes a better dictionary will be created. For some values of k, better dictionaries are consistently made and therefore the compression will be better. This however, seems to be a random event, and cannot be predicted. The plot on for the file of random characters supports this conclusion. The LZW algorithm is meant to be used where there will be many repeated strings. However, a file of random characters will not have this property. A string of characters appearing together is determined by probability, unlike words like “the” which appear very often in text. When resetting the dictionary, it is unlikely that the new dictionary will be any luckier than the previous. This accounts for the much smaller deviation in the plot for the file of random characters. A second observation is that the compression on the text file was much better than that on the file of random characters. The file war.txt is actually larger than the file of random characters, but its lower bound of compression was around 1.54 megabytes. However, the lower bound for the file of random characters was around 2.08 megabytes. As discussed briefly before, this is because the dictionary will not adapt well to the non-repetitive nature of the file of random characters. Text, on the other hand, will have many common words that will quickly make their way into the dictionary and will significantly affect the compression of the file. Conclusions Even though we did not find a value of k at which the compression is optimized, we did learn more about the behavior of the algorithm. We were surprised at how quickly the lower bound is approached. The graphs are at their lower bound at 300 kilobytes, and no significant compression gains were achieved using larger values of k. This is most likely due to the finite size of the hash table that stores the dictionary. Another very possible way of making the compression better is to make the size of the hash table bigger. Intuitively, a larger dictionary will mean more compression. However, this comes at a cost. Right now the output for the compressed file is comprised of 12-bit unsigned integers. These 12 bits can represent the codes for a dictionary of size 4096. This is exactly the size of our hash table. Therefore, if we are to make the hash table larger, we would have to output more bits. The greater number of bits would offset the gains in compression from having the larger hash table. Not surprisingly we did not determine anything that computer scientists have not known for years, however, we did achieve a better understanding of the problems that the industry faces. The balance of power between space and speed is an elegant battle that will be fought until technology eliminates one or both of the combatants. As we approach that future, we are proud to say that we were part of the journey. Appendix 1 Compression of War.txt Appendix 2 Compression of randwar.txt Bibliography Held, Gilbert. Data and Image Compression, John Wiley and Sonts, Ltd. West Sussex, England 1996. p. 280-283 Lelewer, D and Hirschberg, D. “Data Compression.” ACM Computing Surveys, Vol. 19, Number 3, 1987, p. 261-296 Sahni, Sartaj. Data Structures, Algorithms, and Applications in C++. New York, 1998. p. 357-370 Salomon, David. Data Compression. Springer-Verlag, New York, 1998 p. 123-131.