A classification of compression methods and their usefulness for a large data processing center by DORON GOTTLIEB, STEVEN A. HAGERTH, PHILIPPE G. H. LEHOT and HENRY S. RABINOWITZ Fireman's Fund American Insurance Companies San Francisco, California INTRODUCTION ing a subset of the information deemed "relevant information." The compression techniques surveyed in this paper all work to reduce storage space for data files at the price of increased CPU activity needed for compression and decompression. As CPU time becomes cheaper relative to the cost of external storage devices, compression appears as an increasingly attractive option for dealing with large files. In a small shop, which is typically 1-0 bound, compression uses available CPU time to decrease the amount of disc or tape storage. More generally, compression of storage space is achieved only at the expense of CPU time. The most clear-cut use of compression is for archive files where the main consideration is minimizing physical storage space. This paper surveys available techniques for automatic reversible compression of files; i.e., techniques that require no special knowledge of the contents of a file. The theoretical advantages of the two main categories of compression-differencing and statistical encoding-are compared, and the practical results of these techniques on large insurance files are shown, both in terms of compression efficiency and CPU efficiency. Suggestions are offered for improving the compression achieved through Huffman coding by adding a schema to code strings of a repeated character. An Algorithm is given to find the threshold for the minimal length of those strings whose coding will result in improved compression. Compression of data is a Compaction technique which is completely reversible. Compression ratio is the size of the compressed file expressed as percentage of the original file. A compaction technique that is not a compression technique involves elimination of information deemed superfluous in order to decrease overall storage requirements. Such a technique is, by definition, dependent on the semantics of the data. The file-oriented techniques studied in this paper are primarily compression techniques since these are the easiest to implement in a generalized fashio~. While a familiarity with the semantics of a file is necessary for-maximal compaction, compression techniques have the advantage of "automatic" applicability to a wide variety of files. Compaction techniques that are irreversible are most applicable to directories of a file. Indeed, at a directory level there is often an advantage in disregarding less important information, which may be carried in lower directories or in the file itself, so as to speed up the directory scanning and the overall efficiency of the "general directory access method." COMPACTION OF RANDOM KEYS A SEQUENCE OF SORTED DEFINITIONS AND USES OF COMPACTION AND COMPRESSION Introduction Definitions A good example of compaction is the following frontcompression/rear-compaction scheme on a sequence of sorted keys. The scheme achieves a very compact first level directory in which only those portions of a key K are kept that are Since there are no standardized definitions of compaction and compression, we propose the following usage, to be followed throughout this paper: -not identical to the previous key -necessary to make K unique; i.e., distinct from previous key and following key. Compaction of data means any technique which reduces the size of the physical representation of the data while preserv453 From the collection of the Computer History Museum (www.computerhistory.org) 454 National Computer Conference, 1975 In particular, the "front string" (the initial string of characters of K identical to the same-positioned characters in the key before K) will be skipped. The "rear string" (the string of trailing characters which are not needed to distinguish K from the previous key and the following key) is knocked out. Rear compaction involves a loss of information. Hence, the keys must be carried with their full information at the level of the record, or at some intermediary level. Key #i will be coded as (8) 0 1 0 1 0, where 8 is the size of the front redundant string (FRS) and "01010" is the useful string left over after front and rear compaction. Note that the last bit, and only the last bit, of the FRS differs from the corresponding bit of the previous record. Note also that the FRS of key #i is sufficient to distinguish it from key # (i-l), but that the 5 bits of the Useful String (US) are needed to distinguish· key #i from key #(i+l). Front compression Note: Another way of looking at rear compaction is to define the useful string of key # i as follows: The leading bits of a key which are identical to the previous key's leading bits constitute the FRS (front redundant string) and need not be repeated. The FRS is expanded to include one extra bit, since it follows automatically that if the first n bits 01 a key are the initial repeated string, then the (n+1)st bit must be different. Instead of the FRS itself, a number can be written specifying the length of the FRS. This number will only require a field of bits equal to the logarithm (base 2) of the length of the key. For example, if m is the length of the key, say m = 32 bits, then the length of FRS cannot exceed m; hence, the number of bits needed to express the FRS-length is [lOg2mJ = 5 bits. Rear compaction Unlike front compression, which suppresses some redundancy but does not really do away with any information per se (provided one knows the previous key) the rear compaction will do away with information which is judged unnecessary. Rear compaction will delete an "RRS" (Rear Redundant String) . An RRS is composed. of those right most bits of a key which are not necessary to uniquely distinguish this key with respect to the set of all keys in the particular sequence to be rear-compacted. We can immediately state the following theorem: Theorem: Given a set of keys, some of which may be identical, in order to find the RRS of a key K, it is enough to look at the previous key P and the following key A in the sorted sequence; i.e., the RRS for K relative to the whole set of keys is identical to the RRS for K relative to the set of 3 keys P, K, and A. The useful string (US) is what is left of the key after the FRS and the RRS have been removed. Example: P Key # (i - 1) 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 K Key #i FRS US RRS 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 A Key # (i + 1) 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 If FRS of key # (i+l) contains the FRS of key #i then the US of key #i is the string obtained by deleting the FRS of key # i from the FRS of key # (i+l); otherwise it is null. If the keys are viewed as binary numbers of n bits each, then the compacted key #i will occupy [log2(Ki+l-Ki)]+ [lOg2 (K i - K i - 1 ) ]+2[lOg2n] bits in the first case and 2[log n] bits in the second. COMPRESSION BY DIFFERENCING The term differencing describes techniques which compare a current record to a pattern record and retain only the differences between them; i.e., information in compressed record = information in current record-information already in pattern record. This technique is particularly successful with large record files of alphanumeric characters where most .corresponding fields in different records are the same (or even blanks or zeros); also, compression is often improved by sorting the file on the largest field. In some sense, the differencing scheme is a generalization of front compression seen above. The process of compressing the front string is repeated for each maximal substring in the current record which matches a substring (in the same position) in a pattern record. The start and end signals for such a matched substring are the overhead for the scheme. The information unit on which differencing is performed can be the bit, the byte, the field, or logical information. a. Bit: Both current" and pattern records are considered as equal length bit strings. (They could· also be left or right justified variable-length bit strings.) b. Byte or character: Bbth current and pattern records are viewed as character strings. (Byte access being cheaper, this is the most common case.) c. Field: The record is viewed as a string of fields (each with its own characteristics). Quite often, the start and end signals for the unmatched "strings" will be implemented by bit maps, where each bit for the map is on or off to signal whether a given field of the current record is identical to the corresponding field of the pattern record. It is a rougher scheme, but it may From the collection of the Computer History Museum (www.computerhistory.org) A Classification of Compression Methods present the advantage of less over-head whenever matching fields are frequent. d. Logical information (instead of physical data such as bit/byte/field) as in the example: date 1, date 2, date 3 (=) date 1, interval 2, interval 3 where interval 2 = date 2-date 1 interval 3 = date 3 - date 2 In conclusion, we see that differencing schemes, (like front compression which is a special case) seek to diminish the overall amount of information by not repeating (andactually subtracting) that part of the information in a record which is already present in another (previous/pattern) record. Most often, differencing is applied to sequential files where the pattern record is taken to be the previous record in the file, which itself may have been sorted. If used with a direct access file, the first record of the block which directly accessed should be left intact (non-compressed). This may be expensive when the ratio (size of noncompressed record) / (size of block) is not small enough. In this case, a change in the blocking format might be warranted. Zero and blank compression techniques can be viewed as a special case of differencing in which a zero or blank record is used as the pattern record for the entire file. The use of the same pattern record for the whole file may not yield as good a compression as a schema where the pattern used to compress each record is the record preceding it. But the latter choice is more expensive in encoding and decoding time. Indeed, whenever a record is to be read every record preceding it has to be decoded; i.e., half a block decompression on the average. Deletions and insertions are, clearly, even costlier. The Ling-Palermo algorithm for compression of blocks of datal through a clever use of linear'depE::ndence concept is an extension of differencing and so is the QUATREE method by Hardgrave. 2 STATISTICAL ENCODING A statistical encoding is a transformation of tlte user's alphabet, converting each member of the alphabet into a code bit string whose length is inversely related to the frequency of the member in a text. A text is normally written using a fixed alphabet where each character is represented by a fi~ed length bit string (e.g., a byte). A statistical encoding schema attempts to take advantage of the fact that different characters will usually occur with different frequencies. Coding each character as a bit string of length inversely related to its frequency (i.e., coding non-frequent characters with long ones) will usually compress the text. If a text is written in an alphabet 1= {aI, ... , an} where each character occupies k bits, then an efficient statistical 455 encoding will assign a code {3i to each character ai such that: n L I {3i I ~k*N i=l where ii is the frequency of ai in the text, I {3i I is the length of the code {3i, and N is the number of characters in the text. An essential property of any statistical encoding schema is complete reversibility. That is, the ability to retrieve the original text from the encoded one in a finite (preferably linear) number of steps. Another desired quality is the prefix property where no code {3i is the prefix of another code {3j. This property assures both complete and unique reversibility and also that the decoder never has to back up and rescan any portion of the text. It is sometimes desired to have a coding schema that will preserve the alphabetic ordering of the user's alphabet (the alphabetic property) that is, if ai precedes aj in the original alphabet, one should be able to deduce this fact from the codes {3i and {3j without having to decode them. Information-Theoretic considerations assure us that when an alphabet of n characters is coded so that its complete reversibility is assured, N*H is the shortest possible binary representation for a text of N characters where H is the entropy of the distribution of characters in the text. This means, roughly, that the more "skewed" the distribution, the better the compression. Huffman coding scheme3 is a very elegant and simple statistical coding algorithm with the prefix property. It is optimal in the sense that its performance reaches the information-theoretic lower bound stated above. The HuTucker algorithm4 is a statistical coding scheme with both the prefix and alphabetic property. In the next section, we discuss in more detail the application of these two algorithms. EVALUATION OF HUFFMAN CODING FOR LARGE BUSINESS FILES Huffman coding, based on statistical characteristics of a file, provides an easy and effective method .of file compression without necessitating any inquiry into' ,the' semantics of file records. Thus, one package can be used on a wide variety of files to achieve compression without investment of large amounts of programmers' time to investigate particular files for their storage-wasteful properties. In testing a Huffman encoding package on a variety of large insurance files, ·the worst results encountered (on an already compact binary file) were 50 percent compression. Furthermore, contrary to Kreutzer,5 Huffman coding is suitable for files that are frequently updated. This is because the compression ratio achieved by Huffman coding can be discerned immediately from a table of the frequency of occurrence of each character in the file. If the frequency .of occurrence of letter i is ii, then the expected code length generated by Huffman code is closely approximated by the entropy of the frequency table H = LiEf [(fi/ LiEI Ii) * log2 ( (LiEI Ii) Iii) ] where I is the alphabet being coded. In fact, H ~expected code length ~H +1. Thus, if a file is frequently updated, it is easy to compile new frequency From the collection of the Computer History Museum (www.computerhistory.org) 456 National Computer Conference, 1975 statistics at the same time updates are performed. Then, after a certain number of updates, the expected code length using the new statistics and a new coding can be compared with the expected code-length using the new statistics and old code. If a significant improvement can be made by recoding (which will probably occur only rarely), then a new code can be generated and the file recoded. Because the compression-ratio can be calculated using a simple statistical pass of the file, the Huffman technique gives more immediate information than a differencing technique which cannot give an advance notice of its effectiveness before an actual encoding pass occurs. Huffman coding has the further advantage, over differencing techniques, that records can be decoded individually without need for some reference to a pattern record. In fact, differencing techniques derive most of their power from the fact that blanks or zeroes are commonly repeated, a fact that is handled well by Huffman coding, and even better by a modification of Huffman coding that is discussed below. It is hard to improve on Huffman coding while still preserving its "automatic" effectiveness; i.e., without reference to data semantics. However, some progress can be made in special coding techniques for repeating characters and unrecognized characters. Huffman code is optimal given the assumption that the probability of appearance of any letter is independent of the probability of appearance of any other letter. Of course, this is never actually the case, but this assumption is necessitated by the difficulty of discerning patterns in an automatic fashion. The simplest "pattern" is a ~epeating string of the same character (a clump). Usually the most frequent character in a file (say, blank or zero) is not raridomly distributed throughout the file but occurs in clumps. Since this commonly occurring condition violates the assumptions under which Huffman is optimal, it is possible to devise strategies to improve on Huffman, the simplest of which is to invent a "repeat" flag. A repeat flag can be included in the frequency table of a file with frequency equal to the number of occurrences of repeat strings whose length is greater than some threshold T. Then, for example, the string .hlS151511, instead of being encoded as (codeJ5) (codeJ5) (codeJ5) (codeJ1') (code.k), will be encoded as (code of repeat flag) (5) (code})). (Note the introduction of the repeat flag modifies the frequencies of the characters which ar.e repeated beyond the threshold.) Despite the vicious cycle nature of the problem, there is an algorithm that enables one to estimate the lower threshold T for length of repeated strings above which use of the repeat flag is more efficient than simple Huffman code. This algorithm depends only on the frequency of occurrence of characters and of repetitions. In practice, it turns out that this technique provides significant improvement over Huffman only when applied to the most frequent character in the file. We will assume here that repeating strings of character ai are to be encoded using the format: (code of repeat flag) (clump size) The following tables are obtained by scanning the file once: Character Frequency Clump Length Frequency of Clumps it h m m - 1 <Pm-l 2 <P2 al a2 <Pm n N = Lfi i=l TABLE A Frequency of Characters TABLE B Frequency of Clumps of Character ai Whereij is the total frequency of character aj in the text and C(Jk is the number of clumps (of character ai) of length exactly k (i.e., k successive occurrences of ai bounded on both sides by different characters). The threshold T is then the maximal K satisfying: Length of flag+count field~K*length of code for ai Denoting by a n+l the flag character, we will use log2 (N lij) as an estimate for the length of the Huffman code for character aj, and [IOg2 mJ as the number of bits in the fixed size count field (m is the size of the longest clump of ai in the text). The following is the algorithm to find T: (1) set r=o (2) set r=r+ 1 (3) set ii=ii- (m-r+1)*"'m-r+1 (adjusting the frequency of ai by subtracting occurrences of ai in clumps of size m-r+1) (4) set in+1=in+l+C(Jm-r+l (adjust the frequency of the flag) n+l (5) set N = L Ii (adjust the total) j=l (6) Evaluate: l~g(Nlin+1) + [IOg2 mJ< (m-r+1) *log2(NIii) If the inequality holds, go to Step 2. Otherwise, readjust ii=ii- (m-r+1)*C(Jm-r+l, n+l i n+l = i n+1 - C(Jm-r+l, N = L Ii j=l (i.e., use the frequencies from the previous step). Set T=m-r+2 and proceed to produce Huffman code using the resulting Table A. EnQode the file applying the repeat-flag format to ai clumps of size ~ T. The above algorithm could be modified so that clumps of other frequent characters could be evaluated. This will re- From the collection of the Computer History Museum (www.computerhistory.org) A Classification of Compression Methods quire a table (like Table B) for each of the characters under consideration and possibly a format: (flag) (count) (code of repeating character) Note that to obtain the expected gain in compression one could, at Step 6, compute the entropy of Table A and use that to compute the compression ratio. Another simple addition to Huffman coding is the unrecognized character flag. Suppose a file is to be encoded byte by byte, as is natural with IBM implementations. In most files, many of the 256 possible patterns of 8 bits do not occur. If these patterns are included in the coding, even with a weight of zero, space needed to store the code table will increase. Thus, it is more efficient in terms of storage (and CPU time for code making) to code only those characters that actually appear in the file, along with a special flag to mark the presence of an unrecognized character. Then, if a character not in the code table becomes included in the file due to an update, it will be coded by the code for the unrecognized-character-flag followed by the character itself written as 8 bits. Unfortunately, this technique is not suitable to HuTucker coding. Hu-Tucker coding is nearly as short as Huffman coding and it preserves alphabetical order. Thus, it would seem to be useful in a situation where alphabetical sorting of records or keys would be necessary. However, if the file is to be updated at all, th~ user is faced with two equally unpleasant alternatives. One is to code every possible character that could ever occur. Even if the absent characters were coded with frequency zero, this would greatly increase the expected code length. (Unlike in the Huffman code tree, the unused characters cannot be stuck off in one remote subtree, but must be interspersed with the other characters in natural alphabetical order, thus, increasing the code length for all.) The other alternative is to use an unrecognized character flag. But this technique destroys the alphabetic property which distinguishes Hu-Tucker. Thus, Hu-Tucker coding is of practical interest in files that are rarely, if ever, updated or whose character set is fixed. In conclusion, Huffman coding is the optimal prefix property bit string coding given a particular choice of alphabet. However, due to patterns and dependencies among the data, the choice of alphabet itself can make a difference in the efficiency of the coding. CONCLUSIONS A variety of compression techniques were applied to large insurance files some of which were already in compact form (that is, after a semantic analysis was used to eliminate "redundant" information like long strings of blanks, etc.). The programs were all written in PL/1 and executed on IBM 370/168 system. CPU measurements given below have only a relative meaning. For production purposes, assemblercode routines should perform roughly 10 times faster. 457 Differencing Differencing techniques are limited to files of fixed formatted records. Differencing was the most economical method as far as CPU time. Encoding required about 5 milliseconds (mls) per 1000 characters, and decoding about 3 mls. Sequential differencing, where each record is used as a pattern for the record succeeding it, yielded good compression ratios varying between 28 and 44 percent. However, the disadvantages of this technique are apparent. Any update requires a complete decoding of the entire file since the code for every record depends on all records preceding it. This also . implies that physical damage to a record will propagate and might hinder complete decoding of succeeding records. Trying to get around this by using a fixed pattern for the entire file (or fixed pattern for each block) alleviated the problem at the expense of yielding worse compression ratios that ranged around 45 percent. Differencing is also characterized by the fact that there is no need for a scanning pass of the data before actual encoding. This, however, implies that one cannot automatically predict the compression ratio without actually encoding the file. Huffman coding Huffman coding, unlike differencing, can be applied to variable length as well as fixed length records. Huffman coding achieved good compression ratios (between 35 and 49 percent) and was surpassed only on one file by sequential differencing. The CPU time for scanning the file and obtaining the frequency table was negligible. The production of the code table from the frequency table required less than 50 mls. We have observed that it was enough to sample only 3-5 percent of a file in order to obtain frequency tables and produce a code table which was identical to the one obtained by scanning the entire file. The cost in CPU time of encoding and decoding was at about 100 mls per 1000 characters, or roughly 20 times more than differencing (in the case when differencing is applicable) . This fact is attributable to bit level versus byte level manipulation. Applying Huffman coding together with the repeat-flag schema improved the compression ratio to between 28 and 43 percent without any detectable change in CPU cost for encoding or decoding. Huffman coding requires an initial statistical pass through the file or through part of it. The frequency table obtained from the initial pass gives an excellent indication of the compression ratio that could be achieved if the file is to be compressed using Huffman coding; i.e., the user could decide whether it is worthwhile compressing a file without the need for an actual compression run. The frequency table that is attached to the file can be updated continuously with every deletion and insertion to the file (at negligible CPU cost) so that the actual compression ratio of the file and the maximum achievable compression are always available to the user or to an automatic monitor- From the collection of the Computer History Museum (www.computerhistory.org) 458 National Computer Conference, 1975 ing routine for deciding whether a new code table should be produced. Hu-Tucker The Hu-Tucker code, as was mentioned in the previous section, is a statistical code which preserves alphabetic ordering. The CPU costs of using Hu-Tucker coding are the same as those for Huffman coding. The compression ratios achieved were only very slightly worse than Huffman. The decline in , compression (as compared to Huffman) never exceeded 7 percent. Hu-Tucker coding is especially useful for directories and files where frequent sorting is necessary. The alphabetic property enables the user to sort a compressed file without the need to decompress it. In short, we found statistical compression methods to be more generally applicable than differencing to a variety of file structures, alas, at a cost of higher CPU time. REFERENCES 1. Ling, H., and F. P. Palermo, A Block Oriented Information Compression, IBM San Jose Research Center, Report RJ 1172, No. 19024. 2. Hardgrave,W. T., The Prospects for Large Capacity Set Support Systems Imbedded within Generalized Data Management Systems, International Computing Symposium, Davos, Switzerland, Sept. 4-7, 1973. 3. Huffman, D. A., "A Method for Construction of Minimal Redundancy Codes," Proc., I.R.I.E., 51, pp. 1098-1101, Sept. 1952. 4. Hu, T. C., and A. C. Tucker, "Optimal Computer Search Trees and Variable Length Alphabetical Codes," S.I.A.M. Journal of Applied Mathmetics, 21, 514 (1971). 5. Kreutzer, P. J., Data Compression for Business Applications, Navy Fleet Material Support Office. 6. Tunstall, Brian, "Synthesis of Digital Compression Codes," Hawaii International Conference on System Sciences, Jan. 1968, pp. 266-268. 7. Tunstall, Brian, Synthesis of Noiseless Compression Codes, Research Report #67-7, Georgia Institute of Technology. 8. DeMaine, P. A. D., Principles of the NAPAK Alphanumeric Compressor in the SOLID System, National Bureau of Standards, Tech. Note 413, August 15, 1967, Part III. 9. Gilbert, E. N., and E. F. Moore, Variable Length Binary Encoding, The Bell System Technical Journal, July 1959. 10. Ott, Eugene, "Compact Encoding of Stationary Markov Sources," I.E.E.E. Transactions on Information Theory, Vol. IT-I, No.1, Jan. 1967. 11. Rottwitt, Theodore, Jr., and P. A. D. DeMaine, "Storage Optimization of Tree Structured Files Representing Descriptor Sets. 12. Rice, R. F., The Code Word Wiggle: TV Data Compression, Technical Memorandum 33-428, National Aeronautics and Space Administration Jet PropUlsion Lab., Cal. Tech., Pasadena, Calif., June 1969. 13. Knuth, D. E., The Art of Computer Programming, Vol. 3, AddisonWesley. From the collection of the Computer History Museum (www.computerhistory.org)