A classification of compression methods and their usefulness for a

advertisement
A classification of compression methods and
their usefulness for a large data processing
center
by DORON GOTTLIEB, STEVEN A. HAGERTH, PHILIPPE G. H. LEHOT and
HENRY S. RABINOWITZ
Fireman's Fund American Insurance Companies
San Francisco, California
INTRODUCTION
ing a subset of the information deemed "relevant information."
The compression techniques surveyed in this paper all work
to reduce storage space for data files at the price of increased
CPU activity needed for compression and decompression.
As CPU time becomes cheaper relative to the cost of external
storage devices, compression appears as an increasingly
attractive option for dealing with large files.
In a small shop, which is typically 1-0 bound, compression
uses available CPU time to decrease the amount of disc or
tape storage. More generally, compression of storage space
is achieved only at the expense of CPU time. The most
clear-cut use of compression is for archive files where the
main consideration is minimizing physical storage space.
This paper surveys available techniques for automatic
reversible compression of files; i.e., techniques that require
no special knowledge of the contents of a file. The theoretical
advantages of the two main categories of compression-differencing and statistical encoding-are compared, and the
practical results of these techniques on large insurance files
are shown, both in terms of compression efficiency and CPU
efficiency.
Suggestions are offered for improving the compression
achieved through Huffman coding by adding a schema to
code strings of a repeated character. An Algorithm is given
to find the threshold for the minimal length of those strings
whose coding will result in improved compression.
Compression of data is a Compaction technique which is
completely reversible.
Compression ratio is the size of the compressed file expressed
as percentage of the original file.
A compaction technique that is not a compression technique involves elimination of information deemed superfluous in order to decrease overall storage requirements.
Such a technique is, by definition, dependent on the semantics of the data.
The file-oriented techniques studied in this paper are
primarily compression techniques since these are the easiest
to implement in a generalized fashio~. While a familiarity
with the semantics of a file is necessary for-maximal compaction, compression techniques have the advantage of "automatic" applicability to a wide variety of files.
Compaction techniques that are irreversible are most applicable to directories of a file. Indeed, at a directory level
there is often an advantage in disregarding less important
information, which may be carried in lower directories or in
the file itself, so as to speed up the directory scanning and the
overall efficiency of the "general directory access method."
COMPACTION OF
RANDOM KEYS
A
SEQUENCE
OF
SORTED
DEFINITIONS AND USES OF COMPACTION AND
COMPRESSION
Introduction
Definitions
A good example of compaction is the following frontcompression/rear-compaction scheme on a sequence of sorted
keys. The scheme achieves a very compact first level directory in which only those portions of a key K are kept that are
Since there are no standardized definitions of compaction
and compression, we propose the following usage, to be
followed throughout this paper:
-not identical to the previous key
-necessary to make K unique; i.e., distinct from
previous key and following key.
Compaction of data means any technique which reduces the
size of the physical representation of the data while preserv453
From the collection of the Computer History Museum (www.computerhistory.org)
454
National Computer Conference, 1975
In particular, the "front string" (the initial string of
characters of K identical to the same-positioned characters
in the key before K) will be skipped. The "rear string"
(the string of trailing characters which are not needed to
distinguish K from the previous key and the following key)
is knocked out. Rear compaction involves a loss of information. Hence, the keys must be carried with their full information at the level of the record, or at some intermediary level.
Key #i will be coded as (8) 0 1 0 1 0, where 8 is the
size of the front redundant string (FRS) and "01010" is the
useful string left over after front and rear compaction.
Note that the last bit, and only the last bit, of the FRS
differs from the corresponding bit of the previous record.
Note also that the FRS of key #i is sufficient to distinguish
it from key # (i-l), but that the 5 bits of the Useful
String (US) are needed to distinguish· key #i from key
#(i+l).
Front compression
Note: Another way of looking at rear compaction is to
define the useful string of key # i as follows:
The leading bits of a key which are identical to the previous
key's leading bits constitute the FRS (front redundant
string) and need not be repeated. The FRS is expanded to
include one extra bit, since it follows automatically that if
the first n bits 01 a key are the initial repeated string, then
the (n+1)st bit must be different. Instead of the FRS itself, a number can be written specifying the length of the
FRS. This number will only require a field of bits equal to the
logarithm (base 2) of the length of the key. For example, if
m is the length of the key, say m = 32 bits, then the length
of FRS cannot exceed m; hence, the number of bits needed to
express the FRS-length is [lOg2mJ = 5 bits.
Rear compaction
Unlike front compression, which suppresses some redundancy but does not really do away with any information
per se (provided one knows the previous key) the rear compaction will do away with information which is judged unnecessary.
Rear compaction will delete an "RRS" (Rear Redundant
String) . An RRS is composed. of those right most bits of a key
which are not necessary to uniquely distinguish this key with
respect to the set of all keys in the particular sequence to be
rear-compacted. We can immediately state the following
theorem:
Theorem:
Given a set of keys, some of which may be identical, in
order to find the RRS of a key K, it is enough to look at the
previous key P and the following key A in the sorted sequence; i.e., the RRS for K relative to the whole set of keys
is identical to the RRS for K relative to the set of 3 keys
P, K, and A.
The useful string (US) is what is left of the key after the
FRS and the RRS have been removed.
Example:
P
Key # (i - 1) 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
K Key #i
FRS
US
RRS
1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0
A Key # (i + 1) 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1
If FRS of key # (i+l) contains the FRS of key
#i then the US of key #i is the string obtained
by deleting the FRS of key # i from the FRS of
key # (i+l); otherwise it is null.
If the keys are viewed as binary numbers of n bits each, then
the compacted key #i will occupy [log2(Ki+l-Ki)]+
[lOg2 (K i - K i - 1 ) ]+2[lOg2n] bits in the first case and 2[log n]
bits in the second.
COMPRESSION BY DIFFERENCING
The term differencing describes techniques which compare
a current record to a pattern record and retain only the differences between them; i.e., information in compressed
record = information in current record-information already
in pattern record.
This technique is particularly successful with large record
files of alphanumeric characters where most .corresponding
fields in different records are the same (or even blanks or
zeros); also, compression is often improved by sorting the
file on the largest field.
In some sense, the differencing scheme is a generalization
of front compression seen above. The process of compressing
the front string is repeated for each maximal substring in the
current record which matches a substring (in the same
position) in a pattern record. The start and end signals for
such a matched substring are the overhead for the scheme.
The information unit on which differencing is performed can
be the bit, the byte, the field, or logical information.
a. Bit: Both current" and pattern records are considered
as equal length bit strings. (They could· also be left or
right justified variable-length bit strings.)
b. Byte or character: Bbth current and pattern records are
viewed as character strings. (Byte access being cheaper,
this is the most common case.)
c. Field: The record is viewed as a string of fields (each
with its own characteristics). Quite often, the start and
end signals for the unmatched "strings" will be implemented by bit maps, where each bit for the map is on
or off to signal whether a given field of the current
record is identical to the corresponding field of the
pattern record. It is a rougher scheme, but it may
From the collection of the Computer History Museum (www.computerhistory.org)
A Classification of Compression Methods
present the advantage of less over-head whenever
matching fields are frequent.
d. Logical information (instead of physical data such as
bit/byte/field) as in the example:
date 1, date 2, date 3 (=) date 1, interval 2, interval 3
where interval 2 = date 2-date 1
interval 3 = date 3 - date 2
In conclusion, we see that differencing schemes, (like front
compression which is a special case) seek to diminish the
overall amount of information by not repeating (andactually subtracting) that part of the information in a record
which is already present in another (previous/pattern)
record.
Most often, differencing is applied to sequential files where
the pattern record is taken to be the previous record in the
file, which itself may have been sorted.
If used with a direct access file, the first record of the block
which directly accessed should be left intact (non-compressed). This may be expensive when the ratio (size of noncompressed record) / (size of block) is not small enough.
In this case, a change in the blocking format might be warranted.
Zero and blank compression techniques can be viewed as a
special case of differencing in which a zero or blank record is
used as the pattern record for the entire file.
The use of the same pattern record for the whole file may
not yield as good a compression as a schema where the pattern
used to compress each record is the record preceding it. But
the latter choice is more expensive in encoding and decoding
time. Indeed, whenever a record is to be read every record
preceding it has to be decoded; i.e., half a block decompression on the average. Deletions and insertions are, clearly,
even costlier.
The Ling-Palermo algorithm for compression of blocks of
datal through a clever use of linear'depE::ndence concept is an
extension of differencing and so is the QUATREE method
by Hardgrave. 2
STATISTICAL ENCODING
A statistical encoding is a transformation of tlte user's
alphabet, converting each member of the alphabet into a
code bit string whose length is inversely related to the
frequency of the member in a text.
A text is normally written using a fixed alphabet where each
character is represented by a fi~ed length bit string (e.g.,
a byte). A statistical encoding schema attempts to take advantage of the fact that different characters will usually
occur with different frequencies. Coding each character as a
bit string of length inversely related to its frequency (i.e.,
coding non-frequent characters with long ones) will usually
compress the text.
If a text is written in an alphabet 1= {aI, ... , an} where
each character occupies k bits, then an efficient statistical
455
encoding will assign a code {3i to each character ai such that:
n
L I {3i I ~k*N
i=l
where ii is the frequency of ai in the text, I {3i I is the length
of the code {3i, and N is the number of characters in the text.
An essential property of any statistical encoding schema is
complete reversibility. That is, the ability to retrieve the
original text from the encoded one in a finite (preferably
linear) number of steps. Another desired quality is the prefix
property where no code {3i is the prefix of another code {3j.
This property assures both complete and unique reversibility
and also that the decoder never has to back up and rescan any
portion of the text. It is sometimes desired to have a coding
schema that will preserve the alphabetic ordering of the
user's alphabet (the alphabetic property) that is, if ai
precedes aj in the original alphabet, one should be able
to deduce this fact from the codes {3i and {3j without having
to decode them.
Information-Theoretic considerations assure us that when
an alphabet of n characters is coded so that its complete
reversibility is assured, N*H is the shortest possible binary
representation for a text of N characters where H is the
entropy of the distribution of characters in the text. This
means, roughly, that the more "skewed" the distribution,
the better the compression.
Huffman coding scheme3 is a very elegant and simple
statistical coding algorithm with the prefix property. It is
optimal in the sense that its performance reaches the information-theoretic lower bound stated above. The HuTucker algorithm4 is a statistical coding scheme with both
the prefix and alphabetic property. In the next section, we
discuss in more detail the application of these two algorithms.
EVALUATION OF HUFFMAN CODING FOR LARGE
BUSINESS FILES
Huffman coding, based on statistical characteristics of a
file, provides an easy and effective method .of file compression
without necessitating any inquiry into' ,the' semantics of file
records. Thus, one package can be used on a wide variety of
files to achieve compression without investment of large
amounts of programmers' time to investigate particular files
for their storage-wasteful properties. In testing a Huffman
encoding package on a variety of large insurance files, ·the
worst results encountered (on an already compact binary
file) were 50 percent compression.
Furthermore, contrary to Kreutzer,5 Huffman coding is
suitable for files that are frequently updated. This is because
the compression ratio achieved by Huffman coding can be
discerned immediately from a table of the frequency of occurrence of each character in the file. If the frequency .of occurrence of letter i is ii, then the expected code length generated by Huffman code is closely approximated by the
entropy of the frequency table H = LiEf [(fi/ LiEI Ii) *
log2 ( (LiEI Ii) Iii) ] where I is the alphabet being coded.
In fact, H ~expected code length ~H +1. Thus, if a file is
frequently updated, it is easy to compile new frequency
From the collection of the Computer History Museum (www.computerhistory.org)
456
National Computer Conference, 1975
statistics at the same time updates are performed. Then,
after a certain number of updates, the expected code length
using the new statistics and a new coding can be compared
with the expected code-length using the new statistics and
old code. If a significant improvement can be made by recoding (which will probably occur only rarely), then a new
code can be generated and the file recoded.
Because the compression-ratio can be calculated using a
simple statistical pass of the file, the Huffman technique
gives more immediate information than a differencing
technique which cannot give an advance notice of its effectiveness before an actual encoding pass occurs. Huffman
coding has the further advantage, over differencing techniques, that records can be decoded individually without
need for some reference to a pattern record. In fact, differencing techniques derive most of their power from the fact
that blanks or zeroes are commonly repeated, a fact that is
handled well by Huffman coding, and even better by a
modification of Huffman coding that is discussed below.
It is hard to improve on Huffman coding while still preserving its "automatic" effectiveness; i.e., without reference
to data semantics. However, some progress can be made in
special coding techniques for repeating characters and unrecognized characters.
Huffman code is optimal given the assumption that the
probability of appearance of any letter is independent of the
probability of appearance of any other letter. Of course, this
is never actually the case, but this assumption is necessitated
by the difficulty of discerning patterns in an automatic
fashion. The simplest "pattern" is a ~epeating string of the
same character (a clump). Usually the most frequent character in a file (say, blank or zero) is not raridomly distributed
throughout the file but occurs in clumps. Since this commonly
occurring condition violates the assumptions under which
Huffman is optimal, it is possible to devise strategies to
improve on Huffman, the simplest of which is to invent a
"repeat" flag.
A repeat flag can be included in the frequency table of a
file with frequency equal to the number of occurrences of
repeat strings whose length is greater than some threshold T.
Then, for example, the string .hlS151511, instead of being encoded as (codeJ5) (codeJ5) (codeJ5) (codeJ1') (code.k), will
be encoded as (code of repeat flag) (5) (code})). (Note the
introduction of the repeat flag modifies the frequencies of the
characters which ar.e repeated beyond the threshold.)
Despite the vicious cycle nature of the problem, there is an
algorithm that enables one to estimate the lower threshold
T for length of repeated strings above which use of the
repeat flag is more efficient than simple Huffman code. This
algorithm depends only on the frequency of occurrence of
characters and of repetitions. In practice, it turns out that
this technique provides significant improvement over Huffman only when applied to the most frequent character in the
file.
We will assume here that repeating strings of character
ai are to be encoded using the format:
(code of repeat flag) (clump size)
The following tables are obtained by scanning the file once:
Character
Frequency
Clump Length
Frequency of
Clumps
it
h
m
m - 1
<Pm-l
2
<P2
al
a2
<Pm
n
N
=
Lfi
i=l
TABLE A
Frequency of Characters
TABLE B
Frequency of Clumps of
Character ai
Whereij is the total frequency of character aj in the text and
C(Jk is the number of clumps (of character ai) of length
exactly k (i.e., k successive occurrences of ai bounded on both
sides by different characters). The threshold T is then the
maximal K satisfying:
Length of flag+count
field~K*length
of code for ai
Denoting by a n+l the flag character, we will use log2 (N lij)
as an estimate for the length of the Huffman code for character aj, and [IOg2 mJ as the number of bits in the fixed
size count field (m is the size of the longest clump of ai in the
text).
The following is the algorithm to find T:
(1) set r=o
(2) set r=r+ 1
(3) set ii=ii- (m-r+1)*"'m-r+1 (adjusting the frequency of ai by subtracting occurrences of ai in
clumps of size m-r+1)
(4) set in+1=in+l+C(Jm-r+l (adjust the frequency of the
flag)
n+l
(5) set N =
L Ii
(adjust the total)
j=l
(6) Evaluate:
l~g(Nlin+1)
+ [IOg2 mJ< (m-r+1) *log2(NIii)
If the inequality holds, go to Step 2. Otherwise, readjust
ii=ii- (m-r+1)*C(Jm-r+l,
n+l
i n+l = i n+1 - C(Jm-r+l, N =
L Ii
j=l
(i.e., use the frequencies from the previous step).
Set T=m-r+2 and proceed to produce Huffman code
using the resulting Table A. EnQode the file applying the
repeat-flag format to ai clumps of size ~ T.
The above algorithm could be modified so that clumps of
other frequent characters could be evaluated. This will re-
From the collection of the Computer History Museum (www.computerhistory.org)
A Classification of Compression Methods
quire a table (like Table B) for each of the characters under
consideration and possibly a format:
(flag) (count) (code of repeating character)
Note that to obtain the expected gain in compression one
could, at Step 6, compute the entropy of Table A and use
that to compute the compression ratio.
Another simple addition to Huffman coding is the unrecognized character flag. Suppose a file is to be encoded byte
by byte, as is natural with IBM implementations. In most
files, many of the 256 possible patterns of 8 bits do not occur.
If these patterns are included in the coding, even with a
weight of zero, space needed to store the code table will
increase. Thus, it is more efficient in terms of storage (and
CPU time for code making) to code only those characters
that actually appear in the file, along with a special flag to
mark the presence of an unrecognized character. Then, if a
character not in the code table becomes included in the file
due to an update, it will be coded by the code for the unrecognized-character-flag followed by the character itself
written as 8 bits.
Unfortunately, this technique is not suitable to HuTucker coding. Hu-Tucker coding is nearly as short as
Huffman coding and it preserves alphabetical order. Thus, it
would seem to be useful in a situation where alphabetical
sorting of records or keys would be necessary. However, if
the file is to be updated at all, th~ user is faced with two
equally unpleasant alternatives. One is to code every possible
character that could ever occur. Even if the absent characters
were coded with frequency zero, this would greatly increase
the expected code length. (Unlike in the Huffman code tree,
the unused characters cannot be stuck off in one remote
subtree, but must be interspersed with the other characters
in natural alphabetical order, thus, increasing the code length
for all.) The other alternative is to use an unrecognized
character flag. But this technique destroys the alphabetic
property which distinguishes Hu-Tucker. Thus, Hu-Tucker
coding is of practical interest in files that are rarely, if ever,
updated or whose character set is fixed.
In conclusion, Huffman coding is the optimal prefix
property bit string coding given a particular choice of alphabet. However, due to patterns and dependencies among
the data, the choice of alphabet itself can make a difference
in the efficiency of the coding.
CONCLUSIONS
A variety of compression techniques were applied to large
insurance files some of which were already in compact form
(that is, after a semantic analysis was used to eliminate
"redundant" information like long strings of blanks, etc.).
The programs were all written in PL/1 and executed on
IBM 370/168 system. CPU measurements given below have
only a relative meaning. For production purposes, assemblercode routines should perform roughly 10 times faster.
457
Differencing
Differencing techniques are limited to files of fixed formatted records. Differencing was the most economical
method as far as CPU time. Encoding required about 5
milliseconds (mls) per 1000 characters, and decoding about
3 mls. Sequential differencing, where each record is used as a
pattern for the record succeeding it, yielded good compression
ratios varying between 28 and 44 percent. However, the disadvantages of this technique are apparent. Any update requires a complete decoding of the entire file since the code
for every record depends on all records preceding it. This also .
implies that physical damage to a record will propagate and
might hinder complete decoding of succeeding records.
Trying to get around this by using a fixed pattern for the
entire file (or fixed pattern for each block) alleviated the
problem at the expense of yielding worse compression ratios
that ranged around 45 percent.
Differencing is also characterized by the fact that there is
no need for a scanning pass of the data before actual encoding.
This, however, implies that one cannot automatically predict
the compression ratio without actually encoding the file.
Huffman coding
Huffman coding, unlike differencing, can be applied to
variable length as well as fixed length records. Huffman
coding achieved good compression ratios (between 35 and
49 percent) and was surpassed only on one file by sequential
differencing. The CPU time for scanning the file and obtaining the frequency table was negligible. The production of the
code table from the frequency table required less than 50
mls. We have observed that it was enough to sample
only 3-5 percent of a file in order to obtain frequency tables
and produce a code table which was identical to the one obtained by scanning the entire file. The cost in CPU time of
encoding and decoding was at about 100 mls per 1000 characters, or roughly 20 times more than differencing (in the
case when differencing is applicable) . This fact is attributable
to bit level versus byte level manipulation.
Applying Huffman coding together with the repeat-flag
schema improved the compression ratio to between 28 and
43 percent without any detectable change in CPU cost for
encoding or decoding.
Huffman coding requires an initial statistical pass through
the file or through part of it. The frequency table obtained
from the initial pass gives an excellent indication of the
compression ratio that could be achieved if the file is to be
compressed using Huffman coding; i.e., the user could decide
whether it is worthwhile compressing a file without the need
for an actual compression run.
The frequency table that is attached to the file can be updated continuously with every deletion and insertion to the
file (at negligible CPU cost) so that the actual compression
ratio of the file and the maximum achievable compression
are always available to the user or to an automatic monitor-
From the collection of the Computer History Museum (www.computerhistory.org)
458
National Computer Conference, 1975
ing routine for deciding whether a new code table should be
produced.
Hu-Tucker
The Hu-Tucker code, as was mentioned in the previous
section, is a statistical code which preserves alphabetic
ordering.
The CPU costs of using Hu-Tucker coding are the same as
those for Huffman coding. The compression ratios achieved
were only very slightly worse than Huffman. The decline in
, compression (as compared to Huffman) never exceeded
7 percent.
Hu-Tucker coding is especially useful for directories and
files where frequent sorting is necessary. The alphabetic
property enables the user to sort a compressed file without
the need to decompress it.
In short, we found statistical compression methods to be
more generally applicable than differencing to a variety of
file structures, alas, at a cost of higher CPU time.
REFERENCES
1. Ling, H., and F. P. Palermo, A Block Oriented Information Compression, IBM San Jose Research Center, Report RJ 1172, No. 19024.
2. Hardgrave,W. T., The Prospects for Large Capacity Set Support
Systems Imbedded within Generalized Data Management Systems,
International Computing Symposium, Davos, Switzerland, Sept.
4-7, 1973.
3. Huffman, D. A., "A Method for Construction of Minimal Redundancy Codes," Proc., I.R.I.E., 51, pp. 1098-1101, Sept. 1952.
4. Hu, T. C., and A. C. Tucker, "Optimal Computer Search Trees and
Variable Length Alphabetical Codes," S.I.A.M. Journal of Applied
Mathmetics, 21, 514 (1971).
5. Kreutzer, P. J., Data Compression for Business Applications, Navy
Fleet Material Support Office.
6. Tunstall, Brian, "Synthesis of Digital Compression Codes," Hawaii
International Conference on System Sciences, Jan. 1968, pp. 266-268.
7. Tunstall, Brian, Synthesis of Noiseless Compression Codes, Research
Report #67-7, Georgia Institute of Technology.
8. DeMaine, P. A. D., Principles of the NAPAK Alphanumeric
Compressor in the SOLID System, National Bureau of Standards,
Tech. Note 413, August 15, 1967, Part III.
9. Gilbert, E. N., and E. F. Moore, Variable Length Binary Encoding,
The Bell System Technical Journal, July 1959.
10. Ott, Eugene, "Compact Encoding of Stationary Markov Sources,"
I.E.E.E. Transactions on Information Theory, Vol. IT-I, No.1,
Jan. 1967.
11. Rottwitt, Theodore, Jr., and P. A. D. DeMaine, "Storage Optimization of Tree Structured Files Representing Descriptor Sets.
12. Rice, R. F., The Code Word Wiggle: TV Data Compression, Technical
Memorandum 33-428, National Aeronautics and Space Administration Jet PropUlsion Lab., Cal. Tech., Pasadena, Calif., June 1969.
13. Knuth, D. E., The Art of Computer Programming, Vol. 3, AddisonWesley.
From the collection of the Computer History Museum (www.computerhistory.org)
Download