Universal coding with different modelers in data compression by Reggie ChingPing Kwan A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Montana State University © Copyright by Reggie ChingPing Kwan (1987) Abstract: In data compression systems, most existing codes have integer code length because of the nature of block codes. One of the ways to improve compression is to use codes with noninteger length. We developed an arithmetic universal coding scheme to achieve the entropy of an information source with the given probabilities from the modeler. We show how the universal coder can be used for both the probabilistic and nonprobabilistic approaches in data compression. to avoid a model file being too large, nonadaptive models are transformed into adaptive models simply by keeping the appropriate counts of symbols or strings. The idea of the universal coder is from the Elias code so it is quite close to the arithmetic code suggested by Rissanen and Langdon. The major difference is the way that we handle the precision problem and the carry-over problem. Ten to twenty percent extra compression can be expected as a result of using the universal coder. The process of adaptive modeling, however, may be forty times slower unless parallel algorithms are used to speed up the probability calculating process. UNIVERSAL CODING WITH DIFFERENT MODELERS IN DATA COMPRESSION by Reggie ChingPing Kwan A thesis submitted'in partial fulfillment o f t h e r e q u i r e m e n t s for t he d e g r e e Of M a s t e r of S c i e n c e ’ in Computer Science MONTANA STATE UNIVERSITY Boze m a n , M o n t a n a J u l y '1987 MAIN LJtij /V372 /<97? Co p* j_j_ APPROVAL of a thesis submitted by Reggie chingping Kwan This thesis has been read by each member of the thesis committee a n d h a s b e e n f o u n d to be s a t i s f a c t o r y r e g a r d i n g cont e n t , E nglish usage, format, citation, bibliographic style, and consistency, and is ready for submission to the College of Graduate studies. Date Chairperson, Faduate Commi Approved for the Major Department Date Head / Approved for the College of Graduate studies Date Graduate Dean ie iii STATEMENT OF PERMISSION TO USE In presenting, this thesis in partial fulfillment of th e r e q u i r e m e n t s for a m a s t e r ' s d e g r e e at M o n t a n a S t a t e University, I a g r e e t h a t t h e L i b r a r y s h a l l m a k e it a v a i l a b l e to b o r r o w e r s un d e r rules of the Library. Brief quotations from the thesis are allowable without special permission, provided that accurate acknowledgment of source is made. Permission for extensive quotation from or reproduction of this thesis m a y be granted by m y major professor, or in his/her absence, by the Director of Libraries when, in the opinion of either, the proposed use of the material is' for scholarly purposes. Any copying or use of t h e m a t e r i a l in t h i s t h e s i s for f i n a n c i a l g a i n s h a l l not be a l l o w e d without my written permission. Signature _ 'Date ■ iv TABLE OF CONTENTS Page LIST OF T A B L E S .......... ........................... ,............... vi LIST OF FIGURES............. ....................................... vii A B S T R A C T .......... ........................................ '......... viii 1. INTRODUCTION TO DATA COMPRESSION............ A Basic Communication System.............................. The Need For Data compression............................. Data compression and .Compaction - A definition.......... Theoretical B a s i s .......................................... Components of a Data Compression System................... Information s o u r c e ..... ................................ Mo d e l e r ......................................... Encoder.................................................. Compressed File......................................... De c o d e r ...................... ...... .... '................ 2. 3. I ■ I 2 3 3 9 10 10 11 11 12 UNIVERSAL C O D I N G ............................................... 13 Coding Questions and Efficiency........................... ■ Criteria of a Good Compression System .................... Codes with Combined Modelers and Coders........ Huffman Code for zero-memory Source................... Huffman Code for Markov S o u r c e ........................ LZW Compression System ................................. Universal C o d e r ..................... ,.i..................... Idea Origin — Elias C o d e ........ •......... ............ The A l g o r i t h m ....... ■ The Precision problem.................................. The carry-over problem ............................ 13 14 15 15 18 22 25 25 27 29 29 UNIVERSAL.CODING WITH DIFFERENT MODELERS.................... 33 Is There a Universal Modeler?............... '........ 33 Nonadaptive M o d e l s ......................................... Zero-memory Source.-................ -................... Markov Source........................................... Adaptive M o d e l s ............................................ Zero-memory Source...................................... Markov S o u r c e ........................................... LZW C o d e ..................... -........................... 34 35 36 38 38 39 39 . V TABLE O F CONTENTS— -Continued Page 4. COMPRESSION RESULTS AND ANALYSIS..... ........................ 45 Test Data and Testing M e t h o d ..... ........................ Code A n a l y s i s..................... ■......................... Model A n a l y s i s ................................. The Best c a s e ......I.................................. The Worst c a s e .......................................... Conclusions............................. Future R e s e a r c h......................... i................. 45 46 47 48 50 50 51 REFERENCES C I T E D ............................... 53 APPENDICES. . .......................... 56 A. B. Glossary of symbols and Entropy Expressons.............. Definitions of C o d e s ......................... 57 59 vi LIST OF TABLES Table page 1. Kikian code....;........................................... 5 2. Compressed Kikian c o d e ..................................... 5 3. Probabilities of the Markov source in figure 7 ........... 21 4. The LZW string t a b l e ....................................... 24 5. Compression results for a variety of data typ e s ......... 46 vii LIST OF FIGURES Figure. Page 1. A basic communication s y stem.............................. 2 2. Coding of an information s o u r c e ........................... 4 3. •Code tree of Code A and Code B . ........................... 6 4. components of a data compression s y stem.......:......... 9 5. Reduction p r o c e s s ..... '............. ................. ..... 17 6. splitting p r o c e s s ......... 17 7. State diagram of a first-order Markov source............ 18 8. Encoding state d i a g r a m ................. '.................. 20 9. Decoding state d i a gram.............. 20. 10. A compression example for LZW algorithm.................. 24 11. A decompression example of figure 1 0 ..................... 24 12. Subdivision of the interval [0, 1] for source sequences........................................... 26 subdivision of the interval [0, I]. with the probabilities in binary digi t s ........ 26 14. The encoding process of the universal encoder........ 30 15. The decoding process of the universal decoder............ 30 16. An example of a model f i l e ......... ................. '.... 36 17. An example of a model file for a first-order Markov s o u r c e ............................................... 37 An example of an adaptive modeler for zero-memory s o u r c e ..................'...................... 40 An example of an adaptive modeler for a first-order Markov s o urce............................... 41 An example of converting the LZW table into a m o d e l e r ............. .................. ■......... . 43 Subclasses of c o d e s ........................................ 59 13. 18. 19. 20. 21. viii ABSTRACT In d a t a c o m p r e s s i o n s y s t e m s , m o s t e x i s t i n g c o d e s h a v e i nteger code length because of the nature of block codes. improve compression One of the ways to is to use c o d e s w i t h n o n i n t e g e r length. We developed an arithmetic universal coding scheme to achieve the entropy of a n information modeler. source w ith the given probabilities from t he W e s h o w h o w t h e u n i v e r s a l c o d e r can b e u s e d for both t he probabilistic and nonprobabiIistic approaches in data compression, to avoid a model file being, too large, nonadaptive models are transformed i n t o a d a p t i v e m o d e l s s i m p l y b y k e e p i n g the a p p r o p r i a t e c o u n t s of symbols or strings. The idea of the universal coder is from the Elias code so it is quite close to the arithmetic code suggested by Rissanen and Langdon. T h e m a j o r d i f f e r e n c e is t h e w a y t h a t w e h a n d l e t h e precision p r o blem and the carry-over problem. Ten to twenty percent extra compression can be expected as a result of using the universal coder. The process of adaptive modeling, however, m a y be forty times slower unless parallel algorithms are used to speed up the probability calc u l a t i n g process. I Chapter I INTRODUCTION TO DATA COMPRESSION ■• A Basic Communication System C o m m u n i c a t i o n s a r e u s e d to p r o v i d e t e r m i n a l u s e r s a c cess to computer systems that m a y be remotely located, as well as to provide a m e a n s for c o m p u t e r s t o c o m m u n i c a t e w i t h other c o m p u t e r s for t h e purpose of load sharing and accessing the data and/or programs stored at a d i s t a n t c o m p u t e r . T h e c o n v e n t i o n a l c o m m u n i c a t i o n s y s t e m is modeled by: 1. An information source 2. An encoding of this source 3. A channel over, or through, which the information is sent 4. A noi.se (error) s o u r c e that is a d d e d t o t h e s i g n a l in t h e channel 5. A decoding and hopefully a recovery of the original informat­ ion from the contaminated received signal 6. A sink for the information T h i s m o d e l , s h o w n in f i g u r e I, is a s i m p l i f i e d v e r s i o n o f a ny communication systems. In (3), w e are particularly interested with sending information either from n o w to then (storage), or from here to there (transmission). 2 T <^JsourceTj>— * (I) Figure I. Encode Channel (2 ) (3) Decode -1Cjsink (4) (5) A basic communication system. The Need For Data Compression There are three major reasons that we bother to code: 1. Error detection and correction 2. Encryption 3. Data compression To r e c o v e r f r o m noise, p a r i t y c h e c k and H a m m i n g (4,7) co d e s [ I ] a r e i n t r o d u c e d t o d e t e c t a n d c o r r e c t e r r o r s respectively. unauthorized receivers, system [2] are used. To a v o i d encryption schemes like public key encryption Finally, to reduce the data storage requirements a n d / o r the d a t a c o m m u n i c a t i o n costs, t h e r e is a n e e d t o reduce th e s i z e of v o l u m i n o u s a m o u n t s o f data. [3], Huffman employed. [3], LZW [4], Thus, c o d e s like S h a n n o n - F a n n o and A r i t hmetic [5] and [ 6 ] etc. ar e 3 D a t a c o m p r e s s i o n t e c h n i q u e s a re m o s t useful for large archival files, I/O bound systems with spare CPU cycles, and transfer of large files over long distance communication links especially when low band width is available. This paper discusses several data compression systems and develops a universal coder for diffferent modelers. Data Compression and compaction - A definition D a t a c o m p r e s s i o n / c o m p a c t i o n c an be d e f i n e d a s t h e p r o c e s s of creating a condensed version redundancies in the data. of the source data by reducing In general, data compression techniques can be divided into t w o categories — nonreversible and reversible, in nonreversible techniques (usually called data compaction), the size of the physical r e p r e s e n t a t i o n o f d a t a is r e d u c e d w h i l e a s u bset of "relevant" information is preserved, e.g. CCC algorithm compaction considered [ 8 ], In r e l evant, reversible techniques, all [7], and Rear- information and compression/decompression reco v e r s is th e original data, e.g. L Z W algorithm [4]. Theoretical Basis T h e e n c o d i n g o f an i n f o r m a t i o n s o u r c e can be i l l u s t r a t e d by figure 2. 4 sI X1 \ S2 > x2 Information X S > source — > Xj^ X^ is a sequence of • corresponding to S^ • / / xr Sq Figure 2. Source Code alphabet alphabet Code word Coding of an information source. In fact, f i g u r e 2 is the c o m p o n e n t la b e l e d (2) in figure I. The simplest and most important coding philosophy in data compression is to assign short code words to those events with high probabilities and long code words to those events with low probabilities [3]. An event, however, can be a value, source symbol or a group of source symbols to be encoded. T a b l e s I a n d 2 a r e u s ed to i l l u s t r a t e the idea to a c h i e v e d a t a compression when the probabilities are known, Kikian of a very primitive tribe Kiki. consider the language The Kikian can only communicate in four words # = kill, @ = eat, z = sleep, and & = think. T h e s o u r c e a l p h a b e t S = { #, @, z, & }. A s s u m e an a n t h r o p o l o g i s t s t u d i e s K i k i a n a n d he e n c o d e s th e i r d a i l y a c t i v i t i e s w i t h a b i n a r y code. Then, the c o d e a l p h a b e t X = { 0, I }. Without k n o w i n g the 5 probabilities of the tribe's activi t i e s , or source symbols, we constructed the code for the source symbols as table I. With the probabilities of each activity, or symbol, being known, we can employ our coding philosophy to construct a n e w code, as shown in table 2. Table I. Kikian code. Code Symbol Table 2. # 00 @ 01 Z 10 & 11 Compressed Kikian code. Symbol Probability Code B # 0.125 HO @ 0.25 10 z 0.5 0 & 0.125 111 6 The average length L of a code word using any code is defined as L= Y q j where q i = l IC(Si)| * P(Si) (1-1) = number of source symbols Si = source symbols C(Si ) = code for S i Ic(Si)] = length of the code for symbol S i P(Si ) = probability of symbol Si The average length L a of a code word using code A is obviously 2 bits because Jc(Si)| = 2 for all i a n d P ( S i ) is a l w a y s one. On t he o t h e r hand, t h e a v e r a g e l e n g t h L 0 of a c o d e w o r d u s i n g c o d e B is 1.75 bits, calculate as L 0 = 0.125 * 3 + 0.25 * 2 + 0.5 * I + 0.125 * 3. T h a t is, for our e n c o d i n g s y s t e m for the K i k i a n w e h a v e fou n d a m e t h o d u s i n g an a v e r a g e of o n l y 1.75 b i t s per s y m b o l , bits per symbol. rather t h a n 2 Both codes, however, are instantaneous (see Appendix B ) , a n d t h e r e f o r e u n i q u e l y decodable. It is b e c a u s e all code w o r d s f r o m b o t h c o d e s a r e f o r m e d by t he l e aves of their c o d e tree as in f i g u r e 3. Code B Code A A / \ A / \ / \ 0 / / \ \ 0 I I / \ / \ 0 10 1 # @ & # 0 \ I z @ / \ 0 I z Figure 3. Code trees of Code A and Code B. & 7 A t y p i c a l d a y o f t h e K i k i a n c o u l d b e sleep, eat, sleep, think, eat, a n d sleep. If a K i k i a n has a diary, z @ z # z & @ z w i l l be the symbols appear on'a typical day. book, it would be 1001100010110110 probabilities of different activities, known, however, sleep, kill, On the anthropologist's note without considering the with the probabilities being the daily activities of the Kikian can be represented by 01001100111100. In other words, to represent the s a m e information, t h e d a i l y a c t i v i t i e s of t h e Kikian, w e c an either, u s e 16 bits or 14 bits. C o d e b , in t h i s case, is c l e a r l y b e t t e r t h a n c o d e A even t h o u g h they are both uniquely decodable which is. the major criteria in data compression. Nevertheless, at this point w e cannot conclude if code B gives us the best compression for the given probabilities, though it turns out to be true. To d e t e r m i n e w h a t is t he b e s t c o m p r e s s i o n w e c a n get for the g i v e n p r o b a b i l i t i e s , let us look at t he i n f o r m a t i o n c o n t e n t of each e v e n t first. Let p(S^) b e the p r o b a b i l i t y o f t he s y m b o l Sj_. We can s a y t h a t t h e i n f o r m a t i o n c o n t e n t of Sj_ is: I ( S i ) = - log P ( S i ) (1.2) bits of information if log base two is used [3]. Code B uses only one bit to represent z and three bits to represent @ because the information content of z is less than @. I(z) = - I o g 2 p(z) = -Iog 2 0.5 = I bit I(@) = -Iog 2 p(@) = -Iog 2 0.125 = 3 bits 8 Intuitively, if an event is highly unlikely to happen a nd happens, it deserves m o r e attention or it contains m o r e information. To determine if a code has the shortest average code length L, we h a v e t o c o m p a r e L w i t h t h e e n t r o p y H(S). If s y m b o l s^ occurs, we obtain an amount of information equal to . I(Si ) = - Iog 2 P(Si ) bits T h e p r o b a b i l i t i e s t h a t S i is P ( S i ), so t he a v e r a g e a m o u n t of information obtain per symbol from the source is P(Si ) I(Si ) The quantity, bits the average a m o u n t of i n f o r m a t i o n p er source symbol, is called entropy H(S) of the source H(S) = - ]Tq i=l P(Si ) Iog 2 P (Si ) bits (1.3) . In c o d e B, L b = H(S) = 1.7-5 bits, a n d t h e r e f o r e c o d e B p r o v i d e s the best compression with the given probabilities according to Shannon First Theorem H r (S) =< L < H r (S) + I. (1.4) The proof of the above theorm can be found in any information or c o d i n g book. It is o b v i o u s that t he l o w e r b o u n d o f L is t h e entropy. T h e u p p e r b o u n d o f L, o n t h e o t h e r hand, n a t u r e o f b l o c k coded.. ca n b e e x p l a i n e d b y t h e T h e l e n g t h o f a n y b l o c k c o d e is a l w a y s an i n t e g e r , e v e n t h o u g h t h e i n f o r m a t i o n c o n t e n t is n o t n e c e s s a r i l y an integer because P ( S i ) I. (C(Si I)J f o r m (1.1) is c a l c u l a t e d as (C(Si )J = |log 9 Components of a Data Compression System The fundamental components of a data compression system are: 1. Information source 2. Modeler 3. Encoder 4. Compressed file 5. D e c oder. In a d i g i t a l i m a g e c o m p r e s s i o n sy s t e m , a p r e d i c t o r m a y b e used w i t h the m o d e l e r to c a p t u r e h o p e f u l l y not s o o b v i o u s redundancies. For the purpose of our discussion, encoders and decoders. we would concentrate on modelers, Components of a data compression system are shown in figure 4. Information Modeler Encoder source Model Compressed Decoder Figure 4. Modeler Components of a data compression system. 10 Information Source For communication purposes, of a r a n d o m process {s(i)} that e m i t s o u t c o m e s of r a n d o m variables S(i) in a never ending stream. t h e integers. an information source can be thought The time index i then runs through all In a d d i t i o n t o t he r a n d o m v a r i a b l e s S(i), t he p r o c e s s also defi n e d all the finite joint v a r i a b l e s (S(i), S( j), ...). T h e most frequently studied i n f ormation s o u r c e s a r e e i t h e r independent, or Markovian. A source is said to be independent if p(S^) = p(Sj_|Sj), where, j = l,...,q. Therefore, it is often called zero-memory source. A m o r e general type of information source with q symbols is one in which the occurrence of a source symbol Si m ay depend on a finite number m of preceding symbols. Such a source (called Markovian, or an m th_order Markov source) is specified by giving the source alphabet S and the set of conditional probabilities PtSi ISj1 ,Sj2/•••/Sjm ) for i=l,2,...,q; jp=l,2,...,q (1.5) For an m ^ ^ - o r d e r M a r k o v source, t h e p r o b a b i l i t y o f a s y m b o l is known if we know the m preceding symbols. Modeler .The • modeler is responsible for redundancies of the information source. capturing Designated the "designated" redundancies can be viewed as the frequency measures used in the coding process. In a straight forward data compression system like Huffman code [3], the modeler simply counts the occurrence of each symbol S i , for 11 i=l,2,...,q. In o t h e r m o r e s o p h i s t i c a t e d systems, e.g. L Z W algorithm [4], the modeler is responsible for putting n e w strings in the string table. Though the algorithm does not attempt to subdivide the process i n t o m o d e l i n g a n d codi n g , m o d e l i n g and coding are actually done separately and will be shown in chapter Depending on how 3. the probabilities ar e calculated, we c an distinguish between t w o cases: the adaptive and the nonadaptive models d i s c u s s e d in d e t a i l in c h a p t e r 3. file is generated as in figure In a n o n a d a p t i v e m o d e l , 4. The model file a model c o n t a i n s th e necessary probabilities for both the encoder and the decoder. Encoder The encoder encodes the symbols p r o b a b i l i t i e s g i v e n b y t h e mode l e r . philosophy, symbols or strings T he b a s i c data with their compression with higher probabilities are assigned with the shorter code words, applies here. The encoder also ensures the codes to be uniquely decodable for reversible compression. In our d a t a c o m p r e s s i o n s y s t e m , a ll th e e n c o d e r n e e d s f r o m th e modeler are the probabilities of all symbols' and the previous symbols. The whole encoding process will be adding and shifting. Compressed File T h e c o m p r e s s e d f i l e c an be t h o u g h t of a file w i t h a s e q u e n c e of c o d e w o r d s or c o d e alph a b e t s. W i t h X t he set of t h e c o d e alphabets, the compressed file x should be simply a subset of X*. 12 Decoder The decoding process also involves the modeler regardless of the nature of the modeler ( a d aptive or nonadaptive). If n o n a d a p t i v e modeler is used, a model file is required .with the compressed file to provide all the necessary statistics. In our data compression system, th e d e c o d e r goes subtract and shift process to recover the original s o u r c e . through a 13 Chapter 2 UNIVERSAL CODING Coding Questions and Efficiency The three major questions in coding theory are: 1. H o w good are the best codes? 2. H o w can we design "good" codes? 3. H o w can we decode such codes? In data compression systems, the first question has been answered by Shannon's fi r s t theorem described in t he previous chapter. Shannon's f i r s t t h e o r e m s h o w s t h a t t h e r e exis t s a c o m m o n y a r d s t i c k w i t h w h i c h w e m a y m e a s u r e a n y i n f o r m a t i o n source. The value of a symbol (or string in s o m e cases) from an information source .S m a y be m e a s u r e d in t e r m s of an e q u i v a l e n t n u m b e r of r - a r y d i g i t s n e e d e d to represent one symbol/string from the source; the theorem says that the average value of a symbol/string from S is H(S). Let the average length of a uniquely decodable r-ary code for the s o u r c e S b e L. H r (S). By Shannon's first t h e o r e m , L c a n n o t be less t h a n Accordingly, we define E, the Efficiency of the code, by E = H r (S) / L (2.1) W e a r e t r y i n g to a n s w e r th e o t h e r t w o q u e s t i o n s in this thesis. In o r d e r t o a n s w e r t h e m , w e need to study several existing data 14 compression systems because there are no absolute answers for the t w o questions. B e f o r e w e d o that, w e s h a l l look at w h a t d o w e m e a n by "good" code in the next section. Criteria of a Good Data Compression System A g o o d d a t a c o m p r e s s i o n s y s t e m h as to h a v e t h e f o l l o w i n g thr e e qualities; 1. the codes it produces have to be decodable. 2. the average code length should be as close to the entropy as possible, or E - 1.0. 3. the coding (encoding and decoding) process is done in "accep­ table" time. Decodability can easily be achieved by constructing all the code w o r d s b y t h e l e a v e s of t h e c o d e t r e e in t h e c a s e of b l o c k codes. We shall develop ways to ensure decodability in later sections. L a t e r in t h e c h a p t e r , w e d e s c r i b e a u n i v e r s a l c o d i n g te c h n i q u e that guarantees the average code length to be the s a m e as the entropy for the given probabilities provided by the modeler. That is to say, t h e e f f i c i e n c y o f t h e c o d e e q u a l s o n e a c c o r d i n g to (2.1). looking at the efficiency of codes> we can also compression ratios defined by ■PC = ' ( IOF I - |CF|) / IOF I where PC = Percentage of compression, OF = Original File, and CF = Compressed File. (2.2) Be s i d e s compare their 15 T h e r e is n o a b s o l u t e w a y to m e a s u r e "acceptable" time. Some c o d e s a r e r e l a t i v e l y m o r e a c c e p t a b l e t h a n others, w h i l e it is c l e a r that t h e r e is a t r a d e - o f f b e t w e e n c o m p r e s s i o n r a t i o a n d time. Th e l o n g e r t h e m o d e l e r a n a l y z e s t h e i n f o r m a t i o n s o u r c e t h e s m a l l e r t he compressed file is going to be. Codes with Combined Modelers and Coders A lot of data compression systems do not separate their processes into, modeling and coding. In this chapter, we look at several popular data compression systems and then discuss the universal coding scheme that w e use. Huffman Code for- zero-memory Source Huffman code [1] is o ne of the variable-length enforces the data compression philosophy — to h a v e t h e shortest s y m b o l s S]_, S 2 Z - w the code X with e n c o dings, consider the source S with the Sq a n d s y m b o l p r o b a b i l i t i e s P 1 , p 2 , ..., p q > the that the most frequent symbols s y m b o l s X 1 , x 2 , ..., x r . Huffman code can be divided into t w o processes, splitting. codes Let T h e g e n e r a t i o n of namely reduction and The reduction process involves:. a. Let t h e s y m b o l s b e o r d e r e d t h a t P 1 >= p 2 >= ... >= Pq . b. Combine the last r symbols into one symbol. c. R e a r r a n g e t h e r e m a i n i n g s o u r c e c o n t a i n i n g q ^ r s y m b o l s in d e c r e a s i n g order. 16 d. R e p e a t (c) a n d (d) u n t i l t h e r e a r e less t h a n or eq u a l to r symbols. After the reduction process, w e then go backward to perform the splitting process: a. List the r code symbols in a fixed o r d e r . b. Assign the r symbols to the final reduced source. c. go backward with the one of the symbols splits into r symbols. d. Append the r code symbols to the last r source symbols. e. Repeat (c) and (d) until there are q source symbols. Figures 5 and 6 are t h e r e d u c t i o n a n d s p l i t t i n g p r o c e s s e s o f an e x a m p l e w i t h S = ( S 1 , S 2 , S 3 , S 4 , S 5I, P 1 = 0.4, P 2 - 0.2, P 3 = 0.2, P 4 = 0.1, P 3 — 0.±, a n d X — {0, I }• The Huffman code for S is formed where C(S1 ) = I, C(S2 ) = 01, and so on. S i n c e t h e l e n g t h s of t h e c o d e s a re not n e c e s s a r y the same, Huffman code is a p e r f e c t e x a m p l e o f v a r i a b l e - l e n g t h code. The average code length of the Huffman code: L = 1(0.4) + 2(0.2) + 3(0.2) + 4(0.1) + 4(0.1) = 2.5 bits/symbol. The entropy of the above source: H(S) = (0.4)log(1/0.4) + 2 * (0.2)log(l/0.2) + 2 * (0.l)log(l/0.I) = 2.12 bits/symbols. The efficiency of the code: E = 2.12 / 2.5 0.848 17 Original source Symbols Reduced source probabilities 51 0.4 52 0.2 s3 0.2 54 0.1 55 0.1 J First Reduction Reduction P1 Reduced source Code s2 sI I 0.4 I s2 0.2 01 0.2 01 s3 0.2 000 0.2 000 s4 0.1 OOlOl 0.2 001 s5 0.1 OOllj Figure 6. Reduction Reduction process. 0.4 sI Third Second Original source Symbols S2 ) Original Figure 5. S2 sI / splitting process , s3 0.4 1> 0.6 0 0.4 00 0.4 I 0.2 01 18 Huffman Code for Markov source T h e b e s t w a y to i l l u s t r a t e th e b e h a v i o r of a m t h _ order M a r k o v source defined in chapter I is through the use of state diagram, state d i a g r a m in a w e r e p r e s e n t e a c h of th e q m p o s s i b l e s t a t e s o f the source by a single circle, and the possible transitions from state to state by arrows as in figure 7. As an example of a simple one-memory source, suppose we have an alphabet of three symbols, a, b, and c (S = {a, b, c}). Let the probabilities be: p(a|a) = 1/3 p(b|a) = 1/3 P(c|a) = 1/3 p(a|b) = 1/4 P(b|b) = 1/2 P(c|b) = 1/4 p(a|c) = 1/4 p(b|c) = 1/4 p(c|c) = 1/2 formed by By solving the following equations probabilities, Pa * P(a|a)+ Pa * P(b|a) + pb * p(b|b) + pc * p(b|c) = Pa * P(c|a)+ pb * p(c|b) + pc * p(c|c) = pc Pb * P(a|b) + Pc * p(a|c) = pa pb Pa + Pb + Pc = I we get Pa = 3/11, Figure 7. p b = 4/11, pc = 4/11. State diagram of a first-order Markov source. the above 19 T h e r e are, of course, t h r e e s t a t e s (31 ) in this case, l a b e l e d a, b, and c and transition indicated by the circles. Each d i r e c t e d f r o m o n e s t a t e to a n o t h e r state, line is a w h o s e p r o b a b i l i t y is indicated by the number alongside the line. For e a c h s t a t e in the M a r k o v s y s t e m w e can u se th e a p p r o p r i a t e Huffman code obtained from the corresponding transition probabilities for l e a v i n g that state. T h e g e n e r a t i o n of the c o d e s is ex a c t l y the s a m e as the zero-memory source at each state. By the same token, we c a n b r e a k d o w n a n y o r d e r M a r k o v so u r c e into q m z e r o - m e m o r y states. For state a w e get the binary encoding a = I Pa = 1/3 b = 00 Pb c = 01 Pc = I/3 La = 1.67. = 1/3 For state b, a = 10 Pa = V 4 b = 0 Pb = 1/2 c = 11 Pc = 1/4 Lb = 1.5. For state c, 11 Pb = 1/4 c = 0 LC = 1.5. Ii M b = Il 10 O? \ I—I a = The average code the above Markov source lM = Pa * La + Pb * Lb + Pc * Lc = 1 ^55 bits/symbols. A s s u m i n g t h e s t a r t i n g s t a t e is a, t h e e n c o d i n g p r o c e s s c an be shown by another state diagram as in figure 8. 20 Figure 8. Encoding State diagram. To encode, for in s t a n ce, t h e s o u r c e s = b a a b c c c c c c , the resu l t w i l l be 0 0 1 0 1 0 0 1 1 0 0 0 0 0 . To d ecode, w e u s e a n o t h e r s t a t e d i a g r a m as shown in figure 9. I Figure 9. Decoding State diagram. 21 Now, let us calculate the efficiency of the Huffman code for the a b o v e M a r k o v source. T h e e n t r o p y H(S) of a Hit^1- O rder M a r k o v s o urce [1] is H(S) — ^ P(S ^ S j2/ • • •/Sjj^) * log (l/(pg^ |pjj,pj 2, . .. ,Pjj^)) The relevant probabilities are in table 3. The entropy is calculated u s i n g (2.4): H(S) = I P(S j , p s i ) * log (!/(P(Si ISj )) = 1.42 bits /symbol. The efficiency is calculated using (2.1) E = 1.42/1.55 = 0.92 Table 3. Probabilities of the Markov source in figure 7. P(Si ISj ) P(Sj) (2.4) P ( S izS j ) sj Si a a 1/3 3/11 1/11 a b 1/3 3/11 1/11 a C 1/3 3/11 1/11 b a 1/4 4/11 1/11 b b 1/2 4/11 2/11 b C 1/4 4/11 1/11 C a 1/4 4/11 1/11 C b 1/4 4/11 1/11 C C 1/2 4/11 2/11 22 LZW Compression System L Z W compression algorithm data compression. [4] is a nonprobabilistic approach to T h e c o d i n g p r o c e s s is a s t r i n g m a t c h i n g p r o c e s s w h i c h c a p t u r e s h i g h u s a g e p a t t e r n s (strings) [10]. s u g g e s t e d in [ 9 ] a n d C o n t r a r y to the p r e v i o u s t w o H u f f m a n codes, L Z W is a t y p i c a l fixed length code in which each code represents different numbers of source symbols. Welch's modification [4] of Lempel and ziv's idea in [9] and [10] m a k e s it a lot e a s i e r t o i m p l e m e n t . W e i m p l e m e n t e d the L Z W d a t a c o m p r e s s i o n s y s t e m a c c o r d i n g to th e a l g o r i t h m g i v e n in [4]. The results will be discussed in chapter 4. The most important feature in L ZW algorithm is the string table which holds all the strings w e encounter in the source file. The only time that we output a code is when a particular symbol/string has been seen twice. The following is the main idea of the encoding process: 1. Initialize the string table to contain single character strings. 2. W ^ _ the first character in the source file. 3. Ki— 4. If the concatenation of WK is in the table, the next character in the source file. W * _ WK, else output the code for W, put WK in the table, W— 5. K. If not end of file, goto 3. 23 The basic idea of the decoding process can be simply as follows: 1. Initialize the string table to contain single character strings. 2. W ^ _ decode the first code in the compressed file. 3. CODE-*— 4. If CODE exists, K -+— the next code in the compressed file. decode CODE, else — decode CODE as WjW 5. W-»— K. 6. If there are more codes, goto step 3. T a b l e 4 is t h e s t r i n g t a b l e for t h e e x a m p l e in f i g u r e 10. T he s a m e table is re-produced by the decoding process in figure 11. The efficiency of the L Z W code is a little harder to calculate because the code probabilities of symbols/strings are involved, they are not used in a direct sense. We know, is a nonprobabilistic however, code. Though t h a t t h e l o w e r b o u n d of the c o d e is t he e n t r o p y by Shannon first theorem (1.4). Welsh in [4] for instance, Let us take the 12-bit code suggested by in the beginning of the encoding process, single symbols are represented by 12 bits while their information is I (single symbol) = - log p(single symbol) = - log 1/128 = 7 bits. It is because the table is initialized with all the ASCII characters and there are 128 of them. The algorithm assumes them to be equally 24 Table 4. The LZW string table. code string a b I d 2 3 4 ab ba aa aac cd da aba abab 5 6 7 8 9 10 11 12 C b a Output Code I 2 I 5 a a 7 Decompressed Data String Added To Table By Decoder b C d a b a b a d 3 4 b a b b a a 11 5 11 9 8 12 10 A compression example for LZW algorithm. I 2 I 7 3 4 5 11 2 V V V V V V V V V a b a aa C d ab aba b 5 9 7 6 Figure 11. a a a c 7 6 Compressed Codes a b a a aa d a ab a ab aba a Figure 10. K C Input Symbols New String Added to Table W 8 11 10 A decompression example of figure 10. 12 25 p r o b a b l e s o t h e i r p r o b a b i l i t y is 1/128. Si n c e t h e L Z W c o d e a nd all the others w e investigated are concatenation codes w i t h integer code lengths, it leads us to believe that in order to construct a code with efficiency equal to one w e have to use nonblock.codes. Universal Coder Both the encoder and the decoder need the probabilities given by the modeler. Thus, a universal encoder needs only the probabilities to produce uniquely decpdable codes. The universal decoder also needs the s a m e probabilities to reproduce the source file. The Elias code [3] is the first code that tries to reflect the exact probabilities of the symbols/strings. Rissanen and Langdon in [5], [6], [10], and [11] g e n e r a l i z e t h e c o n c e p t o f the E l i a s c o d e to e m b r a c e ot h e r types of nonblock codes called arithmetic codes. Idea Origin - Elias code Elias code assigns each source sequence to a subdivision of.the interval [0, I]. T o i l l u s t r a t e t h e c o d i n g p r o c e s s , c o n s i d e r a z e r o m e m o r y source with t w o symbols a and b, occurring with probabilities 0.7 a n d 0.3 r e s p e c t i v e l y . W e m a y r e p r e s e n t an a r b i t r a r y i n f i n i t e s e q u e n c e of p o s s i b l e s o u r c e f i l e as f i g u r e 12. figure, A c c o r d i n g t o -the sequences starting with "a" are assigned to the interval [0, 0.7]; s e q u e n c e s s t a r t i n g w i t h "aa" a r e a s s i g n e d to t h e i n t e r v a l [ 0, 0.49]; a n d so on. T o e n c o d e t he s e q u e n c e f r o m t he s o u r c e w e s i m p l y 26 |s| = I 0 |s| = 2 0.7 I I aa 0 Figure 12. 0.49 I ba 0.7 I bb 0.91 I 1.0 Subdivision of the interval [0, 1] for source sequences I I o I I I 0.5 0 00 O Figure 13. ab 1.0 I 01 0.25 I 1.0 10 0.5 I 11 0.75 I 1.0 Subdivision of the interval [0, 1] with the probabilities in binary digits. p r o v i d e a b i n a r y e x p a n s i o n for e a c h p o i n t in [0, 1] as in figure 13. Thus, a source that starts with "ba" will have acompressed file that starts with subtraction. "10". Decoding is c l e a r l y a process of successive If the b i n a r y s e q u e n c e s t a r t s w i t h "1011", the point r e p r e s e n t e d m u s t lie b e t w e e n 0.6875 a n d 0.75; h e n c e t he first th r e e symbols from the source must be "bab". 27 The Algorithm With the idea from Elias, w e developed an arithmetic code which is s i m i l a r t o t h e o n e d e s c r i b e d in [10]. T he d i f f e r e n c e is the w a y that we handle the precision problem and the carry-over problem. The algorithm that we develop accepts any type of input source as long as w e know the size of the source alphabet S, and the symbols are a r r a n g e d in l e x i c a g r a p h i c or a n y f i x e d order. T h e b a s i c idea of our c o d e for a n y s y m b o l Sj is t he c u m u l a t i v e p r o b a b i l i t i e s b e f o r e S j, P ( S 1 ) + P ( S 2 ) + ... + p ( S j _ 1 ). T h e c o d e c an b e d e f i n e d r e c u r s i v e l y by the equations C(s S 1 ) = C ( s ) , C(s S i ) = C(s) + P(Si_1 ). (2.5) Both t h e e n c o d e r a n d d e c o d e r r e q u i r e t w o b i n a r y r e gisters, C a nd A. T h e f o r m e r a c t u a l l y s h o u l d be a b u f f e r a nd the l a t t e r of w i d t h n (n w i l l be discussed in t h e s e c t i o n precision). T h e c o d e s a re h e l d t e m p o r a r i l y in t h e C re g i s ter, w h i l e t he c u m u l a t i v e p r o b a b i l i t y is held in t h e a d d e n t A. C r e p r e s e n t s t h e e x a c t p r o b a b i l i t y o f t he s o u r c e f i l e s w i t h an i m a g i n a r y d e c i m a l p o i n t in t he front o f the compressed file. The following is the encoding algorithm: 1. Initialization: C 2. S1 3. A 4. Leftover * — 5. If Leftover < 0 then Leftover 6. Shift A by Sht bits from the end of C. 7. C 8. Goto 2 until the end of the source file. 4— read from the source file. P ( S i_ 1 ), Sht — -4 — all zero, Shift, previous. Leftover-*- 0. I -log Previous]. Leftover - noninteger part of (-log previous). I - Leftover, Sht C + A, Previous-+- P(Si ). Sht + ( |c| = 0 initially) I. 28 T h e e n c o d i n g p r o c e s s is b a s i c a l l y s h i f t i n g a nd adding. token, By the s a m e the decoding process should be subtracting and shifting: 1. Initialization: previous I, Leftover— 0. C-*— code from the compressed file. 2. i— 2. 3. If C - Previous * P(Si ) < 0 then decode as symbol S i-J/ previous S h t p(Si_ 1 ), I(-log Previous) |, Leftover Leftover - noninteger part of (-log Previous). If Leftover < 0 then Leftover^— Sht — I - Leftover, Sht + I. Shift C Sht bits to the left. Goto (2). else i i + I/ Goto (3). T h e C r e g i s t e r , d u r i n g the d e c o d i n g process, has t o h a v e at least as m a n y b i t s as the A register, to d e m o n s t r a t e t he c o d i n g processes, consider S = { a, b, c, d } and s = abaaacdababab, the example that we used in figure 10. Assuming it is a zero-memory statistics of s: p(a) =7/13, P(a) = 0, p(b) =4/13, P(b) = 7/13, p(c) =1/13, P(c) = 11/13, P(d) =1/13, P(d) = 12/13. source, t he 29 figure 14 is the encoding process of a simplier example when S = { a, b } , X = { p(b) = 7/8. 0, I }, s.= . a b b b b b b a a b a nd p(a) =. 1/8 a n d Figure 15 is the decoding process of the same example. .The precision problem T h e a l g o r i t h m in t h e last s e c t i o n o m i t t e d o n e v e r y i m p o r t a n t thing — the number of bits that is required to represent A to ensure decodability. Theorem: T h e n u m b e r o f b i t s that A n e e d s is n, w h e r e n is the number of bits used in counting the occurrence of symbols/strings in .the data compression system. Proof: T h e s m a l l e s t n u m b e r b e t w e e n t w o a d d e n t s is 1/n. r e p r e s e n t 1/n, To w e n e e d e x a c t l y n b i t s in b i n a r y w i t h an i m a g i n a r y decimal in front. Remarks: Even though n bits are used, the bits per code are by no m e a n s n b e c a u s e c o d e s o v e r l a p e a c h oth e r u n l e s s t h e va l u e o f the addent is a l w a y s 1/n. In ord e r for that to h appen, t he p r e v i o u s s y m b o l s h o u l d a l w a y s h a v e t he p r o b a b i l i t y 1/n w h i c h is i m p o s s i b l e unless IS I = n = |s|. O u r idea is o ne o f t he w a y s o f u s i n g f l o a t i n g point arithmetic to overcome the requirement for unlimited precision of the Elias code. The Carry-over problem During the encoding process, [11] , a n d t h e a b o v e m a y fail. the algorithms given in [6], When encoding a symbol [10], with high 30 S = { a, b }, X — { 0 , l } , s = a b b b b b b b a a b, p(a) = 1/8, p(b) = 7/8, P(a) = OOO2 , P(b) = o o i 2 . Register C Input A Shift Leftover a 000 000 0 0 b 000001 001 0 b b 0000011 0000100 001 3 I b b 0000101 0000110 001 001 0 0 001 001 0 .6147 .4421 .2294 001 0 I .0368 .8441 000 000 001 0 3 3 .6515 .6515 .6515 b 0000111 b a 00001111 00001111 a 00001111000 00001111000001 b I - log (7/8)= .8074 Conpressed file: 00001111000001 Figure 14. The encoding process of the universal encoder. Register c ' Shift Leftover 3 I 0 .8074 01111000001 b 00001111000001 01111000001 b b 1011000001 1001000001 0 0 .6147 .4421 1001000001 0111000001 b 0111000001 0 .1194 0101000001 b 0101000001 .0368 0011000001 .8411 001000001 .6515 .6515 000000001 000001 001 Decoded as a Register C b 0011000001 0 I b a 001000001 000000001 0 3 a 000001 3 .6515 b 001 0 .4588 Figure 15. 1011000001 The decoding process of the universal decoder. 31 probability, an addition is performed with a large number of 1's. If a long sequence of I-bits has been shifted out of the C register, this w i l l p r e c i p i t a t e a "carry-over". If t he s e q u e n c e o f 1 - b i t s e x t e n d s through the C register and into the already written code bits (in the c o m p r e s s e d f i l e ) , it b e c o m e s i m p o s s i b l e to c o r r e c t l y p e r f o r m t he addition without rereading the compressed file. A r b i t r a r i l y l o n g s e q u e n c e s o f 1 - b i t s and therefore arbitrarily long carry-overs m a y be generated. In binary addition, the first 0- bit b e f o r e a s e q u e n c e o f 1 - b i t s w i l l s t o p a c arry-over. reason, For this inserting 0-bits into the code stream at strategic locations w i l l p r e v e n t c a r r y - o v e r s from causing problems. "Zero-stuffing" is used by Rissanen and Langdon [12] to check the o u t p u t s t r e a m for s e q u e n c e s of 1-bits. W h e n t h e l e n g t h of o ne of these reaches a predefined value x, a 0 is inserted (stuffed) into the code str e a m just after the offending sequence. The sequence is undone by the decoder by checking x consecutive 1-bits and extracting the following bit of the sequence. In order that the d e c o d e r .can correctly undo the zero-stuffing, 0-bits have to be inserted which are not n e c e s s a r i l y n e e d e d for t h e c a r r y - o v e r prob l e m . determine the value x can also cause problems. T h e w a y that w e If x is too small, we could waste a lot of bits and consequently lead to poor compression. O n t h e o t h e r hand, t h e s i z e of x is r e s t r i c t e d b y t h e s i z e o f the C register. T h e w o r s t t h i n g abo u t z e r o - s t u f f i n g is t h a t w e h a v e to constantly check the C register for x consecutive 1-bits. We use another solution by simply i n t e r v a l s in t h e code. insert 0 - b i t s at regular I n s t e a d of u s i n g o n l y the r e g i s t e r for C, w e 32 use an e x t r a 4K b u f f e r w i t h t h e first bit (we a c t u a l l y use a byte) being the "carry-catcher". W e always wait until the buffer is filled before w e output it to the compressed file. Wasting one bit (byte) in 4 0 9 6 s a v e s a lot o f t i m e c h e c k i n g for x c o n s e c u t i v e s e q u e n c e o f 1bits. Moreover, l o n g c a r r y - o v e r s a r e s o m e w h a t rare. .A n y that do occur tend to eliminate others from occurring because a long sequence of 1 - b i t s t u r n s t o a l o n g s e q u e n c e of 0-bits w h e n a c a r r y ri p p l e s through. This method works well. A carry-over only happens after a large a d d e n t w i t h a s m a l l shift, it w i l l no t n o r m a l l y a d d a n y s i g n i f i c a n t t i m e t o t h e c o d i n g process. I m p l e m e n t a t i o n is v e r y simple. Th e insertions can be done with the normal file-buffering - insert a 0-bit in t h e b e g i n n i n g o f e a c h buffer. T h e c o m p l e x i t y o f t h e p r o c e s s is further reduced by inserting an entire 0-byte in the beginning of each buffer. 'B y t e - l e v e l addressing. addressing is often easier than bit-level The decoder then extracts and e x a m i n e s the entire byte rather than a bit. 33 Chapter 3 UNIVERSAL CODING WITH DIFFERENT MODELERS Modeling an information source is without question the hardest p a r t o f a n y d a t a c o m p r e s s i o n system. discover the compressed. redundancies The or information T h e job of t h e m o d e l e r is to regular can features be viewed in t h e d a t a to be as the probability distribution on the alphabet of the information source. In order to use the universal coder, the coding process has to be d i v i d e d i n t o a m o r e g e n e r a l f o r m as in figure 4. The modeler r e s p o n s i b l e for p r o v i d i n g a p p r o p r i a t e p r o b a b i l i t i e s e n c o d e r a n d t h e decoder. is for b o t h t he Even t h o u g h s o m e of t he d a t a c o m p r e s s i o n systems stay aw a y from the probabilities, like L Z W [4] (an improvement of L e m p e l a n d ziv's id e a s in [9] a n d [10]) or t h e s t r i n g m a t c h i n g techniques suggested by Rodeh w e can a l w a y s [13], r e d u n d a n c y r e d u c t i o n t e c h n i q u e s i n t o our modeler. convert their The conversion actually improves their codes compression-wise because w e enforce our c o d e .to t h e lower b o u n d of S h a n n o n first t h e o r e m (1.4) w i t h t he appropriate probabilities. Is There A Universal Modeler? Since the universal encoder and '■ decoder need only the p r o b a b i l i t i e s o f t h e s y m b o l s / s t r i n g s a n d the p r o b a b i l i t i e s o f t h e previous symbol/string, the most basic modeler should provide these 34 probabilities. T h e e a s i e s t w a y to do that is to k e e p the c o u n t s of each symbol/string: pfS^) = c o u n t (Sj^) / total count. (3.1) However, different information sources have totally different kind of redundancies. redundancies Welsh [4] suggested in c o m m e r c i a l data, that there are na m e l y character four typ e s of distri b u t i o n , c h a r a c t e r r e p e t i t i o n , h i g h - u s a g e patterns and positional redundancy. To capture a particular kind of redundancy, information source to decide techniques have to be used. what kind For instance, w e need to analyze the of redundancy capturing the probabilities w e need in a first order Markov source are P(SX |Sy), which can be given by p ( S x |sy ) = C o u n t (Sx |Sy ) / Count(y ) w h e r e Count(y) denotes (3.2) the n u m b e r of t i m e s c l a s s y o c c u r s in the source s, and Count(Sx )Sy ) denotes the number of times the next symbol at these occurrences is Sx . Modelers can be divided into two. categories, and adaptive. namely nonadaptive The following sections show h o w w e can convert existing models to provide appropriate probabilities for the universal coder so as to improve compression. ■ Nonadaptive Models Nonadaptive models basically scan the information source to get the appropriate probabilities. The probabilities will be put.in the m o d e l f i l e so t h a t b o t h t h e e n c o d e r a n d d e c o d e r c a n use. Therefore, t h e d e c o d i n g p r o c e s s c a n n o t b e p e r f o r m e d w i t h o n l y t he c o m p r e s s e d 35 file, and the compression ratios defined in (2.2) should be redefined as PC = ( IOF I - ICF + M F I) / IOF I. (3.3) T h e m a j o r d i s a d v a n t a g e of n o n a d a p t i v e m o d e l s is t h e fact that they require two passes of the c o m p r e s s e d f i l e c a n b e g e n erated. information source before the Nevertheless, the probabilities p r o v i d e d b y a n o n a d a p t i v e m o d e l is u s u a l l y b e t t e r t h a n an a d a p t i v e model, therefore better compression can be expected if w e can keep the size of the model file down. Zero-memory Source F o r a z e r o - m e m o r y source, w e h a v e b e e n u s i n g t h e H u f f m a n c o d e s i n c e t h e fifties. However, it h a s b e e n s h o w n b y Shannon's first theorem that the worst a code would do is one extra bit more, than the entropy. T h o u g h t h e H u f f m a n c o d e d o e s n ot s e p a r a t e t he m o d e l i n g process f r o m the coding, w e can view the first step of the following algorithm of the Huffman code as modeling and the rest as the encoding process: 1. s c a n t h e s o u r c e f i l e b y c o u n t i n g t he o c c u r r e n c e of e a c h source symbol. 2. a r r a n g e t h e s y m b o l s in d e s c e n d i n g o r d e r w i t h re s p e c t o f their probabilities. 3. p e r form the reduction and splitting process as in figures 5 and 6 to generate the code. 4. build a finite state machine as in figure 8. 5. encode the source file using the machine built in step 4. 36 For the universal coder, w e need the probabilities of each symbol just as in t h e a b o v e a l g o r i t h m . The modeler is responsible for p r o v i d i n g (2.1) for a l l i by s i m p l y c o u n t i n g the o c c u r r e n c e of each symbol. In o t h e r w o r d s , t h e m o d e l e r for a ny z e r o - m e m o r y s o u r c e is s t e p I of t h e a l g o r i t h m above. = a b a c d b a e c a , Figure 16. For e x a m p l e , s= { a, b, c, d, e }, s and the model file can be represented by An example of a model file. **The model file above actually represents the statistics for figure 6. Markov Source To d e s i g n a d a t a c o m p r e s s i o n s y s t e m for a M a r k o v source, th e m o d e l e r has to p r o v i d e the p r o b a b i l i t i e s g i v e n in (3.2). F r o m (3.2), we can calculate the information content of the source by I(S) = Y t z=l Proof: Count(z) log Count(z) -]T ^ C o u n U x |z)logCount(x|z). z=l xcS (3.4) Since p(x|z) = cou n t(x | z ) / co u n t (z ), and P(S)=TT TT P(XjQ1 ) ... TT TT P(x|%). xeS c o u n t (XlQ1 ) xeS Count(x|Qk ) l/p(s) = (3.5) TT T r c o u n t ( Q 1) Z C o u n t ( X l Q 1 ) ... Tf TT C o u n t ( Q K )/Count(x |q k ) x€S C o u n t (x IQl) Count(x|Qk) Xes 37 I(s) = Y Z Y log Count(z ) - ^ Z Y log Count(x |z) x€S C o u n t ( x Iz) z=l XfS Count(xjz) Z=I j K K =^z=llog Count(z) [xcS Y Count(xI Z z) _Z ZCount(x|z) log count(x|z) z=l xeS Z Z Z = Count(z) log Count(z) - z=l Count(x|z)logCount(x|z). z=l Xes Let us illustrate nonadaptive models with the example w e used before where the source s = abbbacccbaaabbbcccabcaac, and the source alphabet S = { a, b, c }. By a s s u m i n g the s o u r c e to be a f i r s t - o r d e r M a r k o v source, the m o d e l e r c o u n t s th e s y m b o l s c o n d i t i o n e d on t h e p r e v i o u s symbols. W i t h the i n i t i a l s t a t e as a w e s h o u l d ge t t h e c o n d i t i o n a l count C o u n t(a |a ) 3 Co unt(bIa ) = 3 C o u n t (cI a ) = 3 C o u n t(a Ib ) 2 count(bIb) = 4 C o u n t (cIb) = 2 C o u n t( a I c ) 2 Count(b I c ) = 2 C o u n t (c I c ) = 4 t o p r o d u c e t h e m o d e l f i l e in f i g u r e 17. By (3.4), t he ideal c o d e length with this model is I(s) = 91og9 + 2 * 81og8 - (3 * 31og3 + 4 * 21og2 + 2 * 41og4) = 38.26 bits. F ig u re 17. An e x a m p le source. of a m odel f ile fo r a firs t-o rd e r M arkov 38 Adaptive Models The major disadvantage of the nonadaptive models is the fact that the information source has to be prescanned by the modeler to get the appropriate probabilities. More importantly, the model file c o n t a i n i n g t h e p r o b a b i l i t i e s ha s t o b e t r a n s m i t t e d t o t h e d e c o d i n g u n i t s as a p r e a m b l e in th e c o d e string. S i n c e t he m o d e l file is not n e e d e d in t h e d a t a c o m p r e s s i o n s y s t e m u s i n g an a d a p t i v e m o d e l e r as shown in figure 4, compression could sometimes be better especially for M a r k o v s o u r c e w h e r e t h e m o d e l f i l e is large. Markov source, the model file contains In a f i r s t - o r d e r [s|^ probabilities. When the n u m b e r o f o r d e r s g o e s up, t he n u m b e r o f p r o b a b i l i t i e s in t he m o d e l file goes up exponentially, Number of probabilities needed = |S(orc^e r . (3.6) •D A n A S C I I f i l e w h i c h is a t h i r d - o r d e r M a r k o v s o u r c e r e q u i r e s 2 5 6 J = 1 6 7 7 7 2 1 6 p r o b a b i l i t y values, source, to represent a higher order of Ma r k o v the model file could even be bigger than the source file. To a v o i d p r e s c a n n i n g t he s o u r c e a n d the p r e a m b l e , consider adaptive models in our data compression system. of fact, w e n e e d to As a matter adaptive models also give the modeler a chance to adjust the probabilities as the source is being encoded. Zero-memory Source T h e H u f f m a n c o d e w e l o o k e d at e a r l i e r is a c l a s s i c e x a m p l e of a code using a nonadaptive model. The conversion of such a nonadaptive m o d e l t o an a d a p t i v e o n e is q u i t e s t r a i g h t forward. E q u a t i o n (3.1) 39 still holds- while the probabilities are calculated after each symbol is scan n e d . The modeler of an adaptive model does the followings: 1. . initialize the source symbols to be equally probable. 2. r e a d in a s y m b o l f r o m t h e s o u r c e file, a n d s e n d the pr e s e n t probabilities to the encoder. 3. update the probabilities with the current source symbol. 4. If not end of file, goto step 2. E q u a t i o n (3.1) h a s t o b e r e c a l c u l a t e af t e r e a c h s y m b o l is s c a n n e d b e c a u s e b o t h t h e t o t a l c o u n t a nd the co u n t of t he p r e s e n t s y m b o l change. Figure 18 is what an adaptive modeler should provide to the encoder and decoder. Markov. Source T h e a d a p t i v e m o d e l e r for a M a r k o v s o u r c e is n ot m u c h d i f f e r e n t than a zero-memory source, as each state of the finite state machine of the Markov source is the same as the zero-memory source, f i g u r e 17 a n d f i g u r e 18, combining f i g u r e 19 is an e x a m p l e of a f i r s t - o r d e r .Markov source. LZW C o d e . Though the L Z W compression a l g o r i t h m .[4] does not s e e m to fit our u n i v e r s a l c o d e r b e c a u s e it is a n o n p r o b a b i l i s t i c a p p r o a c h , w e c an use their idea in building such a modeler that provides probabilities of symbol/string for the encoder and the decoder. W e can consider the s t r i n g t a b l e g e n e r a t e d b y t he L Z W a l g o r i t h m as t he m o d e l e r w h i c h 40 S = { a, b, c, d, e }, s = a b a c d b a e c a. After scanning b, Figure 18. An example of an adaptive modeler for zero-memory source. 41 S = { a, b, c }, s = abbbacccbaaabbbcccabcaac. Before scanning. I After the source s, 4 Figure 19. An example of an adaptive modeler for a first-order Markov source. 42 counts the symbols and strings. We can build the modeler with a tree k e e p i n g t h e c o u n t o f a l l s y m b o l s a n d s t r i n g s that t h e m o d e l e r h as scanned. built. 1. W e e n c o d e o n e s y m b o l at at t i m e b y p a r s i n g t he tree w e W e build the L Z W modeler as follows: I n i t i a l i z e t h e co u n t of l a m b d a to q ( |S |), a n d th e count of all the symbols to one. Previous — 2. Initialize the pointer W to Lambda. I. Read one symbol f o r m t he s o u r c e file, s e n d P(S^) a nd Previous to the encoder. 3. Previous - P ( S i ) / W, Update W. 4. Move W to point to S i . 5. Read one symbol Sj from the source file, if Sj is a son of W then send P(Sj) and previous to the encoder. Previous — Update P(Sj) / W. W. move W to Sj. else send P(Sj) and previous to the encoder. Previous — P(Sj) / W. Update both W and Pj. Move W to lambda. Figure 20 is the example in table 4 by using such a modeler. beauty of such a modeler The is it converts a nonprobabilistic algorithm 43 S = { a , b , c, d } , s = a b a a a c d a b a b a b . Before scanning, After scanning a. After scanning b, After scanning the source s, Figure 20. An example of converting the LZW table into a modeler. 44 into a probabilistic capturing redundancies. one while preserving th e same' p r o c e s s of The major shortcoming of using such a modeler is the speed of both the encoding and decoding processes. Each time a symbol process, of is encoded, calculating we haye to go a l l .the p r o b a b i l i t i e s algorithm uses an adaptive modeler, through at one th e whole level. Since the LZW the decoding unit simply uses the s a m e modeler and the decoder described in chapter 2. 45 Chapter 4 COMPRESSION RESULTS AND ANALYSIS Test Data and Testing Method T h e test d a t a for t h e u n i v e r s a l c o d e r w i t h d i f f e r e n t m o d e l e r consisted of different types of r e d u n d a n c i e s — scale images, and floating point arrays. E n g l i s h text, g r a y ­ The floating point arrays we u s e d a r e f i l e s w h i c h d o not c o n t a i n a n y "natural" redundancies. use 0, I and 2 Taylor Series approximation (predictor), compression is e x p e c t e d n o m a t t e r predictor is used, w h a t c o d e w e use. We otherwise no Whenever a we use a particular predictor throughout one test of a particular file to compare the results. As w e have discussed in the previous chapter, building a model for a high order Markov source is too expensive memory-wise. compression of our universal co d e r a nd different W e test the modelers by c o m p r e s s i n g t h e s a m e f i l e w i t h H u f f m a n c o d e and L Z W code, a nd then, with the universal coder with z e r o -memory modeler and with LZ W modeler. Table 5 shows the compression results of different codes with and w i t h o u t u s i n g t h e u n i v e r s a l c o d i n g scheme. improvement in compression; section model analysis. W e c an s e e the cl e a r the speed trade-off will be discussed in 46 Table 5. Compression results for a variety of data types. ' with our modeler and universal coder Huffman LZW 25% Floating Arrays Gray-scale images Data Type English text Zero-memory LZW 45% 33% 55% 5% 10% 28% 30 % 5% 35% 30% 50% I • code Analysis T h e c o d i n g u n i t is b a s e d on t h e idea of t he E l i a s c o d e [3]. The ■ c o d e p r o d u c e d b y t h e u n i v e r s a l c o d e r t h a t w e b u i l t c a n b e c a l l e d as Arithmetic Code which in its original practical formulation is due to J. Rissanen [12]. W e also constructed the Fast Arithmetic code [10] which is an implementation of Arithmetic code for sources using binary alphabets and wh ich uses no multiplication. resulting' f a s t e r performance. coding is a loss o f at T h e p r i c e p a i d for t he most 3% in c o m p r e s s i o n In p r a c t i c e , the Fast A r i t h m e t i c c o d e m a y be a b e t t e r c o d e t o u s e b e c a u s e it is a b o u t t w i c e as fast as our u n i v e r s a l code. However, the Fast Arithmetic Code can only handle binary source code so for most cases the universal code has the clear advantage. There are also several advantages .to using the universal code as o p p o s e d to o t h e r a v a i l a b l e codes. First, t he u n i v e r s a l co d e c an be 47 used with any model of the source. Other codes have been historically highly mod e l dependent and in fact the coding and the model are hard t o separate. The universal code enables t he d e s i g n e r compression syst e m to emphasize .the modeling of the data. of a d a t a Modeling is w h e r e t h e d i f f i c u l t y in d a t a c o m p r e s s i o n r e a l l y lies b e c a u s e of t h e variety of redundancies. The adaptability of the universal code allows of a data compression system. the modular design If any changes are made to the system, for example changing from floating point numbers to gray scale images or f r o m a n o n a d a p t i v e a p p r o a c h to a n a d a p t i v e one, o n l y t h e m o d e l e r n e e d s t o b e c h a n g e d a n d not t he coder. the design of one T h i s f l e x i b i l i t y a l l o w s for data compression system which is able to compress various types of data under various circumstances without the need of rebuilding the syst e m for each type data. It also allows for models to use any type of source alphabets. T h e m o s t s i g n i f i c a n t g a i n in the u n i v e r s a l c o d e is that it is a noninteger code and so can perform compression as l ow as the entropy. It is then easier for us to compare one model to another. Model Analysis Since our universal coder guarantees to generate codes with the shortest average code length, we can appropriate models for particular sources. the entropy probabilities. — the best possible concentrate on finding Remember the definition of compression with The word "given" is the key of a modeler. the given The several models we discussed in the previous chapter provides totally different 48 kinds of probabilities. best, but rather Intuitively, It is not possible to conclude which is the which is better for a particular source. it is s a f e to s a y that L Z W m o d e l w o r k s b e s t for m o s t sources because L Z W works like a variable order Markov process. the popular patterns are put in the string table. b e s t a n d w o r s t c a s e in d a t a c o m p r e s s i o n , Only Before w e study the let us c o n s i d e r a ge n e r a l information source. Let t h e s o u r c e a l p h a b e t S = ( S 1 , S 2 / •••/ Sq I , t h e i n f o r m a t i o n s o u r c e s € S * z a n d s is m a d e u p o f W 1 , w 2 , ...,'.W1 , ..., w m , w h e r e Wj_£ S a n d Is I = m. The Best Case T h e b e s t c a s e for a n y d a t a c o m p r e s s i o n s y s t e m is w h e n the file contains only the repetition of a single symbol of the source alphabet. First of all, let us look at h o w nonadaptive model would do. In a z e ro-memory source or Huffman code, a single symbol Sx will have the probability equal one. code. The pcHuffman = Sx will be represented by one bit in Huffman percentage compression of Huffman code by (2.2) is I _ [(m + # of bits in the model file) / m * # of bits per source symbolj, so pcHuffman ~ By using the I - U universal / bits per source symbol). coder and the zero-memory modeler, the e n c o d i n g u n i t s h o u l d s i m p l y s e n d t h e m o d e l file a n d t h e n u m b e r of occurrences of symbol Sx to the decoder. This is exactly the w ay that a run-length code would handle the situation. Thus the compression is 49 even better than the Huffman code though the theoretical 0 bit is needed to encode such a source, because H(s) = I log I + (q - I) * 0 log 0 0. If w e t r e a t s u c h a s o u r c e as a M a r k o v source, w e s h o u l d c o m e up w i t h the e x a c t r e s u l t e x c e p t a b i g g e r m o d e l file, it is b e c a u s e both the e n c o d i n g a n d d e c o d i n g p r o c e s s s t a y in o ne sta t e o f t h e finite s t a t e machine as in figures 8 and 9. For adaptive models, the performance of both zero-memory sources and Markov sources are somewhat worse than the nonadaptive models in the best case. all sym b o l s . It is because we start off with equal probabilities of It is i n t e r e s t i n g to s e e h o w the L Z W m o d e l e r p e r f o r m s under the situation. By the LZW algorithm, the source s = w^, w 2, .., w^, .., w m , ans s is p a r t i t i o n e d i n t o t^, .., t i , .., t r , w h e r e t i S*. T h e e n c o d e d f i l e of s is th e c o n c a t e n a t i o n of t h e c o d e of tj, .., a n d t r , that is t o s a y C(s) = C ( t i ) C ( t 2 ) ... C t t i ) .. C ( t r ). best case, W i = wj, for i, j = I, 2, ..., m. W 1 , t 2 = W 2W 3 , t 3 = W 4W so on. For instance, aaaaaaaaaaaaaaa. aa, aaa, aaaa, |s| aaaaa, C(aaaa) C(aaaaa). = 5W 6 , jt-J = I, |t2 | = 2, S ASCII = 128 a nd and Th e p a r t i t i o n is t^ = ..., a nd = { |s| = 15. the c o d e N > of s C(s) = C(a) s = C(aa) C(aaa) In general, if s (r - l)(r - 2) / 2 N - f sqrt(2m) ] and s is p a r t i t i o n e d into a, T h u s s can be e n c o d e d by 5 codes. >= |t3 | = 3, a n d c h a r a c t e r s }, Im | , and the number of code needed to encode s is N, r(r - I) / 2 For t he (4.1) 50 If w e u s e a 12-b i t c o d e as s u g g e s t e d in [4], t he c o m p r e s s e d file is about sqrt(2*m) *• 12 bits. . By using the L Z W modeler and the universal coder, w e n e e d (7 + 12) / 2 b i t s for e a c h c o d e o n t h e average. other words, In w e c a n s a v e 2.5 b i t s p e r c o d e a n d t h e n u m b e r of code needed is sqrt(2*m). The Worst Case The worst case for any data compression system is a file with all the symbols and possible strings having exactly the same probabilities. For n o n a d a p t i v e m o d e l s , p(S^) = z e r o - m e m o r y source, a n d p(S j |Yj_) p(Sj) for i,j = I, 2, ..., q in = P(Sj) for all'j a n d = I, 2, ..., q. •The best that the universal coder using a zero-memory modeler or a Markov source modeler is H(s) = q * 1/q log q. (4.2) I(s) = m * H(s). For adaptive v models, equation (4.3) (4.2) and (4.3) still hold because an adaptive model starts with equal probable symbols and the count of symbols and strings never changes much in the worst case. The original L Z W algorithm, ho w e v e r , produces a compressed file which is bigger than the source file because the nature of the fixed l e n g t h code. It is t h e r e a s o n w h y t h e L Z W a l g o r i t h m g i v e s n e g a t i v e compression (expansion) in some cases. Conclusions Different different models produce different source files. However, t he r a t i o s of c o m p r e s s i o n on universal coding scheme 51 d e s c r i b e d in t h i s p a p e r g u a r a n t e e s t he s h o r t e s t a v e r a g e c o d e l e n g t h for the given probabilities. More importantly, t h e u n i v e r s a l cod e r c a n be u s e d w i t h a n y m o d e l o f t h e source, a nd t h e c o n v e r s i o n of m o s t d a t a c o m p r e s s i o n s y s t e m s i n t o m o d e l e r a n d c o d e r is q u i t e e a s y an d straight forward. Nevertheless/ using an adaptive model like L Z W with the universal coder slo w s d o w n the speed of the whole coding process by as much as 50 times comparing with the original L Z W code while 10 to 20 percent e x t r a c o m p r e s s i o n is g a i n e d on t he average. It is u p t o t h e user to decide if compression is more important than the speed of the coding process. It a l s o le a d s t o f u t u r e r e s e a r c h in s p e e d i n g u p t he c o d i n g p r o c e s s for a d a p t i v e m o d e l s w h i c h a r e s u p p o s e d t o b e fast e r t h a n nonadaptive models. Future Research F u t u r e r e s e a r c h in t h e w h o l e a r e a of d a t a c o m p r e s s i o n m a y be concentrated on modeling. Since no universal modeling technique can guarantee effective compression, it would b e nice if there existed an a n a l y z e r w h i c h d e t e r m i n e s w h i c h m o d e l e r w o r k s b e s t f or a n y g i v e n source file. Data compression systems which deal with digitized images complicate the p r o blem even further because some images do not possess a n y "natural" r e d u n d a n c i e s . T h e m o d e l e r s w e look at in this p a p e r capture only sequential redundancies while in most images, of a pixel depends on the surrounding pixels, [14], for instance, have to be used. the value predictors described in 52 Furthermore, quite a bit adaptive models using the universal coder performs slower than nonadaptive models because of the r e c a l c u l a t i o n o f p r o b a b i l i t i e s a f t e r e a c h s y m b o l h a s b e e n scanned. However, parallel algorithms probability calculations: seem to fit very well with- t h e speed is expected, to increase dramatically if such algorithms are employed. 53 REFERENCES CITED 54 REFERENCES CITED [1] Hamming, Richard W. Coding and Information Theory, prentice- H a l l , Englewood Cliffs, New Jersey, 1986. [2] Rivest R.L.; Shamir A.; and Adleman L. "On Digital Signatures and Public Key Cryptosystems", Communications of the ACM, Vol. 21, No. 2, February 78, p. 267-284. [3] Abramson, Norman. Information Theory and coding, McGraw-Hill Book Company, New York, New York, 1963. [4] Welch, Terry A.'- "A Technique for High-performance Data Coirpressi o n " , IEEE Computer, June 1984, p. 8-19. [5] L a n g d o n , Glen G. "Antreduction to Arithmetic Coding", I.B.M. J. Res. D e v elop., V o l . 28, No. 2, March 1984, p. 135-149. [6] Rissanen, Jorma; and Langdon, Glen G. "Arithemetic Coding", IEEE Transaction on Information Theory, Vol. 27, No. I, March 1979, p.149-162. [7] [8] C a m p e l l , Graham. "Two Bit/Pixel Full Color Encoding", ACM S IGGRAPH 1986 Proceedings, V o l . 20, No. 4, 1986, p. 127-135. Reg b a t i, K. "An O v e rview of Data Compression Techniques", IEEE C o m puter, April 1981, p. 71-75. [9] Lempel, Abraham; and Ziv, Jacob. "On the Complexity of Finite Sequences", IEEE Transaction on Information Theory, V o l . 22, No. I, January 1976, p. 75-81. [10] Langdon, Glen G. "A simple General Binary Source Code", IEEE Transaction on Information T h eory, V o l . 28, No. 5, September 1982, p. 800-803. [11] " L a n g d o n , Glen G.; and Rissanen, J o r m a . "Compression of BlackWhite images with Arithmetic Coding", IEEE Transaction on Information Theory, V p l . 29, No. 6, June 1981, p. 858-867. 55 REFERENCES CITED— Continued [ 1 2 ] ■Rissanen, J o r m a ; and Langdon, Glen G. "Universal Modeling and Codi n g " , IEEE Transaction on Information T h e o r y , V o l . 27, No. I, January 1981, p. 12-23. [13] Rodeh, Michael; Pratt, Vaughan R.; and E v e n , Shimon. "Linear Algorithm for Data Compression via String Matching". Journal of the Association for Computing Machinery, V o l . 28, No. I, .January 1981, p. 16-24. [14] Henry, Joel E. "Model Reduction and predictor selection in Image Compression", Master's thesis. D e p a r t m e n t of c o m p u t e r Science, Monatan State University, B ozeman, Montana, August 1986. APPENDICES 57 APPENDIX A Glossary of Symbols and Entropy Expression General Symbols S source alphabet Si source symbol q = |s| number of source symbols Sn nth extension of source alphabet s information source, an element of S* |S| size of an information source C(Si ) code of the source symbol S C(s) code of the inormation source s X code alphabet Xi code symbol r number of code symbol Ii number of code symbols in code word X i corresponding to Si L average length of a code word for S Ln average of a code word for Sn Entropy Expressions L = £ q i=1 Ic(Si ) I P(Si ) the average code length 58 I(Si ) = - log P(Si ) he amount of information obtained when S i is received H(S) = - E s P(Si ) log P(Si ) ±(s) = IsI * H(S) entropy of the zero-memory source S amount of information information source of an 59 APPENDIX B Definitions of codes Instantaneous Nonblock Not instantaneous Uniquely decodable Not uniquely decodable Nonsingular Singular Block Figure 21. Subclasses of codes. Definitions: A block c o d e is a c o d e w h i c h m a p s e a c h o f th e s y m b o l s o f t he source alphabet S i n t o a fi x e d sequence of symbols of the code a l p h a b e t X. A b l o c k c o d e is s a i d to be nonsingular if all t h e w o r d s of the codes are distinct. 60 The nth extension of a block code which maps the symbols the code words s y m b o l s (SjilS ^ 2 into is the block code which maps the sequences of source "s In ^ ^n t o t he s e q u e n c e s of c o d e w o r d s ... Xfn). A b l o c k c o d e is s a i d to be u n i q u e l y d e c o d a b l e if, a n d only if, the nth extension of the code is nonsingular, for every finite n. . A u n i q u e l y d e c o d a b l e c o d e is s a i d t o be i n s t a n t a n e o u s if it is possible to decode each succeeding code symbols. word in sequence without reference to MONTANA STATE UNIVERSITY LIBRARIES 762 10025 63 4 (#'