Chapter 3 Source Coding Analog source: Source whose outputs are analog. Discrete source: Source whose outputs are discrete. (1). Discrete binary source: {0,1} (2). Discrete L-ary source: {x1 , x2 , , xL } 3.1 Mathematical models for information sources Consider a discrete L-ary source with alphabet X {x1 , x2 , , xL } . Each letter in X is assigned a probability of occurrence as pk P( X xk ), 1 k L L where p k 1 k 1 Two types of discrete sources: (i). Output sequence (between letters) is statistically independent Example: A source whose output sequence is statistically independent is called discrete memoryless source (DMS). (ii). Output sequence is statistically dependent, for example, English text. Stationary discrete source: the joint probabilities of two sequences of length n of source output, say a1 , a2 , , an and a1m , a2m , , anm are identical for all n 1 and for all shifts m. That is, the joint probabilities are invariant under a shift in the time origin. Conversion from analog signal to discrete signal by using the sampling theorem X ( t ) is assumed to be a stationary stochastic process with autocorrelation function xx ( ) and power spectral density xx ( f ) . Suppose X ( t ) is a band-limited stochastic process, xx ( f ) 0, | f | W , then by applying the sampling theorem, X ( t ) can be expressed by n sin 2 W t n 2W X (t ) X n 2W n 2 t 2W (3.1-1) n where X denotes the samples of X ( t ) taken at the sampling (Nyquist) rate of f s 2W 2W samples /sec. 1 (a). The sampled output is characterized statistically by the joint pdf p( x1, x2 , , xm ) for all m 1 , where X n X (n / 2W ), 1 n m , are the random variables corresponding to the samples of X ( t ) . (b). In general, a quantization process is needed to converse discrete signal into digital signal. Such a process leads to the loss of information of the original signal because the original signal cannot be exactly recovered from quantized samples. 3.2 A logarithmic measure of information Consider two random variables are X {xi , i 1,2, , n} and Y { yi , i 1,2, , m} , respectively. Suppose the amount of information of observance of Y y j (the occurrence of the event {Y y j } ) is used to provide about the event { X x j } . Two extreme situations are as below: (1). When X and Y are statistically independent, the occurrence of Y y j provides no information about the occurrence of X x j . (2). When X and Y are fully dependent, the occurrence of Y y j determines the occurrence of X xj . A suitable measure satisfying above conditions is used logarithm. Logarithm used to measure the information content provided by the occurrence of the event {Y y j } about the event { X x j } is defined by I ( xi ; y j ) log P( xi | y j ) P( xi ) log P( xi , y j ) P( xi ) P( y j ) (3.2-1) where P( xi | y j ) P( X xi | Y y j ) and P( X xi ) P( xi ) . I ( xi ; y j ) is called the mutual information between xi and y j . When two is the base of the logarithm, the unit of I ( xi ; y j ) is bit; when e is the base of the logarithm, the unit of I ( xi ; y j ) is nat (natural unit). (1). When X and Y are statistically independent, P ( xi | y j ) P ( xi ) , hence I ( xi ; y j ) 0 . The occurrence of Y y j provides nothing about the occurrence of X xi . 2 (2). When X and Y are fully dependent, the occurrence of Y y j uniquely determines the occurrence of X xi 1, hence 1 log P( xi ) . P( xi ) I ( xi ; y j ) log (3.2-2) The self-information of the event { X xi } is defined as I ( xi ) log 1 log P( xi ) P( xi ) (3.2-3) Conclusion: A high-probability event conveys less information than a low-probability event. If there is only a single event {x} with P( x ) 1 , then I ( x ) 0 . Example 3.2-1 Even when the roles of X and Y are changed, the information provided is identical, that is I ( xi ; y j ) log P( xi | y j ) P( xi ) log P( xi | y j ) P ( y j ) P( xi ) P( y j ) log P ( y j | xi ) P( y j ) I ( y j ; xi ) (3.2-4) Example 3.2-2. The conditional self-information is defined by I ( xi | y j ) log 1 log P( xi | y j ) P( xi | y j ) (3.2-5) The relationship among mutual information, self-information and conditional self-information is I ( xi ; y j ) I ( xi ) I ( xi | y j ) (3.2-6) The conditional self-information I ( xi | y j ) is interpreted as the self-information about the event { X xi } after having observed the event {Y y j } . Note: I ( xi ) 0 and I ( xi | y j ) 0 , but I ( xi ; y j ) may be positive, zero, or negative. 3.2.1 Average mutual information and entropy 1 For example, if x f ( y ) , then the conditional probability density function p( x | y ) ( x f (y )) 3 There are three kinds of information described above and their average quantities can be evaluated as below. (I). The average mutual information is obtained by weighting the mutual information I ( xi ; y j ) by the probability of occurrence of the joint event and summing over all possible joint events as n m n i 1 j 1 P( xi , y j ) m I ( X ;Y ) P( xi , y j ) I ( xi ; y j ) P( xi , y j ) log i 1 j 1 P( xi ) P( y j ) (3.2-7) Notes: (1). I ( X ; Y ) 0 if X and Y are statistically independent. (2). I ( X ; Y ) 0 Proof: n m n m I ( X ; Y ) P( xi , y j ) I ( xi ; y j ) P( xi , y j ) log i 1 j 1 i 1 j 1 P( xi ) P( y j ) P( xi , y j ) n m n m P( xi ) P( y j ) n m P( xi , y j ) 1 P( xi ) P( y j ) P( xi , y j ) 0 i 1 j 1 i 1 j 1 P( xi , y j ) i 1 j 1 (II). The average self-information, denoted by H(X), as n n i 1 i 1 H ( X ) P( xi ) I ( xi ) P( xi ) logP( xi ) (3.2-8) Entropy: the entropy of the source is defined as the average self-information per source letter. For example, if X is the alphabet of possible output letters from a source, and assume all the letters from the source are all equally likely, i.e., P( xi ) 1/ n for all i. The resulting entropy is n 1 1 H ( X ) log log n n i 1 n (3.2-9) In general, H ( X ) log n . In other words, the entropy of a discrete source is maximum when the output letters are equally probable. Example 3.2-3 H ( X ) H ( q) q log q (1 q) log(1 q) (3.2-10) 4 (III). The average conditional self-information (also called conditional entropy) is defined as n m H ( X | Y ) P( xi , y j ) log i 1 j 1 1 P( xi | y j ) (3.2-11) H ( X | Y ) can be interpreted as the uncertainty in X after Y is observed. Here, it is important to indicate that the uncertainty is the information. The relationship among these three quantities is I ( X ;Y ) H ( X ) H ( X | Y ) (3.2-12) Since I ( X ; Y ) 0 , it follows that H ( X ) H ( X | Y ) , with equality iff X and Y are statistically independent. Interpretation from the uncertainty viewpoint (a). H ( X ) represents the average uncertainty prior to observation. (b). H ( X | Y ) is the average uncertainty in X after Y is observed. (c). I ( X ; Y ) is the average uncertainty provided about the set X by the observation of the set Y. Because the average information represents the average uncertainty, it is clear H ( X ) H ( X | Y ) . Example 3.2-4. 5 When the conditional entropy H ( X | Y ) is viewed in terms of a channel whose input is X and whose output is Y, H ( X | Y ) is called the equivocation (未決量) and is interpreted as the amount of average uncertainty remaining in X after observation of Y. Consider more than two random variables, for example, a block of k random variables X 1 , X 2 , , X k with joint probability P( x1, x2 , , xk ) P( X 1 x1, X 2 x2 , , X k xk ) . Then the entropy for the block is defined as n1 H ( X1X 2 n2 X k ) j1 1 j2 1 nk P( x jk 1 j1 , x j2 , , x jk ) log P( x j1 , x j2 , , x jk ) (3.2-13) By using the chain rule, the joint probability can be factored as P( x1 , x2 , , xk ) P( x1 ) P( x2 | x1 ) P( x3 | x1 x2 ) P( xk | x1x2 xk 1 ) (3.2-14) Hence H ( X1 X 2 X k ) H ( X1 ) H ( X 2 | X1 ) H ( X k | X1 X 2 k X k 1 ) H ( X i | X 1 X 2 i 1 X i 1 ) (3.2-15) 6 By applying the result H ( X ) H ( X | Y ) with X X m and Y X 1 X 2 X m1 , we obtain k H ( X1X 2 Xk ) H(Xm) (3.2-16) m 1 with equality if and only if the random variables X 1 , X 2 , , X k are statistically independent. 3.2.2 Information measures for continuous random variables Suppose X and Y are two continuous random variables with joint PDF p ( x , y ) and marginal PDFs p ( x ) and p ( y ) . The average mutual information between X and Y is defined as I ( X ;Y ) p( x, y ) log p ( x, y ) p( x ) p( y | x ) dxdy p( x ) p( y | x ) log dxdy p( x ) p( y ) p( x ) p( y ) (3.2-17) Note: The self-information for continuous random variables doesn’t exist because a continuous random variable requires an infinite number of binary digits to represent it exactly. Hence, its self-information is infinite so is its entropy. P( x ) 0 for any x, as a result, the self-information I ( x ) log P( x ) for all x. Differential entropy: The definition of differential entropy of the continuous random variable X is H ( X ) p( x ) log p( x )dx (3.2-18) Note that this quantity does not have the physical meaning of self information. The average conditional entropy of X given Y (over all possible joint events) is H ( X | Y ) p( x, y ) log p( x | y )dx (3.2-19) The average mutual information is I ( X ;Y ) H ( X ) H ( X | Y ) H (Y ) H (Y | X ) Consider one of random variable is discrete, for example, X, with outcomes xi , i 1,2, , n . When X and Y are statistically dependent, n p ( y ) p ( y | xi ) P ( xi ) i 1 The mutual information provided about the event { X xi } by the occurrence of the event {Y y} is (Referred to the definition of discrete type version) 7 I ( xi ; y ) log p( y | xi ) P( xi ) p( y | xi ) log p( y ) P( xi ) p( y ) (3.2-20) The average mutual information between X and Y is n I ( X ;Y ) p( y | xi ) P( xi ) log i 1 p( y | xi ) dy p( y ) (3.2-21) Example 3.2-5 3.3 Coding for discrete sources 3.3.1 Coding for discrete memoryless sources A DMS having the alphabet {xi , i 1, 2, , L} with probabilities P{xi }, i 1,2, , L produces an output letter or symbol every s seconds. The entropy of the DMS in bit per source symbol can be proven as L H ( X ) P{xi }log 2 P{xi } log 2 L (3.3-1) i 1 with equality when the symbols are equally probable. (1). The average number of bits per source symbol is H ( X ) . (2). The source rate in bits/sec (bits/s) is defined as H ( X ) / s . 3.3.1.1 Fixed length code words Noiseless coding (i.e. coding without distortion) (a). A unique set of R binary digits is assigned to each symbol. (b). The number of binary digits per symbol required for unique encoding when L is a power of 2 is R log 2 L , (3.3-2) and when L is not a power of 2 is R log2 L 1 . (3.3-3) (c). R in bits per symbol is the code rate. Therefore H ( X ) R . (d). The efficiency of the encoding for the DMS is defined as the ratio H ( X ) / R . (e). When L is a power of 2 and the source letters are equally probable, R H ( X ) . That is, a fixed length code of R bits per symbol attains 100% efficiency. If L is not a power of 2 but the source symbols are still equally probable, R differs from by at most 1 bit per symbol. 8 (f). When log2 L 1 , the efficiency is high, but when log2 L 1 , the efficiency is not very well and can be improved by encoding a sequence of J symbols at a time (instead of a symbol at a time). In this case, LJ unique code words are required. (g). For example, when N binary digits is used to encode a J-symbol sequence, there are 2 N code words in total and N must satisfy the inequality N J log2 L (because must satisfy 2 N LJ ). Hence the minimum integer value of N is N J log2 L 1 . (3.3-4) (h). The average number of bits per symbol is N / J R . The efficiency is improved by J from H ( X ) / R (symbol-to symbol) to H ( X ) / R J H ( X ) / N , and remember J 1 is for symbol-to symbol situation. J can be chosen as close to unity as desired. (i). The encoding process as above between symbol or symbols and codeword is one-to-one (unique) and usually is called noiseless coding. Coding with distortion (a). Consider the encoding is not one-to-one, whose purpose is to reduce the code rate R (increase efficiency). Suppose only a fraction of the LJ blocks of symbols is encoded uniquely, for example, we select the 2N 1 most probable J-symbol blocks and encode each of them uniquely, while the remaining LJ (2 N 1) J-symbol blocks are represented by the single remaining code word. (b). Such encoding a procedure results in a decoding failure probability of error every time a low probability block (one of the remaining LJ (2 N 1) j-symbol blocks) is mapped into this single code word. The probability of error is denoted as Pe . (c). Shannon’s Source coding theorem I Let X be the ensemble of letters from a DMS with finite entropy H ( X ) . Blocks of J symbols from the source are encoded into code words of length N from a binary alphabet. For any 0 , the probability Pe of a block decoding failure can be made arbitrarily small if R N H(X ) J (3.3-5) and J is sufficiently large. Conversely, if R H(X ) (3.3-6) then Pe becomes arbitrarily close to 1 as J is made sufficiently large. Observations from this theorem: (1). The average number of bits per symbol required to encode the output of a DMS with arbitrarily small probability of decoding failure is lower bounded by the entropy H ( X ) . (2). If R H ( X ) , the decoding failure rate approached 100% as J is arbitrarily increased. 9 3.3.1.2 variable length code words (1). When the source symbols are not equally probable, a more efficient encoding method is to use variable-length code words. (2). A good example is the Morse code, in which short code words are for frequent letters and long code words for infrequent letters. (3). According the probability of occurrence of source letter to assign code word to it is called entropy coding. Example. A DMS with output letters and associated probabilities is below. There are three codes, see Table 3.3-1, but the first code has potential flaw. For example, the output sequence of source may 001001… Though the first symbol (00 a2) can be decoded correctly, but the following four bits are ambiguous and cannot be decoded uniquely. Code II is uniquely decodable and instantaneously decodable. Note that no code word in this code is a prefix of any other code word. Code tree sees Figure 3.3-1. (4). In general, the prefix condition requires that for a given code word Ck of length k having elements (b1 , b2 , , bk ) , there is no other code word of length l k with elements (b1, b2 , , bl ) for 1 l k 1. In other words, there is no code word of length l k that is identical to the first l binary digits of another code word of length l k . This property makes the code words instantaneously decodable. Code III (Code tree sees Figure 3.3-2) don’t satisfy the prefix condition and is not instantaneously 10 decodable, it is uniquely decodable though. (5). The objective is to devise a systematic procedure for constructing uniquely decodable variable-length codes that are efficient in the sense that the average number of bits per source letter, defined as the quantity is minimized. L R nk P(ak ) (3.3-7) k 1 where nk represents the length of the kth code word and P(ak ) is the probability of the source symbol a k . (6). The conditions for the existence of a code that satisfies the prefix conditions are given by the Kraft inequality. Kraft inequality. A necessary and sufficient condition for the existence of a binary code with code words having lengths n1 n2 nL that satisfy the prefix condition is L 2 nk 1 (3.3-8) k 1 Proof: L (a). Sufficient condition. (if 2 nk 1 , then the existence of a binary code….) k 1 1. Construct a full binary tree of order n nL 2. Such a tree has 2 n terminals and two nodes of order k stemming from each node of order k-1, 1 k n 3. Select any node of order n1 as the first code word C1 . 4. This choice eliminates 2n n1 terminals (the fraction 2 n1 of the 2 n terminal nodes). 5. From the remaining available nodes of order n2 , select one node for the second code word C2 . 6. This choice eliminates 2nn2 terminals (the fraction 2 n2 of the 2 n terminal nodes). 7. This procedure continues until the last code word is assigned at the terminal node n nL . 8. Since at the node of order j L , the fraction of the number of terminal nodes eliminated is j L k 1 k 1 2 nk 2 nk 1 (remember that 2 k 1) k 1 11 9. There is always a node of order k j available to be assigned to the next code word. 10. That is, we have constructed a code tree that is embedded in the full tree of 2 n nodes. Figure 3.3-3 is an example of nL 4 . L (b). Necessary condition (if the existence of a binary code… then 2 nk 1 ) k 1 In the code tree of order n nL , the number of terminal nodes eliminated from the total number of 2 n terminal nodes is L 2 n nk 2n k 1 Hence L 2 nk 1 k 1 Source coding theorem II Let X be the ensemble of letters from a DMS with finite entropy H(X) and output letters xk , 1 k L , with corresponding probabilities of occurrence pk , 1 k L . It is possible to construct a code that satisfies the prefix condition and has an average length R that satisfies the inequalities H(X ) R H(X ) 1 (3.3-9) 1. Lower bound For code words that have length nk , 1 k L , the difference H ( X ) R may be expressed as L L 1 2 nk H ( X ) R Pk log 2 nk Pk Pk log 2 Pk k 1 Pk k 1 k 1 L (3.3-10) Using the inequality ln x x 1 12 L 2 nk L H ( X ) R log 2 e pk 1 log 2 e 2 nk 1 0 k 1 k 1 pk Kraft inequality Equality holds if and only if pk 2 nk , 1 k L . 2. Upper bound Since nk , 1 k L are integers, we may select the {nk } such that 2 nk pk 2 nk 1 . But if the terms pk 2 nk are summed over 1 k L , we obtain the Kraft inequality, for which we have demonstrated that there exists a code that satisfies the prefix. By taking logarithm of pk 2 nk 1 , we obtain log pk n 1 , or equivalently k n 1 log pk k (3.3-11) By multiplying both sides of above equation by pk and sum over 1 k L , the upper bound can be obtained. Algorithms for variable-length encoding (for any DMS with source symbols that are not equally probable) Huffman coding algorithm (1952, Huffman): An algorithm based on the source letter probabilities P( xi ), i 1,2, , L is optimum in the sense that the average number of binary digits required to present the source symbols is a minimum. The code words satisfy the prefix condition, which allows the received sequence to be uniquely and instantaneously decodable. Example 3.3-1 (1). Arrange the seven symbols according to the decreasing order of probabilities as Figure 3.3-4. (2). Tie up the two symbols having least probabilities into a new node and arbitrarily assign 0 or 1.The resultant probability for new node is the sum of probabilities of two symbols. Now, we have six symbols. (3). Tie up the two symbols having least probabilities into a new node as in step (2). The remaining procedure is as in (2). Up to the last second step, we find the three symbols have probabilities: 0.35 ( x1 ), 0.3 ( x2 ) , and 0.35 ( x3 ) . What can we do? There are two choices. The first is tying up x2 and x1 and then the new node ties x3 together as Figure 3.3-4. The second is to tie up x1 and x2 as Figure 3.3-5. The entropy is H ( X ) 2.11 . Note that (a) the average number of bits per source symbol for these two codes is the same, 13 R 2.21 and (b) code is not necessarily unique. Example 3.3-2 A DMS symbols with probabilities is shown in Figure 3.3-6. The Huffman code is obtained by using procedure as the previous example. The efficiency is H ( X ) / R 2.63/ 2.7 0.974 . 14 To be more efficient, encoding a block of J symbols at a time, instead of symbol by symbol, can be used. In such a case, the bounds of source coding theorem II become J H ( X ) RJ J H ( X ) 1 (3.3-12) Note that the entropy of a J-symbol block from a DMS is JH ( X ) , and RJ is the corresponding average number of bits per J-symbol blocks. Then H(X ) RJ 1 H(X ) J J (3.3-13) RJ R is the average number of bits per symbol. Thus R can be made as close to H ( X ) J as desired by selecting J sufficiently large (that is, more efficient). where Example 3.3-3 Suppose a DMS output has three possible letters, see Table 3.3-2. There are two encoding algorithms. The first is based on symbol-by-symbol basis, and the second is based by 2-symbol basis. The resulting average number of bits per symbol and efficiency are shown in Tables 3.3-2 and 3.3-3. 15 3.3.2 Discrete stationary sources Contrary to the assumption of the statistical independence of output letters, in this section, the source is assumed to be statistically stationary and the sequence of output letters is statistically dependent (1). As a starting point, let us evaluate the entropy of any sequence of letters from a stationary source. The entropy of a block of random variables X 1 , X 2 , , X k is (see eqn.3.2-15) k H ( X 1, X 2 , , X k ) H ( X i | X 1, X 2 , i 1 , X i 1 ) (3.3-14) Define the entropy per letter for the k-symbol block as Hk ( X ) 1 H ( X 1, X 2 , k , Xk) (3.3-15) Define the information content of a stationary source as the entropy per letter in the limit as k as 1 H ( X ) lim H k ( X ) lim H ( X 1 , X 2 , k k k , Xk ) (3.3-16) Define the entropy per letter from the source in terms of the conditional entropy H ( X k | X 1 , X 2 , , X k 1 ) in the limit as k 16 H ( X ) lim H ( X k | X 1, X 2 , k (3.3-17) , X k 1 ) It can be shown that H ( X k | X 1 X 2 X k 1 ) is a non-increasing sequence in k. , X k 1 ) H ( X k 1 | X 1, X 2 , H ( X k | X 1, X 2 , (3.3-18) , X k 2 ) Proof: (1). Using the previous result, H ( X | Y ) H ( X ) , i.e., conditioning on a random variable can’t increase entropy H ( X k | X1 X 2 X k 1 ) H ( X k | X 2 (3.3-19) X k 1 ) (2). From the stationarity of the source, H(Xk | X2 X k 1 ) H ( X k 1 | X 1 X 2 (3.3-20) X k 2 ) That completes the proof. Besides, we have the result Hk ( X ) H ( X k | X1 X 2 (3.3-21) X k 1 ) From (3.3-14), (3.3-15), and (3.3-19) 1 H ( X 1, X 2 , , X k ) k 1 k H ( X i | X 1 , X 2 , , X i 1 ) k i 1 1 {H ( X k | X 1 , X 2 , , X k 1 ) H ( X k 1 | X 1 , X 2 , k H ( X k | X 1 , X 2 , , X k 1 ) Hk ( X ) , X k 2 ) H ( X 1 )} Next is to show H k ( X ) is a non-increasing sequence in k. From the definition of H k ( X ) , we may write Hk ( X ) 1 H ( X 1, X 2 , k 1 , X k 1 , X k ) [ H ( X 1 , X 2 , k , X k 1 ) H ( X k | X 1 , X 2 , , X k 1 )] treat as a unit and use (3.3-14) 1 [( k 1) H k 1 ( X ) H ( X k | X 1 , X 2 , k k 1 1 H k 1 ( X ) H k ( X ) k k , X k 1 )] That is, we obtain the inequality H k ( X ) H k 1 ( X ) . Since H k ( X ) and the conditional entropy H ( X k | X 1 (3.3-22) X k 1 ) are both non-increasing and non-negative with k, both limits must exist. 17 By using (3.3-14) and (3.3-15), the entropy H k j ( X ) can be expressed by 1 H ( X1X 2 X k j ) k j 1 1 H ( X 1 X 2 X k 1 ) [H ( X k | X1X 2 k j k j H ( X k 1 | X 1 X 2 X k ) H ( X k j | X 1 X 2 Hk j ( X ) This equation can be obtained by treating X 1 X 2 X k 1 ) X k j 1 )] X k 1 as a unit and then the skill is used as that in deriving (3.3-22). Hk j ( X ) 1 H ( X1 X 2 k j X k 1 ) j 1 H ( X k | X1X 2 k j X k 1 ) (3.3-23) For a fixed k, the limit of (3.3-23) as j yields H ( X ) H ( X k | X1 X 2 X k 1 ) (3.3-24) But (3.3-24) is valid for all k, hence, it is valid for k . That is, H ( X ) lim H ( X k | X 1 X 2 k (3.3-25) X k 1 ) On the other hand, from (3.3-21), H ( X ) lim H ( X k | X 1 X 2 k (3.3-26) X k 1 ) Hence, the equation (3.3-17) has been proven. Suppose there is a discrete stationary source that emits J letters with H J ( X ) as the entropy per letter. The sequence of J letters is encoded with variable-length Huffman code that satisfies the prefix condition. The resulting code has an average number of bits for the J-letter block that satisfies the condition H ( X1X 2 X J ) RJ H ( X 1 X 2 X J ) 1 (3.3-27) Or, the average number of bits per source letter H ( X1 X 2 J XJ ) RJ H ( X 1 X 2 X J ) 1 J J Hence H J ( X1X 2 X J ) R H ( X1X 2 XJ ) 1 J (3.3-28) By increasing the block size J, we can approach H J ( X ) arbitrarily closely, and in the limit as J , we have 18 H ( X ) R H ( X ) (3.3-29) Conclusions: (1). The larger the block size of symbols are encoded, the higher the encoding efficiency. (2). The design of the Huffman code requires knowledge of the joint PDF for the J-symbol blocks. 3.3.3 The Lempel-Ziv Algorithm From the previous discussions, we know (1). Huffman coding yields optimal source code in the sense that the code words satisfy the prefix condition and the average block length is minimum. (2). The drawbacks lie that (a) for DMS, one need to know the probabilities of source letters; (b) for discrete source with memory, one need to know the joint probabilities of block of length n 2 . Unfortunately, the probabilities are not available in practice. (3). On the other hand, the Lempel-Ziv source-coding algorithm is designed to be independent of the source statistics. That is, it belongs to the class of universal source coding algorithms and is a variable-to-fixed length algorithm. The encoding process of the Lempel-Ziv source-coding algorithm is as below. (a). The sequence at the output of the discrete source is parsed into variable-length blocks, which called phrases. (b). A new phrase is introduced every time a block of letters from the source differs from some previous phrase in the last letter. (c). The phrases are listed in a dictionary, which stores the location of the existing phrases. (d). In encoding a new phrase, we simply specify the location of the existing phrase in the dictionary and append the new letter. Example. (see page 102) Notes: (1). The initial location of the dictionary is set to be 0000 so the first phrase (1) is appended and is encoded as 00001. (2). The source decoder for the code constructs an identical copy of the dictionary at the receiving end of the communication system and decodes the received sequence in step with the transmitted data sequence. 19 20