Notes on Information Theory by Jeff Steif 1 Entropy, the Shannon-McMillan-Breiman Theorem and Data Compression These notes will contain some aspects of information theory. We will consider some REAL problems that REAL people are interested in although we might make some mathematical simplifications so we can (mathematically) carry this out. Not only are these things inherently interesting (in my opinion, at least) because they have to do with real problems but these problems or applications allow certain mathematical theorems (most notably the Shannon-McMillan-Breiman Theorem) to “come alive”. The results that we will derive will have much to do with the entropy of a process (a concept that will be explained shortly) and allows one to see the real importance of entropy which might not at all be obvious simply from its definition. The first aspect we want to consider is so-called data compression which means we have data and want to compress it in order to save space and hence money. More mathematically, let X1 , X2 , . . . , Xn be a finite sequence from a stationary stochastic process {Xk }∞ k=−∞ . Stationarity will always 1 mean strong stationarity which means that the joint distribution of Xm , Xm+1 , Xm+2 , . . . , Xm+k is independent of m, in other words, we have time-invariance. We also assume that the variables take on values from a finite set S with |S| = s. Anyway, we have our X1 , X2 , . . . Xn and want a rule which given a sequence assigns to it a new sequence (hopefully of shorter length) such that different sequences are mapped to different sequences (otherwise one could not recover the data and the whole thing would be useless, e.g., assigning every word to 0 is nice and short but of limited use). Assuming all words have positive probability, we clearly can’t code all the words of length n into words of length n − 1 since there are sn of the former and sn−1 of the latter. However, one perhaps can hope that the expected length of the coded word is less than the length of the original word (and by some fixed fraction). What we will eventually see is that this is always the case EXCEPT in the one case where the stationary process is i.i.d. AND uniform on S. It will turn out that one can code things so that the expected length of the output is H/ ln(s) times the length of the input where H is the entropy of the process. (One should conclude from the above discussion that the only stationary process with entropy ln(s) is i.i.d. uniform.) What is the entropy of a stationary process? We define this in 2 steps. First, consider a partition of a probability space into finitely many sets where (after some ordering), the sets have probabilities, p1 , . . . , pk (which are of course positive and add to 1). The entropy of this partition is defined to be − k X pi ln pi . i=1 Sounds arbitrary but turns out to be very natural, has certain natural 2 properties (after one interprets entropy as the amount of information gained by performing an experiment with the above probability distribution) and is in fact the only (up to a multiplicative constant) function which has these natural properties. I won’t discuss these but there are many references (ask me if you want). (Here ln means natural ln, but sometimes in these notes, it will refer to a different base.) Now, given a stationary process, look only at the random variables X1 , X2 , . . . Xn . This breaks the probability space into sn pieces since there are this many sequences of length n. Let Hn be the entropy (as defined above) of this partition. Definition 1.1: The entropy of a stationary process is limn→∞ Hn n where Hn is defined above (this limit can be shown to always exist using subadditivity). Hard Exercise: Define the conditional entropy H(X|Y ) of X given Y (where both r.v.’s take on only finitely many values) in the correct way and show that the entropy of a process {Xn } is also limn→∞ H(X0 |X−1 , . . . X−n ). Exercise: Compute the entropy first of i.i.d. uniform and then a general i.i.d.. If you’re then feeling confident, go on and look at Markov chains (they are not that hard). Exercise: Show that entropy is uniquely maximized at i.i.d. uniform. (The previous exercise showed this to be the case if we restrict to i.i.d.’s but you need to consider all stationary processes.) We now discuss the Shannon-McMillan-Breiman Theorem. It might seem dry and the point of it might not be so apparent but it will come very much alive soon. 3 Before doing this, we need to make an important (and not unnatural) assumption that our process is ergodic. What does this mean? To motivate it, let’s construct a stationary process which does not satisfy the Strong Law of Large Numbers (SLLN) for trivial reasons. Let {Xn } be i.i.d. with 0’s and 1’s, 0 with probability 3/4 and 1 with probability 1/4 and let {Yn } be i.i.d with 0’s and 1’s, 0 with probability 1/4 and 1 with probability 3/4. Now construct a stationary process by first flipping a fair coin and then if it’s heads, take a realization from the {Xn } process and if it’s tails, take a realization from the {Yn } process. (By thinking of stationary processes as measures on sequence space, we are simply taking a convex (1/2,1/2) combination of the measures associated to the two processes {Xn } and {Yn }). You should check that this resulting process is stationary with each marginal having mean 1/2. Letting Zn denote this process, we clearly don’t have that n 1X Zi → 1/2 a.s. n i=1 (which is what the SLLN would tell us if {Zn } were i.i.d.) but rather we clearly have that n 1X Zi → 3/4 (1/4) with probability 1/2(1/2). n i=1 The point of ergodicity is to rule out such stupid examples. There are two equivalent definitions of ergodicity. The first is that if T is the transformation which shifts a bi-infinite sequence one step to the left, then any event which is invariant under T has probability 0 or 1. The other definition is that the measure (or distribution) on sequence space associated to the process is not a convex combination of 2 measures each also corresponding 4 to some stationary process (i.e., invariant under T ). It is not obvious that these are equivalent definitions but they are. Exercise: Show that an i.i.d. process is ergodic by using the first definition. Theorem 1.2 (Shannon-McMillan-Breiman Theorem): Consider a stationary ergodic process with entropy H. Let p(X1 , X2 , . . . , Xn ) be the probability that the process prints out X1 , . . . , Xn . (You have to think about what this means BUT NOTE that it is a random variable). Then as n → ∞, − ln(p(X1 , X2 , . . . , Xn )) → Ha.s. and in L1 . n 2 ,...,Xn )) ] → H is exactly (check this) the NOTE: The fact that E[ − ln(p(X1 ,X n definition of entropy and therefore general real analysis tells us that the a.s. convergence implies the L1 –convergence. We do not prove this now (we will in §2) but instead see how we can use it to do data compression. To do data compression, we will only need the above convergence in probability (which of course follows from either the a.s. or L1 convergence). Thinking about what convergence in probability means, we have the following corollary. Corollary 1.3: Consider a stationary ergodic process with entropy H. Then for all > 0, for large n, we have that the words of length n can be divided into two sets with all words C in the first set satisfying e−(H+)n < p(C) < e−(H−)n and with the total measure of all words in the second set being < . One should think of the first set as the “good” set and the second set as the “bad” set. (Later it will be the good words which will be compressed). 5 Except in the i.i.d. uniform case, the bad set will have many more elements (in terms of cardinality) than the good set but its probability will be much lower. (Note that it is only in the i.i.d. uniform case that probabilities and cardinality are the same thing.) Here is a related result which will finally allow us to prove the data compression theorem. Proposition 1.4: Consider a stationary ergodic process with entropy H. Order the words of length n in decreasing order (in terms of their probabilities). Fix λ ∈ (0, 1). We select the words of length n in the above order one at a time until their probabilities sum up to at least λ and we let Nn (λ) be the number of words we take to do this. Then ln(Nn (λ)) = H. n→∞ n lim Exercise: Note that one obtains the same limit for all λ. Think about this. Convince yourself however that the above convergence cannot be uniform in λ. Is the convergence uniform in λ on compact intervals of (0, 1)? Proof: Fix λ ∈ (0, 1). Let > 0 be arbitrary but < 1 − λ and λ. We will show both lim sup n ln(Nn (λ)) ≤H + n and lim inf n ln(Nn (λ)) ≥H − n from which the result follows. By Corollary 1.3, we know that for large n, the words of length n can be broken up into 2 groups with all words C in the first set satisfying e−(H+)n < p(C) < e−(H−)n 6 and with the total measure of all words in the second set being < . We now break the second group (the bad group) into 2 sets depending on whether p(C) ≥ e−(H−)n (the big bad words) or whether p(C) ≤ e−(H+)n (the small bad words). When we order the words in decreasing order, we clearly first go through the big bad words, then the good words and finally the small bad words. We now prove the first inequality. For large n, the total measure of the good words is at least 1 − which is bigger than λ and hence the total measure of the big bad words together with the good words is bigger than λ. Hence Nn (λ) is at most the total number of big bad words plus the total number of good words. Since all these words have probability at least e−(H+)n , there cannot be more than de(H+)n e of them. Hence lim sup n ln(Nn (λ)) ln(e(H+)n + 1) ≤ lim sup = H + . n n n For the other inequality, let Mn (λ) be the number of good words among the Nn (λ) words taken when we accumulated at least λ measure of the space. Since the total measure of the big bad words is at most , the total measure of these Mn (λ) words is at least λ−. Since each of these words has probability at most e−(H−)n , there must be at least (λ − )e(H−)n words in Mn (λ) (if you don’t see this, we have |Mn (λ)|e−(H−)n ≥ λ − ). Hence lim sup n ln(Mn (λ)) ln(Nn (λ)) ≥ lim sup ≥ n n n lim sup n ln((λ − )e(H−)n ) = H − , n and we’re done. 2 Note that by taking λ close to 1, one can see that (as long as we are not i.i.d. uniform, i.e., as long as H < ln s), we can cover almost all words (in the 7 sense of probability) by using only a neglible percentage of the total number of words (eHn words instead of the total number of words e(n ln s) ). We finally arrive at data compression, now that we have things set up and have a little feeling about what entropy is. We now make some definitions. Definition 1.5: An n–code is an injective (i.e. 1-1) mapping σ which takes the set of words of length n with alphabet S to words of any finite length (of at least 1) with alphabet S. Definition 1.6: Consider a stationary ergodic process {Xk }∞ k=−∞ . The compressibility of an n–code σ is E[ `(σ(X1n,...,Xn )) ] where `(W ) denotes the length of the word W . This measures how well an n–code compresses on average. Theorem 1.7: Consider a stationary ergodic process {Xk }∞ k=−∞ with entropy H. Let µn be the minimum compressibility over all n–codes. Then the limit µ ≡ limn→∞ µn exists and equals H ln s . The quantity µ is called the compression coefficient of the process {Xk }∞ −∞ . Since H is always strictly less than ln s except in the i.i.d. uniform case, this says that data compression is always possible except in this case. Proof: As usual, we have two directions to prove. We first show that for arbitrary and δ > 0, and for large n, µn ≥ (1 − δ) H−2 ln s . Fix any n–code σ. Call an n–word C short (for σ) if σ(C) ≤ n(H−2) ln s . Since codes are 1-1, the number of short words is at most 2 b s + s + ... + s n(H−2) c ln s b ≤s n(H−2) c ln s ∞ X ( i=1 8 s−i ) ≤ s n(H−2) e . s−1 Next, since ln(Nn (δ)) →H n by Proposition 1.4, for large n, we have that Nn (δ) > en(H−) . This says that for large n, if we take < en(H−) of the most probable words, we won’t get δ total measure and so certainly if we take < en(H−) of any of the words, we won’t get δ total measure. In particular, since the number of short words is at most s n(H−2) e s−1 which is for large n less than en(H−) , the short words can’t cover δ total measure for large n, i.e., P(short word) ≤ δ for large n. Now note that this argument was valid for any n–code. That is, for large n, any n-code satisfies P(short word) ≤ δ. Hence for any n-code σ (H − 2) σ(C) ] ≥ (1 − δ) n ln s E[ and so µn ≥ (1 − δ) H − 2 , ln s as desired. For the other direction, we show that for any > 0, we have that for large n, µn ≤ 1/nd n(H+) ln s e + δ. This will complete the argument. The number of different sequences of length exactly d n(H+) ln s e is at least s n(H+) ln s which is en(H+) . By Proposition 1.4 the number Nn (1 − δ) of most probable n–words needed to cover 1 − δ of the measure is ≤ en(H+) for large n. Since there are at least this many words of length d n(H+) ln s e, we can code these high probability words into words of length d n(H+) ln s e. For the 9 remaining words, we code them to themselves. For such a code σ, we have that E[σ(X1 , . . . , Xn )] is at most d n(H+) ln s e + δn and hence the compression of this code is at most 1/nd n(H+) ln s e + δ. 2 This is all nice and dandy but EXERCISE: Show that for a real person, who really has data and really wants to compress it, the above is all useless. We will see later on (§3) a method which is not useless. 2 Proof of the Shannon-McMillan-Breiman Theorem We provide here the classical proof of this theorem where we will assume two major theorems, namely, the ergodic theorem and the Martingale Convergence Theorem. This section is much more technical than all of the coming sections and requires more mathematical background. Feel free if you wish to move on to §3. As this section is independent of all of the others, you won’t miss anything. We first show that the proof of this result is an easy consequence of the ergodic theorem in the special case of an i.i.d. process or more generally of a multistep Markov chain. We take the middle road and prove it for Markov chains (and the reader will easily see that it extends trivially to multi-step Markov chains). Proof of SMB for Markov Chains: First note that ln(p(X1 , X2 , . . . , Xn )) − =− n 10 Pn 1 ln p(Xj |Xj−1 ) n where the first term ln p(X1 |X0 ) is taken to be ln p(X1 ). If the first term were interpreted as ln p(X1 |X0 ) instead, then the ergodic theorem would immediately tell us this limit is a.s. E[− ln p(X1 |X0 )] which (you have seen as an exercise) is the entropy of the process. Of course this difference in the first term makes an error of at most const/n and the result is proved. 2 For the multistep markov chain, rather than replacing the first term by something else (as above), we need to do so for the first k terms where k is the look-back of the markov chain. Since this k is fixed, there is no problem. To prove the general result, we isolate the main idea into a lemma which is due to Breiman and which is a generalization of the ergodic theorem which is also used in its proof. Actually, to understand the statement of the next result, one really needs to know some ergodic theory. Learn about it on your own, ask me for books, or I’ll discuss what is needed or just forget this whole section. Proposition 2.1: Assume that gn (x) → g(x) a.e. and that Z sup |gn (x)| < ∞. n Then if φ is an ergodic measure preserving transformation, we have that X 1 n−1 lim gj (φj (x)) = n→∞ n 0 Z g a.e. and in L1 . Note if gn = g ∈ L1 for all n, then this is exactly the ergodic theorem. Proof: (as explained in the above remark), the ergodic theorem tells us X 1 n−1 lim g(φj (x)) = n→∞ n 0 11 Z g a.e. and in L1 . Since X X X 1 n−1 1 n−1 1 n−1 gj (φj (x)) = g(φj (x)) + [gj (φj (x)) − g(φj (x))], n 0 n 0 n 0 we need to show that (∗) X 1 n−1 |gj (φj (x)) − g(φj (x))| → 0 a.e. and in L1 . n 0 Let FN (x) = supj≥N |gj (x)−g(x)|. By assumption, each FN is integrable and goes to 0 monotonically a.e. from which dominated convergence gives R fN → 0. Now, −1 X 1 NX 1 n−1 |gj (φj (x)) − g(φj (x))| ≤ |gj (φj (x)) − g(φj (x))|+ n 0 n 0 n−N X−1 1 n−N −1 fN (φj φN (x)). n n−N −1 0 For fixed N , letting n → ∞, the first term goes to 0 a.e. while the second term goes to R fN a.e. by the ergodic theorem. Since R fN → 0, this proves the pointwise part of (*). For the L1 part, again for fixed N , letting n → ∞, the L1 norm of the first term goes to the 0 while the L1 norm of the second term is always at most R fN (which goes to 0) and we’re done. 2 Proof of the SMB Theorem: To properly do this and to apply the previous theorem, we need to set things up in an ergodic theory setting. Our process Xn gives us a measure µ on U = {1, . . . , S}Z (thought of as its distribution function). We also have a transformation T on U , called the left shift, which simply maps an infinite sequence 1 unit to the left. The fact that the process is stationary means that T preserves the measure µ in that for all events A, µ(T A) = µ(A). 12 Next, letting gk (w) = − ln p(X1 |X0 , . . . , X−k+1 ), (with g0 (w) = − ln p(X1 )), we get immediately that − X ln(p(X1 (w), X2 (w), . . . , Xn (w))) 1 n−1 = gj (T j (w)). n n 0 (Note that w here is an infinite sequence in our space X.) We are now in the set-up of the previous proposition although of course a number of things need to be verified, the first being the convergence of the gk ’s to something. Note that for any i in our alphabet {1, . . . , S}, gk (w) on the cylinder set X1 = i is simply − ln P (X1 = i|X0 , . . . , X−k+1 ) which (here we use a second nontrivial result) converges by the Martingale convergence theorem (and the continuity of ln) to − ln P (X1 = i|X0 , . . .). We therefore get that gk (w) → g(w) where g(w) = − ln P (X1 = i|X0 , . . .) on the cylinder set X1 = i, i.e., g(w) = − ln P (X1 |X0 , . . .). R If we can show that supk |gk (w)| < ∞, the previous proposition will tell us that − ln(p(X1 , X2 , . . . , Xn )) → n Z g a.e. and in L1 . Since, by definition, E[− ln(p(X1 ,Xn2 ,...,Xn )) ] → h({Xi }) by definition of entropy, it follows that R g = h({Xi }) and we would be done. EXERCISE: Show directly that E[g(w)] is h({Xi }). We now proceed with verifying R supk |gk (w)| < ∞. Fix λ > 0. We show that P (supk gk > λ) ≤ se−λ (where s is the alphabet size) which implies the desired integrability. P (supk gk > λ) = P k P (Ek ) where Ek is the set where the first time gj gets larger than λ is k, i.e., Ek = {gj ≤ λ, j = 0, . . . k − 1, gk > λ}. These 13 are obviously disjoint for different k and make up the set above. Now, P (Ek ) = P i P ((X1 = i) ∩ Ek ) = P i P ((X1 = i) ∩ Fki ) where Fki is the set where the first time fji gets larger than λ is k where fji = − ln p(X1 = i|X0 , . . . , X−j+1 ). Since Fki is X−k+1 , . . . , X0 –measurable, we have P ((X1 = i) ∩ Z = Fki ) Z = P (X1 = i|X−k+1 , . . . , X0 ) Fki i e−fk ≤ e−λ P (Fki ). Fki Since the Fki are disjoint for different k (but not for different i), P k P (Ek ) ≤ P P i k e−λ P (Fki ) ≤ se−λ . 2 We finally mention a couple of things in the higher–dimensional case, that is, where we have a stationary random field where entropy can be defined in a completely analogous way. It was unknown for some time if a (pointwise, that is, a.s.) SMB theorem could be obtained in this case, the main obstacle being that it was known that a natural candidate for a multidimensional martingale convergence theorem was in fact false and so one could not proceed as we did above in the 1-d case. (It was however known that this theorem held in “mean” (i.e., in L1 ), see below.) However, recently, (in 1983) Ornstein and Weiss managed to get a pointwise SMB not only for the multidimensional lattice but in the more general setting of something called amenable groups. (They used a method which circumvents the Martingale Convergence Theorem). The result of Ornstein and Weiss is difficult and so we don’t present it. Instead we present the Shannon–McMillan Theorem (note, no McMillan here) which is the L1 convergence in the SMB Theorem. Of course, this is a 14 weaker result than we just proved but the point of presenting it is two-fold, first to see how much easier it is than the pointwise version and also to see a result which generalizes (relatively) easily to higher dimensions (this higher dimensional variant being left to the reader). Proposition 2.2 (Mean SMB Theorem or MB Theorem): The SMB theorem holds in L1 . Proof Let gj be defined as above. The Martingale convergence theorem implies that gj → g∞ in L1 as n → ∞. We then have (with kk denoting the L1 norm), k− X ln(p(X1 , X2 , . . . , Xn )) 1 n−1 gj (T j (w)) − Hk ≤ − Hk = k n n 0 X X 1 n−1 1 n−1 j j k |gj (T (w)) − g∞ (T (w))|k + k g∞ (T j (w)) − Hk. n 0 n 0 The first term goes to 0 by the L1 convergence in the Martingale Convergence Theorem (we don’t of course always get L1 convergence in the Martingale Convergence Theorem but we have so in this setting). The second term goes to 0 in L1 by the ergodic theorem together with the fact that R g∞ = H, an easy computation left to the reader. 2 3 Universal Entropy Estimation? This section simply raises an interesting point which will be delved into further later on. One can consider the problem of trying to estimate the entropy of a process {Xn }∞ n=−∞ after we have only seen X1 , X2 , . . . , Xn in such a way 15 that these estimates converge to the true entropy with prob 1. (As above, we know that to have any chance of doing this, one needs to assume ergodicity). Obviously, one could simply take the estimate to be − ln(p(X1 , X2 , . . . , Xn )) n which will (by SMB) converge to the entropy. However, this is clearly undesirable in the sense that in order to use this estimate, we need to know what the distribution of the process is. It certainly would be preferable to find some estimation procedure which does not depend on the distribution since we might not know it. We therefore want a family of functions hn : S n → R (hn (x1 , . . . , xn ) would be our guess of the entropy of the process if we see the data x1 , . . . , xn ) such that for any given ergodic process {Xn }∞ n=−∞ , lim hn (X1 , . . . , Xn ) = h({Xn }∞ n=−∞ ) a.s. . n→∞ (Clearly the suggestion earlier is not of this form since hn depends on the process). We call such a family of functions a “universal entropy estimator”, universal since it holds for all ergodic processes. There is no immediate obvious reason at all why such a thing exists and there are similar types of things which don’t in fact exist. (If entropy were a continuous function defined on processes (the latter being give the usual weak topology), a general theorem (see §9) would tell us that such a thing exists BUT entropy is NOT continuous.) However, as it turns out, a universal entropy estimator does in fact exist. One such universal entropy estimator comes from an algorithm called the “Lempel-Ziv algorithm” which was later improved by Wyner-Ziv. We will see that this algorithm also turns out to be a universal data compressor, which we will describe later. (This method is used, as far as I understand it, in the real world). 16 4 20 questions and Relative Entropy We have all heard problems of the following sort. You have 8 coins one of which has a different weight than all the others which have the same weight. You are given a scale (which can determine which of two given quantities is heavier) and your job is to identify the “bad” coin. How many weighings do you need? Later we will do something more general but it turns out the problem of how many questions one needs to ask to determine something has an extremely simple answer which is given by something called Kraft’s Theorem. The answer will not at all be surprising and will be exactly what you would guess, namely if you ask questions which have D possible answers, then you will need lnD (n) questions where n is the total number of possibilities. (If the latter is not an integer, you need to take the integer right above it). Definitin 4.1: A mapping C from a finite set S to finite words with alphabet {1, . . . , D} will be called a prefix-free D–code on S if for any x, y ∈ S, C(x) is not a prefix of C(y) (which implies of course that C is injective) Theorem 4.2 (Kraft’s Inequality): Let C be a prefix-free D–code defined on S. Let {`1 , . . . `|S| } be the lengths of the code words of C. Then |S| X D−`i ≤ 1. i=1 Conversely, if {`1 , . . . `n } are integers satisfying n X D−`i ≤ 1, i=1 then there exists a prefix-free D–code on a set with n elements whose code words have length {`1 , . . . `n }. 17 This turns out to be quite easy to prove so try to figure it out on your own. I’ll present it in class but won’t write it here. Corollary 4.3: There is a prefix–free D–code on a set of n elements whose maximum code word has length dlnD (n)e while there is no such code whose maximum code word is < dlnD (n)e. I claim that this corollary implies that the number of questions with D possible answers one needs to ask to determine something with n possibilities is lnD (n) (where if this is not an integer, it is taken to be the integer right above it). To see this one just needs to think a little and realize that asking questions until we get the right answer is exactly a prefix–free D–code. (For example, if you have the code first, your first question would be “What is the first number in the codeword assigned to x?” which of course has D possible answers.) Actually, going back to the weighing question, we did not really answer that question at all. We understand the situation when we can ask any D–ary question (i.e., any partition of the outcome space into D pieces) but with questions like the weighing question, while any weighing breaks the outcome space into 3 pieces, we are physically limited (I assume, I have not thought about it) from constructing all possible partitions of the outcome space into 3 pieces. So this investigation, while interesting and useful, did not allow us to answer the original question at all but we end there nonetheless. Now we want to move on and do something more general. Let’s say that X is a random variable taking on finitely many values X with probabilities p1 , . . . , pn (so |X | = n). Now we want a successful guessing scheme (i.e., a prefix–free code) where the expected number of questions we need to ask 18 (rather than the maximum number of questions) is not too large. The following theorem answers this question and more. For simplicity, we talk about prefix–free codes. Theorem 4.4: Let C be a prefix–free D–code on X and N the length of the codeword assigned to X (which is a r.v.). Then HD (X) ≤ E[N ] where HD (X) is the D–entropy of X given by − Pk i=1 pi lnD pi . Conversely, there exists a prefix–free D–code satisfying E[N ] < HD (X) + 1. (We will prove this below but must first do a little more development.) EXERCISE: Relate this result to Corollary 4.3. An important concept in information theory which is useful for many purposes and which will allow us to prove the above is the concept of relative entropy. This concept comes up in many other places–below we will explain three other places where it arises which are respectively (1) If I construct a prefix–free code satisfying E[N ] < HD (X) + 1 under the assumption that the true distribution of X is q1 , . . . , qn but it turns out the true distribution is p1 , . . . , pn , how bad will E[N ] be? (2) (level 2) large deviation theory for the Glivenko–Cantelli Theorem (which is called Sanov’s Theorem) (3) universal data compression for i.i.d. processes. The first 1 of these will be done in this section, the 2nd in the §5 while the last will be done in §6. 19 Definition 4.5: If p = p1 , . . . , pn and q = q1 , . . . , qn are two probability distributions on X, the relative entropy or Kullback Leiber distance between p and q is n X pi ln i=1 pi . qi Note that if for some i, pi > 0 and qi = 0, then the relative entropy is ∞. One should think of this as a metric BUT it’s not-it’s neither symmetric nor satisfies the triangle inequality. One actually needs to specify the base of the logarithm in this definition. The first thing we need is the following whose proof is an easy application of Jensen’s inequality which is left to the reader. Theorem 4.6: Relative entropy (no matter which base we use) is nonnegative and 0 if and only if the two distributions are the same. Proof of Theorem 4.4: Let C be a prefix–free code with `i denoting the length of the ith codeword and N denoting the length of the codeword as a r.v.. Simple manipulation gives E[N ] − HD (X) = X i pi lnD ( pi D−`i / X P iD −`i ) − lnD ( D−`i ). i The first sum is nonnegative since it’s a relative entropy and the the second term is nonnegative by Kraft’s Inequality. To show there is a prefix–free code satisfying E[N ] < HD (X) + 1, we simply construct it. Let `i = dlnD (1/pi )e. A trivial computation shows that the `i ’s satisfy the conditions of Kraft’s Inequality and hence there is a prefix–free D code which assigns the ith account a codeword of length `i . A trivial calculation shows that E[N ] < HD (X) + 1. 2 20 How did we know to choose `i = dlnD (1/pi )e? If the integers `i were allowed to be nonintegers, Lagrange Multipliers show that the `i ’s minimizing E[N ] subject to the contraint of Kraft’s inequality are `i = lnD (1/pi ). There is an easy corollary to Theorem 4.4 whose proof is left to the reader, which we now give. Corollary 4.7: The mimimum expected codeword length L∗n per symbol satisfies H(X1 , . . . , Xn ) H(X1 , . . . , Xn ) 1 ≤ L∗n ≤ + . n n n Moreover, if X1 , . . . is a stationary process with entropy H, then L∗n → H. This particular prefix–free code, called the Shannon Code is not necessarily optimal (although it’s not bad). There is an optimal code called the Huffman code which I will not discuss. It turns out also that the Shannon Code is close to optimal in a certain sense (and in a better sense than that the mean code length is at most 1 from optimal which we know). We now discuss our first application of relative entropy (besides our using it in the proof of Theorem 4.4). We mentioned the following question above. If I contruct the Shannon prefix–free code satisfying E[N ] < HD (X) + 1 under the assumption that the true distribution of X is q1 , . . . , qn but it turns out the true distribution is p1 , . . . , pn , how bad will E[N ] be? The answer is given by the following theorem. Theorem 4.8: The expected length Ep [N ] under p(x) of the Shannon code 21 `i = dlnD (1/qi )e satisfies HD (p) + D(pkq) ≤ Ep [N ] < HD (p) + D(pkq) + 1. Proof: Trivial calculation left to the reader. 2 5 Sanov’s Theorem: another application of relative entropy In this section, we give an important application (or interpretation) of relative entropy, namely (level 2) large deviation theory. Before doing this, let me remind you (or tell you) what (level 1) large deviation theory is (this is the last section of the first chapter in Durrett’s probability book). (Level 1) large deviation theory simply says the following modulo the technical assumptions. Let Xi be i.i.d. with finite mean m and have a moment generating function defined in some neighborhood of the origin. First, the WLLN tells us Pn P (| i=1 Xi n − m| ≥ ) → 0 as n → ∞ for any . (For this, we of course do not need the exponential moment). (Level 1) large deviation theory tells us how fast the above goes to 0 and the answer is, it goes to 0 exponentially fast and more precisely, it goes like e−f ()n where f () > 0 is the so-called Frechel transform of the logarithm of the moment generating function given by f () = max(θ − ln(E[eθX ])). θ 22 Level two large deviation theory deals with a similar question but at the level of the empirical distribution function. Let Xi be i.i.d. again and let Fn be the empirical distribution of X1 , . . . , Xn . Glivenko–Cantelli tells us that Fn → F a.s. where F is the the common distribution of the Xi ’s. If C is a closed set of measures not containing F , it follows (we’re now doing weak convergence of probability measures on the space of probability measures on R so you have to think carefully) that P (Fn ∈ C) → 0 and we want to know how fast. The answer is 1 lim − ln(P (Fn ∈ C)) = min D(P kF ). n→∞ n P ∈C The above is called Sanov’s Theorem which we now prove. There will be a fair development before getting to this which I feel will be useful and worth doing (and so this might not be the quickest route to Sanov’s Theorem). All of this material is taken from Chapter 12 of Cover and Thomas’s book. If x = x1 , . . . xn is a sample from an i.i.d. process with distribution Q, then Px , which we call the type of x, is defined to be the empirical distribution (which I will assume you know) of x. Definition 5.1: We will let Pn denote the set of all types (i.e., empirical distributions) from a sample of size n and given P ∈ Pn , we let T (P ) be all x = x1 , . . . xn (i.e., all samples) which have type (empirical distribution) P . EXERCISE: If Q has its support on {1, . . . , 7}, find |T (P )| where P is the type of 11321. 23 Lemma 5.2: |Pn | ≤ (n + 1)|X | where X is the set of possible values are process can take on. This lemma, while important, is trivial and left to the reader. Lemma 5.3: Let X1 , . . . be i.i.d. according to Q and let Px denote the type of x. Then Qn (x) = 2−n(H(Px )+D(Px ||Q)) and so the Qn probability of x depends only on its type. Proof: Trivial calculation (although a few lines) left to the reader. 2 Lemma 5.4: Let X1 , . . . be i.i.d. according to Q. If x is of type Q, then Qn (x) = 2−nH(Q) . Our next result gives us an estimate of the size of a type class T (P ). Lemma 5.5: 1 2nH(P ) ≤ |T (P )| ≤ 2nH(P ) . (n + 1)|X | Proof: One method is to write down the exact cardinality of the set (using elementary combinatorics) and then apply Stirling’s formula. We however proceed differently. The upper bound is trivial. Since each x ∈ T (P ) has (by the previous corollary) under P , probability 2−nH(P ) , there can be at most 2nH(P ) elements in T (P ). For the lower bound, the main step is is to show that type class P has the largest probability of all type classes (note however, that an element from 24 type class P does not necessarily have a higher probability than elements from other type classes as you should check), i.e., P n (T (P )) ≥ P n (T (P 0 )) for all P 0 . I leave this to you (although it takes some work, see the book if you want). Then we argue as follows. Since there are at most (n + 1)|X | type classes, and T (P ) has the largest probability, its P n –probability must be at least 1 . (n+1)|X | Since each element of this type class has P n –probability 2−nH(P ) , we obtain the lower bound. 2 Combining Lemmas 5.3 and 5.5, we immediately obtain Corollary 5.6: 1 2−nD(P ||Q) ≤ Qn (T (P )) ≤ 2−nD(P ||Q) . (n + 1)|X | Remark: The above tells us that if we flip n fair coins, the probability that we get exactly 50% heads, while going to 0 as n → ∞, does not go to 0 exponentially fast. The above discussion has set us up for an easy proof of the Glivenko–Cantelli Theorem (analogous to the fact that once one has level 1 large deviation theory set up, the Strong Law of Large Numbers follows trivially). The key that makes the whole thing work is the fact (Corollary 5.6) that the probability under Q that a certain type class (different from Q) arises is exponentially small (the exponent being given by the relative entropy) and the fact that there are only polynomially many type classes. 25 Theorem 5.7: Let X1 , . . . be i.i.d. according to P . Then P (D(Pxn ||P ) > ) ≤ 2−n(−|X | ln(n+1) ) n and hence (by Borel Cantelli) D(Pxn ||P ) → 0 a.s. . Proof: Using 2−nD(Q||P ) ≤ X P n (D(Pxn ||P ) > ) ≤ Q:D(Q||P )≥ (n + 1)|X | 2−n = 2−n(−|X | ln(n+1) ) n . 2 We are now ready to state Sanov’s Theorem. We are doing everything under the assumption that the underlying distribution has finite support (i.e., there are only a finite number of values our r.v. can take on). One can get rid of this assumption but we stick to it here since it gets rid of some technicalities. We can view the set of all prob. measures (call it P) on our finite outcome space (of size |X |) as a subspace of R|X | (or even of R|X |−1 ). (In the former case, it would be simplex in the positive cone.) This immediately gives us a nice topology on these measures (which of course is nothing but the weak topology in a simple setting). Assuming that P gives every x ∈ X positive measure (if not, change X ), note that D(Q||P ) is a continuous function of Q on P and hence on any closed (and hence compact) set E in P, this function assumes a minimum (which is not 0 if P 6∈ E). 26 Theorem 5.8 (Sanov’s Theorem): Let X1 , . . . be i.i.d. with distribution P and E be a closed set in P. Then P n (E)(= P n (E ∩ Pn )) ≤ (n + 1)|X | 2−nf (E) where f (E) = minQ∈E D(Q||P ). (P n (E) means of course P n ({x : Px ∈ E}).) If, in addition, E is the closure of its interior, then the above exponential upper bound also gives a lower bound in that lim n→∞ 1 ln P n (E) = −f (E). n Proof: The derivation of the upper bound follows easily from Theorem 5.7 which we leave to the reader. The lower bound is not much harder but a little care is needed (as should be clear from the topological assumption on E). We need to be able to find guys in E ∩Pn . Since the Pn ’s become “more and more” dense as n gets large, it is easy to see (why?) that E ∩ Pn 6= ∅ for large n (this just uses the fact that E has nonempty interior) and that there exist Qn ∈ E ∩ Pn with Qn → Q∗ where Q∗ is some distribution which minimizes D(Q||P ) over Q ∈ E. This immediately give that D(Qn ||P ) → D(Q∗ ||P )(= f (E)). We now have P n (E) ≥ P n (T (Qn )) ≥ 1 2−nD(Qn ||P ) (n + 1)|X | from which one immediately concludes that lim inf n 1 ln P n (E) ≥ lim inf −D(Qn ||P ) = −f (E). n n 27 The first part gives this as an upper bound and so we’re done. 2 Note that the proof goes through if one of the Q∗ ’s which minimize D(Q||P ) over Q ∈ E (of which there must be at least one due to compactness) is in the closure of the interior of E. It is in fact also the case (although not att all obvious and related to other important things) that if E is also convex, then there is a unique minimizing Q∗ and so there is a “Hilbert space like” picture here. (If E is not convex, it is easy to see that there is not necessarily a unique minimizing guy). The fact that there is a unique minimizing guy (in the convex case) is suggested by (by not implied by) the convexity of relative entropy in its two arguments. Hard Exercise: Recover the level 1 large deviation theory from Sanov’s Theorem using Lagrange multipliers. 6 Universal Entropy Estimation and Data Compression We first use relative entropy to prove the possibility of universal data compression for i.i.d. processes. Later on, we will do the much more difficult universal data compression for general ergodic soures, the so–called ZivLempel algorithm which is used in the real world (as far as I understand it). Of course, from a practical point of view, the results of Chapter 1 are useless since typically one does NOT know the process that one is observing and so one wants a “universal” way to compress data. We will now do this for i.i.d. processes where it is quite easy using the things we derived in §5. We have seen previously in §4 that if we know the distribution of a r.v. x, we can encode x by assigning it a word which has length − ln(p(x)). We 28 have also seen if the distribution is really q, we get a penalty of D(p||q). We now find a code “of rate R” which suffices for all i.i.d. processes of entropy less than R. Definition 6.1: A fixed rate block code of rate R for a process X1 , . . . with known state space X consists of two mappings which are the encoder, fn : X n → {1, . . . , 2nR } and a decoder φn : {1, . . . , 2nR } → X n . We let Pe(n,Q) = Qn (φn (fn (X1 , . . . , Xn )) 6= (X1 , . . . , Xn )) be the probability of an error in this coding scheme under the assumption that Q is the true distribution of the xi ’s. Theorem 6.2: There exists a fixed rate block code of rate R such that for all Q with H(Q) < R, Pe(n,Q) → 0 as n → ∞. Proof: Let Rn = R − |X | ln(n+1) and let An = {x ∈ X n : H(Px ) ≤ Rn }. n Then |An | = X |T (P )| ≤ P ∈Pn ,H(P )≤Rn (n + 1)|X | 2nRn = 2nR . 29 Now we index the elements of An in an arbitrary way and we let fn (x) = index of x in An if x ∈ An and 0 otherwise. The decoding map φn takes of course the index into corresponding word. Clearly, we obtain an error (at the nth stage) if and only if x 6∈ An . We then get Pe(n,Q) = 1 − Qn (An ) = X Qn (T (P )) ≤ P :H(P )>Rn X 2−nD(P ||Q) ≤ (n + 1)|X | 2−n minP :H(P )>Rn D(P ||Q) . P :H(P )>Rn Now Rn → R and H(Q) < R and so Rn > H(Q) for large n and so (n,Q) the exponent minP :H(P )>Rn D(P ||Q) is bounded away from 0 and so Pe goes exponentially to 0 for fixed Q. Actually, this last argument was a little sloppy and one should be a little more careful with the details. One first should note that compactness together with the fact that D(P ||Q) is 0 only when P = Q implies that if we stay away from any neighborhood of Q, D(P ||Q) is bounded away from 0. Then note that if Rn > H(Q), the continuity of the entropy function tells us that {P : H(P ) > Rn } misses some neighborhood of Q and hence minP :H(P )>Rn D(P ||Q) > 0 and clearly this is increasing in n (look at the derivative of ln x x ). 2 Remarks: (1) Technically, 0 is not an allowable image. Fix this (it’s trivial). (2) For fixed Q, since the above probability goes to exponentially fast, Borel– Cantelli tells us the coding scheme will work eventually forever (i.e., we have an a.s. statement rather than just an “in probability” statement). (3) There is no uniformity in Q but there is a uniformity in Q for H(Q) ≤ R − . (Show this). (4) For Q with H(Q) > R, the above guessing scheme works with probability going to 0. Could there be a different guessing scheme which works? 30 The above theorem produces for us a “universal data compressor” for i.i.d. sequences. Exercise: Construct an entropy estimator which is universal for i.i.d. sequences. We now construct a “universal entropy estimator” which is universal for all ergodic processes. We will now construct something which is not techniquely a “universal entropy estimator” according to the definition given in §2 but seems analogous and can easily be modified to be one. We will later discuss how this procedure is related to a universal data compressor. Theoreom 6.3: Let (x1 , x2 , . . .) be an infinite word and let Rn (x) = min{j ≥ n : x1 , x2 , . . . , xn = xj+1 xj+2 . . . xj+n } (which is the first time after n when the first n–block repeats itself). Then lim n→∞ log Rn (X1 , X2 , . . .) = Ha.s. n where H is the entropy of the process. Rather then writing out the proof of that, we will read the paper “Entropy and Data Compression Schemes” by Ornstein and Weiss. This paper will be much more demanding than these notes have been up to now. 7 Shannon’s Theorems on Capacity In this section, we discuss and prove the fundamental theorem of channel capacity due to Shannon in his groundbreaking work in the field. Before going into the mathematics, we should first give a small discussion. Assume that we have a channel to which one can input a 0 or a 1 and which outputs a 0 or 1 from which we want to recover the input. Let’s assume 31 that the channel transmits the signal correctly with probability 1 − p and corrupts it (and therefore sends the other value) with probability p. Assume that p < 1/2 (if p = 1/2, the channel clearly would be completely useless). Based on the output we receive, our best guess at the input would of course be the output in which case the probability of making an error would be p. If we want to decrease the probability of making an error, we could do the following. We could simply send our input through the channel 3 times and then based on the 3 outputs, we would guess the input to be that output which occurred the most times. This coding/decoding scheme will work as long as the channel does not make 2 or more errors and the probability of this is (for small p) much much smaller (≈ p2 ) than if we were to send just 1 bit through. Great! The probability is much smaller now BUT there is a price to pay for this. Since we have to send 3 bits through the channel to have one input bit guess, the “rate of transmission” has suddenly dropped to 1/3. Anyway, if we are still dissatisfied with the probability of this scheme making a error, we could send the bit through 5 times (using an analogous majority rule decoding scheme) thereby decreasing the probability of a decoding mistake much more (≈ p3 ) but then the transmission rate would go even lower to 1/5. It seems that in order to achieve the probability of error going to 0, we need to drop the rate of transmission closer and closer to 0. The amazing discovery of Shannon is that this very reasonable statement is simply false. It turns out that if we are simply willing to use a rate of transmission which is less than “channel capacity” (which is simply some number depending only on the channel (in our example, this number is strictly positive as long as p 6= 1 2 which is reasonable)), then as long as we send long strings of input, we can transmit them with arbitrarily low 32 probability of error. In particular, the rate need not go to 0 to achieve arbitrary reliability. We now enter the mathematics which explains more precisely what the above is all about and proves it. The formal definition of capacity of a channel is based on a concept called mutual information of two r.v.’s which is denoted by I(X; Y ). Definition 7.1: I(X; Y ), called the mutual information of X and Y , is the relative entropy of the joint distribution of X and Y and the product distribution of X and Y . Note that I(X; Y ) and I(Y ; X) are the same and that they are not infinite (although recall in general relative entropy can be infinite). Definition 7.2: The conditional entropy of Y given X, denoted H(Y |X) is defined to be P x P (X = x)H(Y |X = x). Proposition 7.3: (1)H(X, Y ) = H(X) + H(Y |X) (2) I(X; Y ) = H(X) − H(X|Y ). The proof of these are left to the reader. There is again no idea involved, simply computation. Definition 7.4: A discrete memoryless channel (DMC) is a system which takes as input elements of an input alphabet X and prints out elements from an output alphabet Y randomly according to some p(y|x) (i.e., if x is sent in, then y comes out with probability p(y|x)). Of course the numbers p(y|x) are nonnegative and for fixed x give 1 if we sum over y. (Such a thing is sometimes called a kernel in probability theory and could be viewed as 33 a markov transition matrix if the sets X and Y were the same (which they need not be)). Definition 7.5: The information channel capacity of a discrete memoryless channel is C = max I(X; Y ). p(x) Exercise: Compute this for different examples like X = Y = {0, 1} and the channel being p(x|x) = p for all x. We assume now that if we send a sequence through the channel, the outputs are all independently chosen from p(y|x) (i.e., the outputs occur independently, an assumption you might object to). Definition 7.6: An (M, n) code C consists of the following: 1. An index set {1, . . . , M }. 2. An encoding function X n : {1, . . . , M } → X n which yields the codewords X n (1), . . . , X n (M ) which is called the codebook. 3. A decoding function g : Y n → {1, . . . , M }. For i ∈ {1, . . . , n}, we let λni be the probability of an error if we encode and send i over the channel, that is P (g(Y n ) 6= i) when i is encoded and sent over the channel. We also let λn = maxi λni be the maximal error and Pen = 1 M n i λi P be the average probability of error. We attach a superscript C to all of these if the code is not clear from context. Exercise: Given any DMC, find an (M, n) code such that λn1 = 0. Definition 7.7: The rate of an (M, n) code is 34 ln M n (where we use base 2). We now arrive at one of the most important theorems in the field (as far as I understand it), which says in words that all rates below channel capacity can be achieved with arbitrarily small probability of error. Theorem 7.8 (Shannon’s Theorem on Capacity): All rates below capacity C are achievable in that given R < C, there exists a sequence of (2nR , n)–codes with maximum probability λn → 0. Conversely if there exists a sequence of (2nR , n)–codes with maximum probability λn → 0, then R ≤ C. Before presenting the proof of this result, we make a few remarks which should be of help to the reader. In the definition of an (M, n)–code, we have this set {1, . . . , M } and an encoding function X n mapping this set into X n . This way of defining a code is not so essential. The only thing that really matters is what the image of X n is and so one could really have defined an (M, n)–code to be a subset of X n consisting of M elements together with the decoding function g : Y n → X n . Formally, the only reason these definitions are not EXACTLY the same is that X n might not be injective which is certainly a property you want your code to have anyway. The point is one should also think of a code in this way. Proof: The proof of this is quite long. While the mathematics is all very simple, conceptually it is a little more difficult. The method is a randomization method where we choose a code “at random” (according to some distribution) and show that with high probability it will be a good code. The fact that we are choosing the code according to a special distribution will make the computation of this probability tractable. The key concept in the proof is the notion of “joint typicality”. The 35 SMB (for i.i.d.) tells us that a typical sequence x1 , . . . , xn from an ergodic stationary process has − n1 ln(p(x1 , . . . , xn )) being close to the entropy H(X). If we have instead independent samples from a joint distribution p(x, y), we want to know what x1 , y1 , . . . , xn , yn typically looks like and this gives us the concept of joint typicality. An important point is that x1 , . . . , xn and y1 , . . . , yn can be individually typical without being jointly typical. We now give a completely heuristic proof of the theorem. Given the output, we will choose as our guess for the input something (which encodes to something) which is jointly typical with the output. Since the true input and the output will be jointly typical with high probability, there will with high probability be some input which is jointly typical with the output, namely the true input. The problem is maybe there is more than 1 possible input sequence which is jointly typical with the output. (If there is only one, there is no problem.) The probability that a possible input sequence which was not input to the output is in fact jointly typical with the output sequence should be more or less the probability that a possible input and output sequence chosen independently are jointly typical which is 2−nI(X;Y ) by a trivial computation. Hence if we use 2nR with R < I(X; Y ) different possible input sequences, then, with high probability, none of the possible input sequences (other than the true one) will be jointly typical with the output and we will decode correctly. We need the following lemma whose proof is left to the reader. It is easy to prove using the methods in §1 when we looked at consequences of the SMB Theorem. An arbitrary element of X n x1 , . . . , xn will be denoted by xn below. Lemma 7.9: Let p(x, y) be some joint distribution. Let An be the set of 36 –jointly typical sequences (with respect to p(x, y)) of length n, namely, An = {(xn , y n ) ∈ X n × Y n : | − |− 1 p(xn ) − H(X)| < , n 1 1 p(y n ) − H(Y )| < , and | − p(xn , y n ) − H(X, Y )| < } n n where p(xn , y n ) = Q i p(xi , yi ). Then if (X n , Y n ) is drawn i.i.d. according to p(x, y), then 1. P ((X n , Y n ) ∈ An ) → 1 as n → ∞. 2. |An | ≤ 2n(H(X,Y )+) . 3. If (X̃ n , Ỹ n ) is chosen i.i.d. according to p(x)p(y), then P ((X̃ n , Ỹ n ) ∈ An ) ≤ 2−n(I(X;Y )−3) and for sufficiently large n P ((X̃ n , Ỹ n ) ∈ An ) ≥ (1 − )2−n(I(X;Y )+3) . We now proceed with the proof. Let R < C. We now construct a sequence of (2nR , n)–codes (2nR means b2nR c here). whose maximum probability of error λ(n) goes to 0. We will first prove the existence of a sequence of codes which have a small average error (in the limit) and then modify it to get something with small maximal error (in the limit). By the definition of capacity, we can choose p(x) so that R < I(X; Y ) and then we can choose so that R + 3 < I(X; Y ). Let Cn be the set of all encoding functions X n : {1, 2, . . . , 2nR } → X n . Our encoding function will be chosen randomly according to p(x) in that each i ∈ {1, 2, . . . , 2nR } is independently sent to xn where xn is chosen according to Qn i=1 p(x). Once we have chosen some encoding function X n , 37 the decoding procedure is as follows. If we receive y n , our decoder guesses that the word W ∈ {1, 2, . . . , 2nR } has been sent if (1) X n (W ) and y n are – jointly typical and (2) there is no other Ŵ ∈ {1, 2, . . . , 2nR } with X n (Ŵ ) and y n being –jointly typical. We also let Cn denote the set of codes obtained by considering all encoding functions and their resulting codes. Let E n denote the event that a mistake is made when W ∈ {1, 2, . . . , 2nR } is chosen uniformly. We first show that P (E n ) → 0 as n → ∞. X P (E n ) = P (C)Pe(n),C = C∈Cn nR X P (C) C∈Cn By symmetry above is P C∈Cn P 1 2X 2nR C∈Cn nR λw (C) = w=1 1 2X X 2nR P (C)λw (C). w=1 C∈Cn P (C)λw (C) does not depend on w and hence the P (C)λ1 (C) = P (E n |W = 1). One should easily see directly anyway that symmetry gives P (E n ) = P (E n |W = 1). Because of the independence in which the encoding function and the word W was chosen, we can instead consider the procedure where we choose the code at random as above but always transmit the first word 1. In the new resulting probability space, we still let E n denote the event of a decoding mistake. We also for each i ∈ {1, 2, . . . , 2nR } have the event Ein = {(X n (i), Y n ) ∈ An } where Y n nR is the output of the channel. We clearly have E n ⊆ ((E1n )c ∪ ∪2i=2 Ein ). Now P (E1n ) → 1 as n → ∞ by the joint typicality lemma. Next the independence of X n (i) and X n (1) (for i 6= 1) implies the independence of X n (i) and Y n (1) (for i 6= 1) which gives (by the joint typicality lemma) P (Ein ) ≤ nR 2 E n ) ≤ 2nR 2−n(I(X;Y )−3) = 2−n(I(X;Y )−3−R) 2−n(I(X;Y )−3) and so P (∪i=2 i which goes to 0 as n → ∞. Since we saw that P (E n ) goes to 0 as n → ∞ and 38 P (E n ) = P C∈Cn (n),C P (C)Pe (n),C C ∈ Cn so that Pe , it follows that for all large n there is a code is small. Now that we know such a code exists, one can find it by exhaustive search. There is much interest in finding such codes. We finally obtain a code with maximum probability of error being small. Since the average error is very small (say δ), it follows that if we take half of the codewords with smallest error, the maximum error of these is also very small (at most 2δ, why?). We therefore obtain a new code by throwing away the larger half of the codewords (in terms of their errors). Since we now have 2nR−1 codewords, the rate has been cut to R − 1 n which is negligible. (Actually, by throwing away this set of codewords, we obtain a new code, and since it is a new code, we should check that the maximal error is still small. This is of course obvious but the point is that this non–problem should come to mind). CONVERSE: We are now ready to prove the converse which is somewhat easier. We first need two easy lemmas. Lemma 7.10 (Fano’s inequality): Consider a DMC with code C with (n) input message uniformly chosen. Letting Pe = P (W 6= g(Y n )), we have H(X n |Y n ) ≤ 1 + Pe(n) nR. Hint of Proof: Let E be the event of error and expand H(E, W |Y n ) in two possible ways. Then think a little. 2 Lemma 7.11: Let Y n be the output when inputting X n through a DMC. 39 Then for any distribution p(xn ) on X n (the n bits need not be independent), I(X n ; Y n ) ≤ nC where C is channel capacity. Proof: Left to reader-very straightforward computation with no thought needed. 2 We now complete the converse. Assume that we have a sequence of (2nR , n) codes with maximal error (or even average error) going to 0. Let the word W be chosen uniformly over {1, . . . , 2nR } and consider Pen = P (Ŵ 6= W ) where Ŵ is our guess at the input W . We have nR = H(W ) = H(W |Y n ) + I(W ; Y n ) ≤ H(W |Y n ) + I(X n (W ); Y n ) ≤ 1 + Pe(n) nR + nC. Dividing by n and letting n → ∞ gives R ≤ C. 2 Remarks. (n) (1) The last step gives a lower probability on the average error of Pe 1− C R − 1 nR ≥ which of course is only interesting when the transmission rate R is larger than the capacity C. (2) Note that (in the first half of the proof) we proved that if p(x) is any distribution on X (not necessarily one which maximized mutual information), then choosing a code at random using the distribution p(x) instead, will with high probability result in a code whose probability of error goes to 0 with n providing we use a transmission rate R which is smaller than the mutual information between the input and output when X has distribution 40 p(x). (3) One can prove a stronger converse which says that for rates above capacity, the average error of probability must go exponentially to 1 (What does this exactly mean?) which makes capacity an even stronger dividing line. Exercise: Consider a channel with 1,2,3 and 4 as both input and output and where 3 and 4 are always sent to themselves while both 1 and 2 are sent to 1 and 2 with probability 1/2. What is the capacity of this DMC? Before computing, take a guess first and guess how it would compare to a channel with only 2 possible inputs but with perfect transmission. How would it compare to the same channel where 2 is not allowed as input? Exercise: Given a DMC, construct a new one where we add a new symbol ∗ to both the input and output alphabets and where if we input x 6= ∗, the output has the same distibution as before (and so ∗ can’t come out) and if ∗ is inputted, the output is uniform on all possible outputs (including ∗). What has happened to the capacity of the DMC? We have two important theorems, namely Corollary 4.7 and Theorem 7.8. These two theorems seem to be somewhat disjoint from each other but the following theorem brings them together in a very nice way. We leave this theorem as a difficult exercise for the reader who using the methods developed in these notes should be able to carry out this proof. Theorem 7.12 (Source–channel coding theorem): Let V = V1 , . . . be a stochastic process satisfying the AEP property (stationary ergodic suffices for example). Then if H(V) < C, there exists a source channel code with (n) Pe → 0. 41 Conversely, for any stationary stochastic process, if H(V) > C, then the probability of error is bounded away from 0 (bounded away over all n and all source codes). One should think why this theorem says that there is no better way to send a stationary process over a DMC resulting in small probability of error than to do it in 2 steps, first compressing (as we saw we can do) to get strings of length the order of 2nH and then sending that over the channel using Theorem 7.8. We end this section with a result which is perhaps somewhat out of place but we place it here since mutual information was introduced in this section. This result will be needed in the next section. Theorem 7.13: The mutual information I(X; Y ) is a concave function of p(x) for fixed p(y|x) and a convex function of p(y|x) for fixed p(x). Proof: We give an outline of the proof which contains most of the details. The first thing we need is the so called log sum inequality which says that if a1 , . . . , an and b1 , . . . , bn are nonnegative then X i P X ai ai ai log( ) ≥ ( ai ) log( Pi ) bi i bi i with equality if and only if ai bi is constant. This is proved by Jensens inequality as is the nonnegativity of relative entropy but in fact this also follows from the latter easily anyway as follows. A trivial computation shows that if the above holds for a1 , . . . , an and b1 , . . . , bn , then it holds if we multiple these vectors by any (possibly different) constants. Now just normalize them to be probability vectors and use the nonnegativity of relative entropy. 42 The log sum inequality can then be used to demonstrate the convexity of relative entropy in its two arguments, i.e., D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 ). as follows. The log sum inequality with n = 2 gives for fixed x (λp1 (x) + (1 − λ)p2 (x)) log λp1 (x) log λp1 (x) + (1 − λ)p2 (x) ≤ λq1 (x) + (1 − λ)q2 (x) λp1 (x) λp2 (x) + (1 − λ)p2 (x) log . λq1 (x) λq2 (x) Now summing over x gives the desired convexity. Now we prove the theorem. Fixing p(y|x), we want to first show that I(X; Y ) is concave in p(x). We have I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − X p(x)H(Y |X = x). x If p(y|x) is fixed, then p(y) is a linear function of p(x). Since H(Y ) is a concave function of p(y) (a fact which follows easily from the convexity of relative entropy in its two arguments or can be proven more directly), it follows that H(Y ) is a concave function of p(x) (since linear composed with concave is concave). The second term on the other hand is obviously linear in p(x) and hence I(X; Y ) is concave in p(x) (concave minus linear is concave). For the other side, fix p(x) and we want to show I(X; Y ) is convex in p(y|x). A simple computation shows that the joint distribution obtained by using a convex combination of p1 (y|x) and p2 (y|x) is simply the same convex combination of the measures obtained by using these two conditional things. 43 Similarly, the product measure obtained when using a convex combination of p1 (y|x) and p2 (y|x) is simply the same convex combination of the product measures obtained by using these two conditional things. Using the definition of information, these two observations together with the convexity of relative entropy in its two arguments allows an easy computation (left to the reader) which shows that I(X; Y ) is convex in p(y|x) as desired. 2 EXERCISE: It seems that the second argument above could perhaps also be used to show that for fixed p(y|x), I(X; Y ) is convex in p(x). Where does the argument break down? 8 Rate Distortion Theory The area of rate distortion theory is very important in information theory and entire books have been written about it. Our presentation will be brief but we want to get to one of the main theorems, whose development parallels the theory in the previous section. The simplest context in which rate distortion theory arises is in quantization which is the following problem. We have a random variable X which we want to quantize, that is, we want to represent the value of X by say 1 of 2 possible values. You get to choose the values and assign them to X as you wish, that is, you get to choose a function f : R → R whose image has only two points such that f (X) (thought of as the quantized version of X) represents X well. What does “represent X well” mean? Perhaps we would want E[(f (X) − X)2 ] to be small. Recall that in the previous section, we needed to send alot of data (i.e., take n large) in order to have small probability of error for a fixed rate 44 which is below the channel capacity. Similarly, in this section, we will be considering “large n” where n is the amount of data to obtain another elegant result. We let X denote the state space for our random variable and X̂ denote the possible quantized versions (or representations) of our random variable. We will now consider a generalization of the quantization problem which we now formulate. Definition 8.1: A distortion function is a mapping d : X × X̂ → R+ . d(x, x̂) should be thought of as some type of cost of representing the outcome x by x̂. Given the above distortion function d, we take as the distortion between to finite sequences xn1 and x̂n1 to be d(xn1 , x̂n1 ) = n 1X d(xi , x̂i ) n i=1 which might be something you object to but which is required for our theory to go through as we will do it. Definition 8.2: A (2Rn , n)–rate distortion code consists of an encoding function fn : X n → {1, 2, . . . , 2nR } and a decoding (reproduction) function gn : {1, 2, . . . , 2nR } → X̂ n . 45 If X1n ≡ X1 , . . . , Xn are random variables taking values in X and d is a distortion function, then the distortion associated with the above code is D = E[d(X1n , gn (fn (X1n )))]. The way to think of fn is that it breaks up the set of all possible outcomes X n into 2nR sets (this is the quantization part) and then each quantized piece is sent to some element of X̂ n which is supposed to “represent” the original element of X n . Obviously, the larger we take R, the smaller we can make D (e.g. if |X | = 2 and R = 1, we can get D = 0). In the previous section, we wanted to take R large, while now we want to take it small. We now assume that we have a fixed rate distortion function d and a fixed distribution on X , p(x). We will assume that X1n ≡ X1 , . . . Xn are i.i.d. with distribution p(x). As we did in the last chapter, for simplicity, we will assume that everything (i.e., X and X̂ ) are finite. Definition 8.3: The pair (R, D) is achievable if there exists a sequence of (2Rn , n)–rate distortion codes, (fn , gn ) with lim sup E[d(X1n , gn (fn (X1n )))] ≤ D. n Definition 8.4: The rate distortion function R(D), (where we still have a fixed distribution p(x) on X and a fixed distortion function d) is defined by R(D) = inf{R : (R, D) is achievable }. 46 One should view R(D) as the coarsest possible quantization subject to keeping the distortion at most D. This is clearly a quantity that one should be interested in although it seems (analogous to (non-information) channel capacity) that this would be impossible to compute. We will now define something called the “information rate distortion” function which will play the role in our rate distortion theory that the information channel capacity did in the channel capacity theory. This quantity is something much easier to compute (as was the information channel capacity) and the main theorem will be that these two (the rate distortion function and the information rate distortion function) are the same. Definition 8.5: The information rate distortion function RI (D), (where we still have a fixed distribution p(x) on X and a fixed distortion function d) is defined as follows. Let P D be the set of all distributions p(x, x̂) on X × X̂ whose first marginal is p(x) and which satisfy Ep(x,x̂) d(x, x̂) ≤ D. Then we define RI (D) = min p(x,x̂)∈P D I(X; X̂). Long Exercise: Let X = X̂ = {0, 1} and consider the rate distortion function which is Hamming distance (which is 1 if the two guys are not the same and 0 otherwise). Assume that X has distribution pδ1 + (1 − p)δ0 . Compute RI (D) in this case. We now state out main theorem and go directly to the proof. Recall that both the rate distortion function and the information rate distortion function are defined relative to the distribution p(x) on X and the distortion function d. 47 Theorem 8.6: RI (D) = R(D), i.e., the rate distortion function and the information rate distortion function are the same. Lemma 8.7: RI (D) is a nonincreasing convex function of D. Proof: The nonincreasing part is obvious. For the convexity, consider D1 and D2 and λ ∈ (0, 1). Choose (using compactness) p1 (x, x̂) and p2 (x, x̂) which minimize I(X; X̂) over the sets P D1 and P D2 . Then (by linearity of expectation) pλ ≡ λp1 (x, x̂) + (1 − λ)p2 (x, x̂) ∈ P λD1 +(1−λ)D2 . By the second part of Theorem 7.13, we have RI (λD1 + (1 − λ)D2 ) ≤ Ipλ (X; X̂) ≤ λIp1 (X; X̂) + (1 − λ)Ip2 (X; X̂) = λRI (D1 ) + (1 − λ)RI (D2 ). 2 Proof of R(D) ≥ RI (D): We need to show that if we have a sequence of (2Rn , n)–rate distortion codes, (fn , gn ) with lim sup E[d(X1n , gn (fn (X1n )))] ≤ D, n then R ≥ RI (D). The encoding and decoding functions clearly give us a joint distribution of X n × X̂ n (where the marginal on X n is Qn i=1 p(x)) and we then have nR ≥ H(X̂1n ) ≥ H(X1n ) − H(X1n |X̂1n ) ≥ n X I(Xi ; X̂i ) ≥ i=1 n X RI (E[d(Xi , X̂i )]) = n i=1 n X 1 nRI ( i=1 n n X i=1 n X i=1 H(Xi ) − n X H(Xi |X̂i ) = i=1 1 I R (E[d(Xi , X̂i )]) ≥ n E[d(Xi , X̂i )])( by the convexity of RI ) = nRI (E[d(X1n , X̂1n )]). 48 This gives R ≥ RI (E[d(X1n , X̂1n )]) for all n. Since lim supn E[d(X1n , X̂1n )] ≤ D and RI (D) is nondecreasing, we should be able to conclude that R ≥ RI (D) but note that to make this conclusion, we need the continuity of RI or if you really think about it, you need only right–continuity. Of course if for some n, we had that E[d(X1n , X̂1n )] ≤ D, then we wouldn’t have to worry about this technicality. Exercise: Prove the needed continuity property. 2 Remark: Note that we actually proved that even if lim inf n E[d(X1n , X̂1n )] ≤ D, then R ≥ RI (D). We now proceed with the more difficult direction which uses (as in the channel capacity theorem) a randomization procedure. We introduced the definition of –jointly typical sequences in the previous section. We now need to return to this again but with one other condition. Definition 8.8: Let p(x, x̂) be a joint distribution of X × X̂ and d a distortion function on the same set. We say (xn , y n ) is –jointly typical (relative to p(x, x̂) and d) if |− |− 1 1 p(xn ) − H(X)| < , | − p(y n ) − H(X̂)| < , n n 1 p(xn , y n ) − H(X, X̂)| < , and |d(xn , y n ) − E[d(X, X̂)]| < n and we denote the set of all such pairs by And, . The following three lemmas are left to the reader. The first is immediate, the second takes a little more work and the third is pure analysis. Lemma 8.9: Let (Xi , X̂i ) be drawn i.i.d. according to p(x, x̂). Then P (And, ) → 1 as n → ∞. 49 Lemma 8.10: For all (xn , x̂n ) ∈ And, , p(x̂n ) ≥ p(x̂n |xn )2−n(I(X;X̂)+3) . Lemma 8.11: For 0 ≤ x, y ≤ 1 and n > 0 (1 − xy)n ≤ 1 − x + e−yn . Proof of R(D) ≤ RI (D): Recall that p(x) (the distribution of X) and d (the distortion function) are fixed and X1 , . . . , Xn are i.i.d. with distribution p(x). We need to show that for any D and any R > RI (D), (R, D) is achievable. EXERCISE: Show that it suffices to show that (R, D + ) is achievable for all and then of course it suffices to show this for with R > RI (D) + 3. For this fixed D, choose a joint distribution p(x, x̂) which minimizes I(X; X̂) in the definition of RI (D) and let p(x̂) be the marginal on x̂. In particular, we then have that R > I(X; X̂) + 3. Choose 2nR sequences from X̂ n independently, each with distribution Qn i=1 p(x̂) which then gives us a (random) decoding (or reproduction) func- tion gn : {1, 2, . . . , 2nR } → X̂ n . Once we have the decoding function gn , we define the encoding function fn as follows. Send X n to w ∈ {1, 2, . . . , 2nR } if (X n , gn (w)) ∈ And, (if there is more than one such w send it to any of them) while if there is no such w, send X n to 1. Our probability space consists of first choosing a (2Rn , n)–rate distortion code in the above way and then choosing X n independently from Qn i=1 p(x). Consider the event En that there does not exist w ∈ {1, 2, . . . , 2nR } with 50 (X n , gn (w)) ∈ And, (an event which depends on both the random code chosen and X n ). The key step is the following which we prove afterwards. Lemma 8.12: P (En ) → 0 as n → ∞. Calculating things in this probability space, we get (X̂ n is of course gn (fn (X n )) here) E[d(X n , X̂ n )] ≤ D + + dmax P (En ) where dmax is of course the maximum value d can take on. The reason why this holds is that if Enc occurs, then (X n , X̂ n ) is in And, which implies by definition that |d(X n , X̂ n ) − E[d(X, X̂)]| < which gives the result since E[d(X, X̂)]| ≤ D. Applying the lemma, we obtain lim supn E[d(X n , X̂ n )] ≤ D + . It follows that there must be a sequence of (2Rn , n)–rate distortion codes (fn , gn ) with lim supn E[d(X1n , gn (fn (X1n )))] ≤ D + , as desired. 2 Proof of Lemma 8.12: This is simply a long computation. Let Cn denote the set of (2Rn , n)–rate distortion codes. For C ∈ Cn , let J(C) be the set of sequences xn for which there is some codeword x̂n with (xn , x̂n ) ∈ And, . We have P (En ) = X C∈Cn X xn p(xn ) X P (C) p(xn ) = xn :xn 6∈J(C) X P (C). C∈Cn :xn 6∈J(C) Letting K(xn , x̂n ) be the indicator function of (xn , x̂n ) ∈ And, and fixing some xn , we have P ((xn , X̂ n ) 6∈ And, ) = 1 − X x̂n 51 p(x̂n )K(xn , x̂n ) giving X X p(xn ) xn P (C) = X p(xn )[1 − xn C∈Cn :xn 6∈J(C) X p(x̂n )K(xn , x̂n )]2 nR . x̂n Next, Lemmas 8.10 and 8.11 (in that order) and some algebra give X P (En ) ≤ 1 − p(xn , x̂n )K(xn , x̂n ) + e−2 −n(I(X;X̂)+3) 2nR . xn ,x̂n Since R > I(X; X̂) + 3, the last term goes (super)–exponentially to 0. The first two terms are simply P ((And, )c ) which goes to 0 by Lemma 8.9. This completes the proof. 2 9 Other universal and non–universal estimators The following question I find to be interesting and a number of papers have been written about it. Given a property of ergodic stationary processes, when does there exist a consistent estimator of it (consistent in the class of all ergodic stationary processes)? Exercise: Sticking to processes taking on only the values {0, . . . , 9}, show that if a property (i.e., a function) is continuous on the set of all ergodic processes (where the latter is given the usual weak topology), then there exists a consistent estimator. (Olle Nerman pointed this out to me when I mentioned the more general question above). I won’t write anything in this section, but rather we will go through some recent research papers. I will attach zeroxed copies of some papers here. 52 The theme of these papers will be, among other things, to learn how to construct stationary processes with certain desired behavior. 53