Information Theory Lecture Notes Richard Combes 1 Version 1.0 1 Université France Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des signaux et systèmes, 2 Contents 1 2 Information Measures 1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Definition . . . . . . . . . . . . . . . . . . 1.1.2 Entropy and Physics . . . . . . . . . . . . 1.1.3 Positivity of Entropy and Maximal Entropy 1.2 Joint and Conditional Entropy . . . . . . . . . . . 1.2.1 Definition . . . . . . . . . . . . . . . . . . 1.2.2 Properties . . . . . . . . . . . . . . . . . . 1.3 Relative Entropy . . . . . . . . . . . . . . . . . . 1.3.1 Definition . . . . . . . . . . . . . . . . . . 1.3.2 Positivity of Relative Entropy . . . . . . . 1.3.3 Relative Entropy is Not a Distance . . . . . 1.4 Mutual Information . . . . . . . . . . . . . . . . . 1.4.1 Definition . . . . . . . . . . . . . . . . . . 1.4.2 Positivity of Mutual Information . . . . . . 1.4.3 Conditionning Reduces Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 11 11 12 13 13 13 14 14 14 15 15 15 16 16 Properties of Information Measures 2.1 Chain Rules . . . . . . . . . . . . . . . . . 2.1.1 Chain Rule for Entropy . . . . . . . 2.1.2 Chain Rule for Mutual Information 2.1.3 Chain Rule for Relative Entropy . . 2.2 Log Sum Inequality . . . . . . . . . . . . . 2.2.1 Statement . . . . . . . . . . . . . . 2.3 Data Processing and Markov Chains . . . . 2.3.1 Markov Chains . . . . . . . . . . . 2.3.2 Data Processing Inequality . . . . . 2.4 Fano Inequality . . . . . . . . . . . . . . . 2.4.1 Estimation Problems . . . . . . . . 2.4.2 Statement . . . . . . . . . . . . . . 2.5 Asymptotic Equipartition and Typicality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 17 18 18 19 19 19 19 20 20 20 21 22 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 CONTENTS 2.5.1 2.5.2 2.5.3 3 4 5 AEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Typicality . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Joint Typicality . . . . . . . . . . . . . . . . . . . . . . . 23 Data Representation: Fundamental Limits 3.1 Source Coding . . . . . . . . . . . . . . . . . . . 3.1.1 Definition . . . . . . . . . . . . . . . . . . 3.1.2 Expected Length . . . . . . . . . . . . . . 3.1.3 Non-Singular Codes . . . . . . . . . . . . 3.1.4 Uniquely Decodable Codes . . . . . . . . . 3.2 Prefix Codes . . . . . . . . . . . . . . . . . . . . . 3.2.1 Definition . . . . . . . . . . . . . . . . . . 3.2.2 Prefix Codes as Trees . . . . . . . . . . . . 3.2.3 Kraft Inequality . . . . . . . . . . . . . . . 3.3 Optimal Codes and Entropy . . . . . . . . . . . . . 3.3.1 Lower Bound on the Expected Code Length 3.3.2 Existance of Nearly Optimal Codes . . . . 3.3.3 Asymptotically Optimal Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Representation: Algorithms 4.1 The Huffman Algorithm . . . . . . . . . . . . . . . . 4.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . 4.1.2 Rationale . . . . . . . . . . . . . . . . . . . . 4.1.3 Complexity . . . . . . . . . . . . . . . . . . . 4.1.4 Limitations . . . . . . . . . . . . . . . . . . . 4.1.5 Illustration . . . . . . . . . . . . . . . . . . . 4.1.6 Optimality . . . . . . . . . . . . . . . . . . . 4.2 Markov Coding . . . . . . . . . . . . . . . . . . . . . 4.2.1 Markov Sources . . . . . . . . . . . . . . . . 4.2.2 The Entropy of English . . . . . . . . . . . . . 4.2.3 Efficient Codes for Markov Sources . . . . . . 4.3 Universal Coding . . . . . . . . . . . . . . . . . . . . 4.3.1 Universality . . . . . . . . . . . . . . . . . . . 4.3.2 A Simple Universal Code for Binary Sequences 4.3.3 Lempel-Ziv Coding . . . . . . . . . . . . . . . Data Representation: Rate-Distorsion Theory 5.1 Lossy Compression, Quantization and Distorsion 5.1.1 Lossless vs Lossy Compression . . . . . 5.1.2 The Quantization Problem . . . . . . . . 5.2 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 25 25 26 26 27 27 27 28 28 28 30 30 . . . . . . . . . . . . . . . 33 33 33 33 34 34 34 35 36 36 37 37 38 38 39 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 . . . 43 . . . 44 . . . 44 . . . 45 CONTENTS 5.3 5.4 5.5 6 7 5 5.2.1 Lloyd-Max Conditions . . . . . . . . . . . . . . . . . 5.2.2 Uniform Distribution . . . . . . . . . . . . . . . . . . 5.2.3 Gaussian Distribution with one bit . . . . . . . . . . . 5.2.4 General Distributions . . . . . . . . . . . . . . . . . . Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Vector Quantization is Better than Scalar Quantization 5.3.2 Paradoxes of High Dimensions . . . . . . . . . . . . . 5.3.3 Rate Distorsion Function . . . . . . . . . . . . . . . . Rate Distorsion Theorem . . . . . . . . . . . . . . . . . . . . 5.4.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Efficient Coding Scheme: Random Coding . . . . . . Rate Distorsion for Gaussian Distributions . . . . . . . . . . . 5.5.1 Gaussian Random Variables . . . . . . . . . . . . . . 5.5.2 Gaussian Vectors . . . . . . . . . . . . . . . . . . . . Mutual Information and Communication: discrete channels 6.1 Memoryless Channels . . . . . . . . . . . . . . . . . . . . 6.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Information Capacity of a Channel . . . . . . . . . 6.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Non-Overlapping Outputs Channels . . . . . . . . 6.1.5 Binary Symmetric Channel . . . . . . . . . . . . . 6.1.6 Typewriter Channel . . . . . . . . . . . . . . . . . 6.1.7 Binary Erasure Channel . . . . . . . . . . . . . . 6.2 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Coding Schemes . . . . . . . . . . . . . . . . . . 6.2.2 Example of a Code for the BSC . . . . . . . . . . 6.2.3 Achievable Rates . . . . . . . . . . . . . . . . . . 6.3 Noisy Channel Coding Theorem . . . . . . . . . . . . . . 6.3.1 Capacity Upper Bound . . . . . . . . . . . . . . . 6.3.2 Efficient Coding Scheme: Random Coding . . . . 6.4 Computing the Channel Capacity . . . . . . . . . . . . . . 6.4.1 Capacity of Weakly Symmetric Channels . . . . . 6.4.2 Concavity of Mutual Information . . . . . . . . . 6.4.3 Algorithms for Mutual Information Maximization . . . . . . . . . . . . . . . . . . . . Mutual Information and Communication: continuous channels 7.1 Information Mesures for Continous Variables . . . . . . . . 7.1.1 Differential Entropy . . . . . . . . . . . . . . . . . 7.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Joint and Conditional Entropy Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 46 47 48 48 48 49 50 50 50 52 53 53 54 . . . . . . . . . . . . . . . . . . . 57 57 57 58 58 59 59 60 60 61 61 62 63 63 63 64 66 67 67 68 69 . . . 69 . . . 69 . . . 70 . . . 71 6 CONTENTS 7.2 7.3 7.4 7.5 8 9 7.1.4 Unified Definitions for Information Measures . . . . Properties of Information Measures for Continous Variables 7.2.1 Chain Rule for Differential Entropy . . . . . . . . . 7.2.2 Differential Entropy of Affine Transformation . . . . Differential Entropy of Multivariate Gaussians . . . . . . . . 7.3.1 Computing the Differential Entropy . . . . . . . . . 7.3.2 The Gaussian Distribution Maximizes Entropy . . . Capacity of Continuous Channels . . . . . . . . . . . . . . Gaussian Channels . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Gaussian Channel . . . . . . . . . . . . . . . . . . . 7.5.2 The AWGN Channel . . . . . . . . . . . . . . . . . 7.5.3 Parallel Gaussian Channels . . . . . . . . . . . . . . 7.5.4 Vector Gaussian Channels . . . . . . . . . . . . . . Portfolio Theory 8.1 A Model for Investment . . . . . . . . . 8.1.1 Asset Prices and Portfolios . . . 8.1.2 Relative Returns . . . . . . . . 8.2 Log Optimal Portfolios . . . . . . . . . 8.2.1 Asymptotic Wealth Distribution 8.2.2 Growth Rate Maximization . . . 8.3 Properties of Log Optimal Portfolios . . 8.3.1 Kuhn Tucker Conditions . . . . 8.3.2 Asymptotic Optimality . . . . . 8.4 Investment with Side Information . . . 8.4.1 Mismatched Portfolios . . . . . 8.4.2 Exploiting Side Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Theory for Machine learning and Statistics 9.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Statistical Inference . . . . . . . . . . . . . . 9.1.2 Examples of Inference Problems . . . . . . . 9.1.3 Empirical Distributions . . . . . . . . . . . . 9.2 The Method Of Types . . . . . . . . . . . . . . . . . 9.2.1 Probability Distribution of a Sample . . . . . 9.2.2 Number of Types . . . . . . . . . . . . . . . 9.2.3 Size of Type Class . . . . . . . . . . . . . . 9.3 Large Deviations and Sanov’s Theorem . . . . . . . 9.3.1 Sanov’s Theorem . . . . . . . . . . . . . . . 9.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 72 72 72 73 73 74 75 75 75 76 77 78 . . . . . . . . . . . . 81 81 81 82 82 82 83 83 84 84 85 86 86 . . . . . . . . . . . 89 89 89 89 90 90 91 92 92 93 93 95 CONTENTS 10 Mathematical Tools 10.1 Jensen Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 7 97 97 97 8 CONTENTS Foreword Those lectures notes pertain to the Information Theory Course given in CentraleSupelec. They are based on the book "Cover and Thomas, Elements of Information Theory", which we highly reccomend to interested students in order to go further in the study of this topic. Each chapter corresponds to a lecture, apart from the last chapter which contains mathematical tools used in the proofs. 9 10 CONTENTS Chapter 1 Information Measures In this chapter we introduce information measures for discrete random variables, which form the basis of all information theory: entropy, relative entropy and mutual information, and prove a few elementary properties. 1.1 Entropy 1.1.1 Definition Definition 1.1.1. The entropy of X ∈ X a discrete random variable with distribution pX is: X 1 1 = pX (x) log2 H(X) = E log2 pX (X) pX (x) x∈X Entropy is arguably the most fundamental information measure. The entropy H(X) is a real number, that only depends on the distribution of X, and is expressed in bits. If the base 2 logarithm log2 is replaced by the natural logarithm log, then entropy is expressed in nats, and the two are equivalent up to a multiplicative factor, 1 ≈ 1.44 bits. We shall later see that H(X) in the sense that 1 nat is equal to log(2) both measures the randomness of X, as well as how much information is contained in X. 1.1.2 Entropy and Physics The notion of entropy originates from statistical physics. If random variable X is the state of a physical system with distribution pX , then H(X) is called the Gibbs entropy. One of the fundamental ideas is the the Gibbs entropy of an isolated physical system is a non-deacreasing function of time, and that it’s equilibrium 11 12 CHAPTER 1. INFORMATION MEASURES distribution must maximize the Gibbs entropy. Therefore, the randomness in a isolated system always increases and is maximized at equilibrium. In fact, one can prove that the Boltzman distribution: pX (x) = P ) exp(− E(x) kB T ′ x′ ∈X ) exp(− E(x ) kB T where T is the temperature, E(x) is the energy of state x and kB is the Boltman P constant, maximizes the Gibbs entropy under an average energy constraint x∈X E(x) = Ē. 1.1.3 Positivity of Entropy and Maximal Entropy Property 1. The entropy of X ∈ X a discrete random variable with distribution X ∼ pX verifies 0 ≤ H(X) ≤ log2 |X | with equality if and only if X is uniform. Proof: Since 0 ≤ pX (x) ≤ 1: H(X) = X pX (x) log2 x∈X X 1 ≥ pX (x) log2 1 = 0. pX (x) x∈X Logarithm is strictly concave, using Jensen’s inequality H(X) = X pX (x) log2 x∈X X 1 1 pX (x) < log2 = log2 |X |. pX (x) pX (x) x∈X If X is uniform: H(X) = X x∈X pX (x) log2 X 1 = pX (x) log2 |X | = log2 |X |. pX (x) x∈X Entropy is positive and is upper bounded by the logarithm of the size of the support |X |, with equality if and only if X is uniform. The fact that entropy is positive makes sense since entropy must measure an amount of information which must be positive. Furthermore, it makes sense to view entropy as a measure of randomness, since it is minimized (H(X) = 0) if X is deterministic, and it is maximized (H(X) = log2 |X |) for the uniform distribution which are respectively the least and the most random distributions over X . 1.2. JOINT AND CONDITIONAL ENTROPY 1.2 1.2.1 13 Joint and Conditional Entropy Definition Definition 1.2.1. The joint entropy of X ∈ X and Y ∈ Y two discrete random variables with joint distribution (X, Y ) ∼ pX,Y (x, y) is: X 1 1 H(X, Y ) = E log2 = p(x, y) log2 pX,Y (X, Y ) pX,Y (x, y) (x,y)∈X ×Y The joint entropy H(X, Y ) is simply the entropy of (X, Y ) seen as a single random variable. It is important to notice that the joint entropy depends on the full joint distribution of X and Y , not only on the marginal distributions. Definition 1.2.2. The conditional entropy of X ∈ X knowing Y ∈ Y two discrete random variables with joint distribution pX,Y and conditional distribution pX|Y is 1 pX|Y (X|Y ) 1 1 − E log2 = E log2 pX,Y (X, Y ) pY (Y ) = H(X, Y ) − H(Y ). H(X|Y ) = E log2 The conditional entropy H(X|Y ) measures the entropy of X once the value of Y has been revealed, and has several definitions, which are all equivalent, from the Bayes rule stating that pX,Y (x, y) = pX|Y (x|y)pY (y) In particular, the last relationship H(X, Y ) = H(X|Y ) + H(Y ) is called a chain rule, and can be interpreted as the fact that the amount of randomness in (X, Y ) equals the amount of randomness in Y plus the amount of randomness left in X once Y has been revealed. 1.2.2 Properties Property 2. If X and Y are independent then H(X|Y ) = H(X) and H(X, Y ) = H(X) + H(Y ) 14 CHAPTER 1. INFORMATION MEASURES Proof: If X and Y are independent then pX,Y (x, y) = pX (x)pY (y) and replacing in the definition gives the result immediately. Entropy is additive for independent random variables, which once again is coherent with its interpretation as a measure of randomness. Indeed, if there is no relationship between X and Y , the randomness of (X, Y ) is simply the sum of the randomness in X and Y taken separately. It is also noticed that entropy is not additive if X and Y are correlated, for instance if X = Y then H(X, Y ) = H(X) ̸= H(X) + H(Y ), unless both X and Y are deterministic. Property 3. Conditional entropy is not symmetrical unless H(X) = H(Y ): H(Y |X) − H(X|Y ) = H(Y ) − H(X) Conditional entropy is not symmetrical, one notable exception being if X and Y have the same distribution. 1.3 1.3.1 Relative Entropy Definition Definition 1.3.1. Consider p, q two distributions over X discrete and X with distribution pX = p. The relative entropy between p and q is: p(x) p(X) X p(x) log2 D(p||q) = E log2 = . q(X) x∈X q(x) if q is absolutely continuous with respect to p and D(p||q) = +∞ otherwise. Relative entropy is another fundamental information measure, and a notable difference is that, while entropy measures the randomness of a single distribution, relative entropy measures the dissimilarity between two distributions p, q. It is also noted that if q is not absolutely continuous with respect to p, then p(x) = +∞ for q(x) some x ∈ X , so that indeed D(p||q) = +∞. 1.3.2 Positivity of Relative Entropy Property 4. Consider p, q two distributions. Then D(p||q) ≥ 0 with equality if and only if p = q. Proof: Since z 7→ − log2 z is strictly convex, from Jensen’s inequality: q(X) q(X) D(p||q) = −E log2 ≥ − log2 E p(X) p(X) X q(x) = − log2 p(x) = − log2 1 = 0. p(x) x∈X 1.4. MUTUAL INFORMATION 15 Relative entropy (sometimes called Kullback-Leibler divergence) is positive, which makes sense as it measures dissimilarity. We always have D(p||q) ≥ 0, and D(p||q) = 0 if p = q and the larger the value of D(p||q), the more dissimilar p is to q. We shall also see later that there exists many other measures of dissimilarity between distributions in information theory. 1.3.3 Relative Entropy is Not a Distance Example 1. Consider |X | = 2 and p = ( 12 , 21 ) and q = (a, 1 − a). Then D(p||q) ̸= D(q||p) if a ̸= 12 . It should be noted that relative entropy is not a distance: it is not symmetrical by the example above, nor does it satisfy the triangle inequality. 1.4 1.4.1 Mutual Information Definition Definition 1.4.1. Let (X, Y ) discrete random variables with joint distribution pX,Y and marginal distributions pX and pY respectively. The mutual information between X and Y is: I(X; Y ) = XX pX,Y (x, y) log2 x∈X y∈Y = XX x∈X y∈Y p(x, y) log2 pX,Y (x, y) pX (x)pY (y) 1 1 − log2 pX (x) pX|Y (x|y) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) = H(X) + H(Y ) − H(X, Y ) = D(pX,Y ||pX pY ) The last measure of information we consider is called the mutual information. We provide several definitions, which are all equivalent to each other, and this can be checked by inspection. Mutual information is symmetric by denition. We have that I(X; Y ) = H(X) − H(X|Y ) therefore, if I(X; Y ) is large, one must have that H(X) is large, so that the randomness of X is large and that H(X|Y ) is small, so that the randomness of X knowing Y is small i.e. it is easy to guess X from Y . We also have that I(X; Y ) measures the dissimilarity between the joint distribution of (X, Y ), which is pX,Y , and the distribution of (X, Y ) if X and Y were chosen 16 CHAPTER 1. INFORMATION MEASURES independendently with the same marginals pX , pY . So mutual information can also be seen as a measure of dependency between X and Y . We shall see later that mutual information also quantifies the amount of information that can be exchanged between a sender whom selects X and a receiver that observes Y . 1.4.2 Positivity of Mutual Information Property 5. Let X, Y discrete random variables then I(X; Y ) ≥ 0 with equality if and only if X and Y are independent. Proof: By definition I(X; Y ) = D(pX,Y ||pX pY ) ≥ 0 since relative entropy is positive, with equality if and only if pX,Y = pX pY so that X,Y are independent. Mutual information is positive, since it can be written as a relative entropy. This has important consequences as we shall see. 1.4.3 Conditionning Reduces Entropy Property 6. Let X, Y discrete random variables then H(X|Y ) ≤ H(X) with equality if and only if X,Y are independent and H(X, Y ) ≤ H(X) + H(Y ) with equality if and only if X,Y are independent. Proof: We have 0 ≥ I(X; Y ) = H(X) − H(X|Y ) ≥ 0 with equality with equality if and only if X,Y are independent. From the chain rule H(X, Y ) = H(Y |X) + H(X) ≤ H(X) + H(Y ) using the previous result. From the positivity of mutual information, we deduce two important properties. The first is that conditioning always reduces entropy, which is intuitive since revealing the value of Y reduces the randomness in X. Furthermore, we have already seen that entropy is additive for independent random variables, we now see that it is in fact sub-additive, so that joint entropy is always smaller than the sum of entropies. Chapter 2 Properties of Information Measures In this chapter we introduce important properties of information measures which enable to manipulate them efficiently such as chain rules. We also introduce fundamental inequalities involving information measures such as: the data processing inequality, the log-sum inequality and Fano’s inequality. 2.1 Chain Rules In general, a chain rule is simply a formula that allows to compute information measures by recursion. 2.1.1 Chain Rule for Entropy Definition 2.1.1. For any X1 , . . . , Xn we have: H(X1 , . . . , Xn ) = n X i=1 H(Xi |Xi−1 , . . . , X1 ) Proof: By definition of conditional entropy: H(X1 , ..., Xn ) = H(Xn |Xn−1 , ..., X1 ) + H(Xn−1 , ..., X1 ) The result follows by induction over n. The chain rule for entropy allows to compute the entropy of X1 , . . . , Xn by successive conditionning, and has the following interpretation: imagine that the values of X1 , . . . , Xn are presented to us as a time series, one value after the other, then H(Xi |Xi−1 , . . . , X1 ) is simply the randomness of the current value Xi knowing the full history of the process up to time i − 1 which is Xi−1 , . . . , X1 17 18 CHAPTER 2. PROPERTIES OF INFORMATION MEASURES 2.1.2 Chain Rule for Mutual Information Definition 2.1.2. For any X1 , . . . , Xn we have: I(X1 , . . . , Xn ; Y ) = n X i=1 I(Xi ; Y |Xi−1 , . . . , X1 ) Proof Using both the chain rule and the definition of mutual information: I(X1 , . . . , Xn ; Y ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y ) n n X X = H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y ) = i=1 n X i=1 i=1 I(Xi ; Y |Xi−1 , . . . , X1 ). The chain rule for mutual information also has a natural interpretation. Imagine that a sender selects X1 , . . . , Xn and attempts to communicate with a receiver who observes Y . Then the information that can be exchanged I(X1 , . . . , Xn ; Y ) is the sum of I(Xi ; Y |Xi−1 , . . . , X1 ) which can be interpreted as the sender sending X1 , the receiver retrieving X1 from Y , then sender sending X2 and the receiver retrieving X1 from both Y and X1 etc. This idea of retrieving X1 , ..., Xn iteratively is used in many communication systems. 2.1.3 Chain Rule for Relative Entropy Definition 2.1.3. Consider X, Y discrete random variables with joint distribution pX,Y and marginal distributions pX , pY respectively. We have: D(pX,Y ||qX,Y ) = D(pX ||qX ) + D(pY |X ||qY |X ) Proof Using the Bayes rule: pX,Y (X, Y ) D(pX,Y ||qX,Y ) = E log2 qX,Y (X, Y ) pY |X (X, Y ) pX (X, Y ) = E log2 + E log2 qY |X (X, Y ) qX (X, Y ) = D(pY |X ||qY |X ) + D(pX ||qX ) proving the result. The interpretation of this chain rule is similar to that for the entropy. 2.2. LOG SUM INEQUALITY 2.2 19 Log Sum Inequality In information theory, weighted sums of logarithms are ubiquitous, and the socalled log-sum inequality is a useful tool in many situations. 2.2.1 Statement Proposition 2.2.1. For any (ai )i , (bi )i positive n X i=1 with equality iff ai bi Pn n X ai ai ai log2 ≥ ( ai ) log2 Pi=1 n bi i=1 bi i=1 = c for all i. Proof Function f (x) = x log2 x is strictly convex as f ′′ (x) = Using Jensen with αi = Pnbi bj : 1 x > 0. j=1 n X i=1 n n n n a X X X X ai ai i bj ) αi f ≥( bj )f ai log2 = ( αi bi bi bi j=1 i=1 j=1 i=1 P n n X ai . =( ai ) log2 Pi=1 n i=1 bi i=1 Interestingly, the log-sum inequality implies a variety of other results as we shall see later. 2.3 Data Processing and Markov Chains A fundamental idea in information theory, which to a degree justifies the definition of mutual information in itself, is that data processing, even with unlimited computing power, cannot create information. This is formalized by the data processing inequality for Markov chains. 2.3.1 Markov Chains Definition 2.3.1. X → Y → Z is a Markov chain iff X and Z are independent given Y . Equivalently we have (X, Y, Z) ∼ pX,Y,Z (x, y, z) with pX,Y,Z (x, y, z) = pX (x)pY |X (y|x)pZ|Y,X (z|y, x) = pX (x)pY |X (y|x)pZ|Y (z|y). 20 CHAPTER 2. PROPERTIES OF INFORMATION MEASURES Simply said, a Markov chain X → Y → Z is such that one first draws the value of X, then once the value of X is known we draw Y accoding to some distribution that depends solely on X, and finally one draws Z accoding to some distribution that depends solely on Y . The key idea is that, in order to generate Z, one can only look at the previously generated value Y , i.e. we generate the process with a memory of order 1. The simplest, and most often encountered example of a Markov chain X → Y → Z is any X, Y, Z such that Z = g(Y ) where g is a known, deterministic function. 2.3.2 Data Processing Inequality Proposition 2.3.2. If X → Y → Z then I(X; Y ) ≥ I(X; Z). Proof We have: I(X; Y, Z) = I(X; Z) + I(X; Y |Z) = I(X; Y ) + I(X; Z|Y ) since I(X; Y |Z) ≥ 0 and I(X; Z|Y ) = 0 we have I(X; Y ) ≥ I(X; Z). The data processing inequality simply states that mutual information cannot increase along a Markov chain, i.e. data processing cannot create information out of nowhere. An interpretation in the context of communication is that if a sender selects X and a receiver observes Y , and a helper offers to help the receiver by computing the value of g(Y ), then X 7→ Y 7→ g(Y ) and so I(X; g(Y )) ≤ I(X; Y ). I.e. the helper is in fact never helpful. 2.4 Fano Inequality We now derive Fano’s inequality, which establishes a fundamental link between entropy and the probability of error in estimation problems, and is essential in both statistics and communication. 2.4.1 Estimation Problems We call estimation problem a situation in which an agent observes a random variable Y , and attempts to guess another hidden random variable X. The agent is allowed to construct any estimator X̂, without any limitation on his computing power. The goal is to minimize the estimation error P(X ̸= X̂). 2.4. FANO INEQUALITY 2.4.2 21 Statement Proposition 2.4.1. If X → Y → X̂ then: h2 (P(X ̸= X̂)) + P(X ̸= X̂) log2 |X | ≥ H(X|Y ) 1 with h2 (p) = p log p1 + (1 − p) log 1−p the binary entropy. Proof: Since X → Y → X̂ is a Markov chain H(X) − H(X|X̂) = I(X; X̂) ≤ I(X; Y ) = H(X) − H(X|Y ) so that H(X|Y ) ≤ H(X|X̂) Define E = 1{X̂ ̸= X}, using the chain rule in both directions: H(E|X̂) + H(X|E, X̂) = H(X, E|X̂) = H(X|X̂) + H(E|X, X̂) Now H(E|X, X̂) = 0 because E is a deterministic function of X, X̂ which proves: H(X|X̂) = H(E|X̂) + H(X|E, X̂) We have H(X|E, X̂) ≤ P(E = 1) log2 (|X | − 1) + P(E = 0) log2 (1) because if E = 0 then X = X̂ has 1 possible values and if E = 1 X ̸= X̂ has |X | − 1 possible values. Finally, since conditioning reduces entropy: H(E|X̂) ≤ H(E) = h2 (P(E = 1)) which concludes the proof. Fano’s inequality states that the estimation error P(X ̸= X̂) cannot be arbitrarly small, unless the conditional entropy of the hidden variable knowing the observation H(X|Y ) is small too. This a fundamendal limit that is true irrespective of how much computational power is available to perform the estimation. This intuitive since H(X|Y ) is the randomness left in X once Y has been seen by the agent. Fano’s inequality therefore shows that conditional entropy can be used as a measure of how difficult an estimation problem might be. 22 CHAPTER 2. PROPERTIES OF INFORMATION MEASURES 2.5 2.5.1 Asymptotic Equipartition and Typicality AEP Proposition 2.5.1. Consider (Xi )i=1,...,n i.i.d. with common distribution pX . Then n 1X 1 log2 → H(X) in probability. n i=1 pX (Xi ) n→∞ Consider (Xi , Yi )i=1,...,n i.i.d. with common joint distribution pX,Y . Then n 1X 1 log2 → H(X, Y ) in probability. n i=1 pX,Y (Xi , Yi ) n→∞ and n 1X 1 → H(X|Y ) in probability. log2 n i=1 pX|Y (Xi |Yi ) n→∞ and n 1X pX (Xi )pY (Yi ) log2 → I(X; Y ) in probability. n i=1 pX,Y (Xi , Yi ) n→∞ Proof: All statements hold true from the weak law of large numbers. The Asymptotic Equipartition Property (AEP), which in itself is a straightforward consequence of the law of large numbers, roughly states that for large i.i.d. samples, the "empirical information measures" behave like the actual information measures. While this is not very useful in itself, a consequence is that i.i.d. samples concentate on what is called on high probability "typical sets". 2.5.2 Typicality Proposition 2.5.2. Consider X1 , . . . , Xn i.i.d. with common distribution X ∼ p(x). Given ϵ > 0 define the typical set: Anϵ n n o 1X 1 n log2 − H(X) ≤ ϵ . = (x1 , ..., xn ) ∈ X : n i=1 p(xi ) Then: (i) |Anϵ | ≤ 2n(H(X)+ϵ) for all n (ii) |Anϵ | ≥ (1 − ϵ)2n(H(X)−ϵ) for n large enough (iii) P((X1 , . . . , Xn ) ∈ Anϵ ) ≥ 1 − ϵ for n large enough 2.5. ASYMPTOTIC EQUIPARTITION AND TYPICALITY 23 Proof: By definition, (x1 , ..., xn ) ∈ Anϵ if and only if 2−n(H(X)+ϵ) ≤ p(x1 )...p(xn ) ≤ 2−n(H(X)−ϵ) Computing the probability of the typical set: P((X1 , . . . , Xn ) ∈ Anϵ ) = X p(x1 )...p(xn ) (x1 ,...,xn )∈An ϵ Which we bound as |Anϵ |2−n(H(X)+ϵ) ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ |Anϵ |2−n(H(X)−ϵ) From asymptotic equipartition the typical set is a high probability set, and for n large enough 1 − ϵ ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 1. The size of the typical set is bounded as |Anϵ | ≤ 2n(H(X)−ϵ) P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 2n(H(X)−ϵ) |Anϵ | ≥ P((X1 , . . . , Xn ) ∈ Anϵ )2n(H(X)+ϵ) ≥ (1 − ϵ)2n(H(X)+ϵ) This concludes the proof. In essence, if one draws an i.i.d. sample X1 , ..., Xn , with high probability, it will fall in the so called "typical set" and this typical set has a size roughly equal to 2nH(X) . This is also fundamental for data compression: imagine that we would like to represent X1 , ..., Xn as a sequence of m binary symbols. If we have a small tolerence for error, then if X1 , ..., Xn is typical we could represent it by its index in the typical set using m ≈ nH(X) binary symbols, and if X1 , ..., Xn is non-typical simply ignore it. This gives a new interpretation of entropy as the number of binary symbols necessary to represent data. We will expand on this in the latter chapters. 2.5.3 Joint Typicality Proposition 2.5.3. Consider (X n , Y n ) = (Xi , Yi )i=1,...,n i.i.d. with distribution p(x, y) and (X̃ n , Ỹ n ) = (X̃i , Ỹi )i=1,...,n i.i.d. with distribution p(x)p(y). Given ϵ > 0 define the jointly typical set: Anϵ n n 1X 1 n n n n = (x , y ) ∈ X × Y : log2 − H(X) n i=1 p(xi ) n n o 1X 1 1X 1 + log2 − H(Y ) + log2 − H(X, Y ) ≤ ϵ . n i=1 p(yi ) n i=1 p(xi , yi ) 24 CHAPTER 2. PROPERTIES OF INFORMATION MEASURES Then: (i) |Anϵ | ≤ 2n(H(X,Y )+ϵ) for all n ; (ii) P((X n , Y n ) ∈ Anϵ ) → 1. n→∞ (iii) (1 − ϵ)2−n(I(X;Y )+ϵ) ≤ P((X̃ n , Ỹ n ) ∈ Anϵ ) ≤ 2−n(I(X;Y )+ϵ) for n large enough Proof We have: n n o 1X 1 n n n n n Aϵ ⊂ (x , y ) ∈ X × Y : log2 − H(X, Y ) ≤ ϵ . n i=1 p(xi , yi ) and we know that this set has size at most 2n(H(X,Y )+ϵ) . From the law of large numbers: n 1X 1 ϵ log2 − H(X) ≥ → 0 n i=1 p(Xi ) 3 n→∞ n 1X ϵ 1 P − H(Y ) ≥ → 0 log2 n i=1 p(Yi ) 3 n→∞ n 1X 1 ϵ P − H(X, Y ) ≥ → 0 log2 n i=1 p(Xi , Yi ) 3 n→∞ P Therefore: P((X n , Y n ) ∈ Anϵ ) → 1 n→∞ Since (X̃ n , Ỹ n ) is i.i.d. with distribution p(x)p(y): P((X̃ n , Ỹ n ) ∈ Anϵ ) = X p(xn )p(y n ) = (xn ,y n )∈An ϵ X (xn ,y n )∈An ϵ p(xn )p(y n ) p(xn , y n ). p(xn , y n ) If (xn , y n ) ∈ Anϵ : 2−n(I(X;Y )+ϵ) ≤ p(xn )p(y n ) ≤ 2−n(I(X;Y )−ϵ) p(xn , y n ) Therefore: 2−n(I(X;Y )+ϵ) ≤ P((X̃ n , Ỹ n ) ∈ Anϵ ) ≤ 2−n(I(X;Y )−ϵ) n n n P((X , Y ) ∈ Aϵ ) and the result is proven as P((X n , Y n ) ∈ Anϵ ) → 1. n→∞ Joint typicality is similar to typicality, and we will expand on its implications when considering communication over noisy channels. Chapter 3 Data Representation: Fundamental Limits In this chapter we start our exposition of how to represent data efficiently using information theoretic tools. We introduce prefix codes and show that the entropy of the source quantifies the length of the best prefix codes, and how such codes can be constructed. 3.1 Source Coding We consider the problem of source coding, in which we would like to represent a sequence of symbols X1 , .., Xn from some finite set X as a sequence of bits, with the goal of doing so as efficiently as possible. 3.1.1 Definition Definition 3.1.1. Consider X ∈ X and D the set of finite strings on {0, 1}. A source code is a mapping C : X → D. A source code takes as input a symbol X and maps it into a finite sequence of bits. 3.1.2 Expected Length Definition 3.1.2. Let X ∈ X with distribution p(x). The expected length of code C is: X L(C) = Eℓ(X) = p(x)ℓ(x). x∈X with ℓ(x) the length of codeword C(x). 25 26 CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS One of the main measures of efficiency of a source code is its expected length, which is the expected number of bits required to represent a symbol, if this symbol were drawn according to the source distribution. 3.1.3 Non-Singular Codes Definition 3.1.3. A code C is non singular if and only if C(x) = C(x′ ) =⇒ x = x′ for all x, x′ ∈ X . namely if X can be perfectly retrieved from C(X). A code is non-singular if the original symbol be retrieved from its associated codeword, using some sort of decoding procedure which is possible if any only if there exists no pair of symbols that get assigned the same codeword. Therefore, non-singular codes perfom lossless compression, which is the focus of this chapter. There also exist lossy compression techniques, considered in future chapters, where the amount of information lost (also called "distorsion") is controlled in some fashion. 3.1.4 Uniquely Decodable Codes Definition 3.1.4. The extension of a code C is the mapping from finite strings of X to finite strings of D defined as: C(x1 . . . xn ) = C(x1 ) . . . C(xn ) The extension of a code is what we obtain when encoding the sequence of symbols X1 , ..., Xn as the concatenation of the codewords associated to each symbol C(X1 ), ..., C(Xn ). Definition 3.1.5. A code C is uniquely decodable if its extension is non-singular. A critical point is that extension can create ambiuity , even if the code is nonsingular. Indeed, if one only observes the concatated codewords C(X1 ), ..., C(Xn ), it might be difficult to know where one codeword ends and where the next one begins. A simple example would be X = {a, b, c} and a code C(a) = 0, C(b) = 1 and C(c) = 01. We have C(a)C(b) = C(c) so it is impossible to differentiate between ab and c. A uniquely decodable code is such that extension does not create ambiguity, and enables to encode streams of symbols by encoding each symbol separately, without losing any information. 3.2. PREFIX CODES 3.2 3.2.1 27 Prefix Codes Definition Definition 3.2.1. A code C is a prefix code if C(x) is not a prefix of C(x′ ) unless x = x′ for all (x, x′ ) ∈ X 2 . An important class of uniquely decodable codes are prefix codes, where no codeword can be the prefix of another codeword. Those codes are also called selfpuncturing, or instantaneous, because the decoding can be done without looking ahead in the stream of coded bits. Definition 3.2.2. Prefix codes are uniquely decodable. Proof: Consider the following decoding algorithm: let C(X1 ), ..., C(Xn ) be a sequence of bits u1 ...um and let ℓ the smallest integer such that u1 ...uℓ = C(x) for some x. Then we must have x = X1 , otherwise C(x) would be the prefix of some other codeword. This yields X1 and repeat the procedure to obtain X1 , ..., Xn . It is understood that prefix codes are uniquely decodable, and uniquely decodable codes are non-singular, but there exists uniquely decodable codes that are not prefix codes, and there exists non-singular codes that are not uniquely decodable. 3.2.2 Prefix Codes as Trees We first introduce a few notions related to binary trees, which are important in order to understand properties of prefix codes. Definition 3.2.3. Given a binary tree G = (V, E), we call the "label" of leaf v the binary sequence encoding the the unique path from the root to v, where 0 stands for "down and left" and 1 for "down and right"). Property 7. Consider a binary tree, then the labels of its leaves form a prefix code. Conversely, for any prefix code, there exists a binary tree whose leaves label are the codewords of that code. Proof: Consider v and v ′ two leaves of G such that the label of v is a prefix of the label of v ′ , then this means that v ′ is a descendent of v which is not a leaf, a contradicton. So the leaves labels form a prefix code. Conversely, consider a prefix code, and the following procedure to build the associated binary tre. Start with G a complete binary tree. If the code is not empty then select one of its codewords C(x), find v the node whose label is C(x) and remove all of the descendents of v from G and remove C(x) from the code. Repeat the procedure until the code is empty. 28 CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS Therefore, there is an identity between binary trees and prefix codes: for every prefix code we can construct a binary tree representation of this code, and every binary tree represents a prefix code. This is fundamental in order to derive lower bounds on the code length and design codes which attain these bounds bound. 3.2.3 Kraft Inequality Proposition 3.2.4. For any prefix code we have: X 2−ℓ(x) ≤ 1. x∈X Also, given any (ℓ(x))x∈X satisfying this inequality one can construct a prefix code with codeword lengths (ℓ(x))x∈X . Proof: Let lm = maxx∈X ℓ(x) the largest codeword length. Let Z(x) ⊂ {0, 1}lm set of words that have C(x) as a prefix. Then |Z(x)| = 2lm −ℓ(x) . Furthermore Z(x) ∩ Z(x′ ) = ∅ as C is a prefix code. Summing over x proves the result: X X 2lm = |{0, 1}lm | ≥ |Z(x)| = 2lm −ℓ(x) . x∈X x∈X Conversely, assume that we are given codeword lengths (ℓ(x))x∈X satisfying the Kraft inequality, and one can construct a prefix code with those codeword lengths. Indeed, if ℓ(x) are sorted in increasing order, we can let C(x) the ℓ(x) first digits P of the binary representation of i<x 2−ℓ(i) . Kraft’s inequality is a fundamental limit and states that there is a constraint on the expected length that must be satisfied by any prefix code. 3.3 3.3.1 Optimal Codes and Entropy Lower Bound on the Expected Code Length Proposition 3.3.1. For any prefix code we have: L(C) ≥ H(X). with equality if and only if 2−ℓ(x) = p(x) for all x ∈ X . Proof: Consider the optimization problem (P1 ) X X Minimize p(x)ℓ(x) s.t. 2−ℓ(x) ≤ 1 , ℓ(x) ∈ N , x ∈ X x∈X x∈X 3.3. OPTIMAL CODES AND ENTROPY 29 Now consider its convex relaxation (P2 ): X X Minimize p(x)ℓ(x) s.t. 2−ℓ(x) ≤ 1 , ℓ(x) ∈ R , x ∈ X x∈X x∈X From Lagrangian relaxation the solution of (P2 ) must minimize: X X J= p(x)ℓ(x) + λ 2−ℓ(x) x∈X x∈X The first order conditions read: ∂J = p(x) − λ(log 2)2−ℓ(x) = 0 , x ∈ X ∂ℓ(x) The optimal solution is of the form: 2−ℓ(x) = p(x) , x∈X λ(log 2) We find the value of λ by saturating the constraint: X X p(x) 1 1= 2−ℓ(x) = = . λ(log 2) λ(log 2) x∈X x∈X The optimal solution of (P2 ) is 2−ℓ(x) = p(x) , x ∈ X Its value lower bounds that of (P1 ) which concludes the proof: X X 1 p(x) log2 = H(X), p(x)ℓ(x) = p(x) x∈X x∈X A direct consequence of the Kraft inequality is that the source entropy is a lower bound on the expected length of any prefix code. Furthermore, in order to get close to the lower bound, one must make sure 2−ℓ(x) ≈ p(x). This shows that efficient codes assign short/long code words to frequent /infrequent symbols, in order to minimize the expected length. Now, the lower bound is not always attainable: to attain the bound we require 1 that for all x ∈ X : ℓ(x) = log2 ( p(x) ), where ℓ(x) is an integer. For instance, if p = (1/2, 1/4, 1/8, 1/8), then we can select ℓ = (1, 2, 3, 3), but if p = (2/3, 1/6, 1/6) 1 this is impossible, as log2 ( p(x) ) is not an integer. Two natural questions arise: how close to the entropy can the best prefix code perform, and how to derive the best prefix code in a computationally efficient manner ? 30 CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS 3.3.2 Existance of Nearly Optimal Codes Proposition 3.3.2. There exists a prefix code with codeword lengths ℓ(x) = 1 ⌈log2 p(x) ⌉, such that: H(X) ≤ L(C) ≤ H(X) + 1 Proof: Let ℓ(x) = ⌈log2 X x∈X 2−ℓ(x) = X x∈X 1 ⌉ p(x) which satisfies the Kraft Inequality: 1 2−⌈log2 p(x) ⌉ ≤ X 1 2− log2 p(x) = x∈X X p(x) = 1. x∈X Recall that whenever ℓ(x), x ∈ X satisfy the Kraft inequality, then there exists a corresponding prefix code with lenghts ℓ(x), x ∈ X . The length of this code is: L(C) = X x∈X l 1 m p(x) log2 p(x) x∈X X 1 ≤ p(x) log2 +1 p(x) x∈X p(x)ℓ(x) = X = H(X) + 1. which concludes the proof Therefore, it is always possible to construct a prefix code whose length is within 1 bit of the entropic lower bound. Now, this result is only useful if H(X) is much greater than 1. The key idea is then to use this scheme to encode not one individual symbol (with entropy H(X)), but rather blocks of n independent symbols (with entropy nH(X)) for large n. 3.3.3 Asymptotically Optimal Codes Proposition 3.3.3. Let (X1 , ..., Xn ) i.i.d. copies of X. For any prefix code C for (X1 , ..., Xn ): H(X) ≤ L(C) n and there is a prefix code C for (X1 , ..., Xn ) such that: L(C) 1 ≤ H(X) + . n n 3.3. OPTIMAL CODES AND ENTROPY 31 Proof: From independence H(X1 , ..., Xn ) = nH(X), and select C as the optimal prefix code for (X1 , ..., Xn ). If one encodes blocks of independent symbols with length n (X1 , ..., Xn ), we are interested in the rate L(C) which is the average number of bits per source n symbol required to represent the data. Then the rate of any prefix code must be greater than the entropy H(X), and for large n there exists a prefix code whose rate is approximately equal to the entropy (within a factor of 1/n). Therefore this code is asymptotically optimal, and cannot be improved upon (in terms of rate). This result also justifies entropy not only as a measure of randomness but also as a measure of the average description length of a source symbol. 32 CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS Chapter 4 Data Representation: Algorithms In this chapter we introduce algorithms in order to perfom lossless compression under various assumptions and demonstrate their optimality by comparing their performance to the entropic bound derived in last chapter. 4.1 4.1.1 The Huffman Algorithm Algorithm Algorithm 4.1.1 (Huffman Algorithm). Consider a known distribution p(x), x ∈ X . Start with G = (|X |, E, w) a weighted digraph with |X | nodes, no edges E = ∅, and weights w(x) = p(x). Repeat the following procedure until G is a tree: find i and j the two nodes with no father and maximal weight, add a new node k to G with weight w(k) and add edges (k, j) and (k, i) to E. The Huffman algorithm is a greedy algorithm which takes as an input the probability of each symbol p(x), x ∈ X , and iteratively constructs a prefix code with the goal of minimizing the expected code length. 4.1.2 Rationale The Huffman algorithm is based on the idea that a good prefix code should verify three properties: • (i) If p(x) ≥ p(y) then ℓ(y) ≥ ℓ(x) • (ii) The two longest codewords should have the same lengths • (iii) The two longest codewords differ by only 1 bit and correspond to the two least likely symbols In fact, these facts will serve to show the optimality of the Huffman algorithm. 33 34 4.1.3 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS Complexity At each step of the algorithm, one must find the two nodes with the smallest weight. There are |X | steps and finding the two nodes with smallest weight requires to sort the list of nodes by weight at each step which requires O(|X | ln |X |). Hence a naive implementation of the algorithm requires time O(|X |2 ln |X |). A smarter implementation would be to keep the list of nodes sorted at each step so that finding the two nodes with smallest weight can be done in time O(1) then insert the new node in the sorted list using binary search in time O(ln |X |). Hence the Huffman algorithm can be implemented in time O(|X | ln |X |), almost linear in the number of symbols. 4.1.4 Limitations While optimal, for sources with a lot of millions of symbols, the Huffman algorithm is too complex to implement, and there exists other techniques, such as artihmetic coding (used in JPEG). Also, the Huffman algorithm requires knowing the source distribution p(x) for x ∈ X at the encoder which is a practical limitation, and to solve this problem there exists universal codes, which operate without prior knowledge on p. We will show some simple strategies to design universal codes. 4.1.5 Illustration 1 1 2 1 3 10 1 0 0 1 1 5 0 0 1 A B C 1 2 1 5 1 10 x p(x) C(x) ℓ(x) E D 1 10 1 10 A B C D E 1 2 1 5 1 10 1 10 1 10 0 1 10 2 110 3 1110 4 1111 4 Above is the result of the Huffman algorithm applied to a given source. One can readily verify that the more probable the symbol, the longer the codeword, and 4.1. THE HUFFMAN ALGORITHM 35 that the two least probable symbols D and E have been assigned to the two leaves with highest depth. The length of the code is minimal amongst all prefix codes and equals: 1 1 1 ×1+ ×2+ × (3 + 4 + 4) = 2 2 5 10 which is only slightly larger than the source entropy: 1 1 1 log2 (2) + log2 (5) + 3 × log2 (10) ≈ 1.96 2 5 10 4.1.6 Optimality Proposition 4.1.2. The Huffman algorithm outputs a prefix code with minimal expected length L(C) amongst all prefix codes. Proof: Assume that the source symbols are sorted p(1) ≤ ... ≤ p(|X |). Consider a code C with minimal length, and x, y two symbols such that x ≤ y and ℓ(x) < ℓ(y). Then contruct a new code C ′ such that C ′ (x) = C(y), C ′ (y) = C(x) and C(z) = C(z) for z ̸= x, y. Then clearly L(C ′ ) < L(C) hence C cannot be optimal, a contradiction. This shows that for any x, y such that x ≤ y we must have ℓ(x) ≥ ℓ(y). Futhermore, since the two least probable symbols should have maximal depth we can always assume that they are siblings (otherwise simply perform an exchange between 2 and the sibling of 1). Consider C the prefix code with minimal length, and H the prefix code output by the Huffman algorithm. Further define C ′ and H ′ the codes obtained by considering C and H and replacing nodes 1 and 2 by their father with weight p(1) + p(2). Then we have: L(C ′ ) = L(C) − (p(1) + p(2)) and L(H ′ ) = L(H) − (p(1) + p(2)) We also realize that H ′ is exactly the output of the Huffman algorithm applied to a source with |X | − 1 symbols. We can then prove the result by recursion. Clearly for |X | = 1 symbols the Huffman algorithm is optimal. Furthermore, if for |X | − 1 symbols the Huffman algorithm is optimal this implies that L(C ′ ) = L(H ′ ) so that L(C) = L(H) hence the Huffman algorithm is optimal for |X | symbols. 36 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS 4.2 Markov Coding 4.2.1 Markov Sources Definition 4.2.1. A source is a Markov source with stationary distribution π(x) and transition matrix P (x|x′ ) if: (i) X1 , ..., Xn all have distribution π(x) (ii) For any i we have P(Xi = xi |X1 = x1 , ..., Xi−1 = xi−1 ) = P (xi |xi−1 ). So far we have mostly considered memoryless sources in which the symbols produced by the source X1 , ..., Xn are i.i.d. random variables with some fixed distribution. We now consider the much more general case of Markov sources where the symbols produced by the source X1 , ..., Xn are correlated. The Markovian assumption roughly means that the distribution of the current symbol Xn only depends on the value of the previous symbol Xn−1 . One can generate the symbols sequentially, by first drawing X0 according to π the stationary distribution, and once Xn−1 is known, one would draw Xn with distribution P (.|Xn1 ). The matrix P is called transition matrix, since P (xi |xi−1 ) is the probability of transitionning from xi−1 to xi in one time step. In a sense, a Markov process is a stochastic process with order one memory. It is also noted that, for π to actually be a stationary distribution, it has to verify the balance condition π = πP , since if Xn−1 has distribution π, then Xn has distribution P π and the two must be equal. The simplest model of a Markov source is called the Gilbert Elliot model, which has two equiprobable states and a given probability α of going from one to the other in one step: 1 1 π=( , ) 2 2 and 1−α α P = α 1−α ON OFF To generate the Gilbert Elliot model, first draw X0 ∈ {0, 1} uniformly at random, and the for each n draw Xn = Xn−1 + Un modulo 2, where U1 , ..., Un is Bernoulli with expectation α. In short flip the value of the process at each step with probabiliy α. 4.2. MARKOV CODING 4.2.2 37 The Entropy of English One of the initial motivations for studying Markov sources in the context of information theory was to model english text. Namely, consider English text as a sequence of letters Z1 , ..., Zn , k ≥ 0 and Xn = (Z−1 , ..., Zn−k ), the k letters that precede the n-th letter. Then, when k is large enough, Xn can be considered a Markov chain, meaning that the distribution of the n-th word solely depends on the k words that precede it. The transition probabilities encode all of the structure of the English language: grammar rules, dictionary, frequency of words and so on. This means that, if we wanted to generate English text automatically, one could simply gather a very large corpus of text, and estimate the transition probabilities by figuring out, for any letter x, the probability that x can be the n-th letter of an English sentence knowing that the k previous letters are Xn−1 , ..., Xn−k . Doing this for a large enough k will create computer generated sentences which look very close to English sentences produced by a human. This also means that one could estimate the entropy of English using the following experiment imagined by Shannon: one person thinks about some english sentence, and another person attempts to guess the sentence letter-by-letter without prior information by asking binary question e.g "Is the next letter an ’a’" or "Is the next letter a vowel". Then the ratio between the number of questions and the number of letters in a phrase is a good estimate of the number of bits per symbol in English text. The entropy of English estimated throguh this experiment is usually about 1 bit per letter, much smaller than log2 (26) bits per letter, which is the entropy of an i.i.d. uniform sequence of letters. 4.2.3 Efficient Codes for Markov Sources Proposition 4.2.2. Let (X1 , ..., Xn ) Markov source and define: R(π, P ) = XX π(x)P (y|x) log2 x∈X y∈X 1 . P (y|x) Then for any prefix code C for (X1 , ..., Xn ): (1 − 1 H(X1 ) L(C) )R(π, P ) + ≤ n n n and there is a prefix code C for (X1 , ..., Xn ) such that: L(C) 1 H(X1 ) + 1 ≤ (1 − )R(π, P ) + . n n n 38 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS Proof: Using the chain rule and the Markov property: H(X1 , ..., Xn ) = n X i=1 H(Xi |Xi−1 , ..., X1 ) = n X i=1 H(Xi |Xi−1 ) Furthermore H(Xi |Xi−1 ) = = XX P(Xi−1 = x, Xi = y) log2 x∈X y∈X XX π(x)P (y|x) log2 x∈X y∈X 1 P(Xi = y|Xi−1 = x) 1 = R(π, P ). P (y|x) Therefore: H(X1 , ..., Xn ) = (n − 1)R(π, P ) + H(X1 ). The lower bound holds as before, and applying Huffman coding to (X1 , ..., Xn ) yields a code with: (n − 1)R(π, P ) + H(X1 ) ≤ L(C) ≤ (n − 1)R(π, P ) + H(X1 ) + 1. We have therefore established that the rate of optimal codes for Markov sources is exactly R(π, P ) bits per symbol. Furthermore, optimal codes can be found using the same algorithms as in the memoryless case. One would first determine the transition probabilities for the Markov source at hand, which would then give us the probability of any sequence (X1 , ..., Xn ) and finally we may apply the Huffman algorithm. One can apply this (for instance) in order to encode English text optimally, since English can be seen as a Markov source. Now, one caveat of our approach is that we require to know the probability distribution of any sequence that can be generated by the source. In the case of memoryless sources this implies to know the distribution of a symbol, and in the case of Markov sources this implies knowing both the stationary distribution and the transition probabilities. This can often be a limitation in practice, and to solve this problem we study the concept of universal codes. 4.3 4.3.1 Universal Coding Universality Definition 4.3.1. Consider X1 , ..., Xn i.i.d. copies of X ∈ X with distribution p, and a coding scheme C : X n → D that does not depend on p. This coding scheme is universal if for all p: 1 Eℓ(C(X1 , ..., Xn )) → H(X) n→∞ n n→∞ lim 4.3. UNIVERSAL CODING 39 The idea of a universal code is that the code should have no prior knowledge of the data distribution, and that the code should work well irrespective of the data distribution. This is important in practical scenarious in which nothing is known about the data distribution. In fact, when the data distribution is known, we know that the smallest attainable rate is the entropy H(X), and if a code is universal, then it attains this rate for all distributions. 4.3.2 A Simple Universal Code for Binary Sequences Algorithm P 4.3.2 (Simple Adaptive Binary Code). Consider x1 , ..., xn ∈ {0, 1}n and let k = ni=1 xi . Output the codeword C(x1 , ..., xn ) which is the concatenation of (i) the binary represention of k (ii) the binary represention of the index of x1 , ..., xn in n X Ak = {(x1 , ..., xn ) ∈ {0, 1}n : xi = k} i=1 The main idea behind this P code is that the difficulty of encoding a sequence x1 , ..., xn depends on k = ni=1 xi , which is the number of 1’s in the sequence. Indeed, the number of possible values that x1 , ..., xn can have knowing k is precisely nk . This means that one could first encode the value of k (which requires at most log2 n bits) and subsequently encode the index of x1 , ..., xn amongst the n sequences which have the same value of k (which requires at most log2 k bits). This coding scheme assigns short codewords to sequences with k ≈ 0 and k ≈ n, and longer codewords to sequences with k ≈ n/2. The goal of encoding k along with the sequence is that the decoder will get to know k as well. Proposition 4.3.3. The simple adaptive binary code is universal. Proof: For a given value of k, since Ak has nk elements, the length of the corresponding codeword is n ℓ(C(x1 , ..., xn )) = log2 (n) + log2 k Using Stirling’s approximation log2 n! = n log2 (n/e) + O(log2 n) we have n log2 = n log2 (n/e) − k log2 (k/e) − (n − k) log2 ((n − k)/e) + o(log2 n) k so that 1 ℓ(C(x1 , ..., xn )) = h2 (k/n) + o(1) n 40 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS Consider X ∼Bernoulli(p): n 1X k = Xi → p almost surely n→∞ n n i=1 and since n1 ℓ(C(X1 , ..., Xn )) ≤ 1 dominated convergence yields: 1 Eℓ(C(X1 , ..., Xn )) → h2 (p) = H(X) n→∞ n proving the result. The fact that this simple code is universal not only shows that such codes do exist, but also point to a more general idea for constructing such codes: one can attempt to estimate the value of the underlying distribution, encode that estimate along with the message. P Indeed, if X1 , ..., Xn are i.i.d. Bernoulli with parameter p, then k/n = (1/n) ni=1 Xi is a consistant estimator of p, and the knowledge of k is equivalent to knowing this estimator. In a certain way, universal codes perform both encoding and estimation at the same time (although the estimation might be explicit). Algorithm 4.3.4 (Simple Adaptive Code). Consider x1 , ..., xn ∈ X and let kx = Pn i=1 1{xi = x}. Output the codeword C(x1 , ..., xn ) which is the concatenation of (i) the binary represention of kx for all x ∈ X (ii) the binary represention of the index of x1 , ..., xn in n Ak = {(x1 , ..., xn ) ∈ X : n X i=1 1{xi = x} = kx for all x ∈ X } The simple code can be extended to non-binary sequences, by encoding the empirical distribution of the data (also known as the type of the Pnsequence, see further chapters). It is noted that the empirical distribution kx = i=1 1{xi = x} can be encoded in at most |X | log2 n bits, since for each x kx ∈ {0, ..., n}. 4.3.3 Lempel-Ziv Coding Algorithm 4.3.5 (Lempel Ziv Coding). Consider a string x1 , ..., xn ∈ X n and a window W ≥ 1. Start at position i = 1. Then, until i ≥ n repeat the following: First find the largest k such that (xj , ..., xj+k ) = (xi , ..., xi+k ) for some j ∈ {i − 1 − W, ..., i − 1}. Second, if such a k exists, encode xi , ..., xi+k as the binary representation of (1, i − j, k) and skip to position i + k + 1; and if such a k does not exist, encode xi as (0, xi ) and skip to position i + 1. 4.3. UNIVERSAL CODING 41 The most famous universal codes are called the Lempel-Ziv algorithms, and we present here the algorithm that uses a sliding window. There exists other versions such as the one based on trees. The algorithm encodes the sequence by first parsing it into a set of words, and then to encode each word based on the previous words. The central idea on why this coding scheme works comes from the fact taht if a word (x1 , ..., xk ) of size k has a relatively high probability, then it is likely to appear in a window of size W if W is large enough. In turn this word can be represented with 1 + log2 W + log2 k bits instead of k bits. In short, words that are frequent tend to appear repeatedly, and therefore can be encoded by providing a pointer to one of their past occurences, which enables to drastically reduce the number of bits required. Example 2. Consider the following string of 30 bits 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 After parsing with a window size of W = 4 we get 8 phrases: 0; 0, 0; 1; 0, 0, 0; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; 1; 0, 0, 0; 0, 0 Those phrases will then be represented as: (0, 0) ; (1, 1, 2); (0, 1) ; (1, 4, 3) ; (1, 1, 17); (0, 1); (1, 4, 3); (1, 1, 2) The above example illustrates how the algorithm operates on a binary sequence. The sliding window enables us to encode long sequences of consecutive 0, ..., 0 with relatively few bits. Indeed we manage to encode a sequence of 17 consecuting 0’s by the word (1, 1, 17) which can be represented using rougly 2 + log2 (4) + log2 (17) ≈ 7 bits: a net gain of 17 − 7 = 10 bits. Proposition 4.3.6. Lempel-Ziv Coding is universal. Lempel-Ziv coding has both the advantage of being very easy to implement, requiring no knowledge about the data distribution, and also to be universal. We do not present the proof here, due to its complexity. 42 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS Chapter 5 Data Representation: Rate-Distorsion Theory In this chapter we consider the problem of lossy compression and quantization, which is a central problem when dealing with signals from the physical world such as sounds, images and videos and so on. We introduce the notion of distorsion which measures how much information is lost after encoding, and design optimal rate-distorsion codes which minimize the rate given a constraint on the distorsion. 5.1 Lossy Compression, Quantization and Distorsion Most physical systems produce continuous-valued data such as: sound, electromagnetic fields, currents. On the other hand, information processing systems work with finite-valued data. For instance, in order to store images, sounds and movies, one must somehow represent them as sequences of bits. The transformation from continuous to discrete data is called quantization, and is fundamental for any information system handling data from the physical world. 1 Continuous Data Quantized Data 0.8 0.6 0.4 Data 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 0.2 0.4 0.6 0.8 1 Time 43 1.2 1.4 1.6 1.8 2 44 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY In fact, in some cases, even if data is already discrete, one may want to represent it using less bits, even at the expense of losing some information. For instance, we might be interested in reducing the size (in bits) of an image or a sound file as long as, after the compression, one can reconstruct them and the reconstructed image or sound looks or sounds similar to a human. This means that most of the information has been preserved. We call this process lossy compression. Since quantization and lossy compression can be understood in the same framework, we will use both terms interchangeably. 5.1.1 Lossless vs Lossy Compression It is noted that lossy compression is different from lossless compression studied in the previous chapters, in the sense that lossless compression allows exact reconstruction of the data. This means that for lossy compression we need some criterion in order to assess how much information is lost in the process, and this criterion is called a distorsion measure. 5.1.2 The Quantization Problem We will study the quantization problem in the information theoretic framework, defined as follows: • The source generates data X n = (X1 , ..., Xn ) ∈ X n drawn i.i.d. from a distribution p(x) • The encoder encodes the data as fn (X n ) ∈ {1, ..., 2nR } with nR bits. • The decoder decodes the data by X̂ n = gn (fn (X n )) The mappings fn and gn define the strategy for encoding the data and decoding the data, and given a rate R the goal is to select these mappings in order to minimize the distorsion defined as: n 1X D= E(d(Xi , X̂i )) n i=1 where d is a positive function, e.g. d(x, x′ ) = (x′ − x)2 . A few remarks are in order. The mapping fn is indeed a quantizer as it maps a vector of n source symbols (whose values may be continouous or discrete) to a finite integer between 1 and 2nR , or equivalently to a string of nR bits, so that R measures the number of bits per source symbols at the quantizer. The mapping gn is a decoder and attempts to reconstruct the original data. The n source symbols 5.2. SCALAR QUANTIZATION 45 X n are quantized as fn (X n ), and subsequently reconstructed as X̂ n = gn (fn (X n )) so that one would like X̂ n to be as close as possible to X n and we do so by minimizing D, which can be seen as a measure of dissimilarity between X n and X̂ n , or a measure of how much information was lost in the process. Of course, the choice of d impacts the strategy we should use, and should be chosen wisely. For instance if one is dealing with images, so that X n is an image and Xi is its i-th pixel, then D being small should imply that X n and X̂ n look similar to a human. 5.2 Scalar Quantization We first study scalar quantization, where n = 1 so that we compress symbols one at a time, with the goal of minimizing the per-symbol distorsion. 5.2.1 Lloyd-Max Conditions There exists a general result to find optimal quantization schemes called the LloydMax conditions, which gives necessary conditions that the optimal quantizer must verify. Proposition 5.2.1 (Lloyd-Max). An optimal codebook must satisfy two conditions: (i) The encoder f should verify for all x ∈ X f (x) ∈ arg min i∈{1,...,2R } d(g(i), x) (ii) The decoder g should verify for all i ∈ {1, ..., 2R } g(i) ∈ arg min E(d(x′ , X)|f (X) = i) ′ x ∈X Proof We have that: D = E d(g(f (X)), X) ≥ E min i∈{1,...,2R } d(g(i), X) and 2R X D = E d(g(f (X)), X) = E(d(g(i), X)|f (X) = i)P(f (X) = i) i=1 R ≥ 2 X i=1 min E(d(x′ , X)|f (X) = i)P(f (X) = i) x′ ∈X 46 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY Therefore, if (i) or (ii) are not satisfied, we can decrease the value of the distortion by modifying f or g. The most important insight gained from Lloyd max are twofold. First, to design the quantizer, a point should be mapped to the closest reconstruction point. Second, when designing the decoder, one should select the reconstruction points to minimize the conditional expected distorsion. In fact this shows that if the quantizer f is known, then finding g is easy, and vice-versa, and suggests an iterative algorithm: starting with (f, g) arbitrary and alternatively minimize over f and g until convergence. This algorithm may not always converge to the optimal solution and should be seen as a heuristic. 5.2.2 Uniform Distribution Distribution Quantization Points 1.2 1 p(x) 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 Proposition 5.2.2. Consider n = 1, X ∼Uniform([0, 1]) and distorsion function d(x, x′ ) = (x − x′ )2 . −2R Then the minimal distorsion is D = 2 12 , the optimal quantization scheme is uniform quantization: f (X) ∈ arg min |X − g(i)| i=1,...,2R with g(i) = i2−R . Proof: Let us assume without loss of generality that g(1) < ... < g(2R ). From Lloyd-Max, the quantization scheme should be such that f (x) ∈ arg min i∈{1,...,2R } d(g(i), x) = arg min i∈{1,...,2R } |g(i) − x| Furthermore, knowing that f (X) = i, X has uniform distribution over interval [ g(i) + g(i − 1) g(i) + g(i + 1) , ] 2 2 This implies that g(i) ∈ arg min E(d(x′ , X)2 |f (X) = i) = E(X 2 |f (X) = i) = ′ x ∈X g(i + 1) − g(i − 1) 2 5.2. SCALAR QUANTIZATION 47 One can readily check by recursion that this implies g(i) = i/2R for i = 1, ..., R. 1 −2R 2 , which which concludes the proof. The distorsion is hence D = 12 When data is uniformly distributed over an interval, then the optimal quantization scheme is uniform quantization, which simply partitions the interval in 1 −2R 2R intervals of equal size, and the distorsion is 12 2 , so that when the rate is increased by 1 bit, the distorsion is divided by 4 (or decreased by 6dB). It is also noted that uniform quantization is equivalent to rounding the data to the nearest integer multiple of 2−R so it is very easy to implement. 5.2.3 Gaussian Distribution with one bit 0.4 Distribution Quantization Points 0.35 0.3 p(x) 0.25 0.2 0.15 0.1 0.05 0 −2 −1.5 −1 −0.5 0 x 0.5 1 1.5 2 Proposition 5.2.3. Consider n = 1, R = 1, X ∼ N (0, σ 2 ) and distorsion function d(x, x′ ) = (x − x′ )2 . Then the minimal distorsion is D = π−2 σ 2 and the optimal quantization scheme π is the sign ( 1 if X < 0 f (X) = 2 if X ≥ 0 and the optimal reconstruction is q − 2σ2 if i = 1 q π g(i) = + 2σ2 if i = 2 π Proof: Let us assume without loss of generality that g(1) < g(2). From Lloyd-Max, the quantization scheme should be such that f (x) ∈ arg min i∈{1,...,2R } d(g(i), x) = arg min |g(i) − x| i∈{1,2} Since X has the same distribution as −X one must have that g(2) = −g(1) hence f (X) = 1 if X < 0 and f (X) = 2 otherwise. Furthermore r g(2) ∈ arg min E(d(x′ , X)2 |f (X) = 2) = E(X|f (X) = 1) = E(X|X ∈ [0, +∞]) = ′ x ∈X 2σ 2 π 48 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY hence r g(2) = E(X|X ∈ [0, +∞]) = 2σ 2 π One may readily check that D = π−2 σ 2 which concludes the proof. π If only R = 1 bit per symbol is available, the most efficient quantizer consists in simply encoding the sign of the data, so that the information ofq the absolute value is lost. It is also noted that the optimal reconstruction points ± expected value of the absolute value of X. 5.2.4 2σ 2 π are the General Distributions It should be noted that even for Gaussian distributions and R ̸= 1, finding the optimal quantization scheme is not straightforward. In fact, for many distributions, the optimal quantization scheme is not known. 5.3 Vector Quantization We now study vector quantization n > 1, where we attempt to encode several source symbols at the same time. 5.3.1 Vector Quantization is Better than Scalar Quantization It would be tempting to think that, if one has a good scalar quantizer, one could simply apply it to sequences of n independent symbols, and this would result in a low distorsion. We propose to illustrate this with an example, showing that this intuition is not only false, but also showing that, in general, randomization can be a very powerful tool in order to perform vector quantization. Consider X n = (X1 , ..., Xn ) i.i.d. uniform in [0, 1], so that X n is uniformly distributed on [0, 1]n and let us apply the optimal scalar quantizer to each of its entries, namely: f s (X n ) = (arg min |X1 − i2−R |, ..., arg min |Xn − i2−R |) i=1,...,2R i=1,...,2R and g s (in ) = (i1 2−R , ..., in 2−R ) Then one may readily check that the reconstruction error g s (f s (X n )) − X n 5.3. VECTOR QUANTIZATION 49 1 R 2 , and therefore the has i.i.d. uniformly distributed entries with variance 12 1 −R achieved distorsion is D = 12 2 . On the other hand, consider another quantization strategy where the quantization points g(1), ..., g(nR) are selected uniformly at random in [0, 1]n . One may readily check that, from independence of g(1), ..., g(nR): P( with rn2 = 1 1 nR min d(X, g(i)) ≥ 2R ) = P(d(X, g(1)) ≥ rn2 )2 nR n i=1,...,2 12 1 n2R . 12 Furthermore (πrn2 )n/2 Γ(n/2 + 1) P(d(X, g(1)) ≤ rn ) ≈ since the probability of d(X, g(1)) ≤ rn2 can be approximated by the Lebesgue measure a ball of radius rn centerered at X. We may then use Stirling’s approximation to show that nR P(d(X, g(1)) ≥ rn )2 → 0 n→∞ Therefore, this quantization strategy has distorsion lower than probabilty, and is superior to scalar quantization. 5.3.2 1 R 2 12 with high Paradoxes of High Dimensions This means that this vector quantization is provably better than scalar quantization, and this also shows that, in some cases, drawing the representation points of the quantizer accoding to some distribution, rather than according to some deterministic rule, can perform better. This is due to the counterintuitive fact that rectangular grids do not fill out space very well in high dimensions, while i.i.d sequences to to fill out the space much better (this is in fact the basis of Monte-Carlo methods). 1 1 Quantization Points 0.8 0.8 0.7 0.7 0.6 0.6 2 0.9 0.5 x x2 Quantization Points 0.9 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0 0.1 0 0.1 0.2 0.3 0.4 0.5 x1 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 x1 0.6 0.7 0.8 0.9 1 Another related counterintuitive fact is that even if two random variables X, Y have no relationship with each other, quantizing them together is always better than quantizing them separately. 50 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY 5.3.3 Rate Distorsion Function Definition 5.3.1. A rate distortion pair (R, D) is achievable if and only if there exists a sequence of (2nR , n) distortion codes (fn , gn ) with n 1X lim sup E d(gn (fn (X n ))i , Xi ) ≤ D. n→∞ n i=1 Definition 5.3.2. The rate distortion function R(D) for a given D is the infimum over R such that (R, D) is achievable. Given a rate R and a distorsion D, we say that (R, D) is achievable if, asymptotically when n is large, there exists a sequence of quantizers whose distorsion is at most D. We insist on the fact that for each value of n, an appropriate quantizer must be found and what matters is the limit behaviour of this sequence. This means that the notion of achievability is asymptotic, and there may not exist quantizers with rate R and distorsion D for small values of n. In a sense achievability quantifies the smallest distorsion for n = +∞. Clearly the larger the allowed distorsion, the smaller the rate can be with an efficient quantizer, and a natural question is: what is the optimal trade-off between distorsion and rate. The answer to this question is called the rate distorsion function. Now computing this function may be difficuly in general, and we will show how this may be done by maximizing the mutual information. 5.4 Rate Distorsion Theorem Definition 5.4.1. Define the information rate function RI (D) = min I(X; X̂) E(d(X,X̂))≤D minimizing over all possible conditional distributions p(x̂|x). Theorem 5.4.2. The information rate function equals the rate distorsion function. We will prove the rate-distorsion theorem, by showing that the information rate function is an information theoretic limit of the problem, and then construct efficient rate-distorsion codes which reach this limit. 5.4.1 Lower Bound Proposition 5.4.3. Consider a memoryless source. Then any rate R < RI (D) is not achievable at distorsion D. 5.4. RATE DISTORSION THEOREM 51 Proof: Let us consider a (2nR , n) distortion code (fn , gn ). Since fn ∈ {1, ..., 2nR }: H(fn (X n )) ≤ nR Using the fact that conditional entropy is positive I(X n ; fn (X n )) ≤ H(fn (X n )) Furthermore, from the data processing inequality I(X n ; X̂ n ) = I(X n ; gn (fn (X n ))) ≤ I(X n ; fn (X n )) Since the source is memoryless I(X n ; X̂ n ) = H(X n ) − H(X n |X̂ n ) with n H(X ) = n X H(Xi ) i=1 and, using the chain rule and the fact that conditoning reduces entropy n H(Xn |X̂ ) = n X i=1 n H(Xi |Xi−1 , ..., X1 , X̂ ) ≤ n X i=1 H(Xi |X̂i ) Putting things together n X i=1 I(Xi ; X̂i ) ≤ I(X n ; X̂ n ) By definition of the information rate function n X i=1 R(Di ) ≤ n X I(Xi ; X̂i ) i=1 with PnDi = E(d(Xi , X̂i )) the distorsion for the i-th symbol. We have D = 1 i=1 Di and since the mutual information is convex, so is the rate distorsion n function, which in turn implies: nR(D) ≤ n X R(Di ) i=1 We have proven that R(D) ≤ R, so that R(D) is indeed a lower bound on the rate that can be achieved at distorsion level D. 52 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY 5.4.2 Efficient Coding Scheme: Random Coding We now propose a scheme known as random coding, so that any rate distorsion pair (R, D) which verifies the previous lower bound can be achieved with this scheme. In this sense, random coding is optimal. We do not provide a complete proof for the optimality of random coding in this context. We will go into further details in the next chapter on channel coding. Algorithm 5.4.4 (Random Coding for Rate Distorsion). Consider the following randomized scheme to construct a rate-distorsion codebook: • (Codebook generation) Let p(x̂|x) a distribution such that R(C) = I(X; X̂) and E(d(X, X̂)) ≤ D. Draw C = {X̂ n (i), i = {1, ..., 2nR }} where X̂ n (i) is an i.i.d. sample of size n from p(x̂) • (Encoding) Encode X n by W ∈ {1, ..., 2nR } with W the smallest W such that (X n , X̂ n (W )) is distortion typical. If such a W does not exist let W = 1. • (Decoding) Output the representation point X̂ n (W ) It is noted that this is a randomized strategy, so that both the encoder fn and the decoder gn are in fact random. While it may seem counter-intuitive to select a random codebook, this in fact eases the analysis very much, because it allows us to average over the codebook itself. Furthermore, when performing this averaging, as long as we are able to prove that the codebook has good performance in expectation, it automatically implies that there exists a codebook with good performance. This strategy is common in information theory as well as other fields (for instance random graphs), and is known as the "probabilistic method". The disadvantage of random coding with respect to, for instance, Huffman coding, is that it is much more complex to implement. Proposition 5.4.5. There exists a sequence of codebooks achieving any rate distorsion pair (R, D) with R(D) > D. The main idea centers around the idea of typicality, in that case rate-distorsion typicality. Proposition 5.4.6. Consider (X n , X̂ n ) = (Xi , X̂i )i=1,...,n i.i.d. with p.d.f. p(x, x̂). 5.5. RATE DISTORSION FOR GAUSSIAN DISTRIBUTIONS 53 Given ϵ > 0 define the distortion typical set: Anϵ n n 1 1X n n n n log2 − H(X) = (x , x̂ ) ∈ X × X : n i=1 p(xi ) n n 1X 1X 1 1 + − H(X̂) + − H(X, X̂) log2 log2 n i=1 p(x̂i ) n i=1 p(xi , x̂i ) n o 1X d(xi , x̂i ) − E(d(X, X̂)) ≤ ϵ . + n i=1 Then: P((X n , X̂ n ) ∈ Anϵ ) → 1. n→∞ The point of random coding is that the codewords are drawn in an i.i.d. then the pairs (X n , X̂ n ) will be distorsion typical so that d(X n , X̂ n ) will be arbitrairly close to D with high probability. 5.5 Rate Distorsion for Gaussian Distributions Computing the rate-distorsion function is usually difficult, as it is the solution to a maximization problem, and does not admit a closed-form expression for many distributions. For Gaussian variables and vectors however, the solution can be computed in closed form, and gives several interesting insights into quantization in general. 5.5.1 Gaussian Random Variables 3.5 Rate Distorsion Function 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 Distorsion ′ 2 Proposition 5.5.1. Consider X ∼ N (0, σ 2 ) with d(x, x′ ) = (x − x 2) . The rate distortion function is given by: R(D) = max 21 log2 σD , 0 Proof: We must minimize I(X; X̂) where X ∼ N (0, σ 2 ) and (X, X̂) verifies E((X − X̂)2 ) ≤ D. By definition of the mutual information: I(X; X̂) = h(X) − h(X|X̂). 54 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY Since X ∼ N (0, σ 2 ) we have h(X) = 1 log2 σ 2 2 Furthermore, since conditioning reduces entropy: h(X|X̂) = h(X − X̂|X̂) ≤ h(X − X̂) Now, since the Gaussian distribution maximizes entropy knowing the variance: h(X − X̂) ≤ 1 log2 var(X − X̂) 2 Since var(X − X̂) ≤ D, replacing we have proven that I(X; X̂) ≥ 1 σ2 log2 2 D Now consider the following joint distribution X = X̂ + Z where X̂ and Z are independent and gaussian with respective variances σ 2 − D and D. Then one may readily check that E((X − X̂)2 ) ≤ D and that I(X; X̂) = 1 σ2 log2 2 D which proves the result. The rate-distorsion function for gaussian variables is indeed convex and decreasing, and in particular this function is 0 for any D > σ 2 , due to the fact that, even with no information, one can achieve a distorsion of σ 2 , by representing X by a fixed value equal to E(X). Furthermore, for D < σ 2 , when R is increased by 1, D is divided by 4 so each added bit of quantization decreases the quantization error by 6dB. Finally, as predicted previously, vector quantization is better than scalar quantization. For instance, consider R = 1, using vector quantization on 2 (X1 , ..., Xn ) with a rate of R = 1 yields a distorsion of D = σ4 , while using scalar quantization on each entry of (X1 , ..., Xn ) with a rate of R = 1 yields a distrorsion σ 2 . Hence in that example vector quantization is 45% more efficient of D = π−2 π than scalar quantization. 5.5.2 Gaussian Vectors Proposition 5.5.2. Consider X1 , ..., Xk independent with Xj ∼ N (0, σj2 ) and P distortion function d(x, x′ ) = kj=1 (xj − x′j )2 . 5.5. RATE DISTORSION FOR GAUSSIAN DISTRIBUTIONS 55 The rate distortion function is given by: R(D) = k X 1 j=1 where λ⋆ is chosen such that Pk j=1 2 log2 σj2 min(λ⋆ , σj2 ) min(λ⋆ , σj2 ) = D. Proof: We must minimize I(X k ; X̂ k ) where X1 , ..., Xk are independent with Xj ∼ N (0, σj2 ) and (X k X̂ k ) verifies E((X k − X̂ k )2 ) ≤ D. By definition of the mutual information: I(X k ; X̂ k ) = h(X k ) − h(X k |X̂ k ). Since X1 , ..., Xk are independent: h(X) = k X h(Xi ) i=1 and since conditioning reduces entropy: k k h(X |X̂ ) = k X i=1 k h(Xi |Xi−1 , ..., X1 , X̂ ) ≤ Therefore I(X k ; X̂ k ) ≥ k X k X i=1 h(Xi |X̂i ) I(Xi ; X̂i ) i=1 Define Di = E((X̂i − Xi )2 ) the distorsion attributed to component i. From the scalar case studied in the previous case: I(Xi ; X̂i ) ≥ Hence 1 σ 2 + log2 i 2 Di k X 1 σ 2 + log2 i I(X ; X̂ ) ≥ 2 Di i=1 k k Furthermore, one can achieve equality by choosing (Xi , X̂i ) independent with distribution as in the scalar case. Hence the rate distorsion function is the solution to the optimization problem Minimize k k X X 1 σ 2 + log2 s.t. Di = D. 2 D i i=1 i=1 56 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY From Lagrangian relaxation the solution of this optimiation problem must be such 2 that there exists Pλk > 0 such that either Di = σi or otherwise Di = λ. Selecting λ to ensure that i=1 Di = D yields the result. For Gaussian vectors with independent entries, the rate distorsion function can be computed as well, and the solution is given by an allocation called "reverse water filling" which attempts to equalize the distortion for each component. Bits are allocated mostly to components with high variance, and components with low variance are simply ignored. This makes sense since, for an equal amount of bits, the larger the variance, the larger the distorsion. This can be generalized to gaussian vectors with non-diagonal covariance matrices by performing reverse waterfilling on the eigenvectors/eigenvalues of the covariance matrix. Chapter 6 Mutual Information and Communication: discrete channels We now move away from data representation, and focus on communication over noisy channels. For this problem, we are concerned with the maximal rate at which information can be reliably sent over the channel, in the sence that the receiver should be able to retrieve the sent information with high probability. As we shall see, information theoretic tools provide a complete characterization of the problem in terms of achievable rates as well as coding strategies. 6.1 6.1.1 Memoryless Channels Definition We will consider the case where a sender selects n inputs X n = (X1 , ..., Xn ) from a finite alphabet X , and a receiver observes corresponding outputs Y n = (Y1 , ..., Yn ) from another finite alphabet Y. The relationship between X n and Y n is called a channel. The main problem we aim to solve is how much information can be reliably exchanged between the sender and the receiver as a function of n. The ratio between the amount of information exchanged and n is called the rate (in bits per channel use). Definition 6.1.1. A channel with input X n = (X1 , ..., Xn ) and output Y n = (Y1 , ..., Yn ) is memoryless with transition matrix p(y|x) = P(Y = y|X = x) if n n pY n |X n (y |x ) = n Y i=1 p(yi |xi ) Of course, a channel can model almost any point-to-point communication scenario regardless of the medium: wireless communication, optical communication, 57 58 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS and so on. We will focus mostly on memoryless channels, which already constitude a rather rich model. Of course, there exists more general models, such as Markovian channels and the most general model of ergodic channels. It is noted that, if a channel is memoryless, and X n = (X1 , ..., Xn ) is i.i.d., then Y n = (Y1 , ..., Yn ) is also i.i.d. 6.1.2 Information Capacity of a Channel Definition 6.1.2. The information channel capacity of a memoryless channel is defined as: C = max I(X; Y ) pX where the maximum is taken over all possible input distributions. The information channel capacity is simply the largest amount of mutual information that can be achieved by selecting the input distribution appropriately. It turns out that this number also represents the amount of bits per channel use that can be reliably exchanged between the sender and the receiver, as we shall later see. 6.1.3 Examples We now propose to compute the information channel capacity for a few simple channel models. Noiseless Binary Channel b 1 b 1 0 1 b 0 b 1 Since X can be retrieved perfectly from X we have H(X|Y ) = 0 and the mutual information is I(X; Y ) = H(X) − H(X|Y ) = H(X) 6.1. MEMORYLESS CHANNELS 59 To maximize I(X; Y ) one must maximize H(X), so the input distribution maximizing X should be unifom in {0, 1} and the capacity is C = log2 2 = 1 6.1.4 Non-Overlapping Outputs Channels p0 b 0 1 − p0 b 0 b 1 b 2 p1 b 1 1 − p1 b 3 Once again X can be retrieved perfectly from X we have H(X|Y ) = 0 and the mutual information is I(X; Y ) = H(X) − H(X|Y ) = H(X) To maximize I(X; Y ) one must maximize H(X), so the input distribution maximizing X should be unifom in X and the capacity is C = log2 |X | which generalizes the previous case. 6.1.5 Binary Symmetric Channel 1−p b 0 b 0 p p 1 b 1−p b 1 Knowing Y , X has a Bernoulli(p) distribution, so the mutual information is I(X; Y ) = H(X) − H(X|Y ) = H(X) − h2 (p) 60 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS To maximize I(X; Y ) one must maximize H(X), so the input distribution maximizing X should be unifom in X and the capacity is C = log2 2 − h2 (p) = 1 − h2 (p) 6.1.6 Typewriter Channel 0 b b 0 1 b b 1 2 b b 2 3 b b 3 Knowing X, X has two equiprobable values, therefore I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − 1 One would like to maximize H(Y ) by selecting the distribution of X appropriately. If we select X uniformly distributed, then Y is also uniformly distributed so the input distribution maximizing X should be unifom in X and and the capacity is C = log2 X − 1 6.1.7 Binary Erasure Channel 1−α b 0 b 0 α b × α 1 b 1−α b 1 Knowing X, Y has values X or × with probabilities α and 1 − α so the conditional entropy is H(Y |X)= h2 (α) On the other hand, Y has 3 possible values 0,× and 1 with probabilities 1−α(1−p), 6.2. CHANNEL CODING 61 α and p(1 − α) where p = P(X = 0) so the entropy is: 1 (1 − α)(1 − p) 1 1 + α log2 + (1 − α)p log2 α (1 − α)p = h2 (α) + (1 − α)H(X) H(Y ) = (1 − α)(1 − p) log2 Therefore the mutual information is I(X; Y ) = (1 − α)H(X) To maximize I(X; Y ) one must maximize H(X), so the input distribution maximizing X should be unifom in {0, 1} and the capacity is C = (1 − α) log2 2 = (1 − α) 6.2 6.2.1 Channel Coding Coding Schemes We consider coding over blocks of n channel uses. Definition 6.2.1. Consider the following procedure: • The transmitter chooses a message W ∈ {1, ..., M } • She transmits a codeword X n (W ) = (X1 (W ), ..., Xn (W )) ∈ X n • The receiver sees Y n distributed as p(y n |xn ) • She decodes the message using some decoding rule Ŵ = g(Y n ) Any such procedure is called an (M, n) channel code with rate R= 1 log2 M n and error probability: Pen = P(Ŵ ̸= W ). 62 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS 6.2.2 Example of a Code for the BSC 1−p b 0 b 0 p p 1 b b 1 1−p For the binary symmetric channel, a code given by a subset C of {0, 1}n of size 2nR , along with a decoding rule. The distribution of the channel output y n conditional to transmitting some some codeword xn is given by n p d(xn ,yn ) Y p(y |x ) = (1 − p)1{xi =yi } p1{xi ̸=yi } = (1 − p)n . 1−p i=1 n n where d(xn , y n ) is the so-called Hamming distance between xn and y n , i.e. it is simply the number of entries of xn that are different from that of y n . One can prove that the optimal decoding rule is maximum likelihood decoding (in the sense that this rule minimizes the error probability) which consists in selecting the codeword xn ∈ C that is the most likely to have been transmitted: x̂n = arg max p(xn |y n ) = arg min d(xn , y n ) n n x ∈C x ∈C We notice that this is equivalent to minimizing the Hamming distance between the output and the codeword d(xn , y n ). Also note that, if C is very large, this might be very hard to do computationally. x3 x4 = x1 ⊕ x3 x1 x5 = x2 ⊕ x3 x6 = x1 ⊕ x2 x2 A well known code for the BSC is the so called Hamming code. C = {xn ∈ {0, 1}n : (x4 , x5 , x6 ) = (x1 ⊕ x3 , x2 ⊕ x3 , x1 ⊕ x2 )} It is a code with M = 23 codewords, block size n = 6 so its rate is R = 12 . This code illustrates the idea of appending parity check bits (x4 , x5 , x6 ) at the end of the 6.3. NOISY CHANNEL CODING THEOREM 63 message (x1 , x2 , x3 ) message which adds redundancy in order to allow for error correction. In fact, one can prove that, this code can correct exactly one error and its error probability is given by: Pen = P( n X i=1 6.2.3 1{Xi ̸= Yi } ≥ 2) = 1 − np(1 − p)n−1 − (1 − p)n . Achievable Rates Given any code (M, n) we define the condtionnal error probability: λni = P(g(Y n ) ̸= i|X n = X n (i)), and the maximal error probability: λn = max λi . i=1,...,M Definition 6.2.2. A rate R is achievable if there exists a sequence of (2nR , n) codes with vanishing maximal error probability λn → 0. n→∞ Definition 6.2.3. The capacity of a channel is the supremum of all achievable rates. 6.3 6.3.1 Noisy Channel Coding Theorem Capacity Upper Bound We now show that any rate above the information capacity is not achievable. The main idea is to apply Fano’s inequality to show that if there are too many codewords, then the transmitted codeword cannot be estimated with arbitrary high accuracy. Proposition 6.3.1. Consider a memoryless channel. Then any rate R > C is not achievable. Proof: We recall that X, Y we have that H(X|Y ) ≤ H(X) so that using the chain rule, for any X1 , ..., Xn : H(X1 , ..., Xn ) = n X i=1 H(Xi |Xi−1 , ..., X1 ) ≤ n X i=1 H(Xi ). 64 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS We now upper bound the maximal mutual information with n channel uses. By definition of the capacity: I(X n ; Y n ) = H(Y n ) − H(Y n |X n ) n X n = H(Y ) − H(Yi |Yi−1 , ..., X n ) = H(Y n ) − ≤ = n X i=1 n X i=1 n X i=1 H(Yi |Xi ) H(Yi ) − H(Yi |Xi ) I(Xi ; Yi ). i=1 Therefore I(X n ; Y n ) ≤ nC. For any channel code W → X n (W ) → Y n → Ŵ forms a Markov chain and the data processing inequality yields: I(W ; Ŵ ) ≤ I(X n ; Y n ) ≤ nC. Since message W ∈ {1, ..., 2nR } is chosen uniformly at random we have H(W ) = nR and: H(W |Ŵ ) = H(W ) − I(W ; Ŵ ) ≥ n(R − C). We may now apply Fano’s inequality: h2 (P(W ̸= Ŵ )) + P(W ̸= Ŵ ) log2 2nR ≥ H(W |Ŵ ) ≥ n(R − C) Since h2 ≤ 1 we have proven that: P(W ̸= Ŵ ) ≥ n(R − C) − 1 C → 1− . n→∞ nR R Therefore the probability of error of any family of (n, 2nR ) channel codes does not vanish when R > C. 6.3.2 Efficient Coding Scheme: Random Coding We now show that any rate below the information capacity is achievable, so that, in essence information capacity qunatifies how much information can be reliably 6.3. NOISY CHANNEL CODING THEOREM 65 transmitted over a channel. This is a very strong result because it applies to any communication system. Capacity is the fundamental limit that no scheme can overcome, and reaching this limit can be hard in practice: no good (low complexity) codes were known for 30 years for the BSC and well known examples of such good codes which reach this limit are Turbo Codes and LDPC codes. Proposition 6.3.2. Consider a discrete memory-less channel. Then any rate R < C is achievable. Proof In order to prove the result, we construct a coding scheme called random coding. Random Coding Algorithm 6.3.3 (Random Channel Coding). Consider the following randomized algorithm in order to generate a codebook and transmit data. • (Codebook generation) Let p(x) a distribution such that C = I(X; Y ). Draw C = {X n (i), i = {1, ..., 2nR } where X n (i) is an i.i.d. sample of size n from p(x), and (X n (i))i=1,...,2nR . Reveal C to both the transmitter and receiver. • (Data Transmission) To transmit data, choose W ∈ {1, ..., 2nR } uniformly distributed, and transmit X n (W ) • (Decoding) Observe Y n . If there exists a unique Ŵ such that (X n (Ŵ ), Y n ) are jointly typical, then output Ŵ . Otherwise output an error. Intuition Behind Random Coding Interestingly, the fact that we use a random code ensemble, perhaps counterintuitively, eases analysis. While this analysis is not trivial, the main intuitive idea behind random channel coding is that, if X n (W ) is transmitted and Y n is received, then (X n (W ), Y n ) will be jointly typical with high probability, and for any W ′ ̸= W (X n (W ′ ), Y n ) will not be jointly typical, because X n (W ′ ) is independent from X n (W ) from the random code construction. Error Probability We compute the error probability averaged over C. Define E the event that decoding fails and average over C: nR P(E) = X c P(C = c)Pen (c) = 2 1 XX 2nR i=1 c P(C = c)λi (c). 66 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS By symmetry P c P(C = c)λi (c) does not depend on i, so: nR P(E) = 2 1 XX 2nR c P(C = c)λ1 (c) = P(E|W = 1). i=1 Define the event that a particular couple is typical: Ei = {(X n (i), Y n ) ∈ Anϵ }. If W = 1, decoding fails if either (X n (1), Y n ) is not typical, or there exists i ̸= 1 such that (X n (i), Y n ) is typical, hence: nR P(E|W = 1) ≤ P(E1c |W = 1) + 2 X i=2 P(Ei |W = 1) From joint typicality, for n large P(E1c |W = 1) ≤ ϵ and P(Ei |W = 1) ≤ 2−n(C−ϵ) , i ≥ 2 We conclude that, for n large and R < C − ϵ: P(E|W = 1) ≤ ϵ + 2−n(C−R−ϵ) ≤ 2ϵ. As P(E) ≤ 2ϵ, there exists c⋆ with P(E|C = c⋆ ) ≤ 2ϵ. Since 2nR 1 X ⋆ P(E|C = c ) = nR λi (c⋆ ) ≤ 2ϵ 2 i=1 there are 2nR − 1 indices i such that λi (c⋆ ) ≤ 4ϵ by considering the best half. So we have proven that there exists a sequence of (n, 2nR ) codes with vanishing error probability which concludes the proof. 6.4 Computing the Channel Capacity In general, how does one compute channel capacity ? The problem is usually difficult and for many channels, the computation of their capacity is an open problem. We highlight a two simple strategies here. 6.4. COMPUTING THE CHANNEL CAPACITY 6.4.1 67 Capacity of Weakly Symmetric Channels If the channel has some symmetry features, one can use this to compute the capacity. ′ Definition 6.4.1. A channel is weakly symmetric if (i) for any x,xP , vectors p(.|x), ′ ′ p(.|x ) are equal up to a permutation and (ii) for any y,y we have x∈X p(y|x) = P ′ x∈X p(y |x). If the channel is weakly symmetric, the optimal input distribution is uniform, and the capacity is simply the logarithm of the number of outputs, minus the entropy of a column of the transition matrix. Interestingly, this result generalizes our previous computations. ′ Proposition 6.4.2. Assume that (i) for any x,x′ , vectors p(.|x), p(.|x ) are equal P P ′ up to a permutation and (ii) for any y,y we have x∈X p(y|x) = x∈X p(y ′ |x). Then X 1 , C = log |Y| − p(y|x) log2 p(y|x) y∈Y for any x and the optimal input is uniform. Proof The distribution of Y knowing X = x does not depend on x (up to a permutation), so: I(X; Y ) = H(Y ) − X p(y|x) log2 y∈Y 1 p(y|x) Once again, by symetry As X uniform =⇒ Y uniform, this maximizes H(Y ) and I(X; Y ). 6.4.2 Concavity of Mutual Information Proposition 6.4.3. For any channel we have: (i) 0 ≤ C ≤ log2 (min(|X |, |Y|) (ii) (p(x)) 7→ I(X; Y ) is a concave function. Proof The distribution of Y is: X X p(y) = p(x, y) = p(x)p(y|x). x∈X x∈X 68 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS Define f (x) = x log2 H(Y ) = X 1 x X X 1 = f p(x)p(y|x) p(y) y∈Y x∈X X X 1 1 p(x, y) log2 = p(x) p(y|x) log2 . p(y|x) x∈X p(y|x) y∈Y p(y) log2 y∈Y H(Y |X) = and X (x,y)∈X ×Y We have that f is concave so (p(x)) 7→ H(Y ) is concave as well. Furthermore, (p(x)) 7→ H(Y |X) is linear. Therefore (p(x)) 7→ I(X; Y ) = H(Y ) − H(Y |X) is concave. 6.4.3 Algorithms for Mutual Information Maximization In general capacity is not known in closed form can its computation can be a hard problem. One can maximize I(X; Y ) numerically using convex optimization techniques such as gradient ascent, which are valid for maximizing any concave function. Specific algorithms taking advantage of particular properties of mutual information also exist, such as the algorithm of Arimoto and Blahut. Chapter 7 Mutual Information and Communication: continuous channels In this chapter, we turn our attention to continuous channels, where both the input and the output are real valued. Such channels are ubiquitous in the physical world, due to its continuous nature. To solve this problem we need to generalize the notions of entropy, relative entropy and mutual information to continuous random variables. We compute the capacity and the optimal input distribution gaussian channels, which are found in many applications such as wireless communication. 7.1 7.1.1 Information Mesures for Continous Variables Differential Entropy Definition 7.1.1. Consider X a continuous random variable with p.d.f. pX (x). Its differential entropy is given by h(X) = E log2 1 = pX (X) Z pX (x) log2 X 1 dx pX (x) if the integral exists. Just like entropy, differential entropy is expressed in bits, and is a natural natural extension of the discrete case. It is noted that the integral might not exist. One of the most notable differences with entropy is that differential entropy can be negative. 69 70 7.1.2 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS Examples Uniform Distribution If X ∼ Uniform(X ): h(X) = E(log2 |X |) = log2 |X | It is noted that, if |X | ≤ 1, then h(X) ≤ 0 so that differential entropy can be negative. Also, if X is deterministic, X is a point and h(X) = −∞, which differs from the discrete case where deterministic variables have an entropy of 0. Exponential Distribution If X ∼ Exponential(λ): eXλ λE(X) 1 e h(X) = E log2 = + log2 = log2 λ log 2 λ λ The fact that differential entropy decreases with λ is intuitive, since the smaller λ, the less X is concentrated around 0. Gaussian Distribution If X ∼ N (µ, σ 2 ): √ (X−µ)2 2 2 2σ h(X) = E log2 ( 2πσ e ) = 1 E(X − µ)2 1 log2 (2πσ 2 ) + = log2 (2πeσ 2 ) 2 2 2 log(2)σ 2 This expression will occcur in various places, in particular when computing the capacity of Gaussian channels. Two remarks can be made: first the differential entropy does not depend on µ which illustrates the fact that differential entropy is invariant by translation, second it is increasing in σ 2 , which is intuitive since the larger σ 2 , the less X will be concentrated around its mean µ. 7.1. INFORMATION MESURES FOR CONTINOUS VARIABLES 7.1.3 71 Joint and Conditional Entropy Mutual Information Joint and Conditional Differential Entropy Definition 7.1.2. Let X, Y with joint p.d.f. pX,Y (x, y). The joint differential entropy, conditional differential entropy and mutual information are: Z 1 h(X, Y ) = pX,Y (x, y) log2 dxdy pX,Y (x, y) X ×Y Z pY (y) pX,Y (x, y) log2 h(X|Y ) = dxdy pX,Y (x, y) X ×Y Z pY (y)pX (x) dxdy. I(X; Y ) = pX,Y (x, y) log2 pX,Y (x, y) X ×Y As in the discrete case , one can readily check that h(X|Y ) = h(X, Y ) − h(Y ) I(X; Y ) = h(Y ) − h(Y |X) = h(X) − h(X|Y ) = h(X) + h(Y ) − h(X, Y ). Relative Entropy Definition 7.1.3. Consider two p.d.f.s p(x) and q(x). The relative entropy is Z p(x) D(p||q) = p(x) log2 dx q(x) X Proposition 7.1.4. We have D(p||q) ≥ 0 for any p, q. Proof: Jensen’s inequality. 7.1.4 Unified Definitions for Information Measures In the above presentation, we have presented two distinct set of definitions for information measures for continuous and discrete variables. A natural question is whether or not one can define information measures in such a way that the same definition is applicable to both discrete and continuous variables. The key is to, perhaps counterintuitively, start by defining the relative entropy using the Radon-Nikodym derivative. Definition 7.1.5. Consider P ,Q two distributions over a measurable space X , and assume that P is absolutely continous with respect to Q, then the relative entropy can be defined as: Z P (dx) D(P ||Q) = log2 P (dx) Q(dx) X where P (dx) Q(dx) is the Radon-Nikodym derivative of P with respect to Q. 72 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS The Radon-Nikodym derivative is well-defined from absolute continuity, and in turn this allows to define mutual information in terms of relative entropies. Definition 7.1.6. Consider (X, Y ) random variables with joint distribution P(X,Y ) , then the mutual information between X and Y is I(X, Y ) = D(P(X,Y ) ||PX PY ) As a byproduct, we obtain a very instructive interpretation of mutual information I(X; Y ) as the dissimilarity between the joint distribution of the vector (X, Y ) and another vector with independent entries and the same marginals. Also, one can readily check that the above definitions generalize both the discrete and continuous case. 7.2 Properties of Information Measures for Continous Variables 7.2.1 Chain Rule for Differential Entropy Differential entropies obey a chain rule just like entropies, and the proof follows from the same arguments. Proposition 7.2.1. For any X1 , ..., Xn we have: h(X1 , ..., Xn ) = n X i=1 h(Xi |Xi−1 , ..., X1 ) Proof: By definition of conditional entropy: h(X1 , ..., Xn ) = h(Xn |X1 , ..., Xn−1 ) + h(X1 , ..., Xn−1 ) The result follows by induction over n. 7.2.2 Differential Entropy of Affine Transformation For continous variables over Rn , we will often be interested in how their differential entropy is affected by simple transformations such as translations and linear transformations. Proposition 7.2.2. For any random variable X ∈ R, fixed vector a ∈ R and invertible matrix A ∈ Rd×d we have h(a + AX) = h(X) + log2 det A. If A is not invertible we have h(a + AX) = −∞. 7.3. DIFFERENTIAL ENTROPY OF MULTIVARIATE GAUSSIANS 73 −1 (x−a))dx Proof: If A is invertible and X ∼ p(x)dx then a + AX ∼ p(A det , so: A Z det A dx h(a + AX) = p(A−1 (x − a) log2 −1 p(A (x − a)) det A d ZR det A dx p(A−1 (x − a) log2 = p(A−1 (x − a)) det A d ZR det A = p(y) log2 dy p(y) Rd = h(X) + log2 det A. which proves the first result. If A is not invertible then the support of the distribution of a + AX has Lebesgue measure 0 so that h(a + AX) = −∞. Therefore, an affine transformation incurs an additive change to the entropy, and this change is the logarithm of the determinant of A. If A = I or more generally if A is a rotation then log2 det A = 1 so that differential entropy is invariant by both translation and rotation. 7.3 7.3.1 Differential Entropy of Multivariate Gaussians Computing the Differential Entropy Our previous result allows to derive the differential entropy of multivariate gaussians vectors without any computation, indeed, any gaussian vector can be expressed as an affine transformation of an i.i.d. vector with centered standard Gaussians. Proposition 7.3.1. If X ∼ N (µ, Σ) then h(X) = 1 log2 ((2πe)n det(Σ)) 2 Proof: If X ∼ N (0, I) then X has i.i.d. N (0, 1) entries so that h(X) = n X h(Xi ) = i=1 n log2 (2πe). 2 1 Now consider Y = µ + Σ 2 X, then Y ∼ N (µ, Σ) and 1 h(Y ) = h(X) + log2 det Σ 2 = proving the result. 1 log2 ((2πe)n det(Σ)) 2 74 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS 7.3.2 The Gaussian Distribution Maximizes Entropy One of the reasons why the multivariate Gaussian distribution is ubiquitous in information theory is the fact that, it is an entropy maximizer. Namely, if one knows the mean and covarance of X, then its differential entropy is always upper bounded the differential entropy of a Gaussian vector with the same mean and covariance matrix. This result has interesting applications in statistical modelling: if one must model some incertain parameter by a distribution, and the only information available are its first and second moments, then considering the Gaussian distribution is natural, as it follows the so-called maximum entropy principle for modelling. Another important application of this result is the computation of the capacity of Gaussian channels, and more generally deriving capacity bounds for various types of channels. Proposition 7.3.2. Consider X ∈ Rn with covariance matrix Σ then h(X) ≤ 1 log2 ((2πe)n det(Σ)) 2 with equality if and only if X has a Gaussian distribution. Proof Denote by p(x) the density of X and µ its mean. Define Y ∼ N (µ, Σ) with density q(x), so that p(X) 0 ≤ D(p||q) = E log2 q(X) and h(X) = E log2 Since 1 1 ≤ E log2 p(X) q(X) 1 1 ⊤ −1 e− 2 (x−µ) Σ (x−µ) q(x) = p (2π)n det(Σ) : E log2 1 1 1 = log2 ((2π)n det(Σ)) + E((X − µ)⊤ Σ−1 (X − µ)) q(X) 2 2(log 2) 1 1 = log2 ((2π)n det(Σ)) + E((Y − µ)⊤ Σ−1 (Y − µ)) 2 2(log 2) 1 = E log2 q(Y ) = h(Y ). since the r.h.s only depends on the covariance matrix of X which proves the result. 7.4. CAPACITY OF CONTINUOUS CHANNELS 7.4 75 Capacity of Continuous Channels Consider a continuous, memoryless channel. As in the discrete case, communicating over a continuous channel follows the same paradigm. One may define codebooks, error probabilities and achievable rates. Furthermore, the noisy channel coding theorem still applies: the information capacity of the channel is also the supremum of all achievable rates, and any achievable rate can be attained using the random coding strategy, coupled with typicality decoding. 7.5 7.5.1 Gaussian Channels Gaussian Channel Definition 7.5.1. The Gaussian channel with power P is given by: Y =X +Z where Z ∼ N (0, N ), and the input must satisfy E(X 2 ) ≤ P . The Gaussian channel is the simplest model for communication between a transmitter and a receiver when the only perturbation is additive noise. Gaussian noise is often a good model whenever the perturbation is the result of many small, independent sources of perturbation, from the central limit theorem. We compute the capacity of this channel by maximizing mutual information, and the power constraint E(X 2 ) ≤ P is necessary, otherwise, the capacity of the channel is simply infinite. Proposition 7.5.2. The information capacity of the Gaussian channel with power P is: P 1 1 + C = max I(X; Y ) = log 2 X:E(X 2 )≤P 2 N and the optimal input is X ∼ N (0, P ). Proof: When X is fixed, Y ∼ N (X, N ) so that h(Y |X) = 1 log2 (2πeN ). 2 On the other hand from independence: E(Y 2 ) = E((X + Z)2 ) = E(X 2 ) + 2E(XZ) + E(Z 2 ) = P + N. Therefore h(Y ) ≤ 1 log2 (2πe(N + P )), 2 76 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS with equality iff Y is Gaussian. Finally 1 1 I(X; Y ) = h(Y ) − h(Y |X) ≤ log2 (2πe(N + P )) − log2 (2πeN ) 2 2 P 1 . = log2 1 + 2 N with equality if and only if Y is Gaussian which concludes the proof. The capacity of this channel is an increasing function of the signal-to-noise P ratio (SNR) N . When the SNR is small the capacity is roughly linear in the SNR but when the SNR is large, the capacity is logarithmic. This shows that, when communicating over a Gaussian channel, increasing the power leads to better performance, but one quiclky runs into diminishing returns. 7.5.2 The AWGN Channel A variant of the Gaussian channel is the Additive White Gaussian (AWGN) Noise channel, where the input is a continuous time signal, and this input is perturbed by a continous time process called white noise. For instance, in almost all wireless communication systems, communication is impaired by Johnson-Nyquist noise, which is unwanted noise generated by the thermal agitation of electrons, and Johnson-Nyquist noise usually can be modelled by white noise. Definition 7.5.3. The AWGN (Additive White Gaussian Noise) channel is given by: Y (t) = X(t) + Z(t) where x(t) is bandlimited in [−W, W ] with total power P and Z(t) is white Gaussian noise with spectral power density N0 . Proposition 7.5.4. The capacity of the AWGN channel is given by: C = W log2 1 + P W N0 Proof From the Nyquist sampling theorem, the AWGN channel is equivalent to 2W parallel, identical Gaussian channels, hence the result. We notice that for infinite bandwidth W → ∞ (low SNR): C= P log2 e N0 and that, just like the previous case, there exists a power-bandwidth tradeoff: C is linear in W but logarithmic in WPN0 . This explains why, in most wireless 7.5. GAUSSIAN CHANNELS 77 communication systems, increasing the bandwidth yields much more gains that increasing the power, especially if the SNR of the typical user is already high. Also, the formula of the capacity of the AWGN channel allows to predict the performance of many practical communication systems past and present, and while the capacity is an upper bound of the best performance that can be achieved in ideal conditions (infinite processing power for coding and decoding for instance) the formula allows to roughly predict the typical performance, providing one knows the typical SNR as well as the bandwidth. Here are three illustrative examples: for telephone lines: W = 3.3 kHz, WPN0 = 33 dB, C = 36 Kbits/s. Wifi: W = 40 MHz, WPN0 = 30 dB, C = 400 Mbits/s and for 4G Networks W = 20 MHz, P = 20 dB, C = 133 Mbits/s. W N0 7.5.3 Parallel Gaussian Channels In many communication systems, one can in fact use multiple channels all at once in order to communicate. If those channels are Gaussian, and they are independent from each other, the model is called parallel Gaussian channels. Definition 7.5.5. A set of parallel Gaussian channels with total power P is: Yj = Xj + Zj , j = 1, . . . , k where Zj ∼ N (0, Nj ), j = 1, . . . , k are independent, and the input must satisfy Pk 2 j=1 E(Xj ) ≤ P . In the context of communication, in particular wireless communication, this model covers communication over parallel links, over distinct frequency bands and distinct antennas, all of which are important components of modern wireless systems. The main question is how one should allocate the available power to the various channels. Certainly, if the noise variance is the same across all channels, the problem is trivial and one can simply allocate power uniformly, but in general if some channels are much better than other the problem is non-trivial. We now compute the capacity of parallel Gaussian channels by computing the optimal power allocation across channels. Proposition 7.5.6. The capacity of parallel Gaussian channels is given by: k C= with λ⋆ unique solution to 1X (λ⋆ − Nj )+ log2 1 + 2 j=1 Nj Pk j=1 (λ − Nj )+ = P 78 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS Proof: We need to solve the optimization problem: maximizeP1 ,...,Pk ≥0 k X log2 j=1 k X Pj subject to Pj ≤ P. 1+ Nj j=1 From Lagrangian duality, this can be done by solving: maximizeP1 ,...,Pk ≥0 k X log2 j=1 k X Pj +µ Pj 1+ Nj j=1 Setting the gradient to 0 above yields: 1 Nj 1+ Pj Nj +µ=0 Therefore either Pj = 0 or Pj + Nj = − µ1 ≡ λ and summing over j to get Pk j=1 Pj = P yields the correct value of λ. This concludes the proof. The optimal power allocation is called the ”water filling” solution, due to the fact that the power allocated to channnel i is either 0, or it should be equal to Ni + λ⋆ , where λ⋆ is selected to make sure that the total power allocated equals P . This implies that very noisy channels are ignored. Having parallel channels enable a multiplexing gain: indeed capacity is linear in the number of channels. Finally, one may show that the result also applies to to time varying channels, and bandlimited channels. 7.5.4 Vector Gaussian Channels A generalization of parallel Gaussian channels is called vector Gaussian channels, where the correlation matrix of the noise vector can be arbitrary. Definition 7.5.7. A vector Gaussian channel with total power P is: Y k = Xk + Zk where Z ∼ N (0, ΣZ ) and the input satisfies E((X k )⊤ X k ) ≤ P . This model for instance allows to describe wireless communication systems with multiple-input multiple-output (MIMO), where both the receiver and the transmitter can use several antennas to communicate. This model can also be used to descrive non-memoryless Gaussian channels, where the entries of X k and Y k would describe the successive values of the input and output across time. As in the case of parallel Gaussian channels, we now derive the optimal input and transmission strategy. 7.5. GAUSSIAN CHANNELS 79 Proposition 7.5.8. The capacity of Vector Gaussian Channels is given by: k C= (ν − λj )+ 1X log2 1 + 2 j=1 λj with (λ1 , ..., λk ) = eig(KZ ), ν unique solution to Pk j=1 (ν − λj )+ = P Proof: Since ΣZ is real and symmetric, there exists U : U ⊤ U = I and ΣZ = U ⊤ diag(λ1 , ..., λk )U Multplying by U : U Y k = U X k + U Zk, This defines a new channel: Ȳ k = X̄ k + Z̄ k , We have (X̄ k )⊤ X̄ = (X k )⊤ U ⊤ U X k = (X k )⊤ X k ΣZ̄ = U ⊤ ΣZ U = diag(λ1 , ..., λk ), This is the same as k parallel Gaussian channels with noises λ1 , ..., λk variances, which concludes the proof. It turns out that the optimal power allocation is to perform waterfilling on the eigenvectors of the noise correlation matrix. The main idea behind this is that one can always reduce a vector Gaussian channel to k parallel channels by rotation, so that after the rotation, the noise correlation matrix becomes diagonal. 80 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS Chapter 8 Portfolio Theory In this chapter, we illustrate how, perhaps surprisingly, information theoretic techniques can be used in order to design investment strategies in financial markets, in the context of the so-called portfolio theory. 8.1 A Model for Investment 8.1.1 Asset Prices and Portfolios We consider the following model for investment in a financial market. At the start of the process, the investor has a s tarting wealth S0 . The process is sequential, and at the start of day n ∈ N the investor has wealth Sn , observes the stock prices at opening denoted by (Pn,1 , ..., Pn,m ), chooses a portfolio (bn,1 , ..., bn,m ) with Pthe m bi Sn i=1 bn,i = 1. He invests bj Sn amount of wealth in asset i by buying Pn,i units of asset i. ′ ′ At the end of day n, he observes the closing prices (Pn,1 , ..., Pn,m ) realizes his profits and losses so that the amount of wealth available at the start of day n + 1 equals: m ′ Pn,j Sn+1 X = bn,j Sn Pn,j j=1 By recursion, the wealth at any given time can be written as n−1 m ′ Sn Y X Pi,j = bj S0 Pi,j i=1 j=1 81 ! 82 8.1.2 CHAPTER 8. PORTFOLIO THEORY Relative Returns The model can be written in a simpler form by defining (Xn,1 , ..., Xn,m ) with Xn,i = ′ Pn,i Pn,i the relative return of asset i at time n so that the wealth evolution is n−1 m X Sn X = log2 ( bi,j Xi,j ) log2 S0 i=1 j=1 Indeed, the relative returns of each asset are sufficient in order to predict the evolution of the wealth. Throughout the chapter we will assume that the vectors of relative returns Xn = (Xn,1 , ..., Xn,m ) are i.i.d. with some fixed distribution F . 8.2 8.2.1 Log Optimal Portfolios Asymptotic Wealth Distribution The investor wishes to design optimal portfolio strategies that maximizes the distribution of its wealth log Sn in some sense. Since the investor monitors the market on a daily basis, he may choose an investment strategy that depends on the previous returns Xn−1 , ..., X1 as well as his previous decisions. Of course, since the wealth is a random variable, there are several acceptable criteria to maximize, and we will propose one such criterion. Proposition 8.2.1. Consider a constant investment strategy where bn = (bn,1 , ..., bn,m ) does not depend on n. Then Sn 1 log2 → W (b, F ) almost surely n S0 n→∞ where W (b, F ) is called the growth rate of portfolio b: W (b, F ) = EX∼F log2 m X !! bi X i i=1 Proof If the investment strategy is constant, then average of n i.i.d. random variables: n−1 m X Sn 1X log2 = log2 bj Xi,j S0 n i=1 j=1 1 n log SSn0 is an empirical ! 8.3. PROPERTIES OF LOG OPTIMAL PORTFOLIOS 83 with expectation W (b, F ) so the strong law of large numbers yields the result. The above proposition shows that, if the investor chooses a fixed investment strategy across time, then with high probability, wealth will grow exponentially as a function of time: Sn ≈ S0 2nW (b,F ) and the exponent equals the growth rate of the portfolio W (b, F ) ≥ 0. Perhaps surprisingly, if the growth rate is strictly positive, then with high probability, the wealth asymptotically grows to infinity. 8.2.2 Growth Rate Maximization Definition 8.2.2. The optimal growth rate W ⋆ (b, F ) is the value of maximize W (b, F ) subject to m X i=1 bi ≤ 1 and b ≥ 0 and an optimal portfolio b⋆ is an optimal solution to this problem. The previous results suggests that, if the investor knows the distribution of the returns F , than he should select the portfolio maximizing the growth rate, to ensure that its wealth grows as rapidly as possible. While this is not the only possible objective function in porfolio theory, it comes with strong guarantees providing that returns are indeed i.i.d. Other possible objective functions in portfolio theory are for instance linear combinations of the mean and variance of the returns, as there exists a trade-off between high-risk/high-return and low-risk/low-return portfolios. Another interesting observation is that maximizing Pthe growth rate is usually different from maximizing the expected returns E( m i=1 bi Xi ), which can be ⋆ ⋆ achieved by selecting bi⋆ = 1{i = i } where i = arg maxi E(Xi ), i.e. the investor places all of his wealth on the stock with highest average return, a risky strategy indeed. Usually, maximizing the growth rate is a much moreP conservative, due to the logarithm which places a heavy penalty on the wealth m i=1 bi Xi becoming very close to 0 . In other words, maximizing the growth rate discourages porfolios that can bankrupt the investor in a day. 8.3 Properties of Log Optimal Portfolios We now show how to compute the optimal portfolio maiximizing the growth rate. 84 8.3.1 CHAPTER 8. PORTFOLIO THEORY Kuhn Tucker Conditions Proposition 8.3.1. The optimal portfolio b⋆ is the only porfolio such that for all j: !( = 1 if bj > 0 Xj E Pk < 1 if bj = 0 i=1 bi Xi Proof: It is noted that b 7→ W (b, F ) is a concave function, by concavity of the logarithm. From the KKT conditions, there exists λ > 0 and µ ≥ 0 such that: ∇W (b⋆ , F ) + λ + µ = 0 Since µ ≥ 0 we have for all i: d W (b⋆ , F ) + λ ≤ 0 b.i and furthermore, if b⋆i ̸= 0 d W (b⋆ , F ) + λ = 0 b.i By definition of W : 1 d W (b⋆ , F ) = E b.i log(2) Xj ! Pk ⋆ i=1 bi Xi Multiplying the above by b⋆i , replacing and summing shows that: Pk ⋆ ! k X 1 i=1 bi Xi +λ b⋆i = 0 E Pk ⋆ log(2) b X i=1 i i i=1 1 Therefore, λ = log(2) , and replacing yields the result. The KKT conditions are necessary and sufficent conditions for the optimality of the portfolio, and if F is known, one can search for the optimal using an iterative scheme such as gradient descent. 8.3.2 Asymptotic Optimality So far, we have only considered constant stratgies where the investor uses the same porfolio at all times, and for such strategies the best achievable wealth is given by the growth rate. One can then whether if it is possible to do better by using history dependent strategies where the investors decision at time n depends on the observed returns up to time n − 1. 8.4. INVESTMENT WITH SIDE INFORMATION 85 Definition 8.3.2. A portfolio strategy is said to be causal if for all n, bn,1 , ..., bn,m is solely a function of (Xn′ ,1 , ..., Xn′ ,m ) for n′ < n. Proposition 8.3.3. For any causal portfolio strategy 1 Sn E log2 ≤ W (b⋆ , F ) n S0 With equality if one selects b⋆ the maximizer of W (b, F ) at all times i.e. constant strategies are optimal. Proof: The expected log wealth is given by n−1 1 Sn 1X E log2 = E log2 n S0 n i=1 m X !! bi,j Xi,j j=1 For any i, when (bi,1 , ..., bi,m ) is an arbitrary function of (Xn′ ,1 , ..., Xn′ ,m ) for n′ < n, the optimal choice is to select the maximizer of: E log2 m X j=1 ! bi,j Xi,j ! |(Xn′ ,1 , ..., Xn′ ,m ), n′ < n = E log2 m X !! bi,j Xi,j j=1 since (Xn′ ,1 , ..., Xn′ ,m ) is independent of (Xn′ ,1 , ..., Xn′ ,m ) , n′ < n. Therefore, for each i, (bi,1 , ..., bi,m ) can be chosen as the maximizer of W (b, F ), and constant strategies are optimal. Interestingly, in our setting causal strategies yield no gains with respect to constant strategies. Therefore, the best achievable performance with causal strategies is still given by the growth rate. Of course, this only true if F is known to the investor, and the returns are i.i.d. If F were unknown, then the investor should change his decisions as more and more returns are observed. Similarly if the returns have a significant correlation in time, then the investment strategy should be time varying, as the returns observed up to time n − 1 can be used to predict the returns at time n and choose a portfolio intelligently. 8.4 Investment with Side Information Finally, we investigate how much side information available to the investor may increase his performance, and how much having imperfect knowledge about the market can decrease his performance. 86 8.4.1 CHAPTER 8. PORTFOLIO THEORY Mismatched Portfolios So far, we have assumed that the investor knows the distribution of the relative returns F , and in that case the optimal choice is to select a portfolio maximizing the growth rate W (b, F ). However, in practice a full knowledge of F is not available, and F must be somehow estimated, for instance using historical data. Consider the case where the investor knows G, an estimate of F , and selects the optimal porfolio if G were equal to the unknown F . A natural question is how to assess how much wealth is loss due to the imperfect knowlege of F . Proposition 8.4.1. Consider two distributions F and G, and the corresonding log optimal portfolios b⋆F and b⋆G , which maximize W (b, F ) and W (b, G) respecively. Then we have that W (b⋆F , F ) − W (b⋆G , F ) ≤ D(F ||G) In other words, the amount of growth rate lost by the invstor due to his imperfect knowlege is upper bounded by the relative entropy between the true distribution F and his estimate G. So the wealth of an investor with perfect knowledge will ⋆ be approximatly 2nW (bF ,F ) , the wealth of an investor with imperfect knowledge ⋆ will be approximatly (at least) 2n[W (bF ,F )−D(F ||G)] . It should also be noted that this bound is tight for some distributions of X. This is indeed a surprising link between portfolio theory and information theory. 8.4.2 Exploiting Side Information Now consider the scenario where the investor may use side information before selecting a porfolio. The goal is still to maximize the growth rate, which depends on the distibution of the returns X, however the investor has access to another random variable Y , which is hopefully useful in order to predict X. Two example of scenarios include: financial advice where Y is the prediction of some expert (or experts) that the investor may choose to consult before making a decidion, or correlated returns in which the returns are not i.i.d. anymore so that Xn can be predicted as a function of Y = (Xn′ )n′ <n . Definition 8.4.2. The growth rate of portfolio b with side information Y is: k X W (b, F |Y ) = E(log( bi Xi )|Y ) i=1 If the investor has access to side information Y , then he should select the portfolio maximizing W (b, F |Y ), and while this certainly yields a better performance 8.4. INVESTMENT WITH SIDE INFORMATION 87 compared to the case with no side information, one can wonder how much growth rate is gained with side information (for instance in the case where the investor must pay some premium in order to access the side information). Intuitively, this should depend on how much X and Y are correlated. Proposition 8.4.3. Consider b⋆ the log optimal portfolio maximizing W (b, F ) and b⋆|Y the log optimal portfolio with side information maximizing W (b, F |Y ). Then we have 0 ≥ W (b⋆|Y , F |Y ) − W (b⋆ , F ) ≤ I(X; Y ) Proof: If Y = y, from our previous result, the loss of growth rate between an investor whom assumes that X has distribution G = pX and an investor whom knows the actual distribution F = pX|y is at most X pX|Y (x, y) D(pX|Y =y |pX ) = pX|Y (x, y) log2 pX (x) x∈X Averaging this loss over Y equals: X x∈X pY (y)pX|Y (x, y) log2 pX|Y (x, y) pX (x) = I(X; Y ) which is the announced result. Here we discover another surprising connection between portfolio theory and information theory: the amount of growth rate that can be gained by side information is at most the mutual information between the returns and the side information I(X; Y ). This makes sense since I(X; Y ) does measure the correlation between X and Y . To illustrate, consider two extremes: Y is independent from X, then I(X; Y ) = 0, and the side information yields no benefit, and if Y = X then I(X; Y ) = H(X), and the gain is at most the entropy of X. 88 CHAPTER 8. PORTFOLIO THEORY Chapter 9 Information Theory for Machine learning and Statistics In this chapter, we illustrate how information theoretic techniques can be used to solve problems in statistics and machine learning. 9.1 9.1.1 Statistics Statistical Inference Assume that we are given n data points X1 , ..., Xn in a finite set X drawn i.i.d. from some unknown distribution p. We would like to perform statistical inference, meaning that we would like to learn information about the unknown distribution p, solely by observing the data points X1 , ..., Xn . Of course, depending on what kind of information we wish to obtain, the resulting problems can be vastly different. We give a few examples. 9.1.2 Examples of Inference Problems Density Estimation We would like to construct p̂, an estimator of p with the goal of minimizing E(ℓ(p, p̂) where ℓ is some loss function. The loss function quantifies how close the true p is to its estimate p̂. Parameter Estimation We assume that p is parameterized by some parameter θ (write it pθ ). We would like to construct θ̂, an estimator of θ with the goal of minimizing E(ℓ(p, p̂) where ℓ is some loss function. Binary Hypothesis Testing We partition the set of distributions as H0 ∪ H1 and we would like to know if p lies in H0 or H1 . We would like to compute a well chosen 89 90 CHAPTER 9. MACHINE LEARNING AND STATISTICS function of the data T such that both P(T = 0|p ∈ H0 ) and P(T = 1|p ∈ H1 ) are close to 1. 9.1.3 Empirical Distributions To obtain information about p, the most natural strategy is to compute the empirical distribution of the data, i.e. the frequency at which each possible symbol a ∈ X appears in the data. Definition 9.1.1. Consider a sequence x1 , ..., xn in X , its empirical probability distribution Pxn is given by n Pxn (a) = 1X 1{xi = a} n i=1 for a ∈ X . Alternatively, we call Pxn the "type" of sequence xn = (x1 , ..., xn ). It is noted that the type Pxn is indeed a distribution over X since it has positive entries and sums to 1, and that it is an element of the set of probability distributions over X X P = {p ∈ (R+ )X : xa = 1} a∈X This set is often called the probability simplex, and has dimension |X | − 1. The reason why the most natural strategy is to compute the empirical distribution of the data is because it converges to the true distribution when the number of data points grows large, as a consequence of the law of large numbers. Proposition 9.1.2. If X n = (X1 , ..., Xn ) are drawn i.i.d. from dsitribution Q, then the type of X n converges to Q almost surely. Proof: From the law of large numbers, for any fixed a ∈ X : n 1X PX n (a) = 1{Xi = a} → P(Xi = a) = Q(a) almost surely n→∞ n i=1 This holds for any a which proves the result. 9.2 The Method Of Types The method of types is a very powerful information theoretic strategy in order to control the behaviour of the empirical distribution and works as follows. 9.2. THE METHOD OF TYPES 9.2.1 91 Probability Distribution of a Sample The first step is to show that the distribution of an i.i.d sample only depends on its type, and that this distribution can be expressed in terms of entropy and relative entropy. Proposition 9.2.1. Consider X n = (X1 , ..., Xn ) i.i.d. from distribution Q, then the probability distribution of X n only depends on its type, and P(X n = xn ) = 2−n[H(Pxn )+D(Pxn ||Q)] Proof Consider X n = (X1 , ..., Xn ) i.i.d. from distribution Q, then the probability distribution of X n only depends on its type, in the sense that n n P(X = x ) = n Y Q(xi ) = i=1 Y Q(a) Pn i=1 1{xi =a} = a∈X Y Q(a)nPxn (a) a∈X Indeed, the expression above only depends on the type Pxn , so that all sequences that have the same type are equally likely to occur. Furthermore, taking logarithms and dividing by n: X 1 1 n n Pxn (a) log2 − log2 P(X = x ) = n Q(a) a∈X = X Pxn (a) log2 a∈X Pxn (a) Q(a) + Pxn (a) log2 1 Pxn (a) = H(Pxn ) + D(Pxn ||Q) Hence we have proven that: P(X n = xn ) = 2−n[H(Pxn )+D(Pxn ||Q)] So not only does the probability of a sequence only depend on its type, but the exponent is equal to the sum of the entropy of the type, and the relative entropy between the type and the true distribution. This implies that the most likely type is the true distribution, and also that, when n is large, types that are far away from the true distribution are very unlikely to occur. 92 CHAPTER 9. MACHINE LEARNING AND STATISTICS 9.2.2 Number of Types The second step is to show that the number of possible types is not very large, in the sense that there are at most polynomially many types in n grows and X is fixed. Proposition 9.2.2. The type Pxn lies in Pn = P ∪ {x : nxi ∈ {0, ..., n}, i = 1, ..., n} and the number of types is at most: |Pn | ≤ (n + 1)|X | Proof One can readily check that the entries of Pxn are integer multiples of 1/n by definition. Furthermore, |Pn | ≤ (n + 1)|X | since |Pn | is the number of vectors whose components are positive integer multiples of 1/n and sum to 1, and n|X | is the number of vectors whose components are positive integer multiples of 1/n and where all components are less than 1. 9.2.3 Size of Type Class The third step is to estimate the number of sequences which have a given type. For a given type P ∈ Pn , denote by T (P ) = {xn ∈ X n : Pxn = P } the sequences of type P , called the type class of P . Proposition 9.2.3. For any type we have: (n + 1)−|X | 2−nH(P ) ≤ |T (P )| ≤ 2−nH(P ) Proof: Since the probability of a sequence only depends on its type: X X 1= P(X n = xn ) = |T (P )|2−n[H(P )+D(P ||Q)] xn ∈X n P ∈Pn An upper bound for the size of a type class is: 1 ≥ |T (P )|2−n[H(P )+D(P ||Q)] This holds for any Q, so that for Q = P : |T (P )| ≤ 2−nH(P ) 9.3. LARGE DEVIATIONS AND SANOV’S THEOREM 93 A lower bound can be derived by observing 1 ≤ |Pn | max |T (P )|2−n[H(P )+D(P ||Q)] P ∈Pn One may check that the maximum in the above occurs for P = Q which gives |T (P )| ≥ (n + 1)|X | 2−nH(P ) where we used the fact that |Pn | ≤ (n + 1)|X | . A few important observations. The entropy H(P ) provides an estimate of the number of sequences with type P , and this estimate is accurate in the exponent, in the sense that when n → ∞: 1 log2 (|T (P )|) = H(P ) + o(1) n Second, both the type class T (P ) and the typical set of an i.i.d. sample with distribution P have approximately the same size. Third, the size of type classes grows exponentially with n, but the number of type classes grows polynomially in n. Finally, consider two types P and P ′ with H(P ) < H(P ′ ), then when n is large, T (P ′ ) will be overwhelmingly larger than T (P ). We will leverage those observations to derive powerful results. 9.3 Large Deviations and Sanov’s Theorem Using the method of types we can now derive Sanov’s theorem, which enables to control the fluctuations of the empirical distribution PX n around the true distribution Q when X n is drawn i.i.d. from Q. 9.3.1 Sanov’s Theorem Proposition 9.3.1. Consider X n = (X1 , ..., Xn ) drawn i.i.d. from distribution Q, and consider E ⊂ P a set of distributions. P(PX n ∈ E) ≤ (n + 1)|X | 2−nD(P ⋆ ||Q) where P ⋆ = arg min D(P ||Q) P ∈E Furthermore, if E is the closure of its interior then when n → ∞: 1 log2 P(PX n ∈ E) → D(P ⋆ ||Q) n 94 CHAPTER 9. MACHINE LEARNING AND STATISTICS Proof Summing over the possible types X P(PX n ∈ E) = P(PX n = P ) P ∈Pn ∩E The probability of type P occuring is: X P(PX n = P ) = |T (P )|2−n[H(P )+D(P ||Q)] P ∈Pn ∩E Using the fact that |T (P )| ≤ 2nH(P ) : P(PX n = P ) ≤ 2−nD(P ||Q) ≤ 2−nD(P ⋆ ||Q) using the fact that P ⋆ minimizes D(P ||Q) over E. Summing the above over P and using the fact that Pn ∩ E ≤ (n + 1)|X | we get the first result P(PX n ∈ E) ≤ (n + 1)|X | 2−nD(P ⋆ ||Q) If E is the closure of its interior we can find a sequence of Pn such that when n → ∞, D(Pn ||Q) → D(P ⋆ ||Q) and in turn P(PX n ∈ E) ≥ P(PX n = Pn ) with P(PX n = Pn ) = T (Pn )2−n[H(Pn )+D(Pn ||Q)] ≥ (n + 1)−|X | 2−nD(Pn ||Q) Using the fact that T (Pn ) ≥ (n + 1)−|X | 2nH(Pn ) . Taking logarithms, when n → ∞ 1 log2 P(PX n ∈ E) → D(P ⋆ ||Q) n which is the second result Sanov’s theorem enables us to predict the behavior of the empirical distribution of an i.i.d. sample, and is a "large deviation" result, in the sense that it predicts events with exponentially small probability. The empirical distribution typically lies close to the true distribution Q, and when Q ̸∈ E, this means that PX n ∈ E is unlikely. The theorem predicts that the probability of this event only depends on P ⋆ , which can be interpreted as the "closest" distribution to Q, where "distance" is measured by relative entropy. We will give several examples that illustrate the power of this result. 9.3. LARGE DEVIATIONS AND SANOV’S THEOREM 9.3.2 95 Examples We now highlight a few examples on how Sanov’s theorem may be applied to various statistical problems. Majority Vote Consider an election with two candidates, where Q(1), Q(2) are the proportion of people whom prefer candidates 1, and 2 respectively. We gather the votes X1 , ..., Xn of n voters, which we’ll assume to be i.i.d. distributed from Q. The candidate whom wins is the one who gathers the most votes. Assume that Q(1) > 1/2 so that 1 is the favorite candidate. What is the probability that 2 gets elected in place of 1 ? The votes X n = (X1 , ..., Xn ) are an i.i.d. sample from Q, and 2 gets elected if and only if PX n (2) > 1/2, so that he gets at least n/2 votes. So 2 gets elected if and only if PX n ∈ E where E = {P ∈ P : P (2) ≥ 1/2} We can then apply Sanov’s theorem to conclude that 2 gets elected in place of 1 with probability ⋆ P(PX n ∈ E) ≈ 2−nD(P ||Q) with P ⋆ = (1/2, 1/2) so that D(P ⋆ ||Q) = (1/2) 1 (1/2) 1 log2 + log2 2 Q(2) 2 1 − Q(2) Indeed: D(P ||Q) = P (2) log2 1 − P (2) P (2) + (1 − P (2)) log2 Q(2) 1 − Q(2) and minimizing this quantity over P (2) under the constraint P (2) ≥ 1/2 gives P (2) = 1/2, since Q(2) ≤ 1/2. Testing Fairness Assume that one is given a dice with k faces, and we want to test whether or not the dice is fair, in the sense that it is equally likely to fall on each of its faces. Consider X n = (X1 , ..., Xn ) the outcomes of casting the dice n times where Xn ∈ X is the index of the face on which the dice has fallen. To test fairness of the dice we compute the empirical distribution PX n and we compare it to Q, the uniform distribution over Q. Namely if D(PX n ||Q) ≤ ϵ we deem the dice to be fair, and unfair otherwise. What is the probability that we mistake a fair dice for an unfair dice ? If the dice is fair X n = (X1 , ..., Xn ) is an i.i.d. sample from Q, and we mistake it for an unfair dice if and only and only if PX n ∈ E where E = {P ∈ P : D(P ||Q) ≥ ϵ} 96 CHAPTER 9. MACHINE LEARNING AND STATISTICS Hence from Sanov’s theorem, the probability of a mistake is P(PX n ∈ E) ≈ 2−nD(P ⋆ ||Q) ≈ 2−nϵ with D(P ⋆ ||Q) = minP ∈E D(P ||Q) = ϵ. It is remarkable that Sanov’s theorem allows for an easy, explicit computation. Testing General Distributions It is also noted that the above works in the more general case Q is not the uniform distribution, but simply some target distribution, namely reject the test that P = Q if D(PX n ||Q) ≥ ϵ and accept otherwise. Chapter 10 Mathematical Tools In this chapter we provide a few results that are instrumental for some proofs. Results are stated without proofs. 10.1 Jensen Inequality Definition 10.1.1. A function f : Rd → R is said to be convex if for all x, y ∈ Rd and all λ ∈ [0, 1]: f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) Property 8. A twice differentiable function f : Rd → R is convex if and only if its Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd . Property 9 (Jensen’s Inequality). Consider f : Rd → R a convex function and X a random vector in Rd . Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over the support of the distribution of X. 10.2 Constrained Optimization Property 10 (Karush-Kuhn-Tucker Conditions). Consider f : Rd → R a convex, differentiable function, let x⋆ its maximum over the simplex: Maximize f (x) s.t. d X i=1 xi ≤ 1 and x ≥ 0 Then there exists λ ∈ R+ and µ ∈ (R+ )d such that ∇f (x⋆ ) + λ1 + µ = 0 P where λ( di=1 xi ) = 0 and xi µi = 0 for all i. 97