Uploaded by bechirhannachi5

information theory lecture notes

Information Theory Lecture Notes
Richard Combes 1
Version 1.0
1 Université
France
Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des signaux et systèmes,
2
Contents
1
2
Information Measures
1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Definition . . . . . . . . . . . . . . . . . .
1.1.2 Entropy and Physics . . . . . . . . . . . .
1.1.3 Positivity of Entropy and Maximal Entropy
1.2 Joint and Conditional Entropy . . . . . . . . . . .
1.2.1 Definition . . . . . . . . . . . . . . . . . .
1.2.2 Properties . . . . . . . . . . . . . . . . . .
1.3 Relative Entropy . . . . . . . . . . . . . . . . . .
1.3.1 Definition . . . . . . . . . . . . . . . . . .
1.3.2 Positivity of Relative Entropy . . . . . . .
1.3.3 Relative Entropy is Not a Distance . . . . .
1.4 Mutual Information . . . . . . . . . . . . . . . . .
1.4.1 Definition . . . . . . . . . . . . . . . . . .
1.4.2 Positivity of Mutual Information . . . . . .
1.4.3 Conditionning Reduces Entropy . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
11
11
12
13
13
13
14
14
14
15
15
15
16
16
Properties of Information Measures
2.1 Chain Rules . . . . . . . . . . . . . . . . .
2.1.1 Chain Rule for Entropy . . . . . . .
2.1.2 Chain Rule for Mutual Information
2.1.3 Chain Rule for Relative Entropy . .
2.2 Log Sum Inequality . . . . . . . . . . . . .
2.2.1 Statement . . . . . . . . . . . . . .
2.3 Data Processing and Markov Chains . . . .
2.3.1 Markov Chains . . . . . . . . . . .
2.3.2 Data Processing Inequality . . . . .
2.4 Fano Inequality . . . . . . . . . . . . . . .
2.4.1 Estimation Problems . . . . . . . .
2.4.2 Statement . . . . . . . . . . . . . .
2.5 Asymptotic Equipartition and Typicality . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
17
18
18
19
19
19
19
20
20
20
21
22
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
CONTENTS
2.5.1
2.5.2
2.5.3
3
4
5
AEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Typicality . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Joint Typicality . . . . . . . . . . . . . . . . . . . . . . . 23
Data Representation: Fundamental Limits
3.1 Source Coding . . . . . . . . . . . . . . . . . . .
3.1.1 Definition . . . . . . . . . . . . . . . . . .
3.1.2 Expected Length . . . . . . . . . . . . . .
3.1.3 Non-Singular Codes . . . . . . . . . . . .
3.1.4 Uniquely Decodable Codes . . . . . . . . .
3.2 Prefix Codes . . . . . . . . . . . . . . . . . . . . .
3.2.1 Definition . . . . . . . . . . . . . . . . . .
3.2.2 Prefix Codes as Trees . . . . . . . . . . . .
3.2.3 Kraft Inequality . . . . . . . . . . . . . . .
3.3 Optimal Codes and Entropy . . . . . . . . . . . . .
3.3.1 Lower Bound on the Expected Code Length
3.3.2 Existance of Nearly Optimal Codes . . . .
3.3.3 Asymptotically Optimal Codes . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Data Representation: Algorithms
4.1 The Huffman Algorithm . . . . . . . . . . . . . . . .
4.1.1 Algorithm . . . . . . . . . . . . . . . . . . . .
4.1.2 Rationale . . . . . . . . . . . . . . . . . . . .
4.1.3 Complexity . . . . . . . . . . . . . . . . . . .
4.1.4 Limitations . . . . . . . . . . . . . . . . . . .
4.1.5 Illustration . . . . . . . . . . . . . . . . . . .
4.1.6 Optimality . . . . . . . . . . . . . . . . . . .
4.2 Markov Coding . . . . . . . . . . . . . . . . . . . . .
4.2.1 Markov Sources . . . . . . . . . . . . . . . .
4.2.2 The Entropy of English . . . . . . . . . . . . .
4.2.3 Efficient Codes for Markov Sources . . . . . .
4.3 Universal Coding . . . . . . . . . . . . . . . . . . . .
4.3.1 Universality . . . . . . . . . . . . . . . . . . .
4.3.2 A Simple Universal Code for Binary Sequences
4.3.3 Lempel-Ziv Coding . . . . . . . . . . . . . . .
Data Representation: Rate-Distorsion Theory
5.1 Lossy Compression, Quantization and Distorsion
5.1.1 Lossless vs Lossy Compression . . . . .
5.1.2 The Quantization Problem . . . . . . . .
5.2 Scalar Quantization . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
25
25
26
26
27
27
27
28
28
28
30
30
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
33
33
34
34
34
35
36
36
37
37
38
38
39
40
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
. . . 43
. . . 44
. . . 44
. . . 45
CONTENTS
5.3
5.4
5.5
6
7
5
5.2.1 Lloyd-Max Conditions . . . . . . . . . . . . . . . . .
5.2.2 Uniform Distribution . . . . . . . . . . . . . . . . . .
5.2.3 Gaussian Distribution with one bit . . . . . . . . . . .
5.2.4 General Distributions . . . . . . . . . . . . . . . . . .
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Vector Quantization is Better than Scalar Quantization
5.3.2 Paradoxes of High Dimensions . . . . . . . . . . . . .
5.3.3 Rate Distorsion Function . . . . . . . . . . . . . . . .
Rate Distorsion Theorem . . . . . . . . . . . . . . . . . . . .
5.4.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Efficient Coding Scheme: Random Coding . . . . . .
Rate Distorsion for Gaussian Distributions . . . . . . . . . . .
5.5.1 Gaussian Random Variables . . . . . . . . . . . . . .
5.5.2 Gaussian Vectors . . . . . . . . . . . . . . . . . . . .
Mutual Information and Communication: discrete channels
6.1 Memoryless Channels . . . . . . . . . . . . . . . . . . . .
6.1.1 Definition . . . . . . . . . . . . . . . . . . . . . .
6.1.2 Information Capacity of a Channel . . . . . . . . .
6.1.3 Examples . . . . . . . . . . . . . . . . . . . . . .
6.1.4 Non-Overlapping Outputs Channels . . . . . . . .
6.1.5 Binary Symmetric Channel . . . . . . . . . . . . .
6.1.6 Typewriter Channel . . . . . . . . . . . . . . . . .
6.1.7 Binary Erasure Channel . . . . . . . . . . . . . .
6.2 Channel Coding . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Coding Schemes . . . . . . . . . . . . . . . . . .
6.2.2 Example of a Code for the BSC . . . . . . . . . .
6.2.3 Achievable Rates . . . . . . . . . . . . . . . . . .
6.3 Noisy Channel Coding Theorem . . . . . . . . . . . . . .
6.3.1 Capacity Upper Bound . . . . . . . . . . . . . . .
6.3.2 Efficient Coding Scheme: Random Coding . . . .
6.4 Computing the Channel Capacity . . . . . . . . . . . . . .
6.4.1 Capacity of Weakly Symmetric Channels . . . . .
6.4.2 Concavity of Mutual Information . . . . . . . . .
6.4.3 Algorithms for Mutual Information Maximization .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Mutual Information and Communication: continuous channels
7.1 Information Mesures for Continous Variables . . . . . . . .
7.1.1 Differential Entropy . . . . . . . . . . . . . . . . .
7.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . .
7.1.3 Joint and Conditional Entropy Mutual Information .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
46
47
48
48
48
49
50
50
50
52
53
53
54
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
57
57
58
58
59
59
60
60
61
61
62
63
63
63
64
66
67
67
68
69
. . . 69
. . . 69
. . . 70
. . . 71
6
CONTENTS
7.2
7.3
7.4
7.5
8
9
7.1.4 Unified Definitions for Information Measures . . . .
Properties of Information Measures for Continous Variables
7.2.1 Chain Rule for Differential Entropy . . . . . . . . .
7.2.2 Differential Entropy of Affine Transformation . . . .
Differential Entropy of Multivariate Gaussians . . . . . . . .
7.3.1 Computing the Differential Entropy . . . . . . . . .
7.3.2 The Gaussian Distribution Maximizes Entropy . . .
Capacity of Continuous Channels . . . . . . . . . . . . . .
Gaussian Channels . . . . . . . . . . . . . . . . . . . . . .
7.5.1 Gaussian Channel . . . . . . . . . . . . . . . . . . .
7.5.2 The AWGN Channel . . . . . . . . . . . . . . . . .
7.5.3 Parallel Gaussian Channels . . . . . . . . . . . . . .
7.5.4 Vector Gaussian Channels . . . . . . . . . . . . . .
Portfolio Theory
8.1 A Model for Investment . . . . . . . . .
8.1.1 Asset Prices and Portfolios . . .
8.1.2 Relative Returns . . . . . . . .
8.2 Log Optimal Portfolios . . . . . . . . .
8.2.1 Asymptotic Wealth Distribution
8.2.2 Growth Rate Maximization . . .
8.3 Properties of Log Optimal Portfolios . .
8.3.1 Kuhn Tucker Conditions . . . .
8.3.2 Asymptotic Optimality . . . . .
8.4 Investment with Side Information . . .
8.4.1 Mismatched Portfolios . . . . .
8.4.2 Exploiting Side Information . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Information Theory for Machine learning and Statistics
9.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . .
9.1.1 Statistical Inference . . . . . . . . . . . . . .
9.1.2 Examples of Inference Problems . . . . . . .
9.1.3 Empirical Distributions . . . . . . . . . . . .
9.2 The Method Of Types . . . . . . . . . . . . . . . . .
9.2.1 Probability Distribution of a Sample . . . . .
9.2.2 Number of Types . . . . . . . . . . . . . . .
9.2.3 Size of Type Class . . . . . . . . . . . . . .
9.3 Large Deviations and Sanov’s Theorem . . . . . . .
9.3.1 Sanov’s Theorem . . . . . . . . . . . . . . .
9.3.2 Examples . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
72
72
72
73
73
74
75
75
75
76
77
78
.
.
.
.
.
.
.
.
.
.
.
.
81
81
81
82
82
82
83
83
84
84
85
86
86
.
.
.
.
.
.
.
.
.
.
.
89
89
89
89
90
90
91
92
92
93
93
95
CONTENTS
10 Mathematical Tools
10.1 Jensen Inequality . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . .
7
97
97
97
8
CONTENTS
Foreword
Those lectures notes pertain to the Information Theory Course given in CentraleSupelec. They are based on the book "Cover and Thomas, Elements of Information
Theory", which we highly reccomend to interested students in order to go further
in the study of this topic. Each chapter corresponds to a lecture, apart from the last
chapter which contains mathematical tools used in the proofs.
9
10
CONTENTS
Chapter 1
Information Measures
In this chapter we introduce information measures for discrete random variables,
which form the basis of all information theory: entropy, relative entropy and mutual
information, and prove a few elementary properties.
1.1
Entropy
1.1.1
Definition
Definition 1.1.1. The entropy of X ∈ X a discrete random variable with distribution pX is:
X
1
1
=
pX (x) log2
H(X) = E log2
pX (X)
pX (x)
x∈X
Entropy is arguably the most fundamental information measure. The entropy
H(X) is a real number, that only depends on the distribution of X, and is expressed
in bits. If the base 2 logarithm log2 is replaced by the natural logarithm log, then
entropy is expressed in nats, and the two are equivalent up to a multiplicative factor,
1
≈ 1.44 bits. We shall later see that H(X)
in the sense that 1 nat is equal to log(2)
both measures the randomness of X, as well as how much information is contained
in X.
1.1.2
Entropy and Physics
The notion of entropy originates from statistical physics. If random variable X is
the state of a physical system with distribution pX , then H(X) is called the Gibbs
entropy. One of the fundamental ideas is the the Gibbs entropy of an isolated
physical system is a non-deacreasing function of time, and that it’s equilibrium
11
12
CHAPTER 1. INFORMATION MEASURES
distribution must maximize the Gibbs entropy. Therefore, the randomness in a
isolated system always increases and is maximized at equilibrium.
In fact, one can prove that the Boltzman distribution:
pX (x) = P
)
exp(− E(x)
kB T
′
x′ ∈X
)
exp(− E(x
)
kB T
where T is the temperature, E(x) is the energy of state x and kB is the Boltman
P constant, maximizes the Gibbs entropy under an average energy constraint
x∈X E(x) = Ē.
1.1.3
Positivity of Entropy and Maximal Entropy
Property 1. The entropy of X ∈ X a discrete random variable with distribution
X ∼ pX verifies 0 ≤ H(X) ≤ log2 |X | with equality if and only if X is uniform.
Proof: Since 0 ≤ pX (x) ≤ 1:
H(X) =
X
pX (x) log2
x∈X
X
1
≥
pX (x) log2 1 = 0.
pX (x) x∈X
Logarithm is strictly concave, using Jensen’s inequality
H(X) =
X
pX (x) log2
x∈X
X
1 1
pX (x)
< log2
= log2 |X |.
pX (x)
pX (x)
x∈X
If X is uniform:
H(X) =
X
x∈X
pX (x) log2
X
1
=
pX (x) log2 |X | = log2 |X |.
pX (x) x∈X
Entropy is positive and is upper bounded by the logarithm of the size of the
support |X |, with equality if and only if X is uniform. The fact that entropy is
positive makes sense since entropy must measure an amount of information which
must be positive. Furthermore, it makes sense to view entropy as a measure of
randomness, since it is minimized (H(X) = 0) if X is deterministic, and it is
maximized (H(X) = log2 |X |) for the uniform distribution which are respectively
the least and the most random distributions over X .
1.2. JOINT AND CONDITIONAL ENTROPY
1.2
1.2.1
13
Joint and Conditional Entropy
Definition
Definition 1.2.1. The joint entropy of X ∈ X and Y ∈ Y two discrete random
variables with joint distribution (X, Y ) ∼ pX,Y (x, y) is:
X
1
1
H(X, Y ) = E log2
=
p(x, y) log2
pX,Y (X, Y )
pX,Y (x, y)
(x,y)∈X ×Y
The joint entropy H(X, Y ) is simply the entropy of (X, Y ) seen as a single
random variable. It is important to notice that the joint entropy depends on the full
joint distribution of X and Y , not only on the marginal distributions.
Definition 1.2.2. The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional distribution pX|Y is
1
pX|Y (X|Y )
1
1
− E log2
= E log2
pX,Y (X, Y )
pY (Y )
= H(X, Y ) − H(Y ).
H(X|Y ) = E log2
The conditional entropy H(X|Y ) measures the entropy of X once the value of
Y has been revealed, and has several definitions, which are all equivalent, from the
Bayes rule stating that
pX,Y (x, y) = pX|Y (x|y)pY (y)
In particular, the last relationship
H(X, Y ) = H(X|Y ) + H(Y )
is called a chain rule, and can be interpreted as the fact that the amount of randomness in (X, Y ) equals the amount of randomness in Y plus the amount of
randomness left in X once Y has been revealed.
1.2.2
Properties
Property 2. If X and Y are independent then H(X|Y ) = H(X) and
H(X, Y ) = H(X) + H(Y )
14
CHAPTER 1. INFORMATION MEASURES
Proof: If X and Y are independent then pX,Y (x, y) = pX (x)pY (y) and replacing in the definition gives the result immediately.
Entropy is additive for independent random variables, which once again is
coherent with its interpretation as a measure of randomness. Indeed, if there
is no relationship between X and Y , the randomness of (X, Y ) is simply the
sum of the randomness in X and Y taken separately. It is also noticed that
entropy is not additive if X and Y are correlated, for instance if X = Y then
H(X, Y ) = H(X) ̸= H(X) + H(Y ), unless both X and Y are deterministic.
Property 3. Conditional entropy is not symmetrical unless H(X) = H(Y ):
H(Y |X) − H(X|Y ) = H(Y ) − H(X)
Conditional entropy is not symmetrical, one notable exception being if X and
Y have the same distribution.
1.3
1.3.1
Relative Entropy
Definition
Definition 1.3.1. Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:
p(x)
p(X) X
p(x) log2
D(p||q) = E log2
=
.
q(X) x∈X
q(x)
if q is absolutely continuous with respect to p and D(p||q) = +∞ otherwise.
Relative entropy is another fundamental information measure, and a notable
difference is that, while entropy measures the randomness of a single distribution,
relative entropy measures the dissimilarity between two distributions p, q. It is also
noted that if q is not absolutely continuous with respect to p, then p(x)
= +∞ for
q(x)
some x ∈ X , so that indeed D(p||q) = +∞.
1.3.2
Positivity of Relative Entropy
Property 4. Consider p, q two distributions. Then D(p||q) ≥ 0 with equality if
and only if p = q.
Proof: Since z 7→ − log2 z is strictly convex, from Jensen’s inequality:
q(X)
q(X)
D(p||q) = −E log2
≥ − log2 E
p(X)
p(X)
X q(x)
= − log2
p(x) = − log2 1 = 0.
p(x)
x∈X
1.4. MUTUAL INFORMATION
15
Relative entropy (sometimes called Kullback-Leibler divergence) is positive,
which makes sense as it measures dissimilarity. We always have D(p||q) ≥ 0, and
D(p||q) = 0 if p = q and the larger the value of D(p||q), the more dissimilar p is
to q. We shall also see later that there exists many other measures of dissimilarity
between distributions in information theory.
1.3.3
Relative Entropy is Not a Distance
Example 1. Consider |X | = 2 and p = ( 12 , 21 ) and q = (a, 1 − a). Then D(p||q) ̸=
D(q||p) if a ̸= 12 .
It should be noted that relative entropy is not a distance: it is not symmetrical
by the example above, nor does it satisfy the triangle inequality.
1.4
1.4.1
Mutual Information
Definition
Definition 1.4.1. Let (X, Y ) discrete random variables with joint distribution
pX,Y and marginal distributions pX and pY respectively. The mutual information
between X and Y is:
I(X; Y ) =
XX
pX,Y (x, y) log2
x∈X y∈Y
=
XX
x∈X y∈Y
p(x, y) log2
pX,Y (x, y)
pX (x)pY (y)
1
1
− log2
pX (x)
pX|Y (x|y)
= H(X) − H(X|Y )
= H(Y ) − H(Y |X)
= H(X) + H(Y ) − H(X, Y )
= D(pX,Y ||pX pY )
The last measure of information we consider is called the mutual information.
We provide several definitions, which are all equivalent to each other, and this can
be checked by inspection. Mutual information is symmetric by denition. We have
that I(X; Y ) = H(X) − H(X|Y ) therefore, if I(X; Y ) is large, one must have
that H(X) is large, so that the randomness of X is large and that H(X|Y ) is small,
so that the randomness of X knowing Y is small i.e. it is easy to guess X from Y .
We also have that I(X; Y ) measures the dissimilarity between the joint distribution
of (X, Y ), which is pX,Y , and the distribution of (X, Y ) if X and Y were chosen
16
CHAPTER 1. INFORMATION MEASURES
independendently with the same marginals pX , pY . So mutual information can also
be seen as a measure of dependency between X and Y .
We shall see later that mutual information also quantifies the amount of information that can be exchanged between a sender whom selects X and a receiver
that observes Y .
1.4.2
Positivity of Mutual Information
Property 5. Let X, Y discrete random variables then I(X; Y ) ≥ 0 with equality
if and only if X and Y are independent.
Proof: By definition I(X; Y ) = D(pX,Y ||pX pY ) ≥ 0 since relative entropy is
positive, with equality if and only if pX,Y = pX pY so that X,Y are independent.
Mutual information is positive, since it can be written as a relative entropy.
This has important consequences as we shall see.
1.4.3
Conditionning Reduces Entropy
Property 6. Let X, Y discrete random variables then H(X|Y ) ≤ H(X) with
equality if and only if X,Y are independent and H(X, Y ) ≤ H(X) + H(Y ) with
equality if and only if X,Y are independent.
Proof: We have 0 ≥ I(X; Y ) = H(X) − H(X|Y ) ≥ 0 with equality with
equality if and only if X,Y are independent. From the chain rule H(X, Y ) =
H(Y |X) + H(X) ≤ H(X) + H(Y ) using the previous result.
From the positivity of mutual information, we deduce two important properties.
The first is that conditioning always reduces entropy, which is intuitive since
revealing the value of Y reduces the randomness in X. Furthermore, we have
already seen that entropy is additive for independent random variables, we now see
that it is in fact sub-additive, so that joint entropy is always smaller than the sum of
entropies.
Chapter 2
Properties of Information Measures
In this chapter we introduce important properties of information measures which
enable to manipulate them efficiently such as chain rules. We also introduce fundamental inequalities involving information measures such as: the data processing
inequality, the log-sum inequality and Fano’s inequality.
2.1
Chain Rules
In general, a chain rule is simply a formula that allows to compute information
measures by recursion.
2.1.1
Chain Rule for Entropy
Definition 2.1.1. For any X1 , . . . , Xn we have:
H(X1 , . . . , Xn ) =
n
X
i=1
H(Xi |Xi−1 , . . . , X1 )
Proof: By definition of conditional entropy:
H(X1 , ..., Xn ) = H(Xn |Xn−1 , ..., X1 ) + H(Xn−1 , ..., X1 )
The result follows by induction over n.
The chain rule for entropy allows to compute the entropy of X1 , . . . , Xn by
successive conditionning, and has the following interpretation: imagine that the
values of X1 , . . . , Xn are presented to us as a time series, one value after the
other, then H(Xi |Xi−1 , . . . , X1 ) is simply the randomness of the current value Xi
knowing the full history of the process up to time i − 1 which is Xi−1 , . . . , X1
17
18
CHAPTER 2. PROPERTIES OF INFORMATION MEASURES
2.1.2
Chain Rule for Mutual Information
Definition 2.1.2. For any X1 , . . . , Xn we have:
I(X1 , . . . , Xn ; Y ) =
n
X
i=1
I(Xi ; Y |Xi−1 , . . . , X1 )
Proof Using both the chain rule and the definition of mutual information:
I(X1 , . . . , Xn ; Y ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y )
n
n
X
X
=
H(Xi |Xi−1 , . . . , X1 ) −
H(Xi |Xi−1 , . . . , X1 , Y )
=
i=1
n
X
i=1
i=1
I(Xi ; Y |Xi−1 , . . . , X1 ).
The chain rule for mutual information also has a natural interpretation. Imagine
that a sender selects X1 , . . . , Xn and attempts to communicate with a receiver
who observes Y . Then the information that can be exchanged I(X1 , . . . , Xn ; Y ) is
the sum of I(Xi ; Y |Xi−1 , . . . , X1 ) which can be interpreted as the sender sending
X1 , the receiver retrieving X1 from Y , then sender sending X2 and the receiver
retrieving X1 from both Y and X1 etc. This idea of retrieving X1 , ..., Xn iteratively
is used in many communication systems.
2.1.3
Chain Rule for Relative Entropy
Definition 2.1.3. Consider X, Y discrete random variables with joint distribution
pX,Y and marginal distributions pX , pY respectively. We have:
D(pX,Y ||qX,Y ) = D(pX ||qX ) + D(pY |X ||qY |X )
Proof Using the Bayes rule:
pX,Y (X, Y )
D(pX,Y ||qX,Y ) = E log2
qX,Y (X, Y )
pY |X (X, Y )
pX (X, Y )
= E log2
+ E log2
qY |X (X, Y )
qX (X, Y )
= D(pY |X ||qY |X ) + D(pX ||qX )
proving the result.
The interpretation of this chain rule is similar to that for the entropy.
2.2. LOG SUM INEQUALITY
2.2
19
Log Sum Inequality
In information theory, weighted sums of logarithms are ubiquitous, and the socalled log-sum inequality is a useful tool in many situations.
2.2.1
Statement
Proposition 2.2.1. For any (ai )i , (bi )i positive
n
X
i=1
with equality iff
ai
bi
Pn
n
X
ai
ai
ai log2 ≥ (
ai ) log2 Pi=1
n
bi
i=1 bi
i=1
= c for all i.
Proof Function f (x) = x log2 x is strictly convex as f ′′ (x) =
Using Jensen with αi = Pnbi bj :
1
x
> 0.
j=1
n
X
i=1
n
n
n
n
a X
X
X
X
ai
ai i
bj )
αi f
≥(
bj )f
ai log2 = (
αi
bi
bi
bi
j=1
i=1
j=1
i=1
P
n
n
X
ai
.
=(
ai ) log2 Pi=1
n
i=1 bi
i=1
Interestingly, the log-sum inequality implies a variety of other results as we
shall see later.
2.3
Data Processing and Markov Chains
A fundamental idea in information theory, which to a degree justifies the definition
of mutual information in itself, is that data processing, even with unlimited computing power, cannot create information. This is formalized by the data processing
inequality for Markov chains.
2.3.1
Markov Chains
Definition 2.3.1. X → Y → Z is a Markov chain iff X and Z are independent
given Y . Equivalently we have (X, Y, Z) ∼ pX,Y,Z (x, y, z) with
pX,Y,Z (x, y, z) = pX (x)pY |X (y|x)pZ|Y,X (z|y, x) = pX (x)pY |X (y|x)pZ|Y (z|y).
20
CHAPTER 2. PROPERTIES OF INFORMATION MEASURES
Simply said, a Markov chain X → Y → Z is such that one first draws the
value of X, then once the value of X is known we draw Y accoding to some
distribution that depends solely on X, and finally one draws Z accoding to some
distribution that depends solely on Y . The key idea is that, in order to generate Z,
one can only look at the previously generated value Y , i.e. we generate the process
with a memory of order 1. The simplest, and most often encountered example of
a Markov chain X → Y → Z is any X, Y, Z such that Z = g(Y ) where g is a
known, deterministic function.
2.3.2
Data Processing Inequality
Proposition 2.3.2. If X → Y → Z then I(X; Y ) ≥ I(X; Z).
Proof We have:
I(X; Y, Z) = I(X; Z) + I(X; Y |Z) = I(X; Y ) + I(X; Z|Y )
since I(X; Y |Z) ≥ 0 and I(X; Z|Y ) = 0 we have I(X; Y ) ≥ I(X; Z).
The data processing inequality simply states that mutual information cannot
increase along a Markov chain, i.e. data processing cannot create information
out of nowhere. An interpretation in the context of communication is that if a
sender selects X and a receiver observes Y , and a helper offers to help the receiver
by computing the value of g(Y ), then X 7→ Y 7→ g(Y ) and so I(X; g(Y )) ≤
I(X; Y ). I.e. the helper is in fact never helpful.
2.4
Fano Inequality
We now derive Fano’s inequality, which establishes a fundamental link between
entropy and the probability of error in estimation problems, and is essential in both
statistics and communication.
2.4.1
Estimation Problems
We call estimation problem a situation in which an agent observes a random
variable Y , and attempts to guess another hidden random variable X. The agent
is allowed to construct any estimator X̂, without any limitation on his computing
power. The goal is to minimize the estimation error P(X ̸= X̂).
2.4. FANO INEQUALITY
2.4.2
21
Statement
Proposition 2.4.1. If X → Y → X̂ then:
h2 (P(X ̸= X̂)) + P(X ̸= X̂) log2 |X | ≥ H(X|Y )
1
with h2 (p) = p log p1 + (1 − p) log 1−p
the binary entropy.
Proof: Since X → Y → X̂ is a Markov chain
H(X) − H(X|X̂) = I(X; X̂) ≤ I(X; Y ) = H(X) − H(X|Y )
so that
H(X|Y ) ≤ H(X|X̂)
Define E = 1{X̂ ̸= X}, using the chain rule in both directions:
H(E|X̂) + H(X|E, X̂) = H(X, E|X̂) = H(X|X̂) + H(E|X, X̂)
Now H(E|X, X̂) = 0 because E is a deterministic function of X, X̂ which proves:
H(X|X̂) = H(E|X̂) + H(X|E, X̂)
We have
H(X|E, X̂) ≤ P(E = 1) log2 (|X | − 1) + P(E = 0) log2 (1)
because if E = 0 then X = X̂ has 1 possible values and if E = 1 X ̸= X̂ has
|X | − 1 possible values. Finally, since conditioning reduces entropy:
H(E|X̂) ≤ H(E) = h2 (P(E = 1))
which concludes the proof.
Fano’s inequality states that the estimation error P(X ̸= X̂) cannot be arbitrarly
small, unless the conditional entropy of the hidden variable knowing the observation
H(X|Y ) is small too. This a fundamendal limit that is true irrespective of how
much computational power is available to perform the estimation. This intuitive
since H(X|Y ) is the randomness left in X once Y has been seen by the agent.
Fano’s inequality therefore shows that conditional entropy can be used as a measure
of how difficult an estimation problem might be.
22
CHAPTER 2. PROPERTIES OF INFORMATION MEASURES
2.5
2.5.1
Asymptotic Equipartition and Typicality
AEP
Proposition 2.5.1. Consider (Xi )i=1,...,n i.i.d. with common distribution pX . Then
n
1X
1
log2
→ H(X) in probability.
n i=1
pX (Xi ) n→∞
Consider (Xi , Yi )i=1,...,n i.i.d. with common joint distribution pX,Y . Then
n
1X
1
log2
→ H(X, Y ) in probability.
n i=1
pX,Y (Xi , Yi ) n→∞
and
n
1X
1
→ H(X|Y ) in probability.
log2
n i=1
pX|Y (Xi |Yi ) n→∞
and
n
1X
pX (Xi )pY (Yi )
log2
→ I(X; Y ) in probability.
n i=1
pX,Y (Xi , Yi ) n→∞
Proof: All statements hold true from the weak law of large numbers.
The Asymptotic Equipartition Property (AEP), which in itself is a straightforward consequence of the law of large numbers, roughly states that for large i.i.d.
samples, the "empirical information measures" behave like the actual information
measures. While this is not very useful in itself, a consequence is that i.i.d. samples
concentate on what is called on high probability "typical sets".
2.5.2
Typicality
Proposition 2.5.2. Consider X1 , . . . , Xn i.i.d. with common distribution X ∼
p(x).
Given ϵ > 0 define the typical set:
Anϵ
n
n
o
1X
1
n
log2
− H(X) ≤ ϵ .
= (x1 , ..., xn ) ∈ X :
n i=1
p(xi )
Then:
(i) |Anϵ | ≤ 2n(H(X)+ϵ) for all n
(ii) |Anϵ | ≥ (1 − ϵ)2n(H(X)−ϵ) for n large enough
(iii) P((X1 , . . . , Xn ) ∈ Anϵ ) ≥ 1 − ϵ for n large enough
2.5. ASYMPTOTIC EQUIPARTITION AND TYPICALITY
23
Proof: By definition, (x1 , ..., xn ) ∈ Anϵ if and only if
2−n(H(X)+ϵ) ≤ p(x1 )...p(xn ) ≤ 2−n(H(X)−ϵ)
Computing the probability of the typical set:
P((X1 , . . . , Xn ) ∈ Anϵ ) =
X
p(x1 )...p(xn )
(x1 ,...,xn )∈An
ϵ
Which we bound as
|Anϵ |2−n(H(X)+ϵ) ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ |Anϵ |2−n(H(X)−ϵ)
From asymptotic equipartition the typical set is a high probability set, and for n
large enough
1 − ϵ ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 1.
The size of the typical set is bounded as
|Anϵ | ≤ 2n(H(X)−ϵ) P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 2n(H(X)−ϵ)
|Anϵ | ≥ P((X1 , . . . , Xn ) ∈ Anϵ )2n(H(X)+ϵ) ≥ (1 − ϵ)2n(H(X)+ϵ)
This concludes the proof.
In essence, if one draws an i.i.d. sample X1 , ..., Xn , with high probability, it
will fall in the so called "typical set" and this typical set has a size roughly equal
to 2nH(X) . This is also fundamental for data compression: imagine that we would
like to represent X1 , ..., Xn as a sequence of m binary symbols. If we have a small
tolerence for error, then if X1 , ..., Xn is typical we could represent it by its index in
the typical set using m ≈ nH(X) binary symbols, and if X1 , ..., Xn is non-typical
simply ignore it. This gives a new interpretation of entropy as the number of binary
symbols necessary to represent data. We will expand on this in the latter chapters.
2.5.3
Joint Typicality
Proposition 2.5.3. Consider (X n , Y n ) = (Xi , Yi )i=1,...,n i.i.d. with distribution
p(x, y) and (X̃ n , Ỹ n ) = (X̃i , Ỹi )i=1,...,n i.i.d. with distribution p(x)p(y).
Given ϵ > 0 define the jointly typical set:
Anϵ
n
n
1X
1
n n
n
n
= (x , y ) ∈ X × Y :
log2
− H(X)
n i=1
p(xi )
n
n
o
1X
1
1X
1
+
log2
− H(Y ) +
log2
− H(X, Y ) ≤ ϵ .
n i=1
p(yi )
n i=1
p(xi , yi )
24
CHAPTER 2. PROPERTIES OF INFORMATION MEASURES
Then:
(i) |Anϵ | ≤ 2n(H(X,Y )+ϵ) for all n ; (ii) P((X n , Y n ) ∈ Anϵ ) → 1.
n→∞
(iii) (1 − ϵ)2−n(I(X;Y )+ϵ) ≤ P((X̃ n , Ỹ n ) ∈ Anϵ ) ≤ 2−n(I(X;Y )+ϵ) for n large
enough
Proof We have:
n
n
o
1X
1
n
n n
n
n
Aϵ ⊂ (x , y ) ∈ X × Y :
log2
− H(X, Y ) ≤ ϵ .
n i=1
p(xi , yi )
and we know that this set has size at most 2n(H(X,Y )+ϵ) .
From the law of large numbers:
n
1X
1
ϵ
log2
− H(X) ≥
→ 0
n i=1
p(Xi )
3 n→∞
n
1X
ϵ
1
P
− H(Y ) ≥
→ 0
log2
n i=1
p(Yi )
3 n→∞
n
1X
1
ϵ
P
− H(X, Y ) ≥
→ 0
log2
n i=1
p(Xi , Yi )
3 n→∞
P
Therefore:
P((X n , Y n ) ∈ Anϵ ) → 1
n→∞
Since (X̃ n , Ỹ n ) is i.i.d. with distribution p(x)p(y):
P((X̃ n , Ỹ n ) ∈ Anϵ ) =
X
p(xn )p(y n ) =
(xn ,y n )∈An
ϵ
X
(xn ,y n )∈An
ϵ
p(xn )p(y n )
p(xn , y n ).
p(xn , y n )
If (xn , y n ) ∈ Anϵ :
2−n(I(X;Y )+ϵ) ≤
p(xn )p(y n )
≤ 2−n(I(X;Y )−ϵ)
p(xn , y n )
Therefore:
2−n(I(X;Y )+ϵ) ≤
P((X̃ n , Ỹ n ) ∈ Anϵ )
≤ 2−n(I(X;Y )−ϵ)
n
n
n
P((X , Y ) ∈ Aϵ )
and the result is proven as P((X n , Y n ) ∈ Anϵ ) → 1.
n→∞
Joint typicality is similar to typicality, and we will expand on its implications
when considering communication over noisy channels.
Chapter 3
Data Representation: Fundamental
Limits
In this chapter we start our exposition of how to represent data efficiently using
information theoretic tools. We introduce prefix codes and show that the entropy
of the source quantifies the length of the best prefix codes, and how such codes can
be constructed.
3.1
Source Coding
We consider the problem of source coding, in which we would like to represent a
sequence of symbols X1 , .., Xn from some finite set X as a sequence of bits, with
the goal of doing so as efficiently as possible.
3.1.1
Definition
Definition 3.1.1. Consider X ∈ X and D the set of finite strings on {0, 1}. A
source code is a mapping C : X → D.
A source code takes as input a symbol X and maps it into a finite sequence of
bits.
3.1.2
Expected Length
Definition 3.1.2. Let X ∈ X with distribution p(x). The expected length of code
C is:
X
L(C) = Eℓ(X) =
p(x)ℓ(x).
x∈X
with ℓ(x) the length of codeword C(x).
25
26
CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS
One of the main measures of efficiency of a source code is its expected length,
which is the expected number of bits required to represent a symbol, if this symbol
were drawn according to the source distribution.
3.1.3
Non-Singular Codes
Definition 3.1.3. A code C is non singular if and only if C(x) = C(x′ ) =⇒ x =
x′ for all x, x′ ∈ X . namely if X can be perfectly retrieved from C(X).
A code is non-singular if the original symbol be retrieved from its associated
codeword, using some sort of decoding procedure which is possible if any only if
there exists no pair of symbols that get assigned the same codeword. Therefore,
non-singular codes perfom lossless compression, which is the focus of this chapter.
There also exist lossy compression techniques, considered in future chapters,
where the amount of information lost (also called "distorsion") is controlled in
some fashion.
3.1.4
Uniquely Decodable Codes
Definition 3.1.4. The extension of a code C is the mapping from finite strings of X
to finite strings of D defined as:
C(x1 . . . xn ) = C(x1 ) . . . C(xn )
The extension of a code is what we obtain when encoding the sequence of
symbols X1 , ..., Xn as the concatenation of the codewords associated to each
symbol C(X1 ), ..., C(Xn ).
Definition 3.1.5. A code C is uniquely decodable if its extension is non-singular.
A critical point is that extension can create ambiuity , even if the code is nonsingular. Indeed, if one only observes the concatated codewords C(X1 ), ..., C(Xn ),
it might be difficult to know where one codeword ends and where the next one
begins. A simple example would be X = {a, b, c} and a code C(a) = 0, C(b) = 1
and C(c) = 01. We have C(a)C(b) = C(c) so it is impossible to differentiate
between ab and c.
A uniquely decodable code is such that extension does not create ambiguity,
and enables to encode streams of symbols by encoding each symbol separately,
without losing any information.
3.2. PREFIX CODES
3.2
3.2.1
27
Prefix Codes
Definition
Definition 3.2.1. A code C is a prefix code if C(x) is not a prefix of C(x′ ) unless
x = x′ for all (x, x′ ) ∈ X 2 .
An important class of uniquely decodable codes are prefix codes, where no
codeword can be the prefix of another codeword. Those codes are also called selfpuncturing, or instantaneous, because the decoding can be done without looking
ahead in the stream of coded bits.
Definition 3.2.2. Prefix codes are uniquely decodable.
Proof: Consider the following decoding algorithm: let C(X1 ), ..., C(Xn ) be a
sequence of bits u1 ...um and let ℓ the smallest integer such that u1 ...uℓ = C(x) for
some x. Then we must have x = X1 , otherwise C(x) would be the prefix of some
other codeword. This yields X1 and repeat the procedure to obtain X1 , ..., Xn .
It is understood that prefix codes are uniquely decodable, and uniquely decodable codes are non-singular, but there exists uniquely decodable codes that are not
prefix codes, and there exists non-singular codes that are not uniquely decodable.
3.2.2
Prefix Codes as Trees
We first introduce a few notions related to binary trees, which are important in
order to understand properties of prefix codes.
Definition 3.2.3. Given a binary tree G = (V, E), we call the "label" of leaf v the
binary sequence encoding the the unique path from the root to v, where 0 stands
for "down and left" and 1 for "down and right").
Property 7. Consider a binary tree, then the labels of its leaves form a prefix code.
Conversely, for any prefix code, there exists a binary tree whose leaves label are
the codewords of that code.
Proof: Consider v and v ′ two leaves of G such that the label of v is a prefix of
the label of v ′ , then this means that v ′ is a descendent of v which is not a leaf, a
contradicton. So the leaves labels form a prefix code.
Conversely, consider a prefix code, and the following procedure to build the
associated binary tre. Start with G a complete binary tree. If the code is not empty
then select one of its codewords C(x), find v the node whose label is C(x) and
remove all of the descendents of v from G and remove C(x) from the code. Repeat
the procedure until the code is empty.
28
CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS
Therefore, there is an identity between binary trees and prefix codes: for every
prefix code we can construct a binary tree representation of this code, and every
binary tree represents a prefix code. This is fundamental in order to derive lower
bounds on the code length and design codes which attain these bounds bound.
3.2.3
Kraft Inequality
Proposition 3.2.4. For any prefix code we have:
X
2−ℓ(x) ≤ 1.
x∈X
Also, given any (ℓ(x))x∈X satisfying this inequality one can construct a prefix code
with codeword lengths (ℓ(x))x∈X .
Proof: Let lm = maxx∈X ℓ(x) the largest codeword length. Let Z(x) ⊂
{0, 1}lm set of words that have C(x) as a prefix. Then |Z(x)| = 2lm −ℓ(x) . Furthermore Z(x) ∩ Z(x′ ) = ∅ as C is a prefix code. Summing over x proves the
result:
X
X
2lm = |{0, 1}lm | ≥
|Z(x)| =
2lm −ℓ(x) .
x∈X
x∈X
Conversely, assume that we are given codeword lengths (ℓ(x))x∈X satisfying the
Kraft inequality, and one can construct a prefix code with those codeword lengths.
Indeed, if ℓ(x) are sorted in increasing
order, we can let C(x) the ℓ(x) first digits
P
of the binary representation of i<x 2−ℓ(i) .
Kraft’s inequality is a fundamental limit and states that there is a constraint on
the expected length that must be satisfied by any prefix code.
3.3
3.3.1
Optimal Codes and Entropy
Lower Bound on the Expected Code Length
Proposition 3.3.1. For any prefix code we have:
L(C) ≥ H(X).
with equality if and only if 2−ℓ(x) = p(x) for all x ∈ X .
Proof: Consider the optimization problem (P1 )
X
X
Minimize
p(x)ℓ(x) s.t.
2−ℓ(x) ≤ 1 , ℓ(x) ∈ N , x ∈ X
x∈X
x∈X
3.3. OPTIMAL CODES AND ENTROPY
29
Now consider its convex relaxation (P2 ):
X
X
Minimize
p(x)ℓ(x) s.t.
2−ℓ(x) ≤ 1 , ℓ(x) ∈ R , x ∈ X
x∈X
x∈X
From Lagrangian relaxation the solution of (P2 ) must minimize:
X
X
J=
p(x)ℓ(x) + λ
2−ℓ(x)
x∈X
x∈X
The first order conditions read:
∂J
= p(x) − λ(log 2)2−ℓ(x) = 0 , x ∈ X
∂ℓ(x)
The optimal solution is of the form:
2−ℓ(x) =
p(x)
, x∈X
λ(log 2)
We find the value of λ by saturating the constraint:
X
X p(x)
1
1=
2−ℓ(x) =
=
.
λ(log
2)
λ(log
2)
x∈X
x∈X
The optimal solution of (P2 ) is
2−ℓ(x) = p(x) , x ∈ X
Its value lower bounds that of (P1 ) which concludes the proof:
X
X
1
p(x) log2
= H(X),
p(x)ℓ(x) =
p(x)
x∈X
x∈X
A direct consequence of the Kraft inequality is that the source entropy is a
lower bound on the expected length of any prefix code. Furthermore, in order to
get close to the lower bound, one must make sure 2−ℓ(x) ≈ p(x). This shows that
efficient codes assign short/long code words to frequent /infrequent symbols, in
order to minimize the expected length.
Now, the lower bound is not always attainable: to attain the bound we require
1
that for all x ∈ X : ℓ(x) = log2 ( p(x)
), where ℓ(x) is an integer. For instance, if p =
(1/2, 1/4, 1/8, 1/8), then we can select ℓ = (1, 2, 3, 3), but if p = (2/3, 1/6, 1/6)
1
this is impossible, as log2 ( p(x)
) is not an integer.
Two natural questions arise: how close to the entropy can the best prefix code
perform, and how to derive the best prefix code in a computationally efficient
manner ?
30
CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS
3.3.2
Existance of Nearly Optimal Codes
Proposition 3.3.2. There exists a prefix code with codeword lengths ℓ(x) =
1
⌈log2 p(x)
⌉, such that:
H(X) ≤ L(C) ≤ H(X) + 1
Proof: Let ℓ(x) = ⌈log2
X
x∈X
2−ℓ(x) =
X
x∈X
1
⌉
p(x)
which satisfies the Kraft Inequality:
1
2−⌈log2 p(x) ⌉ ≤
X
1
2− log2 p(x) =
x∈X
X
p(x) = 1.
x∈X
Recall that whenever ℓ(x), x ∈ X satisfy the Kraft inequality, then there exists a
corresponding prefix code with lenghts ℓ(x), x ∈ X .
The length of this code is:
L(C) =
X
x∈X
l
1 m
p(x) log2
p(x)
x∈X
X
1
≤
p(x) log2
+1
p(x)
x∈X
p(x)ℓ(x) =
X
= H(X) + 1.
which concludes the proof
Therefore, it is always possible to construct a prefix code whose length is within
1 bit of the entropic lower bound. Now, this result is only useful if H(X) is much
greater than 1.
The key idea is then to use this scheme to encode not one individual symbol
(with entropy H(X)), but rather blocks of n independent symbols (with entropy
nH(X)) for large n.
3.3.3
Asymptotically Optimal Codes
Proposition 3.3.3. Let (X1 , ..., Xn ) i.i.d. copies of X.
For any prefix code C for (X1 , ..., Xn ):
H(X) ≤
L(C)
n
and there is a prefix code C for (X1 , ..., Xn ) such that:
L(C)
1
≤ H(X) + .
n
n
3.3. OPTIMAL CODES AND ENTROPY
31
Proof: From independence H(X1 , ..., Xn ) = nH(X), and select C as the
optimal prefix code for (X1 , ..., Xn ).
If one encodes blocks of independent symbols with length n (X1 , ..., Xn ), we
are interested in the rate L(C)
which is the average number of bits per source
n
symbol required to represent the data. Then the rate of any prefix code must be
greater than the entropy H(X), and for large n there exists a prefix code whose
rate is approximately equal to the entropy (within a factor of 1/n). Therefore this
code is asymptotically optimal, and cannot be improved upon (in terms of rate).
This result also justifies entropy not only as a measure of randomness but also
as a measure of the average description length of a source symbol.
32
CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS
Chapter 4
Data Representation: Algorithms
In this chapter we introduce algorithms in order to perfom lossless compression
under various assumptions and demonstrate their optimality by comparing their
performance to the entropic bound derived in last chapter.
4.1
4.1.1
The Huffman Algorithm
Algorithm
Algorithm 4.1.1 (Huffman Algorithm). Consider a known distribution p(x), x ∈
X . Start with G = (|X |, E, w) a weighted digraph with |X | nodes, no edges
E = ∅, and weights w(x) = p(x). Repeat the following procedure until G is a tree:
find i and j the two nodes with no father and maximal weight, add a new node k to
G with weight w(k) and add edges (k, j) and (k, i) to E.
The Huffman algorithm is a greedy algorithm which takes as an input the
probability of each symbol p(x), x ∈ X , and iteratively constructs a prefix code
with the goal of minimizing the expected code length.
4.1.2
Rationale
The Huffman algorithm is based on the idea that a good prefix code should verify
three properties:
• (i) If p(x) ≥ p(y) then ℓ(y) ≥ ℓ(x)
• (ii) The two longest codewords should have the same lengths
• (iii) The two longest codewords differ by only 1 bit and correspond to the
two least likely symbols
In fact, these facts will serve to show the optimality of the Huffman algorithm.
33
34
4.1.3
CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
Complexity
At each step of the algorithm, one must find the two nodes with the smallest weight.
There are |X | steps and finding the two nodes with smallest weight requires to
sort the list of nodes by weight at each step which requires O(|X | ln |X |). Hence
a naive implementation of the algorithm requires time O(|X |2 ln |X |). A smarter
implementation would be to keep the list of nodes sorted at each step so that finding
the two nodes with smallest weight can be done in time O(1) then insert the new
node in the sorted list using binary search in time O(ln |X |). Hence the Huffman
algorithm can be implemented in time O(|X | ln |X |), almost linear in the number
of symbols.
4.1.4
Limitations
While optimal, for sources with a lot of millions of symbols, the Huffman algorithm
is too complex to implement, and there exists other techniques, such as artihmetic
coding (used in JPEG). Also, the Huffman algorithm requires knowing the source
distribution p(x) for x ∈ X at the encoder which is a practical limitation, and
to solve this problem there exists universal codes, which operate without prior
knowledge on p. We will show some simple strategies to design universal codes.
4.1.5
Illustration
1
1
2
1
3
10
1
0
0
1
1
5
0
0
1
A
B
C
1
2
1
5
1
10
x
p(x)
C(x)
ℓ(x)
E
D
1
10
1
10
A
B
C
D
E
1
2
1
5
1
10
1
10
1
10
0
1
10
2
110
3
1110
4
1111
4
Above is the result of the Huffman algorithm applied to a given source. One
can readily verify that the more probable the symbol, the longer the codeword, and
4.1. THE HUFFMAN ALGORITHM
35
that the two least probable symbols D and E have been assigned to the two leaves
with highest depth.
The length of the code is minimal amongst all prefix codes and equals:
1
1
1
×1+ ×2+
× (3 + 4 + 4) = 2
2
5
10
which is only slightly larger than the source entropy:
1
1
1
log2 (2) + log2 (5) + 3 ×
log2 (10) ≈ 1.96
2
5
10
4.1.6
Optimality
Proposition 4.1.2. The Huffman algorithm outputs a prefix code with minimal
expected length L(C) amongst all prefix codes.
Proof: Assume that the source symbols are sorted p(1) ≤ ... ≤ p(|X |).
Consider a code C with minimal length, and x, y two symbols such that x ≤ y
and ℓ(x) < ℓ(y). Then contruct a new code C ′ such that C ′ (x) = C(y), C ′ (y) =
C(x) and C(z) = C(z) for z ̸= x, y. Then clearly L(C ′ ) < L(C) hence C cannot
be optimal, a contradiction. This shows that for any x, y such that x ≤ y we must
have ℓ(x) ≥ ℓ(y). Futhermore, since the two least probable symbols should have
maximal depth we can always assume that they are siblings (otherwise simply
perform an exchange between 2 and the sibling of 1).
Consider C the prefix code with minimal length, and H the prefix code output
by the Huffman algorithm. Further define C ′ and H ′ the codes obtained by considering C and H and replacing nodes 1 and 2 by their father with weight p(1) + p(2).
Then we have:
L(C ′ ) = L(C) − (p(1) + p(2))
and
L(H ′ ) = L(H) − (p(1) + p(2))
We also realize that H ′ is exactly the output of the Huffman algorithm applied to a
source with |X | − 1 symbols.
We can then prove the result by recursion. Clearly for |X | = 1 symbols the
Huffman algorithm is optimal. Furthermore, if for |X | − 1 symbols the Huffman
algorithm is optimal this implies that L(C ′ ) = L(H ′ ) so that L(C) = L(H) hence
the Huffman algorithm is optimal for |X | symbols.
36
CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
4.2
Markov Coding
4.2.1
Markov Sources
Definition 4.2.1. A source is a Markov source with stationary distribution π(x)
and transition matrix P (x|x′ ) if:
(i) X1 , ..., Xn all have distribution π(x)
(ii) For any i we have
P(Xi = xi |X1 = x1 , ..., Xi−1 = xi−1 ) = P (xi |xi−1 ).
So far we have mostly considered memoryless sources in which the symbols
produced by the source X1 , ..., Xn are i.i.d. random variables with some fixed
distribution. We now consider the much more general case of Markov sources
where the symbols produced by the source X1 , ..., Xn are correlated. The Markovian assumption roughly means that the distribution of the current symbol Xn only
depends on the value of the previous symbol Xn−1 . One can generate the symbols
sequentially, by first drawing X0 according to π the stationary distribution, and
once Xn−1 is known, one would draw Xn with distribution P (.|Xn1 ). The matrix
P is called transition matrix, since P (xi |xi−1 ) is the probability of transitionning
from xi−1 to xi in one time step. In a sense, a Markov process is a stochastic
process with order one memory. It is also noted that, for π to actually be a stationary distribution, it has to verify the balance condition π = πP , since if Xn−1 has
distribution π, then Xn has distribution P π and the two must be equal.
The simplest model of a Markov source is called the Gilbert Elliot model,
which has two equiprobable states and a given probability α of going from one to
the other in one step:
1 1
π=( , )
2 2
and
1−α
α
P =
α
1−α
ON
OFF
To generate the Gilbert Elliot model, first draw X0 ∈ {0, 1} uniformly at
random, and the for each n draw Xn = Xn−1 + Un modulo 2, where U1 , ..., Un
is Bernoulli with expectation α. In short flip the value of the process at each step
with probabiliy α.
4.2. MARKOV CODING
4.2.2
37
The Entropy of English
One of the initial motivations for studying Markov sources in the context of
information theory was to model english text. Namely, consider English text as a
sequence of letters Z1 , ..., Zn , k ≥ 0 and Xn = (Z−1 , ..., Zn−k ), the k letters that
precede the n-th letter. Then, when k is large enough, Xn can be considered a
Markov chain, meaning that the distribution of the n-th word solely depends on
the k words that precede it. The transition probabilities encode all of the structure
of the English language: grammar rules, dictionary, frequency of words and so on.
This means that, if we wanted to generate English text automatically, one could
simply gather a very large corpus of text, and estimate the transition probabilities
by figuring out, for any letter x, the probability that x can be the n-th letter of an
English sentence knowing that the k previous letters are Xn−1 , ..., Xn−k . Doing
this for a large enough k will create computer generated sentences which look very
close to English sentences produced by a human.
This also means that one could estimate the entropy of English using the
following experiment imagined by Shannon: one person thinks about some english
sentence, and another person attempts to guess the sentence letter-by-letter without
prior information by asking binary question e.g "Is the next letter an ’a’" or "Is
the next letter a vowel". Then the ratio between the number of questions and
the number of letters in a phrase is a good estimate of the number of bits per
symbol in English text. The entropy of English estimated throguh this experiment
is usually about 1 bit per letter, much smaller than log2 (26) bits per letter, which is
the entropy of an i.i.d. uniform sequence of letters.
4.2.3
Efficient Codes for Markov Sources
Proposition 4.2.2. Let (X1 , ..., Xn ) Markov source and define:
R(π, P ) =
XX
π(x)P (y|x) log2
x∈X y∈X
1
.
P (y|x)
Then for any prefix code C for (X1 , ..., Xn ):
(1 −
1
H(X1 )
L(C)
)R(π, P ) +
≤
n
n
n
and there is a prefix code C for (X1 , ..., Xn ) such that:
L(C)
1
H(X1 ) + 1
≤ (1 − )R(π, P ) +
.
n
n
n
38
CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
Proof: Using the chain rule and the Markov property:
H(X1 , ..., Xn ) =
n
X
i=1
H(Xi |Xi−1 , ..., X1 ) =
n
X
i=1
H(Xi |Xi−1 )
Furthermore
H(Xi |Xi−1 ) =
=
XX
P(Xi−1 = x, Xi = y) log2
x∈X y∈X
XX
π(x)P (y|x) log2
x∈X y∈X
1
P(Xi = y|Xi−1 = x)
1
= R(π, P ).
P (y|x)
Therefore:
H(X1 , ..., Xn ) = (n − 1)R(π, P ) + H(X1 ).
The lower bound holds as before, and applying Huffman coding to (X1 , ..., Xn )
yields a code with:
(n − 1)R(π, P ) + H(X1 ) ≤ L(C) ≤ (n − 1)R(π, P ) + H(X1 ) + 1.
We have therefore established that the rate of optimal codes for Markov sources
is exactly R(π, P ) bits per symbol. Furthermore, optimal codes can be found
using the same algorithms as in the memoryless case. One would first determine
the transition probabilities for the Markov source at hand, which would then give
us the probability of any sequence (X1 , ..., Xn ) and finally we may apply the
Huffman algorithm. One can apply this (for instance) in order to encode English
text optimally, since English can be seen as a Markov source.
Now, one caveat of our approach is that we require to know the probability
distribution of any sequence that can be generated by the source. In the case of
memoryless sources this implies to know the distribution of a symbol, and in the
case of Markov sources this implies knowing both the stationary distribution and
the transition probabilities. This can often be a limitation in practice, and to solve
this problem we study the concept of universal codes.
4.3
4.3.1
Universal Coding
Universality
Definition 4.3.1. Consider X1 , ..., Xn i.i.d. copies of X ∈ X with distribution p,
and a coding scheme C : X n → D that does not depend on p. This coding scheme
is universal if for all p:
1
Eℓ(C(X1 , ..., Xn )) → H(X)
n→∞ n
n→∞
lim
4.3. UNIVERSAL CODING
39
The idea of a universal code is that the code should have no prior knowledge of
the data distribution, and that the code should work well irrespective of the data
distribution. This is important in practical scenarious in which nothing is known
about the data distribution. In fact, when the data distribution is known, we know
that the smallest attainable rate is the entropy H(X), and if a code is universal,
then it attains this rate for all distributions.
4.3.2
A Simple Universal Code for Binary Sequences
Algorithm P
4.3.2 (Simple Adaptive Binary Code). Consider x1 , ..., xn ∈ {0, 1}n
and let k = ni=1 xi . Output the codeword C(x1 , ..., xn ) which is the concatenation
of (i) the binary represention of k (ii) the binary represention of the index of
x1 , ..., xn in
n
X
Ak = {(x1 , ..., xn ) ∈ {0, 1}n :
xi = k}
i=1
The main idea behind this
P code is that the difficulty of encoding a sequence
x1 , ..., xn depends on k = ni=1 xi , which is the number of 1’s in the sequence.
Indeed, the
number of possible values that x1 , ..., xn can have knowing k is precisely nk . This means that one could first encode the value of k (which requires
at most log2 n bits) and subsequently encode the index of x1 , ..., xn amongst
the
n
sequences which have the same value of k (which requires at most log2 k bits).
This coding scheme assigns short codewords to sequences with k ≈ 0 and k ≈ n,
and longer codewords to sequences with k ≈ n/2. The goal of encoding k along
with the sequence is that the decoder will get to know k as well.
Proposition 4.3.3. The simple adaptive binary code is universal.
Proof: For a given value of k, since Ak has nk elements, the length of the
corresponding codeword is
n
ℓ(C(x1 , ..., xn )) = log2 (n) + log2
k
Using Stirling’s approximation log2 n! = n log2 (n/e) + O(log2 n) we have
n
log2
= n log2 (n/e) − k log2 (k/e) − (n − k) log2 ((n − k)/e) + o(log2 n)
k
so that
1
ℓ(C(x1 , ..., xn )) = h2 (k/n) + o(1)
n
40
CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
Consider X ∼Bernoulli(p):
n
1X
k
=
Xi → p almost surely
n→∞
n
n i=1
and since n1 ℓ(C(X1 , ..., Xn )) ≤ 1 dominated convergence yields:
1
Eℓ(C(X1 , ..., Xn )) → h2 (p) = H(X)
n→∞
n
proving the result.
The fact that this simple code is universal not only shows that such codes do
exist, but also point to a more general idea for constructing such codes: one can
attempt to estimate the value of the underlying distribution, encode that estimate
along with the message.
P Indeed, if X1 , ..., Xn are i.i.d. Bernoulli with parameter p,
then k/n = (1/n) ni=1 Xi is a consistant estimator of p, and the knowledge of k
is equivalent to knowing this estimator. In a certain way, universal codes perform
both encoding and estimation at the same time (although the estimation might be
explicit).
Algorithm
4.3.4 (Simple Adaptive Code). Consider x1 , ..., xn ∈ X and let kx =
Pn
i=1 1{xi = x}. Output the codeword C(x1 , ..., xn ) which is the concatenation
of (i) the binary represention of kx for all x ∈ X (ii) the binary represention of the
index of x1 , ..., xn in
n
Ak = {(x1 , ..., xn ) ∈ X :
n
X
i=1
1{xi = x} = kx for all x ∈ X }
The simple code can be extended to non-binary sequences, by encoding the
empirical distribution of the data (also known as the type of the
Pnsequence, see
further chapters). It is noted that the empirical distribution kx = i=1 1{xi = x}
can be encoded in at most |X | log2 n bits, since for each x kx ∈ {0, ..., n}.
4.3.3
Lempel-Ziv Coding
Algorithm 4.3.5 (Lempel Ziv Coding). Consider a string x1 , ..., xn ∈ X n and a
window W ≥ 1. Start at position i = 1. Then, until i ≥ n repeat the following:
First find the largest k such that (xj , ..., xj+k ) = (xi , ..., xi+k ) for some j ∈
{i − 1 − W, ..., i − 1}. Second, if such a k exists, encode xi , ..., xi+k as the binary
representation of (1, i − j, k) and skip to position i + k + 1; and if such a k does
not exist, encode xi as (0, xi ) and skip to position i + 1.
4.3. UNIVERSAL CODING
41
The most famous universal codes are called the Lempel-Ziv algorithms, and we
present here the algorithm that uses a sliding window. There exists other versions
such as the one based on trees. The algorithm encodes the sequence by first parsing
it into a set of words, and then to encode each word based on the previous words.
The central idea on why this coding scheme works comes from the fact taht
if a word (x1 , ..., xk ) of size k has a relatively high probability, then it is likely
to appear in a window of size W if W is large enough. In turn this word can be
represented with 1 + log2 W + log2 k bits instead of k bits. In short, words that
are frequent tend to appear repeatedly, and therefore can be encoded by providing
a pointer to one of their past occurences, which enables to drastically reduce the
number of bits required.
Example 2. Consider the following string of 30 bits
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
After parsing with a window size of W = 4 we get 8 phrases:
0; 0, 0; 1; 0, 0, 0; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; 1; 0, 0, 0; 0, 0
Those phrases will then be represented as:
(0, 0) ; (1, 1, 2); (0, 1) ; (1, 4, 3) ; (1, 1, 17); (0, 1); (1, 4, 3); (1, 1, 2)
The above example illustrates how the algorithm operates on a binary sequence.
The sliding window enables us to encode long sequences of consecutive 0, ..., 0
with relatively few bits. Indeed we manage to encode a sequence of 17 consecuting
0’s by the word (1, 1, 17) which can be represented using rougly 2 + log2 (4) +
log2 (17) ≈ 7 bits: a net gain of 17 − 7 = 10 bits.
Proposition 4.3.6. Lempel-Ziv Coding is universal.
Lempel-Ziv coding has both the advantage of being very easy to implement,
requiring no knowledge about the data distribution, and also to be universal. We
do not present the proof here, due to its complexity.
42
CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
Chapter 5
Data Representation:
Rate-Distorsion Theory
In this chapter we consider the problem of lossy compression and quantization,
which is a central problem when dealing with signals from the physical world such
as sounds, images and videos and so on. We introduce the notion of distorsion
which measures how much information is lost after encoding, and design optimal
rate-distorsion codes which minimize the rate given a constraint on the distorsion.
5.1 Lossy Compression, Quantization and Distorsion
Most physical systems produce continuous-valued data such as: sound, electromagnetic fields, currents. On the other hand, information processing systems work
with finite-valued data. For instance, in order to store images, sounds and movies,
one must somehow represent them as sequences of bits. The transformation from
continuous to discrete data is called quantization, and is fundamental for any
information system handling data from the physical world.
1
Continuous Data
Quantized Data
0.8
0.6
0.4
Data
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
0.2
0.4
0.6
0.8
1
Time
43
1.2
1.4
1.6
1.8
2
44 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
In fact, in some cases, even if data is already discrete, one may want to represent
it using less bits, even at the expense of losing some information. For instance, we
might be interested in reducing the size (in bits) of an image or a sound file as long
as, after the compression, one can reconstruct them and the reconstructed image or
sound looks or sounds similar to a human. This means that most of the information
has been preserved. We call this process lossy compression. Since quantization
and lossy compression can be understood in the same framework, we will use both
terms interchangeably.
5.1.1
Lossless vs Lossy Compression
It is noted that lossy compression is different from lossless compression studied
in the previous chapters, in the sense that lossless compression allows exact reconstruction of the data. This means that for lossy compression we need some
criterion in order to assess how much information is lost in the process, and this
criterion is called a distorsion measure.
5.1.2
The Quantization Problem
We will study the quantization problem in the information theoretic framework,
defined as follows:
• The source generates data X n = (X1 , ..., Xn ) ∈ X n drawn i.i.d. from a
distribution p(x)
• The encoder encodes the data as fn (X n ) ∈ {1, ..., 2nR } with nR bits.
• The decoder decodes the data by X̂ n = gn (fn (X n ))
The mappings fn and gn define the strategy for encoding the data and decoding the
data, and given a rate R the goal is to select these mappings in order to minimize
the distorsion defined as:
n
1X
D=
E(d(Xi , X̂i ))
n i=1
where d is a positive function, e.g. d(x, x′ ) = (x′ − x)2 .
A few remarks are in order. The mapping fn is indeed a quantizer as it maps
a vector of n source symbols (whose values may be continouous or discrete) to a
finite integer between 1 and 2nR , or equivalently to a string of nR bits, so that R
measures the number of bits per source symbols at the quantizer. The mapping gn
is a decoder and attempts to reconstruct the original data. The n source symbols
5.2. SCALAR QUANTIZATION
45
X n are quantized as fn (X n ), and subsequently reconstructed as X̂ n = gn (fn (X n ))
so that one would like X̂ n to be as close as possible to X n and we do so by
minimizing D, which can be seen as a measure of dissimilarity between X n and
X̂ n , or a measure of how much information was lost in the process.
Of course, the choice of d impacts the strategy we should use, and should be
chosen wisely. For instance if one is dealing with images, so that X n is an image
and Xi is its i-th pixel, then D being small should imply that X n and X̂ n look
similar to a human.
5.2
Scalar Quantization
We first study scalar quantization, where n = 1 so that we compress symbols one
at a time, with the goal of minimizing the per-symbol distorsion.
5.2.1
Lloyd-Max Conditions
There exists a general result to find optimal quantization schemes called the LloydMax conditions, which gives necessary conditions that the optimal quantizer must
verify.
Proposition 5.2.1 (Lloyd-Max). An optimal codebook must satisfy two conditions:
(i) The encoder f should verify for all x ∈ X
f (x) ∈ arg
min
i∈{1,...,2R }
d(g(i), x)
(ii) The decoder g should verify for all i ∈ {1, ..., 2R }
g(i) ∈ arg min
E(d(x′ , X)|f (X) = i)
′
x ∈X
Proof We have that:
D = E d(g(f (X)), X) ≥ E
min
i∈{1,...,2R }
d(g(i), X)
and
2R
X
D = E d(g(f (X)), X) =
E(d(g(i), X)|f (X) = i)P(f (X) = i)
i=1
R
≥
2
X
i=1
min E(d(x′ , X)|f (X) = i)P(f (X) = i)
x′ ∈X
46 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
Therefore, if (i) or (ii) are not satisfied, we can decrease the value of the distortion
by modifying f or g.
The most important insight gained from Lloyd max are twofold. First, to
design the quantizer, a point should be mapped to the closest reconstruction point.
Second, when designing the decoder, one should select the reconstruction points
to minimize the conditional expected distorsion. In fact this shows that if the
quantizer f is known, then finding g is easy, and vice-versa, and suggests an
iterative algorithm: starting with (f, g) arbitrary and alternatively minimize over f
and g until convergence. This algorithm may not always converge to the optimal
solution and should be seen as a heuristic.
5.2.2
Uniform Distribution
Distribution
Quantization Points
1.2
1
p(x)
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
Proposition 5.2.2. Consider n = 1, X ∼Uniform([0, 1]) and distorsion function
d(x, x′ ) = (x − x′ )2 .
−2R
Then the minimal distorsion is D = 2 12 , the optimal quantization scheme is
uniform quantization:
f (X) ∈ arg min |X − g(i)|
i=1,...,2R
with g(i) = i2−R .
Proof: Let us assume without loss of generality that g(1) < ... < g(2R ). From
Lloyd-Max, the quantization scheme should be such that
f (x) ∈ arg
min
i∈{1,...,2R }
d(g(i), x) = arg
min
i∈{1,...,2R }
|g(i) − x|
Furthermore, knowing that f (X) = i, X has uniform distribution over interval
[
g(i) + g(i − 1) g(i) + g(i + 1)
,
]
2
2
This implies that
g(i) ∈ arg min
E(d(x′ , X)2 |f (X) = i) = E(X 2 |f (X) = i) =
′
x ∈X
g(i + 1) − g(i − 1)
2
5.2. SCALAR QUANTIZATION
47
One can readily check by recursion that this implies g(i) = i/2R for i = 1, ..., R.
1 −2R
2 , which which concludes the proof.
The distorsion is hence D = 12
When data is uniformly distributed over an interval, then the optimal quantization scheme is uniform quantization, which simply partitions the interval in
1 −2R
2R intervals of equal size, and the distorsion is 12
2 , so that when the rate is
increased by 1 bit, the distorsion is divided by 4 (or decreased by 6dB). It is also
noted that uniform quantization is equivalent to rounding the data to the nearest
integer multiple of 2−R so it is very easy to implement.
5.2.3
Gaussian Distribution with one bit
0.4
Distribution
Quantization Points
0.35
0.3
p(x)
0.25
0.2
0.15
0.1
0.05
0
−2
−1.5
−1
−0.5
0
x
0.5
1
1.5
2
Proposition 5.2.3. Consider n = 1, R = 1, X ∼ N (0, σ 2 ) and distorsion function
d(x, x′ ) = (x − x′ )2 .
Then the minimal distorsion is D = π−2
σ 2 and the optimal quantization scheme
π
is the sign
(
1 if X < 0
f (X) =
2 if X ≥ 0
and the optimal reconstruction is
 q
− 2σ2 if i = 1
q π
g(i) =
+ 2σ2 if i = 2
π
Proof: Let us assume without loss of generality that g(1) < g(2). From
Lloyd-Max, the quantization scheme should be such that
f (x) ∈ arg
min
i∈{1,...,2R }
d(g(i), x) = arg min |g(i) − x|
i∈{1,2}
Since X has the same distribution as −X one must have that g(2) = −g(1) hence
f (X) = 1 if X < 0 and f (X) = 2 otherwise. Furthermore
r
g(2) ∈ arg min
E(d(x′ , X)2 |f (X) = 2) = E(X|f (X) = 1) = E(X|X ∈ [0, +∞]) =
′
x ∈X
2σ 2
π
48 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
hence
r
g(2) = E(X|X ∈ [0, +∞]) =
2σ 2
π
One may readily check that D = π−2
σ 2 which concludes the proof.
π
If only R = 1 bit per symbol is available, the most efficient quantizer consists
in simply encoding the sign of the data, so that the information ofq
the absolute
value is lost. It is also noted that the optimal reconstruction points ±
expected value of the absolute value of X.
5.2.4
2σ 2
π
are the
General Distributions
It should be noted that even for Gaussian distributions and R ̸= 1, finding the
optimal quantization scheme is not straightforward. In fact, for many distributions,
the optimal quantization scheme is not known.
5.3
Vector Quantization
We now study vector quantization n > 1, where we attempt to encode several
source symbols at the same time.
5.3.1
Vector Quantization is Better than Scalar Quantization
It would be tempting to think that, if one has a good scalar quantizer, one could
simply apply it to sequences of n independent symbols, and this would result in a
low distorsion. We propose to illustrate this with an example, showing that this
intuition is not only false, but also showing that, in general, randomization can be
a very powerful tool in order to perform vector quantization.
Consider X n = (X1 , ..., Xn ) i.i.d. uniform in [0, 1], so that X n is uniformly
distributed on [0, 1]n and let us apply the optimal scalar quantizer to each of its
entries, namely:
f s (X n ) = (arg min |X1 − i2−R |, ..., arg min |Xn − i2−R |)
i=1,...,2R
i=1,...,2R
and
g s (in ) = (i1 2−R , ..., in 2−R )
Then one may readily check that the reconstruction error
g s (f s (X n )) − X n
5.3. VECTOR QUANTIZATION
49
1 R
2 , and therefore the
has i.i.d. uniformly distributed entries with variance 12
1 −R
achieved distorsion is D = 12 2 .
On the other hand, consider another quantization strategy where the quantization points g(1), ..., g(nR) are selected uniformly at random in [0, 1]n . One may
readily check that, from independence of g(1), ..., g(nR):
P(
with rn2 =
1
1
nR
min d(X, g(i)) ≥ 2R ) = P(d(X, g(1)) ≥ rn2 )2
nR
n i=1,...,2
12
1
n2R .
12
Furthermore
(πrn2 )n/2
Γ(n/2 + 1)
P(d(X, g(1)) ≤ rn ) ≈
since the probability of d(X, g(1)) ≤ rn2 can be approximated by the Lebesgue
measure a ball of radius rn centerered at X.
We may then use Stirling’s approximation to show that
nR
P(d(X, g(1)) ≥ rn )2
→ 0
n→∞
Therefore, this quantization strategy has distorsion lower than
probabilty, and is superior to scalar quantization.
5.3.2
1 R
2
12
with high
Paradoxes of High Dimensions
This means that this vector quantization is provably better than scalar quantization,
and this also shows that, in some cases, drawing the representation points of the
quantizer accoding to some distribution, rather than according to some deterministic
rule, can perform better. This is due to the counterintuitive fact that rectangular
grids do not fill out space very well in high dimensions, while i.i.d sequences to to
fill out the space much better (this is in fact the basis of Monte-Carlo methods).
1
1
Quantization Points
0.8
0.8
0.7
0.7
0.6
0.6
2
0.9
0.5
x
x2
Quantization Points
0.9
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
0.1
0
0.1
0.2
0.3
0.4
0.5
x1
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
x1
0.6
0.7
0.8
0.9
1
Another related counterintuitive fact is that even if two random variables X, Y
have no relationship with each other, quantizing them together is always better
than quantizing them separately.
50 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
5.3.3
Rate Distorsion Function
Definition 5.3.1. A rate distortion pair (R, D) is achievable if and only if there
exists a sequence of (2nR , n) distortion codes (fn , gn ) with
n
1X lim sup
E d(gn (fn (X n ))i , Xi ) ≤ D.
n→∞ n
i=1
Definition 5.3.2. The rate distortion function R(D) for a given D is the infimum
over R such that (R, D) is achievable.
Given a rate R and a distorsion D, we say that (R, D) is achievable if, asymptotically when n is large, there exists a sequence of quantizers whose distorsion is at
most D. We insist on the fact that for each value of n, an appropriate quantizer must
be found and what matters is the limit behaviour of this sequence. This means that
the notion of achievability is asymptotic, and there may not exist quantizers with
rate R and distorsion D for small values of n. In a sense achievability quantifies
the smallest distorsion for n = +∞. Clearly the larger the allowed distorsion, the
smaller the rate can be with an efficient quantizer, and a natural question is: what
is the optimal trade-off between distorsion and rate. The answer to this question is
called the rate distorsion function. Now computing this function may be difficuly
in general, and we will show how this may be done by maximizing the mutual
information.
5.4
Rate Distorsion Theorem
Definition 5.4.1. Define the information rate function
RI (D) =
min
I(X; X̂)
E(d(X,X̂))≤D
minimizing over all possible conditional distributions p(x̂|x).
Theorem 5.4.2. The information rate function equals the rate distorsion function.
We will prove the rate-distorsion theorem, by showing that the information
rate function is an information theoretic limit of the problem, and then construct
efficient rate-distorsion codes which reach this limit.
5.4.1
Lower Bound
Proposition 5.4.3. Consider a memoryless source. Then any rate R < RI (D) is
not achievable at distorsion D.
5.4. RATE DISTORSION THEOREM
51
Proof: Let us consider a (2nR , n) distortion code (fn , gn ). Since fn ∈ {1, ..., 2nR }:
H(fn (X n )) ≤ nR
Using the fact that conditional entropy is positive
I(X n ; fn (X n )) ≤ H(fn (X n ))
Furthermore, from the data processing inequality
I(X n ; X̂ n ) = I(X n ; gn (fn (X n ))) ≤ I(X n ; fn (X n ))
Since the source is memoryless
I(X n ; X̂ n ) = H(X n ) − H(X n |X̂ n )
with
n
H(X ) =
n
X
H(Xi )
i=1
and, using the chain rule and the fact that conditoning reduces entropy
n
H(Xn |X̂ ) =
n
X
i=1
n
H(Xi |Xi−1 , ..., X1 , X̂ ) ≤
n
X
i=1
H(Xi |X̂i )
Putting things together
n
X
i=1
I(Xi ; X̂i ) ≤ I(X n ; X̂ n )
By definition of the information rate function
n
X
i=1
R(Di ) ≤
n
X
I(Xi ; X̂i )
i=1
with
PnDi = E(d(Xi , X̂i )) the distorsion for the i-th symbol. We have D =
1
i=1 Di and since the mutual information is convex, so is the rate distorsion
n
function, which in turn implies:
nR(D) ≤
n
X
R(Di )
i=1
We have proven that R(D) ≤ R, so that R(D) is indeed a lower bound on the rate
that can be achieved at distorsion level D.
52 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
5.4.2
Efficient Coding Scheme: Random Coding
We now propose a scheme known as random coding, so that any rate distorsion pair
(R, D) which verifies the previous lower bound can be achieved with this scheme.
In this sense, random coding is optimal. We do not provide a complete proof for
the optimality of random coding in this context. We will go into further details in
the next chapter on channel coding.
Algorithm 5.4.4 (Random Coding for Rate Distorsion). Consider the following
randomized scheme to construct a rate-distorsion codebook:
• (Codebook generation) Let p(x̂|x) a distribution such that R(C) = I(X; X̂)
and E(d(X, X̂)) ≤ D. Draw C = {X̂ n (i), i = {1, ..., 2nR }} where X̂ n (i) is
an i.i.d. sample of size n from p(x̂)
• (Encoding) Encode X n by W ∈ {1, ..., 2nR } with W the smallest W such
that (X n , X̂ n (W )) is distortion typical. If such a W does not exist let
W = 1.
• (Decoding) Output the representation point X̂ n (W )
It is noted that this is a randomized strategy, so that both the encoder fn and
the decoder gn are in fact random. While it may seem counter-intuitive to select a
random codebook, this in fact eases the analysis very much, because it allows us to
average over the codebook itself. Furthermore, when performing this averaging, as
long as we are able to prove that the codebook has good performance in expectation,
it automatically implies that there exists a codebook with good performance. This
strategy is common in information theory as well as other fields (for instance
random graphs), and is known as the "probabilistic method". The disadvantage of
random coding with respect to, for instance, Huffman coding, is that it is much
more complex to implement.
Proposition 5.4.5. There exists a sequence of codebooks achieving any rate distorsion pair (R, D) with R(D) > D.
The main idea centers around the idea of typicality, in that case rate-distorsion
typicality.
Proposition 5.4.6. Consider (X n , X̂ n ) = (Xi , X̂i )i=1,...,n i.i.d. with p.d.f. p(x, x̂).
5.5. RATE DISTORSION FOR GAUSSIAN DISTRIBUTIONS
53
Given ϵ > 0 define the distortion typical set:
Anϵ
n
n
1
1X
n
n
n
n
log2
− H(X)
= (x , x̂ ) ∈ X × X :
n i=1
p(xi )
n
n
1X
1X
1
1
+
− H(X̂) +
− H(X, X̂)
log2
log2
n i=1
p(x̂i )
n i=1
p(xi , x̂i )
n
o
1X
d(xi , x̂i ) − E(d(X, X̂)) ≤ ϵ .
+
n i=1
Then: P((X n , X̂ n ) ∈ Anϵ ) → 1.
n→∞
The point of random coding is that the codewords are drawn in an i.i.d. then
the pairs (X n , X̂ n ) will be distorsion typical so that d(X n , X̂ n ) will be arbitrairly
close to D with high probability.
5.5
Rate Distorsion for Gaussian Distributions
Computing the rate-distorsion function is usually difficult, as it is the solution to
a maximization problem, and does not admit a closed-form expression for many
distributions. For Gaussian variables and vectors however, the solution can be
computed in closed form, and gives several interesting insights into quantization in
general.
5.5.1
Gaussian Random Variables
3.5
Rate Distorsion Function
3
2.5
2
1.5
1
0.5
0
0
0.5
1
1.5
Distorsion
′ 2
Proposition 5.5.1. Consider X ∼ N (0, σ 2 ) with d(x, x′ ) =
(x − x 2) . The rate distortion function is given by: R(D) = max 21 log2 σD , 0
Proof: We must minimize I(X; X̂) where X ∼ N (0, σ 2 ) and (X, X̂) verifies
E((X − X̂)2 ) ≤ D. By definition of the mutual information:
I(X; X̂) = h(X) − h(X|X̂).
54 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
Since X ∼ N (0, σ 2 ) we have
h(X) =
1
log2 σ 2
2
Furthermore, since conditioning reduces entropy:
h(X|X̂) = h(X − X̂|X̂) ≤ h(X − X̂)
Now, since the Gaussian distribution maximizes entropy knowing the variance:
h(X − X̂) ≤
1
log2 var(X − X̂)
2
Since var(X − X̂) ≤ D, replacing we have proven that
I(X; X̂) ≥
1
σ2
log2
2
D
Now consider the following joint distribution X = X̂ + Z where X̂ and Z are
independent and gaussian with respective variances σ 2 − D and D. Then one may
readily check that E((X − X̂)2 ) ≤ D and that
I(X; X̂) =
1
σ2
log2
2
D
which proves the result.
The rate-distorsion function for gaussian variables is indeed convex and decreasing, and in particular this function is 0 for any D > σ 2 , due to the fact that,
even with no information, one can achieve a distorsion of σ 2 , by representing X
by a fixed value equal to E(X). Furthermore, for D < σ 2 , when R is increased by
1, D is divided by 4 so each added bit of quantization decreases the quantization
error by 6dB. Finally, as predicted previously, vector quantization is better than
scalar quantization. For instance, consider R = 1, using vector quantization on
2
(X1 , ..., Xn ) with a rate of R = 1 yields a distorsion of D = σ4 , while using scalar
quantization on each entry of (X1 , ..., Xn ) with a rate of R = 1 yields a distrorsion
σ 2 . Hence in that example vector quantization is 45% more efficient
of D = π−2
π
than scalar quantization.
5.5.2
Gaussian Vectors
Proposition 5.5.2. Consider X1 , ..., Xk independent with Xj ∼ N (0, σj2 ) and
P
distortion function d(x, x′ ) = kj=1 (xj − x′j )2 .
5.5. RATE DISTORSION FOR GAUSSIAN DISTRIBUTIONS
55
The rate distortion function is given by:
R(D) =
k
X
1
j=1
where λ⋆ is chosen such that
Pk
j=1
2
log2
σj2
min(λ⋆ , σj2 )
min(λ⋆ , σj2 ) = D.
Proof: We must minimize I(X k ; X̂ k ) where X1 , ..., Xk are independent with
Xj ∼ N (0, σj2 ) and (X k X̂ k ) verifies E((X k − X̂ k )2 ) ≤ D.
By definition of the mutual information:
I(X k ; X̂ k ) = h(X k ) − h(X k |X̂ k ).
Since X1 , ..., Xk are independent:
h(X) =
k
X
h(Xi )
i=1
and since conditioning reduces entropy:
k
k
h(X |X̂ ) =
k
X
i=1
k
h(Xi |Xi−1 , ..., X1 , X̂ ) ≤
Therefore
I(X k ; X̂ k ) ≥
k
X
k
X
i=1
h(Xi |X̂i )
I(Xi ; X̂i )
i=1
Define Di = E((X̂i − Xi )2 ) the distorsion attributed to component i. From the
scalar case studied in the previous case:
I(Xi ; X̂i ) ≥
Hence
1
σ 2 +
log2 i
2
Di
k
X
1
σ 2 +
log2 i
I(X ; X̂ ) ≥
2
Di
i=1
k
k
Furthermore, one can achieve equality by choosing (Xi , X̂i ) independent with
distribution as in the scalar case. Hence the rate distorsion function is the solution
to the optimization problem
Minimize
k
k
X
X
1
σ 2 +
log2
s.t.
Di = D.
2
D
i
i=1
i=1
56 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
From Lagrangian relaxation the solution of this optimiation problem must be such
2
that there exists
Pλk > 0 such that either Di = σi or otherwise Di = λ. Selecting λ
to ensure that i=1 Di = D yields the result.
For Gaussian vectors with independent entries, the rate distorsion function can
be computed as well, and the solution is given by an allocation called "reverse
water filling" which attempts to equalize the distortion for each component. Bits
are allocated mostly to components with high variance, and components with low
variance are simply ignored. This makes sense since, for an equal amount of bits,
the larger the variance, the larger the distorsion. This can be generalized to gaussian
vectors with non-diagonal covariance matrices by performing reverse waterfilling
on the eigenvectors/eigenvalues of the covariance matrix.
Chapter 6
Mutual Information and
Communication: discrete channels
We now move away from data representation, and focus on communication over
noisy channels. For this problem, we are concerned with the maximal rate at which
information can be reliably sent over the channel, in the sence that the receiver
should be able to retrieve the sent information with high probability. As we shall
see, information theoretic tools provide a complete characterization of the problem
in terms of achievable rates as well as coding strategies.
6.1
6.1.1
Memoryless Channels
Definition
We will consider the case where a sender selects n inputs X n = (X1 , ..., Xn ) from a
finite alphabet X , and a receiver observes corresponding outputs Y n = (Y1 , ..., Yn )
from another finite alphabet Y. The relationship between X n and Y n is called
a channel. The main problem we aim to solve is how much information can be
reliably exchanged between the sender and the receiver as a function of n. The
ratio between the amount of information exchanged and n is called the rate (in bits
per channel use).
Definition 6.1.1. A channel with input X n = (X1 , ..., Xn ) and output Y n =
(Y1 , ..., Yn ) is memoryless with transition matrix p(y|x) = P(Y = y|X = x) if
n
n
pY n |X n (y |x ) =
n
Y
i=1
p(yi |xi )
Of course, a channel can model almost any point-to-point communication scenario regardless of the medium: wireless communication, optical communication,
57
58
CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
and so on. We will focus mostly on memoryless channels, which already constitude
a rather rich model. Of course, there exists more general models, such as Markovian channels and the most general model of ergodic channels. It is noted that, if a
channel is memoryless, and X n = (X1 , ..., Xn ) is i.i.d., then Y n = (Y1 , ..., Yn ) is
also i.i.d.
6.1.2
Information Capacity of a Channel
Definition 6.1.2. The information channel capacity of a memoryless channel is
defined as:
C = max I(X; Y )
pX
where the maximum is taken over all possible input distributions.
The information channel capacity is simply the largest amount of mutual
information that can be achieved by selecting the input distribution appropriately.
It turns out that this number also represents the amount of bits per channel use that
can be reliably exchanged between the sender and the receiver, as we shall later
see.
6.1.3
Examples
We now propose to compute the information channel capacity for a few simple
channel models.
Noiseless Binary Channel
b
1
b
1
0
1
b 0
b 1
Since X can be retrieved perfectly from X we have H(X|Y ) = 0 and the
mutual information is
I(X; Y ) = H(X) − H(X|Y ) = H(X)
6.1. MEMORYLESS CHANNELS
59
To maximize I(X; Y ) one must maximize H(X), so the input distribution maximizing X should be unifom in {0, 1} and the capacity is
C = log2 2 = 1
6.1.4
Non-Overlapping Outputs Channels
p0
b
0
1 − p0
b 0
b 1
b 2
p1
b
1
1 − p1
b 3
Once again X can be retrieved perfectly from X we have H(X|Y ) = 0 and
the mutual information is
I(X; Y ) = H(X) − H(X|Y ) = H(X)
To maximize I(X; Y ) one must maximize H(X), so the input distribution maximizing X should be unifom in X and the capacity is
C = log2 |X |
which generalizes the previous case.
6.1.5
Binary Symmetric Channel
1−p
b
0
b 0
p
p
1
b
1−p
b 1
Knowing Y , X has a Bernoulli(p) distribution, so the mutual information is
I(X; Y ) = H(X) − H(X|Y ) = H(X) − h2 (p)
60
CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
To maximize I(X; Y ) one must maximize H(X), so the input distribution maximizing X should be unifom in X and the capacity is
C = log2 2 − h2 (p) = 1 − h2 (p)
6.1.6
Typewriter Channel
0
b
b 0
1
b
b 1
2
b
b 2
3
b
b 3
Knowing X, X has two equiprobable values, therefore
I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − 1
One would like to maximize H(Y ) by selecting the distribution of X appropriately.
If we select X uniformly distributed, then Y is also uniformly distributed so the
input distribution maximizing X should be unifom in X and and the capacity is
C = log2 X − 1
6.1.7
Binary Erasure Channel
1−α
b
0
b 0
α
b ×
α
1
b
1−α
b 1
Knowing X, Y has values X or × with probabilities α and 1 − α so the
conditional entropy is
H(Y |X)= h2 (α)
On the other hand, Y has 3 possible values 0,× and 1 with probabilities 1−α(1−p),
6.2. CHANNEL CODING
61
α and p(1 − α) where p = P(X = 0) so the entropy is:
1
(1 − α)(1 − p)
1
1
+ α log2 + (1 − α)p log2
α
(1 − α)p
= h2 (α) + (1 − α)H(X)
H(Y ) = (1 − α)(1 − p) log2
Therefore the mutual information is
I(X; Y ) = (1 − α)H(X)
To maximize I(X; Y ) one must maximize H(X), so the input distribution maximizing X should be unifom in {0, 1} and the capacity is
C = (1 − α) log2 2 = (1 − α)
6.2
6.2.1
Channel Coding
Coding Schemes
We consider coding over blocks of n channel uses.
Definition 6.2.1. Consider the following procedure:
• The transmitter chooses a message W ∈ {1, ..., M }
• She transmits a codeword X n (W ) = (X1 (W ), ..., Xn (W )) ∈ X n
• The receiver sees Y n distributed as p(y n |xn )
• She decodes the message using some decoding rule Ŵ = g(Y n )
Any such procedure is called an (M, n) channel code with rate
R=
1
log2 M
n
and error probability:
Pen = P(Ŵ ̸= W ).
62
CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
6.2.2
Example of a Code for the BSC
1−p
b
0
b 0
p
p
1
b
b 1
1−p
For the binary symmetric channel, a code given by a subset C of {0, 1}n of
size 2nR , along with a decoding rule. The distribution of the channel output y n
conditional to transmitting some some codeword xn is given by
n
p d(xn ,yn )
Y
p(y |x ) =
(1 − p)1{xi =yi } p1{xi ̸=yi } = (1 − p)n
.
1−p
i=1
n
n
where d(xn , y n ) is the so-called Hamming distance between xn and y n , i.e. it is
simply the number of entries of xn that are different from that of y n .
One can prove that the optimal decoding rule is maximum likelihood decoding
(in the sense that this rule minimizes the error probability) which consists in
selecting the codeword xn ∈ C that is the most likely to have been transmitted:
x̂n = arg max
p(xn |y n ) = arg min
d(xn , y n )
n
n
x ∈C
x ∈C
We notice that this is equivalent to minimizing the Hamming distance between the
output and the codeword d(xn , y n ). Also note that, if C is very large, this might be
very hard to do computationally.
x3
x4 = x1 ⊕ x3
x1
x5 = x2 ⊕ x3
x6
= x1 ⊕ x2
x2
A well known code for the BSC is the so called Hamming code.
C = {xn ∈ {0, 1}n : (x4 , x5 , x6 ) = (x1 ⊕ x3 , x2 ⊕ x3 , x1 ⊕ x2 )}
It is a code with M = 23 codewords, block size n = 6 so its rate is R = 12 . This
code illustrates the idea of appending parity check bits (x4 , x5 , x6 ) at the end of the
6.3. NOISY CHANNEL CODING THEOREM
63
message (x1 , x2 , x3 ) message which adds redundancy in order to allow for error
correction.
In fact, one can prove that, this code can correct exactly one error and its error
probability is given by:
Pen
= P(
n
X
i=1
6.2.3
1{Xi ̸= Yi } ≥ 2) = 1 − np(1 − p)n−1 − (1 − p)n .
Achievable Rates
Given any code (M, n) we define the condtionnal error probability:
λni = P(g(Y n ) ̸= i|X n = X n (i)),
and the maximal error probability:
λn = max λi .
i=1,...,M
Definition 6.2.2. A rate R is achievable if there exists a sequence of (2nR , n) codes
with vanishing maximal error probability λn → 0.
n→∞
Definition 6.2.3. The capacity of a channel is the supremum of all achievable
rates.
6.3
6.3.1
Noisy Channel Coding Theorem
Capacity Upper Bound
We now show that any rate above the information capacity is not achievable. The
main idea is to apply Fano’s inequality to show that if there are too many codewords,
then the transmitted codeword cannot be estimated with arbitrary high accuracy.
Proposition 6.3.1. Consider a memoryless channel. Then any rate R > C is not
achievable.
Proof: We recall that X, Y we have that H(X|Y ) ≤ H(X) so that using the
chain rule, for any X1 , ..., Xn :
H(X1 , ..., Xn ) =
n
X
i=1
H(Xi |Xi−1 , ..., X1 ) ≤
n
X
i=1
H(Xi ).
64
CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
We now upper bound the maximal mutual information with n channel uses. By
definition of the capacity:
I(X n ; Y n ) = H(Y n ) − H(Y n |X n )
n
X
n
= H(Y ) −
H(Yi |Yi−1 , ..., X n )
= H(Y n ) −
≤
=
n
X
i=1
n
X
i=1
n
X
i=1
H(Yi |Xi )
H(Yi ) − H(Yi |Xi )
I(Xi ; Yi ).
i=1
Therefore I(X n ; Y n ) ≤ nC.
For any channel code
W → X n (W ) → Y n → Ŵ
forms a Markov chain and the data processing inequality yields:
I(W ; Ŵ ) ≤ I(X n ; Y n ) ≤ nC.
Since message W ∈ {1, ..., 2nR } is chosen uniformly at random we have H(W ) =
nR and:
H(W |Ŵ ) = H(W ) − I(W ; Ŵ ) ≥ n(R − C).
We may now apply Fano’s inequality:
h2 (P(W ̸= Ŵ )) + P(W ̸= Ŵ ) log2 2nR ≥ H(W |Ŵ ) ≥ n(R − C)
Since h2 ≤ 1 we have proven that:
P(W ̸= Ŵ ) ≥
n(R − C) − 1
C
→ 1− .
n→∞
nR
R
Therefore the probability of error of any family of (n, 2nR ) channel codes does not
vanish when R > C.
6.3.2
Efficient Coding Scheme: Random Coding
We now show that any rate below the information capacity is achievable, so that,
in essence information capacity qunatifies how much information can be reliably
6.3. NOISY CHANNEL CODING THEOREM
65
transmitted over a channel. This is a very strong result because it applies to
any communication system. Capacity is the fundamental limit that no scheme can
overcome, and reaching this limit can be hard in practice: no good (low complexity)
codes were known for 30 years for the BSC and well known examples of such
good codes which reach this limit are Turbo Codes and LDPC codes.
Proposition 6.3.2. Consider a discrete memory-less channel. Then any rate R < C
is achievable.
Proof In order to prove the result, we construct a coding scheme called random
coding.
Random Coding
Algorithm 6.3.3 (Random Channel Coding). Consider the following randomized
algorithm in order to generate a codebook and transmit data.
• (Codebook generation) Let p(x) a distribution such that C = I(X; Y ). Draw
C = {X n (i), i = {1, ..., 2nR } where X n (i) is an i.i.d. sample of size n from
p(x), and (X n (i))i=1,...,2nR . Reveal C to both the transmitter and receiver.
• (Data Transmission) To transmit data, choose W ∈ {1, ..., 2nR } uniformly
distributed, and transmit X n (W )
• (Decoding) Observe Y n . If there exists a unique Ŵ such that (X n (Ŵ ), Y n )
are jointly typical, then output Ŵ . Otherwise output an error.
Intuition Behind Random Coding
Interestingly, the fact that we use a random code ensemble, perhaps counterintuitively, eases analysis. While this analysis is not trivial, the main intuitive
idea behind random channel coding is that, if X n (W ) is transmitted and Y n is
received, then (X n (W ), Y n ) will be jointly typical with high probability, and
for any W ′ ̸= W (X n (W ′ ), Y n ) will not be jointly typical, because X n (W ′ ) is
independent from X n (W ) from the random code construction.
Error Probability
We compute the error probability averaged over C. Define E the event that decoding
fails and average over C:
nR
P(E) =
X
c
P(C = c)Pen (c) =
2
1 XX
2nR
i=1
c
P(C = c)λi (c).
66
CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
By symmetry
P
c
P(C = c)λi (c) does not depend on i, so:
nR
P(E) =
2
1 XX
2nR
c
P(C = c)λ1 (c) = P(E|W = 1).
i=1
Define the event that a particular couple is typical:
Ei = {(X n (i), Y n ) ∈ Anϵ }.
If W = 1, decoding fails if either (X n (1), Y n ) is not typical, or there exists i ̸= 1
such that (X n (i), Y n ) is typical, hence:
nR
P(E|W = 1) ≤
P(E1c |W
= 1) +
2
X
i=2
P(Ei |W = 1)
From joint typicality, for n large
P(E1c |W = 1) ≤ ϵ and P(Ei |W = 1) ≤ 2−n(C−ϵ) , i ≥ 2
We conclude that, for n large and R < C − ϵ:
P(E|W = 1) ≤ ϵ + 2−n(C−R−ϵ) ≤ 2ϵ.
As P(E) ≤ 2ϵ, there exists c⋆ with P(E|C = c⋆ ) ≤ 2ϵ.
Since
2nR
1 X
⋆
P(E|C = c ) = nR
λi (c⋆ ) ≤ 2ϵ
2 i=1
there are 2nR − 1 indices i such that λi (c⋆ ) ≤ 4ϵ by considering the best half. So
we have proven that there exists a sequence of (n, 2nR ) codes with vanishing error
probability which concludes the proof.
6.4
Computing the Channel Capacity
In general, how does one compute channel capacity ? The problem is usually
difficult and for many channels, the computation of their capacity is an open
problem. We highlight a two simple strategies here.
6.4. COMPUTING THE CHANNEL CAPACITY
6.4.1
67
Capacity of Weakly Symmetric Channels
If the channel has some symmetry features, one can use this to compute the
capacity.
′
Definition 6.4.1. A channel is weakly symmetric if (i) for any x,xP
, vectors p(.|x),
′
′
p(.|x ) are equal up to a permutation and (ii) for any y,y we have x∈X p(y|x) =
P
′
x∈X p(y |x).
If the channel is weakly symmetric, the optimal input distribution is uniform,
and the capacity is simply the logarithm of the number of outputs, minus the
entropy of a column of the transition matrix. Interestingly, this result generalizes
our previous computations.
′
Proposition 6.4.2. Assume that (i) for any x,x′ , vectors
p(.|x), p(.|x
) are equal
P
P
′
up to a permutation and (ii) for any y,y we have x∈X p(y|x) = x∈X p(y ′ |x).
Then
X
1
,
C = log |Y| −
p(y|x) log2
p(y|x)
y∈Y
for any x and the optimal input is uniform.
Proof The distribution of Y knowing X = x does not depend on x (up to a
permutation), so:
I(X; Y ) = H(Y ) −
X
p(y|x) log2
y∈Y
1
p(y|x)
Once again, by symetry As X uniform =⇒ Y uniform, this maximizes H(Y )
and I(X; Y ).
6.4.2
Concavity of Mutual Information
Proposition 6.4.3. For any channel we have:
(i) 0 ≤ C ≤ log2 (min(|X |, |Y|)
(ii) (p(x)) 7→ I(X; Y ) is a concave function.
Proof The distribution of Y is:
X
X
p(y) =
p(x, y) =
p(x)p(y|x).
x∈X
x∈X
68
CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
Define f (x) = x log2
H(Y ) =
X
1
x
X X
1
=
f
p(x)p(y|x)
p(y) y∈Y
x∈X
X
X
1
1
p(x, y) log2
=
p(x)
p(y|x) log2
.
p(y|x) x∈X
p(y|x)
y∈Y
p(y) log2
y∈Y
H(Y |X) =
and
X
(x,y)∈X ×Y
We have that f is concave so (p(x)) 7→ H(Y ) is concave as well. Furthermore,
(p(x)) 7→ H(Y |X) is linear. Therefore
(p(x)) 7→ I(X; Y ) = H(Y ) − H(Y |X)
is concave.
6.4.3
Algorithms for Mutual Information Maximization
In general capacity is not known in closed form can its computation can be a
hard problem. One can maximize I(X; Y ) numerically using convex optimization
techniques such as gradient ascent, which are valid for maximizing any concave
function. Specific algorithms taking advantage of particular properties of mutual
information also exist, such as the algorithm of Arimoto and Blahut.
Chapter 7
Mutual Information and
Communication: continuous
channels
In this chapter, we turn our attention to continuous channels, where both the input
and the output are real valued. Such channels are ubiquitous in the physical world,
due to its continuous nature. To solve this problem we need to generalize the
notions of entropy, relative entropy and mutual information to continuous random
variables. We compute the capacity and the optimal input distribution gaussian
channels, which are found in many applications such as wireless communication.
7.1
7.1.1
Information Mesures for Continous Variables
Differential Entropy
Definition 7.1.1. Consider X a continuous random variable with p.d.f. pX (x).
Its differential entropy is given by
h(X) = E log2
1 =
pX (X)
Z
pX (x) log2
X
1
dx
pX (x)
if the integral exists.
Just like entropy, differential entropy is expressed in bits, and is a natural
natural extension of the discrete case. It is noted that the integral might not exist.
One of the most notable differences with entropy is that differential entropy can be
negative.
69
70
7.1.2
CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
Examples
Uniform Distribution
If X ∼ Uniform(X ):
h(X) = E(log2 |X |) = log2 |X |
It is noted that, if |X | ≤ 1, then h(X) ≤ 0 so that differential entropy can be
negative. Also, if X is deterministic, X is a point and h(X) = −∞, which differs
from the discrete case where deterministic variables have an entropy of 0.
Exponential Distribution
If X ∼ Exponential(λ):
eXλ λE(X)
1
e
h(X) = E log2
=
+ log2 = log2
λ
log 2
λ
λ
The fact that differential entropy decreases with λ is intuitive, since the smaller λ,
the less X is concentrated around 0.
Gaussian Distribution
If X ∼ N (µ, σ 2 ):
√
(X−µ)2
2
2
2σ
h(X) = E log2 ( 2πσ e
)
=
1
E(X − µ)2
1
log2 (2πσ 2 ) +
= log2 (2πeσ 2 )
2
2
2 log(2)σ
2
This expression will occcur in various places, in particular when computing the
capacity of Gaussian channels. Two remarks can be made: first the differential
entropy does not depend on µ which illustrates the fact that differential entropy is
invariant by translation, second it is increasing in σ 2 , which is intuitive since the
larger σ 2 , the less X will be concentrated around its mean µ.
7.1. INFORMATION MESURES FOR CONTINOUS VARIABLES
7.1.3
71
Joint and Conditional Entropy Mutual Information
Joint and Conditional Differential Entropy
Definition 7.1.2. Let X, Y with joint p.d.f. pX,Y (x, y). The joint differential
entropy, conditional differential entropy and mutual information are:
Z
1
h(X, Y ) =
pX,Y (x, y) log2
dxdy
pX,Y (x, y)
X ×Y
Z
pY (y)
pX,Y (x, y) log2
h(X|Y ) =
dxdy
pX,Y (x, y)
X ×Y
Z
pY (y)pX (x)
dxdy.
I(X; Y ) =
pX,Y (x, y) log2
pX,Y (x, y)
X ×Y
As in the discrete case , one can readily check that
h(X|Y ) = h(X, Y ) − h(Y )
I(X; Y ) = h(Y ) − h(Y |X) = h(X) − h(X|Y ) = h(X) + h(Y ) − h(X, Y ).
Relative Entropy
Definition 7.1.3. Consider two p.d.f.s p(x) and q(x). The relative entropy is
Z
p(x)
D(p||q) =
p(x) log2
dx
q(x)
X
Proposition 7.1.4. We have D(p||q) ≥ 0 for any p, q.
Proof: Jensen’s inequality.
7.1.4
Unified Definitions for Information Measures
In the above presentation, we have presented two distinct set of definitions for
information measures for continuous and discrete variables. A natural question
is whether or not one can define information measures in such a way that the
same definition is applicable to both discrete and continuous variables. The key
is to, perhaps counterintuitively, start by defining the relative entropy using the
Radon-Nikodym derivative.
Definition 7.1.5. Consider P ,Q two distributions over a measurable space X , and
assume that P is absolutely continous with respect to Q, then the relative entropy
can be defined as:
Z
P (dx)
D(P ||Q) =
log2
P (dx)
Q(dx)
X
where
P (dx)
Q(dx)
is the Radon-Nikodym derivative of P with respect to Q.
72
CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
The Radon-Nikodym derivative is well-defined from absolute continuity, and
in turn this allows to define mutual information in terms of relative entropies.
Definition 7.1.6. Consider (X, Y ) random variables with joint distribution P(X,Y ) ,
then the mutual information between X and Y is
I(X, Y ) = D(P(X,Y ) ||PX PY )
As a byproduct, we obtain a very instructive interpretation of mutual information I(X; Y ) as the dissimilarity between the joint distribution of the vector (X, Y )
and another vector with independent entries and the same marginals. Also, one can
readily check that the above definitions generalize both the discrete and continuous
case.
7.2 Properties of Information Measures for Continous Variables
7.2.1
Chain Rule for Differential Entropy
Differential entropies obey a chain rule just like entropies, and the proof follows
from the same arguments.
Proposition 7.2.1. For any X1 , ..., Xn we have:
h(X1 , ..., Xn ) =
n
X
i=1
h(Xi |Xi−1 , ..., X1 )
Proof: By definition of conditional entropy:
h(X1 , ..., Xn ) = h(Xn |X1 , ..., Xn−1 ) + h(X1 , ..., Xn−1 )
The result follows by induction over n.
7.2.2
Differential Entropy of Affine Transformation
For continous variables over Rn , we will often be interested in how their differential entropy is affected by simple transformations such as translations and linear
transformations.
Proposition 7.2.2. For any random variable X ∈ R, fixed vector a ∈ R and
invertible matrix A ∈ Rd×d we have
h(a + AX) = h(X) + log2 det A.
If A is not invertible we have h(a + AX) = −∞.
7.3. DIFFERENTIAL ENTROPY OF MULTIVARIATE GAUSSIANS
73
−1
(x−a))dx
Proof: If A is invertible and X ∼ p(x)dx then a + AX ∼ p(A det
, so:
A
Z
det A
dx
h(a + AX) =
p(A−1 (x − a) log2
−1
p(A (x − a)) det A
d
ZR
det A
dx
p(A−1 (x − a) log2
=
p(A−1 (x − a)) det A
d
ZR
det A
=
p(y) log2
dy
p(y)
Rd
= h(X) + log2 det A.
which proves the first result. If A is not invertible then the support of the distribution
of a + AX has Lebesgue measure 0 so that h(a + AX) = −∞.
Therefore, an affine transformation incurs an additive change to the entropy, and
this change is the logarithm of the determinant of A. If A = I or more generally if
A is a rotation then log2 det A = 1 so that differential entropy is invariant by both
translation and rotation.
7.3
7.3.1
Differential Entropy of Multivariate Gaussians
Computing the Differential Entropy
Our previous result allows to derive the differential entropy of multivariate gaussians vectors without any computation, indeed, any gaussian vector can be expressed as an affine transformation of an i.i.d. vector with centered standard
Gaussians.
Proposition 7.3.1. If X ∼ N (µ, Σ) then
h(X) =
1
log2 ((2πe)n det(Σ))
2
Proof: If X ∼ N (0, I) then X has i.i.d. N (0, 1) entries so that
h(X) =
n
X
h(Xi ) =
i=1
n
log2 (2πe).
2
1
Now consider Y = µ + Σ 2 X, then Y ∼ N (µ, Σ) and
1
h(Y ) = h(X) + log2 det Σ 2 =
proving the result.
1
log2 ((2πe)n det(Σ))
2
74
CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
7.3.2
The Gaussian Distribution Maximizes Entropy
One of the reasons why the multivariate Gaussian distribution is ubiquitous in
information theory is the fact that, it is an entropy maximizer. Namely, if one
knows the mean and covarance of X, then its differential entropy is always upper
bounded the differential entropy of a Gaussian vector with the same mean and
covariance matrix. This result has interesting applications in statistical modelling:
if one must model some incertain parameter by a distribution, and the only information available are its first and second moments, then considering the Gaussian
distribution is natural, as it follows the so-called maximum entropy principle for
modelling. Another important application of this result is the computation of the
capacity of Gaussian channels, and more generally deriving capacity bounds for
various types of channels.
Proposition 7.3.2. Consider X ∈ Rn with covariance matrix Σ then
h(X) ≤
1
log2 ((2πe)n det(Σ))
2
with equality if and only if X has a Gaussian distribution.
Proof Denote by p(x) the density of X and µ its mean.
Define Y ∼ N (µ, Σ) with density q(x), so that
p(X) 0 ≤ D(p||q) = E log2
q(X)
and
h(X) = E log2
Since
1 1 ≤ E log2
p(X)
q(X)
1
1
⊤ −1
e− 2 (x−µ) Σ (x−µ)
q(x) = p
(2π)n det(Σ)
:
E log2
1 1
1
= log2 ((2π)n det(Σ)) +
E((X − µ)⊤ Σ−1 (X − µ))
q(X)
2
2(log 2)
1
1
= log2 ((2π)n det(Σ)) +
E((Y − µ)⊤ Σ−1 (Y − µ))
2
2(log 2)
1 = E log2
q(Y )
= h(Y ).
since the r.h.s only depends on the covariance matrix of X which proves the
result.
7.4. CAPACITY OF CONTINUOUS CHANNELS
7.4
75
Capacity of Continuous Channels
Consider a continuous, memoryless channel. As in the discrete case, communicating over a continuous channel follows the same paradigm. One may define
codebooks, error probabilities and achievable rates. Furthermore, the noisy channel
coding theorem still applies: the information capacity of the channel is also the
supremum of all achievable rates, and any achievable rate can be attained using the
random coding strategy, coupled with typicality decoding.
7.5
7.5.1
Gaussian Channels
Gaussian Channel
Definition 7.5.1. The Gaussian channel with power P is given by:
Y =X +Z
where Z ∼ N (0, N ), and the input must satisfy E(X 2 ) ≤ P .
The Gaussian channel is the simplest model for communication between a
transmitter and a receiver when the only perturbation is additive noise. Gaussian
noise is often a good model whenever the perturbation is the result of many small,
independent sources of perturbation, from the central limit theorem. We compute
the capacity of this channel by maximizing mutual information, and the power
constraint E(X 2 ) ≤ P is necessary, otherwise, the capacity of the channel is simply
infinite.
Proposition 7.5.2. The information capacity of the Gaussian channel with power
P is:
P
1
1
+
C = max
I(X;
Y
)
=
log
2
X:E(X 2 )≤P
2
N
and the optimal input is X ∼ N (0, P ).
Proof: When X is fixed, Y ∼ N (X, N ) so that
h(Y |X) =
1
log2 (2πeN ).
2
On the other hand from independence:
E(Y 2 ) = E((X + Z)2 ) = E(X 2 ) + 2E(XZ) + E(Z 2 ) = P + N.
Therefore
h(Y ) ≤
1
log2 (2πe(N + P )),
2
76
CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
with equality iff Y is Gaussian. Finally
1
1
I(X; Y ) = h(Y ) − h(Y |X) ≤ log2 (2πe(N + P )) − log2 (2πeN )
2
2
P
1
.
= log2 1 +
2
N
with equality if and only if Y is Gaussian which concludes the proof.
The capacity of this channel is an increasing function of the signal-to-noise
P
ratio (SNR) N
. When the SNR is small the capacity is roughly linear in the SNR
but when the SNR is large, the capacity is logarithmic. This shows that, when
communicating over a Gaussian channel, increasing the power leads to better
performance, but one quiclky runs into diminishing returns.
7.5.2
The AWGN Channel
A variant of the Gaussian channel is the Additive White Gaussian (AWGN) Noise
channel, where the input is a continuous time signal, and this input is perturbed by
a continous time process called white noise. For instance, in almost all wireless
communication systems, communication is impaired by Johnson-Nyquist noise,
which is unwanted noise generated by the thermal agitation of electrons, and
Johnson-Nyquist noise usually can be modelled by white noise.
Definition 7.5.3. The AWGN (Additive White Gaussian Noise) channel is given by:
Y (t) = X(t) + Z(t)
where x(t) is bandlimited in [−W, W ] with total power P and Z(t) is white
Gaussian noise with spectral power density N0 .
Proposition 7.5.4. The capacity of the AWGN channel is given by:
C = W log2 1 +
P W N0
Proof From the Nyquist sampling theorem, the AWGN channel is equivalent
to 2W parallel, identical Gaussian channels, hence the result.
We notice that for infinite bandwidth W → ∞ (low SNR):
C=
P
log2 e
N0
and that, just like the previous case, there exists a power-bandwidth tradeoff:
C is linear in W but logarithmic in WPN0 . This explains why, in most wireless
7.5. GAUSSIAN CHANNELS
77
communication systems, increasing the bandwidth yields much more gains that
increasing the power, especially if the SNR of the typical user is already high.
Also, the formula of the capacity of the AWGN channel allows to predict the
performance of many practical communication systems past and present, and while
the capacity is an upper bound of the best performance that can be achieved in ideal
conditions (infinite processing power for coding and decoding for instance) the
formula allows to roughly predict the typical performance, providing one knows
the typical SNR as well as the bandwidth. Here are three illustrative examples: for
telephone lines: W = 3.3 kHz, WPN0 = 33 dB, C = 36 Kbits/s. Wifi: W = 40
MHz, WPN0 = 30 dB, C = 400 Mbits/s and for 4G Networks W = 20 MHz,
P
= 20 dB, C = 133 Mbits/s.
W N0
7.5.3
Parallel Gaussian Channels
In many communication systems, one can in fact use multiple channels all at once
in order to communicate. If those channels are Gaussian, and they are independent
from each other, the model is called parallel Gaussian channels.
Definition 7.5.5. A set of parallel Gaussian channels with total power P is:
Yj = Xj + Zj , j = 1, . . . , k
where Zj ∼ N (0, Nj ), j = 1, . . . , k are independent, and the input must satisfy
Pk
2
j=1 E(Xj ) ≤ P .
In the context of communication, in particular wireless communication, this
model covers communication over parallel links, over distinct frequency bands
and distinct antennas, all of which are important components of modern wireless
systems. The main question is how one should allocate the available power to the
various channels. Certainly, if the noise variance is the same across all channels,
the problem is trivial and one can simply allocate power uniformly, but in general
if some channels are much better than other the problem is non-trivial.
We now compute the capacity of parallel Gaussian channels by computing the
optimal power allocation across channels.
Proposition 7.5.6. The capacity of parallel Gaussian channels is given by:
k
C=
with λ⋆ unique solution to
1X
(λ⋆ − Nj )+ log2 1 +
2 j=1
Nj
Pk
j=1 (λ
− Nj )+ = P
78
CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
Proof: We need to solve the optimization problem:
maximizeP1 ,...,Pk ≥0
k
X
log2
j=1
k
X
Pj subject to
Pj ≤ P.
1+
Nj
j=1
From Lagrangian duality, this can be done by solving:
maximizeP1 ,...,Pk ≥0
k
X
log2
j=1
k
X
Pj +µ
Pj
1+
Nj
j=1
Setting the gradient to 0 above yields:
1
Nj
1+
Pj
Nj
+µ=0
Therefore either Pj = 0 or Pj + Nj = − µ1 ≡ λ and summing over j to get
Pk
j=1 Pj = P yields the correct value of λ. This concludes the proof.
The optimal power allocation is called the ”water filling” solution, due to the
fact that the power allocated to channnel i is either 0, or it should be equal to
Ni + λ⋆ , where λ⋆ is selected to make sure that the total power allocated equals
P . This implies that very noisy channels are ignored. Having parallel channels
enable a multiplexing gain: indeed capacity is linear in the number of channels.
Finally, one may show that the result also applies to to time varying channels, and
bandlimited channels.
7.5.4
Vector Gaussian Channels
A generalization of parallel Gaussian channels is called vector Gaussian channels,
where the correlation matrix of the noise vector can be arbitrary.
Definition 7.5.7. A vector Gaussian channel with total power P is:
Y k = Xk + Zk
where Z ∼ N (0, ΣZ ) and the input satisfies E((X k )⊤ X k ) ≤ P .
This model for instance allows to describe wireless communication systems
with multiple-input multiple-output (MIMO), where both the receiver and the
transmitter can use several antennas to communicate. This model can also be used
to descrive non-memoryless Gaussian channels, where the entries of X k and Y k
would describe the successive values of the input and output across time.
As in the case of parallel Gaussian channels, we now derive the optimal input
and transmission strategy.
7.5. GAUSSIAN CHANNELS
79
Proposition 7.5.8. The capacity of Vector Gaussian Channels is given by:
k
C=
(ν − λj )+ 1X
log2 1 +
2 j=1
λj
with (λ1 , ..., λk ) = eig(KZ ), ν unique solution to
Pk
j=1 (ν
− λj )+ = P
Proof: Since ΣZ is real and symmetric, there exists U : U ⊤ U = I and
ΣZ = U ⊤ diag(λ1 , ..., λk )U
Multplying by U :
U Y k = U X k + U Zk,
This defines a new channel:
Ȳ k = X̄ k + Z̄ k ,
We have
(X̄ k )⊤ X̄ = (X k )⊤ U ⊤ U X k = (X k )⊤ X k
ΣZ̄ = U ⊤ ΣZ U = diag(λ1 , ..., λk ),
This is the same as k parallel Gaussian channels with noises λ1 , ..., λk variances,
which concludes the proof.
It turns out that the optimal power allocation is to perform waterfilling on the
eigenvectors of the noise correlation matrix. The main idea behind this is that one
can always reduce a vector Gaussian channel to k parallel channels by rotation, so
that after the rotation, the noise correlation matrix becomes diagonal.
80
CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
Chapter 8
Portfolio Theory
In this chapter, we illustrate how, perhaps surprisingly, information theoretic
techniques can be used in order to design investment strategies in financial markets,
in the context of the so-called portfolio theory.
8.1
A Model for Investment
8.1.1
Asset Prices and Portfolios
We consider the following model for investment in a financial market. At the start
of the process, the investor has a s tarting wealth S0 . The process is sequential,
and at the start of day n ∈ N the investor has wealth Sn , observes the stock prices
at
opening denoted by (Pn,1 , ..., Pn,m ), chooses a portfolio (bn,1 , ..., bn,m ) with
Pthe
m
bi Sn
i=1 bn,i = 1. He invests bj Sn amount of wealth in asset i by buying Pn,i units of
asset i.
′
′
At the end of day n, he observes the closing prices (Pn,1
, ..., Pn,m
) realizes his
profits and losses so that the amount of wealth available at the start of day n + 1
equals:
m
′
Pn,j
Sn+1 X
=
bn,j
Sn
Pn,j
j=1
By recursion, the wealth at any given time can be written as
n−1
m
′
Sn Y X Pi,j
=
bj
S0
Pi,j
i=1
j=1
81
!
82
8.1.2
CHAPTER 8. PORTFOLIO THEORY
Relative Returns
The model can be written in a simpler form by defining (Xn,1 , ..., Xn,m ) with
Xn,i =
′
Pn,i
Pn,i
the relative return of asset i at time n so that the wealth evolution is
n−1
m
X
Sn X
=
log2 (
bi,j Xi,j )
log2
S0
i=1
j=1
Indeed, the relative returns of each asset are sufficient in order to predict the
evolution of the wealth. Throughout the chapter we will assume that the vectors of
relative returns Xn = (Xn,1 , ..., Xn,m ) are i.i.d. with some fixed distribution F .
8.2
8.2.1
Log Optimal Portfolios
Asymptotic Wealth Distribution
The investor wishes to design optimal portfolio strategies that maximizes the
distribution of its wealth log Sn in some sense. Since the investor monitors the
market on a daily basis, he may choose an investment strategy that depends on the
previous returns Xn−1 , ..., X1 as well as his previous decisions. Of course, since
the wealth is a random variable, there are several acceptable criteria to maximize,
and we will propose one such criterion.
Proposition 8.2.1. Consider a constant investment strategy where bn = (bn,1 , ..., bn,m )
does not depend on n. Then
Sn
1
log2
→ W (b, F ) almost surely
n
S0 n→∞
where W (b, F ) is called the growth rate of portfolio b:
W (b, F ) = EX∼F
log2
m
X
!!
bi X i
i=1
Proof If the investment strategy is constant, then
average of n i.i.d. random variables:
n−1
m
X
Sn
1X
log2
=
log2
bj Xi,j
S0
n i=1
j=1
1
n
log SSn0 is an empirical
!
8.3. PROPERTIES OF LOG OPTIMAL PORTFOLIOS
83
with expectation W (b, F ) so the strong law of large numbers yields the result.
The above proposition shows that, if the investor chooses a fixed investment
strategy across time, then with high probability, wealth will grow exponentially as
a function of time:
Sn ≈ S0 2nW (b,F )
and the exponent equals the growth rate of the portfolio W (b, F ) ≥ 0. Perhaps
surprisingly, if the growth rate is strictly positive, then with high probability, the
wealth asymptotically grows to infinity.
8.2.2
Growth Rate Maximization
Definition 8.2.2. The optimal growth rate W ⋆ (b, F ) is the value of
maximize W (b, F ) subject to
m
X
i=1
bi ≤ 1 and b ≥ 0
and an optimal portfolio b⋆ is an optimal solution to this problem.
The previous results suggests that, if the investor knows the distribution of the
returns F , than he should select the portfolio maximizing the growth rate, to ensure
that its wealth grows as rapidly as possible. While this is not the only possible
objective function in porfolio theory, it comes with strong guarantees providing that
returns are indeed i.i.d. Other possible objective functions in portfolio theory are
for instance linear combinations of the mean and variance of the returns, as there
exists a trade-off between high-risk/high-return and low-risk/low-return portfolios.
Another interesting observation is that maximizing
Pthe growth rate is usually different from maximizing the expected returns E( m
i=1 bi Xi ), which can be
⋆
⋆
achieved by selecting bi⋆ = 1{i = i } where i = arg maxi E(Xi ), i.e. the investor
places all of his wealth on the stock with highest average return, a risky strategy
indeed. Usually, maximizing the growth rate is a much moreP
conservative, due to
the logarithm which places a heavy penalty on the wealth m
i=1 bi Xi becoming
very close to 0 . In other words, maximizing the growth rate discourages porfolios
that can bankrupt the investor in a day.
8.3
Properties of Log Optimal Portfolios
We now show how to compute the optimal portfolio maiximizing the growth rate.
84
8.3.1
CHAPTER 8. PORTFOLIO THEORY
Kuhn Tucker Conditions
Proposition 8.3.1. The optimal portfolio b⋆ is the only porfolio such that for all j:
!(
= 1 if bj > 0
Xj
E Pk
< 1 if bj = 0
i=1 bi Xi
Proof: It is noted that b 7→ W (b, F ) is a concave function, by concavity of the
logarithm. From the KKT conditions, there exists λ > 0 and µ ≥ 0 such that:
∇W (b⋆ , F ) + λ + µ = 0
Since µ ≥ 0 we have for all i:
d
W (b⋆ , F ) + λ ≤ 0
b.i
and furthermore, if b⋆i ̸= 0
d
W (b⋆ , F ) + λ = 0
b.i
By definition of W :
1
d
W (b⋆ , F ) =
E
b.i
log(2)
Xj
!
Pk
⋆
i=1 bi Xi
Multiplying the above by b⋆i , replacing and summing shows that:
Pk ⋆ !
k
X
1
i=1 bi Xi
+λ
b⋆i = 0
E Pk ⋆
log(2)
b
X
i=1 i i
i=1
1
Therefore, λ = log(2)
, and replacing yields the result.
The KKT conditions are necessary and sufficent conditions for the optimality
of the portfolio, and if F is known, one can search for the optimal using an iterative
scheme such as gradient descent.
8.3.2
Asymptotic Optimality
So far, we have only considered constant stratgies where the investor uses the same
porfolio at all times, and for such strategies the best achievable wealth is given
by the growth rate. One can then whether if it is possible to do better by using
history dependent strategies where the investors decision at time n depends on the
observed returns up to time n − 1.
8.4. INVESTMENT WITH SIDE INFORMATION
85
Definition 8.3.2. A portfolio strategy is said to be causal if for all n, bn,1 , ..., bn,m
is solely a function of (Xn′ ,1 , ..., Xn′ ,m ) for n′ < n.
Proposition 8.3.3. For any causal portfolio strategy
1
Sn
E log2
≤ W (b⋆ , F )
n
S0
With equality if one selects b⋆ the maximizer of W (b, F ) at all times i.e. constant
strategies are optimal.
Proof: The expected log wealth is given by
n−1
1
Sn
1X
E log2
=
E log2
n
S0
n i=1
m
X
!!
bi,j Xi,j
j=1
For any i, when (bi,1 , ..., bi,m ) is an arbitrary function of (Xn′ ,1 , ..., Xn′ ,m ) for
n′ < n, the optimal choice is to select the maximizer of:
E log2
m
X
j=1
!
bi,j Xi,j
!
|(Xn′ ,1 , ..., Xn′ ,m ), n′ < n
= E log2
m
X
!!
bi,j Xi,j
j=1
since (Xn′ ,1 , ..., Xn′ ,m ) is independent of (Xn′ ,1 , ..., Xn′ ,m ) , n′ < n. Therefore,
for each i, (bi,1 , ..., bi,m ) can be chosen as the maximizer of W (b, F ), and constant
strategies are optimal.
Interestingly, in our setting causal strategies yield no gains with respect to constant strategies. Therefore, the best achievable performance with causal strategies is
still given by the growth rate. Of course, this only true if F is known to the investor,
and the returns are i.i.d. If F were unknown, then the investor should change his
decisions as more and more returns are observed. Similarly if the returns have a
significant correlation in time, then the investment strategy should be time varying,
as the returns observed up to time n − 1 can be used to predict the returns at time
n and choose a portfolio intelligently.
8.4
Investment with Side Information
Finally, we investigate how much side information available to the investor may
increase his performance, and how much having imperfect knowledge about the
market can decrease his performance.
86
8.4.1
CHAPTER 8. PORTFOLIO THEORY
Mismatched Portfolios
So far, we have assumed that the investor knows the distribution of the relative
returns F , and in that case the optimal choice is to select a portfolio maximizing the
growth rate W (b, F ). However, in practice a full knowledge of F is not available,
and F must be somehow estimated, for instance using historical data. Consider
the case where the investor knows G, an estimate of F , and selects the optimal
porfolio if G were equal to the unknown F . A natural question is how to assess
how much wealth is loss due to the imperfect knowlege of F .
Proposition 8.4.1. Consider two distributions F and G, and the corresonding log
optimal portfolios b⋆F and b⋆G , which maximize W (b, F ) and W (b, G) respecively.
Then we have that
W (b⋆F , F ) − W (b⋆G , F ) ≤ D(F ||G)
In other words, the amount of growth rate lost by the invstor due to his imperfect
knowlege is upper bounded by the relative entropy between the true distribution
F and his estimate G. So the wealth of an investor with perfect knowledge will
⋆
be approximatly 2nW (bF ,F ) , the wealth of an investor with imperfect knowledge
⋆
will be approximatly (at least) 2n[W (bF ,F )−D(F ||G)] . It should also be noted that this
bound is tight for some distributions of X. This is indeed a surprising link between
portfolio theory and information theory.
8.4.2
Exploiting Side Information
Now consider the scenario where the investor may use side information before
selecting a porfolio. The goal is still to maximize the growth rate, which depends
on the distibution of the returns X, however the investor has access to another
random variable Y , which is hopefully useful in order to predict X. Two example
of scenarios include: financial advice where Y is the prediction of some expert
(or experts) that the investor may choose to consult before making a decidion, or
correlated returns in which the returns are not i.i.d. anymore so that Xn can be
predicted as a function of Y = (Xn′ )n′ <n .
Definition 8.4.2. The growth rate of portfolio b with side information Y is:
k
X
W (b, F |Y ) = E(log(
bi Xi )|Y )
i=1
If the investor has access to side information Y , then he should select the portfolio maximizing W (b, F |Y ), and while this certainly yields a better performance
8.4. INVESTMENT WITH SIDE INFORMATION
87
compared to the case with no side information, one can wonder how much growth
rate is gained with side information (for instance in the case where the investor
must pay some premium in order to access the side information). Intuitively, this
should depend on how much X and Y are correlated.
Proposition 8.4.3. Consider b⋆ the log optimal portfolio maximizing W (b, F ) and
b⋆|Y the log optimal portfolio with side information maximizing W (b, F |Y ). Then
we have
0 ≥ W (b⋆|Y , F |Y ) − W (b⋆ , F ) ≤ I(X; Y )
Proof: If Y = y, from our previous result, the loss of growth rate between an
investor whom assumes that X has distribution G = pX and an investor whom
knows the actual distribution F = pX|y is at most
X
pX|Y (x, y)
D(pX|Y =y |pX ) =
pX|Y (x, y) log2
pX (x)
x∈X
Averaging this loss over Y equals:
X
x∈X
pY (y)pX|Y (x, y) log2
pX|Y (x, y)
pX (x)
= I(X; Y )
which is the announced result.
Here we discover another surprising connection between portfolio theory and
information theory: the amount of growth rate that can be gained by side information is at most the mutual information between the returns and the side information
I(X; Y ). This makes sense since I(X; Y ) does measure the correlation between
X and Y . To illustrate, consider two extremes: Y is independent from X, then
I(X; Y ) = 0, and the side information yields no benefit, and if Y = X then
I(X; Y ) = H(X), and the gain is at most the entropy of X.
88
CHAPTER 8. PORTFOLIO THEORY
Chapter 9
Information Theory for Machine
learning and Statistics
In this chapter, we illustrate how information theoretic techniques can be used to
solve problems in statistics and machine learning.
9.1
9.1.1
Statistics
Statistical Inference
Assume that we are given n data points X1 , ..., Xn in a finite set X drawn i.i.d.
from some unknown distribution p. We would like to perform statistical inference,
meaning that we would like to learn information about the unknown distribution p,
solely by observing the data points X1 , ..., Xn . Of course, depending on what kind
of information we wish to obtain, the resulting problems can be vastly different.
We give a few examples.
9.1.2
Examples of Inference Problems
Density Estimation We would like to construct p̂, an estimator of p with the goal
of minimizing E(ℓ(p, p̂) where ℓ is some loss function. The loss function quantifies
how close the true p is to its estimate p̂.
Parameter Estimation We assume that p is parameterized by some parameter
θ (write it pθ ). We would like to construct θ̂, an estimator of θ with the goal of
minimizing E(ℓ(p, p̂) where ℓ is some loss function.
Binary Hypothesis Testing We partition the set of distributions as H0 ∪ H1 and we
would like to know if p lies in H0 or H1 . We would like to compute a well chosen
89
90
CHAPTER 9. MACHINE LEARNING AND STATISTICS
function of the data T such that both P(T = 0|p ∈ H0 ) and P(T = 1|p ∈ H1 ) are
close to 1.
9.1.3
Empirical Distributions
To obtain information about p, the most natural strategy is to compute the empirical
distribution of the data, i.e. the frequency at which each possible symbol a ∈ X
appears in the data.
Definition 9.1.1. Consider a sequence x1 , ..., xn in X , its empirical probability
distribution Pxn is given by
n
Pxn (a) =
1X
1{xi = a}
n i=1
for a ∈ X . Alternatively, we call Pxn the "type" of sequence xn = (x1 , ..., xn ).
It is noted that the type Pxn is indeed a distribution over X since it has positive
entries and sums to 1, and that it is an element of the set of probability distributions
over X
X
P = {p ∈ (R+ )X :
xa = 1}
a∈X
This set is often called the probability simplex, and has dimension |X | − 1.
The reason why the most natural strategy is to compute the empirical distribution of the data is because it converges to the true distribution when the number of
data points grows large, as a consequence of the law of large numbers.
Proposition 9.1.2. If X n = (X1 , ..., Xn ) are drawn i.i.d. from dsitribution Q, then
the type of X n converges to Q almost surely.
Proof: From the law of large numbers, for any fixed a ∈ X :
n
1X
PX n (a) =
1{Xi = a} → P(Xi = a) = Q(a) almost surely
n→∞
n i=1
This holds for any a which proves the result.
9.2
The Method Of Types
The method of types is a very powerful information theoretic strategy in order to
control the behaviour of the empirical distribution and works as follows.
9.2. THE METHOD OF TYPES
9.2.1
91
Probability Distribution of a Sample
The first step is to show that the distribution of an i.i.d sample only depends on its
type, and that this distribution can be expressed in terms of entropy and relative
entropy.
Proposition 9.2.1. Consider X n = (X1 , ..., Xn ) i.i.d. from distribution Q, then
the probability distribution of X n only depends on its type, and
P(X n = xn ) = 2−n[H(Pxn )+D(Pxn ||Q)]
Proof Consider X n = (X1 , ..., Xn ) i.i.d. from distribution Q, then the probability distribution of X n only depends on its type, in the sense that
n
n
P(X = x ) =
n
Y
Q(xi ) =
i=1
Y
Q(a)
Pn
i=1
1{xi =a} =
a∈X
Y
Q(a)nPxn (a)
a∈X
Indeed, the expression above only depends on the type Pxn , so that all sequences
that have the same type are equally likely to occur.
Furthermore, taking logarithms and dividing by n:
X
1
1
n
n
Pxn (a) log2
− log2 P(X = x ) =
n
Q(a)
a∈X
=
X
Pxn (a) log2
a∈X
Pxn (a)
Q(a)
+ Pxn (a) log2
1
Pxn (a)
= H(Pxn ) + D(Pxn ||Q)
Hence we have proven that:
P(X n = xn ) = 2−n[H(Pxn )+D(Pxn ||Q)]
So not only does the probability of a sequence only depend on its type, but the
exponent is equal to the sum of the entropy of the type, and the relative entropy
between the type and the true distribution. This implies that the most likely type is
the true distribution, and also that, when n is large, types that are far away from
the true distribution are very unlikely to occur.
92
CHAPTER 9. MACHINE LEARNING AND STATISTICS
9.2.2
Number of Types
The second step is to show that the number of possible types is not very large, in
the sense that there are at most polynomially many types in n grows and X is fixed.
Proposition 9.2.2. The type Pxn lies in
Pn = P ∪ {x : nxi ∈ {0, ..., n}, i = 1, ..., n}
and the number of types is at most:
|Pn | ≤ (n + 1)|X |
Proof One can readily check that the entries of Pxn are integer multiples of
1/n by definition. Furthermore,
|Pn | ≤ (n + 1)|X |
since |Pn | is the number of vectors whose components are positive integer multiples
of 1/n and sum to 1, and n|X | is the number of vectors whose components are
positive integer multiples of 1/n and where all components are less than 1.
9.2.3
Size of Type Class
The third step is to estimate the number of sequences which have a given type. For
a given type P ∈ Pn , denote by
T (P ) = {xn ∈ X n : Pxn = P }
the sequences of type P , called the type class of P .
Proposition 9.2.3. For any type we have:
(n + 1)−|X | 2−nH(P ) ≤ |T (P )| ≤ 2−nH(P )
Proof: Since the probability of a sequence only depends on its type:
X
X
1=
P(X n = xn ) =
|T (P )|2−n[H(P )+D(P ||Q)]
xn ∈X n
P ∈Pn
An upper bound for the size of a type class is:
1 ≥ |T (P )|2−n[H(P )+D(P ||Q)]
This holds for any Q, so that for Q = P :
|T (P )| ≤ 2−nH(P )
9.3. LARGE DEVIATIONS AND SANOV’S THEOREM
93
A lower bound can be derived by observing
1 ≤ |Pn | max |T (P )|2−n[H(P )+D(P ||Q)]
P ∈Pn
One may check that the maximum in the above occurs for P = Q which gives
|T (P )| ≥ (n + 1)|X | 2−nH(P )
where we used the fact that |Pn | ≤ (n + 1)|X | .
A few important observations. The entropy H(P ) provides an estimate of the
number of sequences with type P , and this estimate is accurate in the exponent, in
the sense that when n → ∞:
1
log2 (|T (P )|) = H(P ) + o(1)
n
Second, both the type class T (P ) and the typical set of an i.i.d. sample with
distribution P have approximately the same size. Third, the size of type classes
grows exponentially with n, but the number of type classes grows polynomially
in n. Finally, consider two types P and P ′ with H(P ) < H(P ′ ), then when n is
large, T (P ′ ) will be overwhelmingly larger than T (P ). We will leverage those
observations to derive powerful results.
9.3
Large Deviations and Sanov’s Theorem
Using the method of types we can now derive Sanov’s theorem, which enables to
control the fluctuations of the empirical distribution PX n around the true distribution Q when X n is drawn i.i.d. from Q.
9.3.1
Sanov’s Theorem
Proposition 9.3.1. Consider X n = (X1 , ..., Xn ) drawn i.i.d. from distribution Q,
and consider E ⊂ P a set of distributions.
P(PX n ∈ E) ≤ (n + 1)|X | 2−nD(P
⋆ ||Q)
where
P ⋆ = arg min D(P ||Q)
P ∈E
Furthermore, if E is the closure of its interior then when n → ∞:
1
log2 P(PX n ∈ E) → D(P ⋆ ||Q)
n
94
CHAPTER 9. MACHINE LEARNING AND STATISTICS
Proof Summing over the possible types
X
P(PX n ∈ E) =
P(PX n = P )
P ∈Pn ∩E
The probability of type P occuring is:
X
P(PX n = P ) =
|T (P )|2−n[H(P )+D(P ||Q)]
P ∈Pn ∩E
Using the fact that |T (P )| ≤ 2nH(P ) :
P(PX n = P ) ≤ 2−nD(P ||Q) ≤ 2−nD(P
⋆ ||Q)
using the fact that P ⋆ minimizes D(P ||Q) over E. Summing the above over P and
using the fact that Pn ∩ E ≤ (n + 1)|X | we get the first result
P(PX n ∈ E) ≤ (n + 1)|X | 2−nD(P
⋆ ||Q)
If E is the closure of its interior we can find a sequence of Pn such that when
n → ∞,
D(Pn ||Q) → D(P ⋆ ||Q)
and in turn
P(PX n ∈ E) ≥ P(PX n = Pn )
with
P(PX n = Pn ) = T (Pn )2−n[H(Pn )+D(Pn ||Q)] ≥ (n + 1)−|X | 2−nD(Pn ||Q)
Using the fact that
T (Pn ) ≥ (n + 1)−|X | 2nH(Pn ) .
Taking logarithms, when n → ∞
1
log2 P(PX n ∈ E) → D(P ⋆ ||Q)
n
which is the second result
Sanov’s theorem enables us to predict the behavior of the empirical distribution
of an i.i.d. sample, and is a "large deviation" result, in the sense that it predicts
events with exponentially small probability. The empirical distribution typically
lies close to the true distribution Q, and when Q ̸∈ E, this means that PX n ∈ E is
unlikely. The theorem predicts that the probability of this event only depends on
P ⋆ , which can be interpreted as the "closest" distribution to Q, where "distance"
is measured by relative entropy. We will give several examples that illustrate the
power of this result.
9.3. LARGE DEVIATIONS AND SANOV’S THEOREM
9.3.2
95
Examples
We now highlight a few examples on how Sanov’s theorem may be applied to
various statistical problems.
Majority Vote Consider an election with two candidates, where Q(1), Q(2)
are the proportion of people whom prefer candidates 1, and 2 respectively. We
gather the votes X1 , ..., Xn of n voters, which we’ll assume to be i.i.d. distributed
from Q. The candidate whom wins is the one who gathers the most votes. Assume
that Q(1) > 1/2 so that 1 is the favorite candidate. What is the probability that 2
gets elected in place of 1 ?
The votes X n = (X1 , ..., Xn ) are an i.i.d. sample from Q, and 2 gets elected if
and only if PX n (2) > 1/2, so that he gets at least n/2 votes. So 2 gets elected if
and only if PX n ∈ E where
E = {P ∈ P : P (2) ≥ 1/2}
We can then apply Sanov’s theorem to conclude that 2 gets elected in place of 1
with probability
⋆
P(PX n ∈ E) ≈ 2−nD(P ||Q)
with P ⋆ = (1/2, 1/2) so that
D(P ⋆ ||Q) =
(1/2) 1
(1/2)
1
log2
+ log2
2
Q(2)
2
1 − Q(2)
Indeed:
D(P ||Q) = P (2) log2
1 − P (2)
P (2)
+ (1 − P (2)) log2
Q(2)
1 − Q(2)
and minimizing this quantity over P (2) under the constraint P (2) ≥ 1/2 gives
P (2) = 1/2, since Q(2) ≤ 1/2.
Testing Fairness Assume that one is given a dice with k faces, and we want to
test whether or not the dice is fair, in the sense that it is equally likely to fall on
each of its faces. Consider X n = (X1 , ..., Xn ) the outcomes of casting the dice n
times where Xn ∈ X is the index of the face on which the dice has fallen. To test
fairness of the dice we compute the empirical distribution PX n and we compare
it to Q, the uniform distribution over Q. Namely if D(PX n ||Q) ≤ ϵ we deem the
dice to be fair, and unfair otherwise.
What is the probability that we mistake a fair dice for an unfair dice ?
If the dice is fair X n = (X1 , ..., Xn ) is an i.i.d. sample from Q, and we mistake
it for an unfair dice if and only and only if PX n ∈ E where
E = {P ∈ P : D(P ||Q) ≥ ϵ}
96
CHAPTER 9. MACHINE LEARNING AND STATISTICS
Hence from Sanov’s theorem, the probability of a mistake is
P(PX n ∈ E) ≈ 2−nD(P
⋆ ||Q)
≈ 2−nϵ
with D(P ⋆ ||Q) = minP ∈E D(P ||Q) = ϵ. It is remarkable that Sanov’s theorem
allows for an easy, explicit computation.
Testing General Distributions It is also noted that the above works in the more
general case Q is not the uniform distribution, but simply some target distribution,
namely reject the test that P = Q if D(PX n ||Q) ≥ ϵ and accept otherwise.
Chapter 10
Mathematical Tools
In this chapter we provide a few results that are instrumental for some proofs.
Results are stated without proofs.
10.1
Jensen Inequality
Definition 10.1.1. A function f : Rd → R is said to be convex if for all x, y ∈ Rd
and all λ ∈ [0, 1]:
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
Property 8. A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .
Property 9 (Jensen’s Inequality). Consider f : Rd → R a convex function and X
a random vector in Rd . Then f (E(X)) ≤ E(f (X)) with equality if and only if f
is linear over the support of the distribution of X.
10.2
Constrained Optimization
Property 10 (Karush-Kuhn-Tucker Conditions). Consider f : Rd → R a convex,
differentiable function, let x⋆ its maximum over the simplex:
Maximize f (x) s.t.
d
X
i=1
xi ≤ 1 and x ≥ 0
Then there exists λ ∈ R+ and µ ∈ (R+ )d such that
∇f (x⋆ ) + λ1 + µ = 0
P
where λ( di=1 xi ) = 0 and xi µi = 0 for all i.
97