Notes on Information Theory by Jeff Steif 1 Entropy, the Shannon-McMillan-Breiman The-

advertisement
Notes on Information Theory
by Jeff Steif
1
Entropy, the Shannon-McMillan-Breiman Theorem and Data Compression
These notes will contain some aspects of information theory. We will consider some REAL problems that REAL people are interested in although
we might make some mathematical simplifications so we can (mathematically) carry this out. Not only are these things inherently interesting (in my
opinion, at least) because they have to do with real problems but these problems or applications allow certain mathematical theorems (most notably the
Shannon-McMillan-Breiman Theorem) to “come alive”. The results that we
will derive will have much to do with the entropy of a process (a concept
that will be explained shortly) and allows one to see the real importance of
entropy which might not at all be obvious simply from its definition.
The first aspect we want to consider is so-called data compression which
means we have data and want to compress it in order to save space and
hence money. More mathematically, let X1 , X2 , . . . , Xn be a finite sequence
from a stationary stochastic process {Xk }∞
k=−∞ . Stationarity will always
1
mean strong stationarity which means that the joint distribution of
Xm , Xm+1 , Xm+2 , . . . , Xm+k is independent of m, in other words, we have
time-invariance. We also assume that the variables take on values from a
finite set S with |S| = s. Anyway, we have our X1 , X2 , . . . Xn and want
a rule which given a sequence assigns to it a new sequence (hopefully of
shorter length) such that different sequences are mapped to different sequences (otherwise one could not recover the data and the whole thing
would be useless, e.g., assigning every word to 0 is nice and short but of
limited use). Assuming all words have positive probability, we clearly can’t
code all the words of length n into words of length n − 1 since there are sn
of the former and sn−1 of the latter. However, one perhaps can hope that
the expected length of the coded word is less than the length of the original
word (and by some fixed fraction).
What we will eventually see is that this is always the case EXCEPT in
the one case where the stationary process is i.i.d. AND uniform on S. It
will turn out that one can code things so that the expected length of the
output is H/ ln(s) times the length of the input where H is the entropy of
the process. (One should conclude from the above discussion that the only
stationary process with entropy ln(s) is i.i.d. uniform.)
What is the entropy of a stationary process? We define this in 2 steps.
First, consider a partition of a probability space into finitely many sets where
(after some ordering), the sets have probabilities, p1 , . . . , pk (which are of
course positive and add to 1). The entropy of this partition is defined to be
−
k
X
pi ln pi .
i=1
Sounds arbitrary but turns out to be very natural, has certain natural
2
properties (after one interprets entropy as the amount of information gained
by performing an experiment with the above probability distribution) and
is in fact the only (up to a multiplicative constant) function which has these
natural properties. I won’t discuss these but there are many references (ask
me if you want). (Here ln means natural ln, but sometimes in these notes,
it will refer to a different base.)
Now, given a stationary process, look only at the random variables
X1 , X2 , . . . Xn . This breaks the probability space into sn pieces since there
are this many sequences of length n. Let Hn be the entropy (as defined
above) of this partition.
Definition 1.1: The entropy of a stationary process is limn→∞
Hn
n
where
Hn is defined above (this limit can be shown to always exist using subadditivity).
Hard Exercise: Define the conditional entropy H(X|Y ) of X given Y (where
both r.v.’s take on only finitely many values) in the correct way and show
that the entropy of a process {Xn } is also limn→∞ H(X0 |X−1 , . . . X−n ).
Exercise: Compute the entropy first of i.i.d. uniform and then a general
i.i.d.. If you’re then feeling confident, go on and look at Markov chains
(they are not that hard).
Exercise: Show that entropy is uniquely maximized at i.i.d. uniform. (The
previous exercise showed this to be the case if we restrict to i.i.d.’s but you
need to consider all stationary processes.)
We now discuss the Shannon-McMillan-Breiman Theorem. It might seem
dry and the point of it might not be so apparent but it will come very much
alive soon.
3
Before doing this, we need to make an important (and not unnatural)
assumption that our process is ergodic. What does this mean? To motivate
it, let’s construct a stationary process which does not satisfy the Strong Law
of Large Numbers (SLLN) for trivial reasons. Let {Xn } be i.i.d. with 0’s
and 1’s, 0 with probability 3/4 and 1 with probability 1/4 and let {Yn } be
i.i.d with 0’s and 1’s, 0 with probability 1/4 and 1 with probability 3/4.
Now construct a stationary process by first flipping a fair coin and then if
it’s heads, take a realization from the {Xn } process and if it’s tails, take a
realization from the {Yn } process. (By thinking of stationary processes as
measures on sequence space, we are simply taking a convex (1/2,1/2) combination of the measures associated to the two processes {Xn } and {Yn }).
You should check that this resulting process is stationary with each
marginal having mean 1/2. Letting Zn denote this process, we clearly don’t
have that
n
1X
Zi → 1/2 a.s.
n i=1
(which is what the SLLN would tell us if {Zn } were i.i.d.) but rather we
clearly have that
n
1X
Zi → 3/4 (1/4) with probability 1/2(1/2).
n i=1
The point of ergodicity is to rule out such stupid examples. There are
two equivalent definitions of ergodicity. The first is that if T is the transformation which shifts a bi-infinite sequence one step to the left, then any
event which is invariant under T has probability 0 or 1. The other definition
is that the measure (or distribution) on sequence space associated to the
process is not a convex combination of 2 measures each also corresponding
4
to some stationary process (i.e., invariant under T ). It is not obvious that
these are equivalent definitions but they are.
Exercise: Show that an i.i.d. process is ergodic by using the first definition.
Theorem 1.2 (Shannon-McMillan-Breiman Theorem): Consider a
stationary ergodic process with entropy H. Let p(X1 , X2 , . . . , Xn ) be the
probability that the process prints out X1 , . . . , Xn . (You have to think about
what this means BUT NOTE that it is a random variable). Then as n → ∞,
− ln(p(X1 , X2 , . . . , Xn ))
→ Ha.s. and in L1 .
n
2 ,...,Xn ))
] → H is exactly (check this) the
NOTE: The fact that E[ − ln(p(X1 ,X
n
definition of entropy and therefore general real analysis tells us that the a.s.
convergence implies the L1 –convergence.
We do not prove this now (we will in §2) but instead see how we can use
it to do data compression. To do data compression, we will only need the
above convergence in probability (which of course follows from either the
a.s. or L1 convergence). Thinking about what convergence in probability
means, we have the following corollary.
Corollary 1.3: Consider a stationary ergodic process with entropy H.
Then for all > 0, for large n, we have that the words of length n can be
divided into two sets with all words C in the first set satisfying
e−(H+)n < p(C) < e−(H−)n
and with the total measure of all words in the second set being < .
One should think of the first set as the “good” set and the second set as
the “bad” set. (Later it will be the good words which will be compressed).
5
Except in the i.i.d. uniform case, the bad set will have many more elements
(in terms of cardinality) than the good set but its probability will be much
lower. (Note that it is only in the i.i.d. uniform case that probabilities and
cardinality are the same thing.) Here is a related result which will finally
allow us to prove the data compression theorem.
Proposition 1.4: Consider a stationary ergodic process with entropy H.
Order the words of length n in decreasing order (in terms of their probabilities). Fix λ ∈ (0, 1). We select the words of length n in the above order one
at a time until their probabilities sum up to at least λ and we let Nn (λ) be
the number of words we take to do this. Then
ln(Nn (λ))
= H.
n→∞
n
lim
Exercise: Note that one obtains the same limit for all λ. Think about this.
Convince yourself however that the above convergence cannot be uniform in
λ. Is the convergence uniform in λ on compact intervals of (0, 1)?
Proof: Fix λ ∈ (0, 1). Let > 0 be arbitrary but < 1 − λ and λ. We will
show both
lim sup
n
ln(Nn (λ))
≤H +
n
and
lim inf
n
ln(Nn (λ))
≥H −
n
from which the result follows.
By Corollary 1.3, we know that for large n, the words of length n can
be broken up into 2 groups with all words C in the first set satisfying
e−(H+)n < p(C) < e−(H−)n
6
and with the total measure of all words in the second set being < .
We now break the second group (the bad group) into 2 sets depending on
whether p(C) ≥ e−(H−)n (the big bad words) or whether p(C) ≤ e−(H+)n
(the small bad words). When we order the words in decreasing order, we
clearly first go through the big bad words, then the good words and finally
the small bad words.
We now prove the first inequality. For large n, the total measure of the
good words is at least 1 − which is bigger than λ and hence the total
measure of the big bad words together with the good words is bigger than
λ. Hence Nn (λ) is at most the total number of big bad words plus the
total number of good words. Since all these words have probability at least
e−(H+)n , there cannot be more than de(H+)n e of them. Hence
lim sup
n
ln(Nn (λ))
ln(e(H+)n + 1)
≤ lim sup
= H + .
n
n
n
For the other inequality, let Mn (λ) be the number of good words among
the Nn (λ) words taken when we accumulated at least λ measure of the
space. Since the total measure of the big bad words is at most , the total
measure of these Mn (λ) words is at least λ−. Since each of these words has
probability at most e−(H−)n , there must be at least (λ − )e(H−)n words in
Mn (λ) (if you don’t see this, we have |Mn (λ)|e−(H−)n ≥ λ − ). Hence
lim sup
n
ln(Mn (λ))
ln(Nn (λ))
≥ lim sup
≥
n
n
n
lim sup
n
ln((λ − )e(H−)n )
= H − ,
n
and we’re done. 2
Note that by taking λ close to 1, one can see that (as long as we are not i.i.d.
uniform, i.e., as long as H < ln s), we can cover almost all words (in the
7
sense of probability) by using only a neglible percentage of the total number
of words (eHn words instead of the total number of words e(n ln s) ).
We finally arrive at data compression, now that we have things set up and
have a little feeling about what entropy is. We now make some definitions.
Definition 1.5: An n–code is an injective (i.e. 1-1) mapping σ which takes
the set of words of length n with alphabet S to words of any finite length
(of at least 1) with alphabet S.
Definition 1.6: Consider a stationary ergodic process {Xk }∞
k=−∞ . The
compressibility of an n–code σ is E[ `(σ(X1n,...,Xn )) ] where `(W ) denotes the
length of the word W .
This measures how well an n–code compresses on average.
Theorem 1.7: Consider a stationary ergodic process {Xk }∞
k=−∞ with entropy H. Let µn be the minimum compressibility over all n–codes. Then
the limit µ ≡ limn→∞ µn exists and equals
H
ln s .
The quantity µ is called the compression coefficient of the process
{Xk }∞
−∞ . Since H is always strictly less than ln s except in the i.i.d. uniform
case, this says that data compression is always possible except in this case.
Proof: As usual, we have two directions to prove.
We first show that for arbitrary and δ > 0, and for large n, µn ≥
(1 − δ) H−2
ln s .
Fix any n–code σ. Call an n–word C short (for σ) if σ(C) ≤
n(H−2)
ln s .
Since codes are 1-1, the number of short words is at most
2
b
s + s + ... + s
n(H−2)
c
ln s
b
≤s
n(H−2)
c
ln s
∞
X
(
i=1
8
s−i ) ≤
s n(H−2)
e
.
s−1
Next, since
ln(Nn (δ))
→H
n
by Proposition 1.4, for large n, we have that Nn (δ) > en(H−) . This says
that for large n, if we take < en(H−) of the most probable words, we won’t
get δ total measure and so certainly if we take < en(H−) of any of the words,
we won’t get δ total measure. In particular, since the number of short words
is at most
s n(H−2)
e
s−1
which is for large n less than
en(H−) ,
the short words can’t cover δ total measure for large n, i.e., P(short word)
≤ δ for large n. Now note that this argument was valid for any n–code.
That is, for large n, any n-code satisfies P(short word) ≤ δ. Hence for any
n-code σ
(H − 2)
σ(C)
] ≥ (1 − δ)
n
ln s
E[
and so
µn ≥ (1 − δ)
H − 2
,
ln s
as desired.
For the other direction, we show that for any > 0, we have that for
large n, µn ≤ 1/nd n(H+)
ln s e + δ. This will complete the argument.
The number of different sequences of length exactly d n(H+)
ln s e is at least
s
n(H+)
ln s
which is en(H+) . By Proposition 1.4 the number Nn (1 − δ) of most
probable n–words needed to cover 1 − δ of the measure is ≤ en(H+) for
large n. Since there are at least this many words of length d n(H+)
ln s e, we can
code these high probability words into words of length d n(H+)
ln s e. For the
9
remaining words, we code them to themselves. For such a code σ, we have
that E[σ(X1 , . . . , Xn )] is at most d n(H+)
ln s e + δn and hence the compression
of this code is at most 1/nd n(H+)
ln s e + δ. 2
This is all nice and dandy but
EXERCISE: Show that for a real person, who really has data and really
wants to compress it, the above is all useless.
We will see later on (§3) a method which is not useless.
2
Proof of the Shannon-McMillan-Breiman Theorem
We provide here the classical proof of this theorem where we will assume two
major theorems, namely, the ergodic theorem and the Martingale Convergence Theorem. This section is much more technical than all of the coming
sections and requires more mathematical background. Feel free if you wish
to move on to §3. As this section is independent of all of the others, you
won’t miss anything.
We first show that the proof of this result is an easy consequence of the
ergodic theorem in the special case of an i.i.d. process or more generally of
a multistep Markov chain. We take the middle road and prove it for Markov
chains (and the reader will easily see that it extends trivially to multi-step
Markov chains).
Proof of SMB for Markov Chains: First note that
ln(p(X1 , X2 , . . . , Xn ))
−
=−
n
10
Pn
1
ln p(Xj |Xj−1 )
n
where the first term ln p(X1 |X0 ) is taken to be ln p(X1 ). If the first term
were interpreted as ln p(X1 |X0 ) instead, then the ergodic theorem would
immediately tell us this limit is a.s. E[− ln p(X1 |X0 )] which (you have seen
as an exercise) is the entropy of the process. Of course this difference in the
first term makes an error of at most const/n and the result is proved. 2
For the multistep markov chain, rather than replacing the first term by
something else (as above), we need to do so for the first k terms where k is
the look-back of the markov chain. Since this k is fixed, there is no problem.
To prove the general result, we isolate the main idea into a lemma which
is due to Breiman and which is a generalization of the ergodic theorem which
is also used in its proof. Actually, to understand the statement of the next
result, one really needs to know some ergodic theory. Learn about it on
your own, ask me for books, or I’ll discuss what is needed or just forget this
whole section.
Proposition 2.1: Assume that gn (x) → g(x) a.e. and that
Z
sup |gn (x)| < ∞.
n
Then if φ is an ergodic measure preserving transformation, we have that
X
1 n−1
lim
gj (φj (x)) =
n→∞ n
0
Z
g a.e. and in L1 .
Note if gn = g ∈ L1 for all n, then this is exactly the ergodic theorem.
Proof: (as explained in the above remark), the ergodic theorem tells us
X
1 n−1
lim
g(φj (x)) =
n→∞ n
0
11
Z
g a.e. and in L1 .
Since
X
X
X
1 n−1
1 n−1
1 n−1
gj (φj (x)) =
g(φj (x)) +
[gj (φj (x)) − g(φj (x))],
n 0
n 0
n 0
we need to show that
(∗)
X
1 n−1
|gj (φj (x)) − g(φj (x))| → 0 a.e. and in L1 .
n 0
Let FN (x) = supj≥N |gj (x)−g(x)|. By assumption, each FN is integrable
and goes to 0 monotonically a.e. from which dominated convergence gives
R
fN → 0.
Now,
−1
X
1 NX
1 n−1
|gj (φj (x)) − g(φj (x))| ≤
|gj (φj (x)) − g(φj (x))|+
n 0
n 0
n−N
X−1
1
n−N −1
fN (φj φN (x)).
n
n−N −1 0
For fixed N , letting n → ∞, the first term goes to 0 a.e. while the second
term goes to
R
fN a.e. by the ergodic theorem. Since
R
fN → 0, this proves
the pointwise part of (*). For the L1 part, again for fixed N , letting n → ∞,
the L1 norm of the first term goes to the 0 while the L1 norm of the second
term is always at most
R
fN (which goes to 0) and we’re done. 2
Proof of the SMB Theorem: To properly do this and to apply the
previous theorem, we need to set things up in an ergodic theory setting.
Our process Xn gives us a measure µ on U = {1, . . . , S}Z (thought of as its
distribution function). We also have a transformation T on U , called the
left shift, which simply maps an infinite sequence 1 unit to the left. The
fact that the process is stationary means that T preserves the measure µ in
that for all events A, µ(T A) = µ(A).
12
Next, letting gk (w) = − ln p(X1 |X0 , . . . , X−k+1 ),
(with g0 (w) = − ln p(X1 )), we get immediately that
−
X
ln(p(X1 (w), X2 (w), . . . , Xn (w)))
1 n−1
=
gj (T j (w)).
n
n 0
(Note that w here is an infinite sequence in our space X.) We are now in
the set-up of the previous proposition although of course a number of things
need to be verified, the first being the convergence of the gk ’s to something.
Note that for any i in our alphabet {1, . . . , S}, gk (w) on the cylinder set
X1 = i is simply − ln P (X1 = i|X0 , . . . , X−k+1 ) which (here we use a second
nontrivial result) converges by the Martingale convergence theorem (and the
continuity of ln) to − ln P (X1 = i|X0 , . . .). We therefore get that
gk (w) → g(w)
where g(w) = − ln P (X1 = i|X0 , . . .) on the cylinder set X1 = i, i.e.,
g(w) = − ln P (X1 |X0 , . . .).
R
If we can show that
supk |gk (w)| < ∞, the previous proposition will
tell us that
−
ln(p(X1 , X2 , . . . , Xn ))
→
n
Z
g a.e. and in L1 .
Since, by definition, E[− ln(p(X1 ,Xn2 ,...,Xn )) ] → h({Xi }) by definition of entropy, it follows that
R
g = h({Xi }) and we would be done.
EXERCISE: Show directly that E[g(w)] is h({Xi }).
We now proceed with verifying
R
supk |gk (w)| < ∞. Fix λ > 0. We show
that P (supk gk > λ) ≤ se−λ (where s is the alphabet size) which implies the
desired integrability.
P (supk gk > λ) =
P
k
P (Ek ) where Ek is the set where the first time gj
gets larger than λ is k, i.e., Ek = {gj ≤ λ, j = 0, . . . k − 1, gk > λ}. These
13
are obviously disjoint for different k and make up the set above. Now,
P (Ek ) =
P
i P ((X1
= i) ∩ Ek ) =
P
i P ((X1
= i) ∩ Fki ) where Fki is the set
where the first time fji gets larger than λ is k where
fji = − ln p(X1 = i|X0 , . . . , X−j+1 ).
Since Fki is X−k+1 , . . . , X0 –measurable, we have
P ((X1 = i) ∩
Z
=
Fki )
Z
=
P (X1 = i|X−k+1 , . . . , X0 )
Fki
i
e−fk ≤ e−λ P (Fki ).
Fki
Since the Fki are disjoint for different k (but not for different i),
P
k
P (Ek ) ≤
P P
i
k
e−λ P (Fki ) ≤ se−λ . 2
We finally mention a couple of things in the higher–dimensional case, that
is, where we have a stationary random field where entropy can be defined
in a completely analogous way. It was unknown for some time if a (pointwise, that is, a.s.) SMB theorem could be obtained in this case, the main
obstacle being that it was known that a natural candidate for a multidimensional martingale convergence theorem was in fact false and so one could
not proceed as we did above in the 1-d case. (It was however known that
this theorem held in “mean” (i.e., in L1 ), see below.) However, recently, (in
1983) Ornstein and Weiss managed to get a pointwise SMB not only for the
multidimensional lattice but in the more general setting of something called
amenable groups. (They used a method which circumvents the Martingale
Convergence Theorem).
The result of Ornstein and Weiss is difficult and so we don’t present
it. Instead we present the Shannon–McMillan Theorem (note, no McMillan
here) which is the L1 convergence in the SMB Theorem. Of course, this is a
14
weaker result than we just proved but the point of presenting it is two-fold,
first to see how much easier it is than the pointwise version and also to see a
result which generalizes (relatively) easily to higher dimensions (this higher
dimensional variant being left to the reader).
Proposition 2.2 (Mean SMB Theorem or MB Theorem): The SMB
theorem holds in L1 .
Proof Let gj be defined as above. The Martingale convergence theorem
implies that gj → g∞ in L1 as n → ∞. We then have (with kk denoting the
L1 norm),
k−
X
ln(p(X1 , X2 , . . . , Xn ))
1 n−1
gj (T j (w)) − Hk ≤
− Hk = k
n
n 0
X
X
1 n−1
1 n−1
j
j
k
|gj (T (w)) − g∞ (T (w))|k + k
g∞ (T j (w)) − Hk.
n 0
n 0
The first term goes to 0 by the L1 convergence in the Martingale Convergence
Theorem (we don’t of course always get L1 convergence in the Martingale
Convergence Theorem but we have so in this setting). The second term goes
to 0 in L1 by the ergodic theorem together with the fact that
R
g∞ = H, an
easy computation left to the reader. 2
3
Universal Entropy Estimation?
This section simply raises an interesting point which will be delved into
further later on.
One can consider the problem of trying to estimate the entropy of a
process {Xn }∞
n=−∞ after we have only seen X1 , X2 , . . . , Xn in such a way
15
that these estimates converge to the true entropy with prob 1. (As above, we
know that to have any chance of doing this, one needs to assume ergodicity).
Obviously, one could simply take the estimate to be
− ln(p(X1 , X2 , . . . , Xn ))
n
which will (by SMB) converge to the entropy. However, this is clearly undesirable in the sense that in order to use this estimate, we need to know what
the distribution of the process is. It certainly would be preferable to find
some estimation procedure which does not depend on the distribution since
we might not know it. We therefore want a family of functions hn : S n → R
(hn (x1 , . . . , xn ) would be our guess of the entropy of the process if we see
the data x1 , . . . , xn ) such that for any given ergodic process {Xn }∞
n=−∞ ,
lim hn (X1 , . . . , Xn ) = h({Xn }∞
n=−∞ ) a.s. .
n→∞
(Clearly the suggestion earlier is not of this form since hn depends on the
process). We call such a family of functions a “universal entropy estimator”,
universal since it holds for all ergodic processes.
There is no immediate obvious reason at all why such a thing exists
and there are similar types of things which don’t in fact exist. (If entropy
were a continuous function defined on processes (the latter being give the
usual weak topology), a general theorem (see §9) would tell us that such a
thing exists BUT entropy is NOT continuous.) However, as it turns out, a
universal entropy estimator does in fact exist. One such universal entropy
estimator comes from an algorithm called the “Lempel-Ziv algorithm” which
was later improved by Wyner-Ziv. We will see that this algorithm also turns
out to be a universal data compressor, which we will describe later. (This
method is used, as far as I understand it, in the real world).
16
4
20 questions and Relative Entropy
We have all heard problems of the following sort.
You have 8 coins one of which has a different weight than all the others
which have the same weight. You are given a scale (which can determine
which of two given quantities is heavier) and your job is to identify the “bad”
coin. How many weighings do you need?
Later we will do something more general but it turns out the problem
of how many questions one needs to ask to determine something has an extremely simple answer which is given by something called Kraft’s Theorem.
The answer will not at all be surprising and will be exactly what you would
guess, namely if you ask questions which have D possible answers, then you
will need lnD (n) questions where n is the total number of possibilities. (If
the latter is not an integer, you need to take the integer right above it).
Definitin 4.1: A mapping C from a finite set S to finite words with alphabet
{1, . . . , D} will be called a prefix-free D–code on S if for any x, y ∈ S,
C(x) is not a prefix of C(y) (which implies of course that C is injective)
Theorem 4.2 (Kraft’s Inequality): Let C be a prefix-free D–code defined
on S. Let {`1 , . . . `|S| } be the lengths of the code words of C. Then
|S|
X
D−`i ≤ 1.
i=1
Conversely, if {`1 , . . . `n } are integers satisfying
n
X
D−`i ≤ 1,
i=1
then there exists a prefix-free D–code on a set with n elements whose code
words have length {`1 , . . . `n }.
17
This turns out to be quite easy to prove so try to figure it out on your own.
I’ll present it in class but won’t write it here.
Corollary 4.3: There is a prefix–free D–code on a set of n elements whose
maximum code word has length dlnD (n)e while there is no such code whose
maximum code word is < dlnD (n)e.
I claim that this corollary implies that the number of questions with D
possible answers one needs to ask to determine something with n possibilities
is lnD (n) (where if this is not an integer, it is taken to be the integer right
above it). To see this one just needs to think a little and realize that asking
questions until we get the right answer is exactly a prefix–free D–code. (For
example, if you have the code first, your first question would be “What is
the first number in the codeword assigned to x?” which of course has D
possible answers.) Actually, going back to the weighing question, we did not
really answer that question at all. We understand the situation when we
can ask any D–ary question (i.e., any partition of the outcome space into
D pieces) but with questions like the weighing question, while any weighing
breaks the outcome space into 3 pieces, we are physically limited (I assume,
I have not thought about it) from constructing all possible partitions of the
outcome space into 3 pieces. So this investigation, while interesting and
useful, did not allow us to answer the original question at all but we end
there nonetheless.
Now we want to move on and do something more general. Let’s say that
X is a random variable taking on finitely many values X with probabilities
p1 , . . . , pn (so |X | = n). Now we want a successful guessing scheme (i.e., a
prefix–free code) where the expected number of questions we need to ask
18
(rather than the maximum number of questions) is not too large. The
following theorem answers this question and more. For simplicity, we talk
about prefix–free codes.
Theorem 4.4: Let C be a prefix–free D–code on X and N the length of
the codeword assigned to X (which is a r.v.). Then
HD (X) ≤ E[N ]
where HD (X) is the D–entropy of X given by −
Pk
i=1 pi lnD
pi .
Conversely, there exists a prefix–free D–code satisfying
E[N ] < HD (X) + 1.
(We will prove this below but must first do a little more development.)
EXERCISE: Relate this result to Corollary 4.3.
An important concept in information theory which is useful for many purposes and which will allow us to prove the above is the concept of relative
entropy. This concept comes up in many other places–below we will explain
three other places where it arises which are respectively
(1) If I construct a prefix–free code satisfying E[N ] < HD (X) + 1 under the
assumption that the true distribution of X is q1 , . . . , qn but it turns out the
true distribution is p1 , . . . , pn , how bad will E[N ] be?
(2) (level 2) large deviation theory for the Glivenko–Cantelli Theorem (which
is called Sanov’s Theorem)
(3) universal data compression for i.i.d. processes.
The first 1 of these will be done in this section, the 2nd in the §5 while
the last will be done in §6.
19
Definition 4.5: If p = p1 , . . . , pn and q = q1 , . . . , qn are two probability
distributions on X, the relative entropy or Kullback Leiber distance
between p and q is
n
X
pi ln
i=1
pi
.
qi
Note that if for some i, pi > 0 and qi = 0, then the relative entropy is ∞.
One should think of this as a metric BUT it’s not-it’s neither symmetric nor
satisfies the triangle inequality. One actually needs to specify the base of
the logarithm in this definition.
The first thing we need is the following whose proof is an easy application
of Jensen’s inequality which is left to the reader.
Theorem 4.6: Relative entropy (no matter which base we use) is nonnegative and 0 if and only if the two distributions are the same.
Proof of Theorem 4.4: Let C be a prefix–free code with `i denoting the
length of the ith codeword and N denoting the length of the codeword as a
r.v.. Simple manipulation gives
E[N ] − HD (X) =
X
i
pi lnD (
pi
D−`i /
X
P
iD
−`i
) − lnD (
D−`i ).
i
The first sum is nonnegative since it’s a relative entropy and the the second
term is nonnegative by Kraft’s Inequality.
To show there is a prefix–free code satisfying E[N ] < HD (X) + 1, we
simply construct it. Let `i = dlnD (1/pi )e. A trivial computation shows
that the `i ’s satisfy the conditions of Kraft’s Inequality and hence there is
a prefix–free D code which assigns the ith account a codeword of length `i .
A trivial calculation shows that E[N ] < HD (X) + 1. 2
20
How did we know to choose `i = dlnD (1/pi )e? If the integers `i were allowed
to be nonintegers, Lagrange Multipliers show that the `i ’s minimizing E[N ]
subject to the contraint of Kraft’s inequality are `i = lnD (1/pi ).
There is an easy corollary to Theorem 4.4 whose proof is left to the
reader, which we now give.
Corollary 4.7: The mimimum expected codeword length L∗n per symbol
satisfies
H(X1 , . . . , Xn )
H(X1 , . . . , Xn ) 1
≤ L∗n ≤
+ .
n
n
n
Moreover, if X1 , . . . is a stationary process with entropy H, then
L∗n → H.
This particular prefix–free code, called the Shannon Code is not necessarily
optimal (although it’s not bad). There is an optimal code called the Huffman
code which I will not discuss. It turns out also that the Shannon Code is
close to optimal in a certain sense (and in a better sense than that the mean
code length is at most 1 from optimal which we know).
We now discuss our first application of relative entropy (besides our using
it in the proof of Theorem 4.4). We mentioned the following question above.
If I contruct the Shannon prefix–free code satisfying E[N ] < HD (X) + 1
under the assumption that the true distribution of X is q1 , . . . , qn but it
turns out the true distribution is p1 , . . . , pn , how bad will E[N ] be? The
answer is given by the following theorem.
Theorem 4.8: The expected length Ep [N ] under p(x) of the Shannon code
21
`i = dlnD (1/qi )e satisfies
HD (p) + D(pkq) ≤ Ep [N ] < HD (p) + D(pkq) + 1.
Proof: Trivial calculation left to the reader. 2
5
Sanov’s Theorem: another application of relative entropy
In this section, we give an important application (or interpretation) of relative entropy, namely (level 2) large deviation theory.
Before doing this, let me remind you (or tell you) what (level 1) large
deviation theory is (this is the last section of the first chapter in Durrett’s
probability book).
(Level 1) large deviation theory simply says the following modulo the
technical assumptions. Let Xi be i.i.d. with finite mean m and have a
moment generating function defined in some neighborhood of the origin.
First, the WLLN tells us
Pn
P (|
i=1 Xi
n
− m| ≥ ) → 0
as n → ∞ for any . (For this, we of course do not need the exponential
moment). (Level 1) large deviation theory tells us how fast the above goes
to 0 and the answer is, it goes to 0 exponentially fast and more precisely,
it goes like e−f ()n where f () > 0 is the so-called Frechel transform of the
logarithm of the moment generating function given by
f () = max(θ − ln(E[eθX ])).
θ
22
Level two large deviation theory deals with a similar question but at the
level of the empirical distribution function. Let Xi be i.i.d. again and let Fn
be the empirical distribution of X1 , . . . , Xn . Glivenko–Cantelli tells us that
Fn → F a.s. where F is the the common distribution of the Xi ’s. If C is a
closed set of measures not containing F , it follows (we’re now doing weak
convergence of probability measures on the space of probability measures on
R so you have to think carefully) that
P (Fn ∈ C) → 0
and we want to know how fast. The answer is
1
lim − ln(P (Fn ∈ C)) = min D(P kF ).
n→∞ n
P ∈C
The above is called Sanov’s Theorem which we now prove.
There will be a fair development before getting to this which I feel will
be useful and worth doing (and so this might not be the quickest route to
Sanov’s Theorem). All of this material is taken from Chapter 12 of Cover
and Thomas’s book.
If x = x1 , . . . xn is a sample from an i.i.d. process with distribution
Q, then Px , which we call the type of x, is defined to be the empirical
distribution (which I will assume you know) of x.
Definition 5.1: We will let Pn denote the set of all types (i.e., empirical
distributions) from a sample of size n and given P ∈ Pn , we let T (P ) be all
x = x1 , . . . xn (i.e., all samples) which have type (empirical distribution) P .
EXERCISE: If Q has its support on {1, . . . , 7}, find |T (P )| where P is the
type of 11321.
23
Lemma 5.2: |Pn | ≤ (n + 1)|X | where X is the set of possible values are
process can take on.
This lemma, while important, is trivial and left to the reader.
Lemma 5.3: Let X1 , . . . be i.i.d. according to Q and let Px denote the type
of x. Then
Qn (x) = 2−n(H(Px )+D(Px ||Q))
and so the Qn probability of x depends only on its type.
Proof: Trivial calculation (although a few lines) left to the reader. 2
Lemma 5.4: Let X1 , . . . be i.i.d. according to Q. If x is of type Q, then
Qn (x) = 2−nH(Q) .
Our next result gives us an estimate of the size of a type class T (P ).
Lemma 5.5:
1
2nH(P ) ≤ |T (P )| ≤ 2nH(P ) .
(n + 1)|X |
Proof: One method is to write down the exact cardinality of the set (using
elementary combinatorics) and then apply Stirling’s formula. We however
proceed differently.
The upper bound is trivial. Since each x ∈ T (P ) has (by the previous corollary) under P , probability 2−nH(P ) , there can be at most 2nH(P )
elements in T (P ).
For the lower bound, the main step is is to show that type class P has the
largest probability of all type classes (note however, that an element from
24
type class P does not necessarily have a higher probability than elements
from other type classes as you should check), i.e.,
P n (T (P )) ≥ P n (T (P 0 ))
for all P 0 . I leave this to you (although it takes some work, see the book if
you want).
Then we argue as follows. Since there are at most (n + 1)|X | type classes,
and T (P ) has the largest probability, its P n –probability must be at least
1
.
(n+1)|X |
Since each element of this type class has P n –probability 2−nH(P ) ,
we obtain the lower bound. 2
Combining Lemmas 5.3 and 5.5, we immediately obtain
Corollary 5.6:
1
2−nD(P ||Q) ≤ Qn (T (P )) ≤ 2−nD(P ||Q) .
(n + 1)|X |
Remark: The above tells us that if we flip n fair coins, the probability that
we get exactly 50% heads, while going to 0 as n → ∞, does not go to 0
exponentially fast.
The above discussion has set us up for an easy proof of the Glivenko–Cantelli
Theorem (analogous to the fact that once one has level 1 large deviation
theory set up, the Strong Law of Large Numbers follows trivially).
The key that makes the whole thing work is the fact (Corollary 5.6) that
the probability under Q that a certain type class (different from Q) arises is
exponentially small (the exponent being given by the relative entropy) and
the fact that there are only polynomially many type classes.
25
Theorem 5.7: Let X1 , . . . be i.i.d. according to P . Then
P (D(Pxn ||P ) > ) ≤ 2−n(−|X |
ln(n+1)
)
n
and hence (by Borel Cantelli)
D(Pxn ||P ) → 0 a.s. .
Proof: Using
2−nD(Q||P ) ≤
X
P n (D(Pxn ||P ) > ) ≤
Q:D(Q||P )≥
(n + 1)|X | 2−n = 2−n(−|X |
ln(n+1)
)
n
.
2
We are now ready to state Sanov’s Theorem. We are doing everything under
the assumption that the underlying distribution has finite support (i.e., there
are only a finite number of values our r.v. can take on). One can get rid of
this assumption but we stick to it here since it gets rid of some technicalities.
We can view the set of all prob. measures (call it P) on our finite outcome
space (of size |X |) as a subspace of R|X | (or even of R|X |−1 ). (In the former
case, it would be simplex in the positive cone.) This immediately gives us
a nice topology on these measures (which of course is nothing but the weak
topology in a simple setting). Assuming that P gives every x ∈ X positive
measure (if not, change X ), note that D(Q||P ) is a continuous function of Q
on P and hence on any closed (and hence compact) set E in P, this function
assumes a minimum (which is not 0 if P 6∈ E).
26
Theorem 5.8 (Sanov’s Theorem): Let X1 , . . . be i.i.d. with distribution
P and E be a closed set in P. Then
P n (E)(= P n (E ∩ Pn )) ≤ (n + 1)|X | 2−nf (E)
where f (E) = minQ∈E D(Q||P ). (P n (E) means of course
P n ({x : Px ∈ E}).)
If, in addition, E is the closure of its interior, then the above exponential
upper bound also gives a lower bound in that
lim
n→∞
1
ln P n (E) = −f (E).
n
Proof: The derivation of the upper bound follows easily from Theorem 5.7
which we leave to the reader. The lower bound is not much harder but a
little care is needed (as should be clear from the topological assumption on
E). We need to be able to find guys in E ∩Pn . Since the Pn ’s become “more
and more” dense as n gets large, it is easy to see (why?) that E ∩ Pn 6= ∅
for large n (this just uses the fact that E has nonempty interior) and that
there exist Qn ∈ E ∩ Pn with Qn → Q∗ where Q∗ is some distribution which
minimizes D(Q||P ) over Q ∈ E. This immediately give that D(Qn ||P ) →
D(Q∗ ||P )(= f (E)).
We now have
P n (E) ≥ P n (T (Qn )) ≥
1
2−nD(Qn ||P )
(n + 1)|X |
from which one immediately concludes that
lim inf
n
1
ln P n (E) ≥ lim inf −D(Qn ||P ) = −f (E).
n
n
27
The first part gives this as an upper bound and so we’re done. 2
Note that the proof goes through if one of the Q∗ ’s which minimize D(Q||P )
over Q ∈ E (of which there must be at least one due to compactness) is in
the closure of the interior of E. It is in fact also the case (although not att
all obvious and related to other important things) that if E is also convex,
then there is a unique minimizing Q∗ and so there is a “Hilbert space like”
picture here. (If E is not convex, it is easy to see that there is not necessarily
a unique minimizing guy). The fact that there is a unique minimizing guy
(in the convex case) is suggested by (by not implied by) the convexity of
relative entropy in its two arguments.
Hard Exercise: Recover the level 1 large deviation theory from Sanov’s
Theorem using Lagrange multipliers.
6
Universal Entropy Estimation and Data Compression
We first use relative entropy to prove the possibility of universal data compression for i.i.d. processes. Later on, we will do the much more difficult
universal data compression for general ergodic soures, the so–called ZivLempel algorithm which is used in the real world (as far as I understand
it). Of course, from a practical point of view, the results of Chapter 1 are
useless since typically one does NOT know the process that one is observing
and so one wants a “universal” way to compress data. We will now do this
for i.i.d. processes where it is quite easy using the things we derived in §5.
We have seen previously in §4 that if we know the distribution of a r.v.
x, we can encode x by assigning it a word which has length − ln(p(x)). We
28
have also seen if the distribution is really q, we get a penalty of D(p||q). We
now find a code “of rate R” which suffices for all i.i.d. processes of entropy
less than R.
Definition 6.1: A fixed rate block code of rate R for a process X1 , . . .
with known state space X consists of two mappings which are the encoder,
fn : X n → {1, . . . , 2nR }
and a decoder
φn : {1, . . . , 2nR } → X n .
We let
Pe(n,Q) = Qn (φn (fn (X1 , . . . , Xn )) 6= (X1 , . . . , Xn ))
be the probability of an error in this coding scheme under the assumption
that Q is the true distribution of the xi ’s.
Theorem 6.2: There exists a fixed rate block code of rate R such that for
all Q with H(Q) < R,
Pe(n,Q) → 0 as n → ∞.
Proof: Let Rn = R − |X | ln(n+1)
and let An = {x ∈ X n : H(Px ) ≤ Rn }.
n
Then
|An | =
X
|T (P )| ≤
P ∈Pn ,H(P )≤Rn
(n + 1)|X | 2nRn = 2nR .
29
Now we index the elements of An in an arbitrary way and we let fn (x) =
index of x in An if x ∈ An and 0 otherwise. The decoding map φn takes
of course the index into corresponding word. Clearly, we obtain an error (at
the nth stage) if and only if x 6∈ An . We then get
Pe(n,Q) = 1 − Qn (An ) =
X
Qn (T (P )) ≤
P :H(P )>Rn
X
2−nD(P ||Q) ≤ (n + 1)|X | 2−n minP :H(P )>Rn D(P ||Q) .
P :H(P )>Rn
Now Rn → R and H(Q) < R and so Rn > H(Q) for large n and so
(n,Q)
the exponent minP :H(P )>Rn D(P ||Q) is bounded away from 0 and so Pe
goes exponentially to 0 for fixed Q. Actually, this last argument was a
little sloppy and one should be a little more careful with the details. One
first should note that compactness together with the fact that D(P ||Q) is
0 only when P = Q implies that if we stay away from any neighborhood of
Q, D(P ||Q) is bounded away from 0. Then note that if Rn > H(Q), the
continuity of the entropy function tells us that {P : H(P ) > Rn } misses
some neighborhood of Q and hence minP :H(P )>Rn D(P ||Q) > 0 and clearly
this is increasing in n (look at the derivative of
ln x
x ).
2
Remarks: (1) Technically, 0 is not an allowable image. Fix this (it’s trivial).
(2) For fixed Q, since the above probability goes to exponentially fast, Borel–
Cantelli tells us the coding scheme will work eventually forever (i.e., we have
an a.s. statement rather than just an “in probability” statement).
(3) There is no uniformity in Q but there is a uniformity in Q for H(Q) ≤
R − . (Show this).
(4) For Q with H(Q) > R, the above guessing scheme works with probability
going to 0. Could there be a different guessing scheme which works?
30
The above theorem produces for us a “universal data compressor” for
i.i.d. sequences.
Exercise: Construct an entropy estimator which is universal for i.i.d. sequences.
We now construct a “universal entropy estimator” which is universal
for all ergodic processes. We will now construct something which is not
techniquely a “universal entropy estimator” according to the definition given
in §2 but seems analogous and can easily be modified to be one. We will
later discuss how this procedure is related to a universal data compressor.
Theoreom 6.3: Let (x1 , x2 , . . .) be an infinite word and let
Rn (x) = min{j ≥ n : x1 , x2 , . . . , xn = xj+1 xj+2 . . . xj+n }
(which is the first time after n when the first n–block repeats itself). Then
lim
n→∞
log Rn (X1 , X2 , . . .)
= Ha.s.
n
where H is the entropy of the process.
Rather then writing out the proof of that, we will read the paper “Entropy and Data Compression Schemes” by Ornstein and Weiss. This paper
will be much more demanding than these notes have been up to now.
7
Shannon’s Theorems on Capacity
In this section, we discuss and prove the fundamental theorem of channel
capacity due to Shannon in his groundbreaking work in the field. Before
going into the mathematics, we should first give a small discussion.
Assume that we have a channel to which one can input a 0 or a 1 and
which outputs a 0 or 1 from which we want to recover the input. Let’s assume
31
that the channel transmits the signal correctly with probability 1 − p and
corrupts it (and therefore sends the other value) with probability p. Assume
that p < 1/2 (if p = 1/2, the channel clearly would be completely useless).
Based on the output we receive, our best guess at the input would of course
be the output in which case the probability of making an error would be
p. If we want to decrease the probability of making an error, we could
do the following. We could simply send our input through the channel 3
times and then based on the 3 outputs, we would guess the input to be
that output which occurred the most times. This coding/decoding scheme
will work as long as the channel does not make 2 or more errors and the
probability of this is (for small p) much much smaller (≈ p2 ) than if we were
to send just 1 bit through. Great! The probability is much smaller now BUT
there is a price to pay for this. Since we have to send 3 bits through the
channel to have one input bit guess, the “rate of transmission” has suddenly
dropped to 1/3. Anyway, if we are still dissatisfied with the probability of
this scheme making a error, we could send the bit through 5 times (using an
analogous majority rule decoding scheme) thereby decreasing the probability
of a decoding mistake much more (≈ p3 ) but then the transmission rate
would go even lower to 1/5. It seems that in order to achieve the probability
of error going to 0, we need to drop the rate of transmission closer and
closer to 0. The amazing discovery of Shannon is that this very reasonable
statement is simply false. It turns out that if we are simply willing to use a
rate of transmission which is less than “channel capacity” (which is simply
some number depending only on the channel (in our example, this number
is strictly positive as long as p 6=
1
2
which is reasonable)), then as long as
we send long strings of input, we can transmit them with arbitrarily low
32
probability of error. In particular, the rate need not go to 0 to achieve
arbitrary reliability.
We now enter the mathematics which explains more precisely what the
above is all about and proves it. The formal definition of capacity of a
channel is based on a concept called mutual information of two r.v.’s which
is denoted by I(X; Y ).
Definition 7.1: I(X; Y ), called the mutual information of X and Y , is
the relative entropy of the joint distribution of X and Y and the product
distribution of X and Y .
Note that I(X; Y ) and I(Y ; X) are the same and that they are not infinite
(although recall in general relative entropy can be infinite).
Definition 7.2: The conditional entropy of Y given X, denoted H(Y |X)
is defined to be
P
x P (X
= x)H(Y |X = x).
Proposition 7.3:
(1)H(X, Y ) = H(X) + H(Y |X)
(2) I(X; Y ) = H(X) − H(X|Y ).
The proof of these are left to the reader. There is again no idea involved,
simply computation.
Definition 7.4: A discrete memoryless channel (DMC) is a system
which takes as input elements of an input alphabet X and prints out elements
from an output alphabet Y randomly according to some p(y|x) (i.e., if x is
sent in, then y comes out with probability p(y|x)). Of course the numbers
p(y|x) are nonnegative and for fixed x give 1 if we sum over y. (Such a thing
is sometimes called a kernel in probability theory and could be viewed as
33
a markov transition matrix if the sets X and Y were the same (which they
need not be)).
Definition 7.5: The information channel capacity of a discrete memoryless channel is
C = max I(X; Y ).
p(x)
Exercise: Compute this for different examples like X = Y = {0, 1} and the
channel being p(x|x) = p for all x.
We assume now that if we send a sequence through the channel, the outputs
are all independently chosen from p(y|x) (i.e., the outputs occur independently, an assumption you might object to).
Definition 7.6: An (M, n) code C consists of the following:
1. An index set {1, . . . , M }.
2. An encoding function X n : {1, . . . , M } → X n which yields the codewords
X n (1), . . . , X n (M ) which is called the codebook.
3. A decoding function g : Y n → {1, . . . , M }.
For i ∈ {1, . . . , n}, we let λni be the probability of an error if we encode
and send i over the channel, that is P (g(Y n ) 6= i) when i is encoded and
sent over the channel. We also let λn = maxi λni be the maximal error and
Pen =
1
M
n
i λi
P
be the average probability of error. We attach a superscript
C to all of these if the code is not clear from context.
Exercise: Given any DMC, find an (M, n) code such that λn1 = 0.
Definition 7.7: The rate of an (M, n) code is
34
ln M
n
(where we use base 2).
We now arrive at one of the most important theorems in the field (as far as
I understand it), which says in words that all rates below channel capacity
can be achieved with arbitrarily small probability of error.
Theorem 7.8 (Shannon’s Theorem on Capacity): All rates below
capacity C are achievable in that given R < C, there exists a sequence
of (2nR , n)–codes with maximum probability λn → 0. Conversely if there
exists a sequence of (2nR , n)–codes with maximum probability λn → 0, then
R ≤ C.
Before presenting the proof of this result, we make a few remarks which
should be of help to the reader. In the definition of an (M, n)–code, we have
this set {1, . . . , M } and an encoding function X n mapping this set into X n .
This way of defining a code is not so essential. The only thing that really
matters is what the image of X n is and so one could really have defined an
(M, n)–code to be a subset of X n consisting of M elements together with the
decoding function g : Y n → X n . Formally, the only reason these definitions
are not EXACTLY the same is that X n might not be injective which is
certainly a property you want your code to have anyway. The point is one
should also think of a code in this way.
Proof: The proof of this is quite long. While the mathematics is all very
simple, conceptually it is a little more difficult. The method is a randomization method where we choose a code “at random” (according to some
distribution) and show that with high probability it will be a good code.
The fact that we are choosing the code according to a special distribution
will make the computation of this probability tractable.
The key concept in the proof is the notion of “joint typicality”. The
35
SMB (for i.i.d.) tells us that a typical sequence x1 , . . . , xn from an ergodic
stationary process has − n1 ln(p(x1 , . . . , xn )) being close to the entropy H(X).
If we have instead independent samples from a joint distribution p(x, y), we
want to know what x1 , y1 , . . . , xn , yn typically looks like and this gives us
the concept of joint typicality. An important point is that x1 , . . . , xn and
y1 , . . . , yn can be individually typical without being jointly typical. We now
give a completely heuristic proof of the theorem.
Given the output, we will choose as our guess for the input something
(which encodes to something) which is jointly typical with the output. Since
the true input and the output will be jointly typical with high probability,
there will with high probability be some input which is jointly typical with
the output, namely the true input. The problem is maybe there is more
than 1 possible input sequence which is jointly typical with the output. (If
there is only one, there is no problem.) The probability that a possible input
sequence which was not input to the output is in fact jointly typical with
the output sequence should be more or less the probability that a possible
input and output sequence chosen independently are jointly typical which is
2−nI(X;Y ) by a trivial computation. Hence if we use 2nR with R < I(X; Y )
different possible input sequences, then, with high probability, none of the
possible input sequences (other than the true one) will be jointly typical
with the output and we will decode correctly.
We need the following lemma whose proof is left to the reader. It is easy
to prove using the methods in §1 when we looked at consequences of the
SMB Theorem. An arbitrary element of X n x1 , . . . , xn will be denoted by
xn below.
Lemma 7.9: Let p(x, y) be some joint distribution. Let An be the set of
36
–jointly typical sequences (with respect to p(x, y)) of length n, namely,
An = {(xn , y n ) ∈ X n × Y n : | −
|−
1
p(xn ) − H(X)| < ,
n
1
1
p(y n ) − H(Y )| < , and | − p(xn , y n ) − H(X, Y )| < }
n
n
where p(xn , y n ) =
Q
i p(xi , yi ).
Then if (X n , Y n ) is drawn i.i.d. according
to p(x, y), then
1. P ((X n , Y n ) ∈ An ) → 1 as n → ∞.
2. |An | ≤ 2n(H(X,Y )+) .
3. If (X̃ n , Ỹ n ) is chosen i.i.d. according to p(x)p(y), then
P ((X̃ n , Ỹ n ) ∈ An ) ≤ 2−n(I(X;Y )−3)
and for sufficiently large n
P ((X̃ n , Ỹ n ) ∈ An ) ≥ (1 − )2−n(I(X;Y )+3) .
We now proceed with the proof.
Let R < C. We now construct a sequence of (2nR , n)–codes (2nR means
b2nR c here). whose maximum probability of error λ(n) goes to 0. We will
first prove the existence of a sequence of codes which have a small average
error (in the limit) and then modify it to get something with small maximal
error (in the limit). By the definition of capacity, we can choose p(x) so that
R < I(X; Y ) and then we can choose so that R + 3 < I(X; Y ).
Let Cn be the set of all encoding functions X n : {1, 2, . . . , 2nR } → X n .
Our encoding function will be chosen randomly according to p(x) in that
each i ∈ {1, 2, . . . , 2nR } is independently sent to xn where xn is chosen
according to
Qn
i=1 p(x).
Once we have chosen some encoding function X n ,
37
the decoding procedure is as follows. If we receive y n , our decoder guesses
that the word W ∈ {1, 2, . . . , 2nR } has been sent if (1) X n (W ) and y n are –
jointly typical and (2) there is no other Ŵ ∈ {1, 2, . . . , 2nR } with X n (Ŵ ) and
y n being –jointly typical. We also let Cn denote the set of codes obtained
by considering all encoding functions and their resulting codes.
Let E n denote the event that a mistake is made when W ∈ {1, 2, . . . , 2nR }
is chosen uniformly. We first show that P (E n ) → 0 as n → ∞.
X
P (E n ) =
P (C)Pe(n),C =
C∈Cn
nR
X
P (C)
C∈Cn
By symmetry
above is
P
C∈Cn
P
1 2X
2nR
C∈Cn
nR
λw (C) =
w=1
1 2X X
2nR
P (C)λw (C).
w=1 C∈Cn
P (C)λw (C) does not depend on w and hence the
P (C)λ1 (C) = P (E n |W = 1). One should easily see directly
anyway that symmetry gives P (E n ) = P (E n |W = 1). Because of the independence in which the encoding function and the word W was chosen, we
can instead consider the procedure where we choose the code at random as
above but always transmit the first word 1. In the new resulting probability
space, we still let E n denote the event of a decoding mistake. We also for
each i ∈ {1, 2, . . . , 2nR } have the event Ein = {(X n (i), Y n ) ∈ An } where Y n
nR
is the output of the channel. We clearly have E n ⊆ ((E1n )c ∪ ∪2i=2 Ein ). Now
P (E1n ) → 1 as n → ∞ by the joint typicality lemma. Next the independence of X n (i) and X n (1) (for i 6= 1) implies the independence of X n (i)
and Y n (1) (for i 6= 1) which gives (by the joint typicality lemma) P (Ein ) ≤
nR
2 E n ) ≤ 2nR 2−n(I(X;Y )−3) = 2−n(I(X;Y )−3−R)
2−n(I(X;Y )−3) and so P (∪i=2
i
which goes to 0 as n → ∞.
Since we saw that P (E n ) goes to 0 as n → ∞ and
38
P (E n ) =
P
C∈Cn
(n),C
P (C)Pe
(n),C
C ∈ Cn so that Pe
, it follows that for all large n there is a code
is small.
Now that we know such a code exists, one can find it by exhaustive
search. There is much interest in finding such codes. We finally obtain
a code with maximum probability of error being small. Since the average
error is very small (say δ), it follows that if we take half of the codewords
with smallest error, the maximum error of these is also very small (at most
2δ, why?). We therefore obtain a new code by throwing away the larger
half of the codewords (in terms of their errors). Since we now have 2nR−1
codewords, the rate has been cut to R −
1
n
which is negligible.
(Actually, by throwing away this set of codewords, we obtain a new code,
and since it is a new code, we should check that the maximal error is still
small. This is of course obvious but the point is that this non–problem
should come to mind).
CONVERSE:
We are now ready to prove the converse which is somewhat easier. We first
need two easy lemmas.
Lemma 7.10 (Fano’s inequality): Consider a DMC with code C with
(n)
input message uniformly chosen. Letting Pe
= P (W 6= g(Y n )), we have
H(X n |Y n ) ≤ 1 + Pe(n) nR.
Hint of Proof: Let E be the event of error and expand H(E, W |Y n ) in
two possible ways. Then think a little. 2
Lemma 7.11: Let Y n be the output when inputting X n through a DMC.
39
Then for any distribution p(xn ) on X n (the n bits need not be independent),
I(X n ; Y n ) ≤ nC
where C is channel capacity.
Proof: Left to reader-very straightforward computation with no thought
needed. 2
We now complete the converse. Assume that we have a sequence of (2nR , n)
codes with maximal error (or even average error) going to 0.
Let the word W be chosen uniformly over {1, . . . , 2nR } and consider
Pen = P (Ŵ 6= W ) where Ŵ is our guess at the input W . We have
nR = H(W ) = H(W |Y n ) + I(W ; Y n ) ≤
H(W |Y n ) + I(X n (W ); Y n ) ≤
1 + Pe(n) nR + nC.
Dividing by n and letting n → ∞ gives R ≤ C. 2
Remarks.
(n)
(1) The last step gives a lower probability on the average error of Pe
1−
C
R
−
1
nR
≥
which of course is only interesting when the transmission rate
R is larger than the capacity C.
(2) Note that (in the first half of the proof) we proved that if p(x) is any
distribution on X (not necessarily one which maximized mutual information), then choosing a code at random using the distribution p(x) instead,
will with high probability result in a code whose probability of error goes to
0 with n providing we use a transmission rate R which is smaller than the
mutual information between the input and output when X has distribution
40
p(x).
(3) One can prove a stronger converse which says that for rates above capacity, the average error of probability must go exponentially to 1 (What
does this exactly mean?) which makes capacity an even stronger dividing
line.
Exercise: Consider a channel with 1,2,3 and 4 as both input and output and
where 3 and 4 are always sent to themselves while both 1 and 2 are sent to
1 and 2 with probability 1/2. What is the capacity of this DMC? Before
computing, take a guess first and guess how it would compare to a channel
with only 2 possible inputs but with perfect transmission. How would it
compare to the same channel where 2 is not allowed as input?
Exercise: Given a DMC, construct a new one where we add a new symbol
∗ to both the input and output alphabets and where if we input x 6= ∗, the
output has the same distibution as before (and so ∗ can’t come out) and if
∗ is inputted, the output is uniform on all possible outputs (including ∗).
What has happened to the capacity of the DMC?
We have two important theorems, namely Corollary 4.7 and Theorem 7.8.
These two theorems seem to be somewhat disjoint from each other but
the following theorem brings them together in a very nice way. We leave
this theorem as a difficult exercise for the reader who using the methods
developed in these notes should be able to carry out this proof.
Theorem 7.12 (Source–channel coding theorem): Let V = V1 , . . . be
a stochastic process satisfying the AEP property (stationary ergodic suffices
for example). Then if H(V) < C, there exists a source channel code with
(n)
Pe
→ 0.
41
Conversely, for any stationary stochastic process, if H(V) > C, then the
probability of error is bounded away from 0 (bounded away over all n and
all source codes).
One should think why this theorem says that there is no better way to send
a stationary process over a DMC resulting in small probability of error than
to do it in 2 steps, first compressing (as we saw we can do) to get strings
of length the order of 2nH and then sending that over the channel using
Theorem 7.8.
We end this section with a result which is perhaps somewhat out of place
but we place it here since mutual information was introduced in this section.
This result will be needed in the next section.
Theorem 7.13: The mutual information I(X; Y ) is a concave function of
p(x) for fixed p(y|x) and a convex function of p(y|x) for fixed p(x).
Proof: We give an outline of the proof which contains most of the details.
The first thing we need is the so called log sum inequality which says
that if a1 , . . . , an and b1 , . . . , bn are nonnegative then
X
i
P
X
ai
ai
ai log( ) ≥ ( ai ) log( Pi )
bi
i bi
i
with equality if and only if
ai
bi
is constant.
This is proved by Jensens inequality as is the nonnegativity of relative
entropy but in fact this also follows from the latter easily anyway as follows. A trivial computation shows that if the above holds for a1 , . . . , an
and b1 , . . . , bn , then it holds if we multiple these vectors by any (possibly
different) constants. Now just normalize them to be probability vectors and
use the nonnegativity of relative entropy.
42
The log sum inequality can then be used to demonstrate the convexity
of relative entropy in its two arguments, i.e.,
D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 ).
as follows.
The log sum inequality with n = 2 gives for fixed x
(λp1 (x) + (1 − λ)p2 (x)) log
λp1 (x) log
λp1 (x) + (1 − λ)p2 (x)
≤
λq1 (x) + (1 − λ)q2 (x)
λp1 (x)
λp2 (x)
+ (1 − λ)p2 (x) log
.
λq1 (x)
λq2 (x)
Now summing over x gives the desired convexity.
Now we prove the theorem.
Fixing p(y|x), we want to first show that I(X; Y ) is concave in p(x). We
have
I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) −
X
p(x)H(Y |X = x).
x
If p(y|x) is fixed, then p(y) is a linear function of p(x). Since H(Y ) is
a concave function of p(y) (a fact which follows easily from the convexity
of relative entropy in its two arguments or can be proven more directly),
it follows that H(Y ) is a concave function of p(x) (since linear composed
with concave is concave). The second term on the other hand is obviously
linear in p(x) and hence I(X; Y ) is concave in p(x) (concave minus linear is
concave).
For the other side, fix p(x) and we want to show I(X; Y ) is convex in
p(y|x). A simple computation shows that the joint distribution obtained by
using a convex combination of p1 (y|x) and p2 (y|x) is simply the same convex
combination of the measures obtained by using these two conditional things.
43
Similarly, the product measure obtained when using a convex combination
of p1 (y|x) and p2 (y|x) is simply the same convex combination of the product
measures obtained by using these two conditional things. Using the definition of information, these two observations together with the convexity of
relative entropy in its two arguments allows an easy computation (left to
the reader) which shows that I(X; Y ) is convex in p(y|x) as desired. 2
EXERCISE: It seems that the second argument above could perhaps also be
used to show that for fixed p(y|x), I(X; Y ) is convex in p(x). Where does
the argument break down?
8
Rate Distortion Theory
The area of rate distortion theory is very important in information theory
and entire books have been written about it. Our presentation will be brief
but we want to get to one of the main theorems, whose development parallels
the theory in the previous section.
The simplest context in which rate distortion theory arises is in quantization which is the following problem. We have a random variable X which
we want to quantize, that is, we want to represent the value of X by say 1
of 2 possible values. You get to choose the values and assign them to X as
you wish, that is, you get to choose a function f : R → R whose image has
only two points such that f (X) (thought of as the quantized version of X)
represents X well. What does “represent X well” mean? Perhaps we would
want E[(f (X) − X)2 ] to be small.
Recall that in the previous section, we needed to send alot of data (i.e.,
take n large) in order to have small probability of error for a fixed rate
44
which is below the channel capacity. Similarly, in this section, we will be
considering “large n” where n is the amount of data to obtain another elegant
result.
We let X denote the state space for our random variable and X̂ denote
the possible quantized versions (or representations) of our random variable.
We will now consider a generalization of the quantization problem which we
now formulate.
Definition 8.1: A distortion function is a mapping
d : X × X̂ → R+ .
d(x, x̂) should be thought of as some type of cost of representing the outcome
x by x̂. Given the above distortion function d, we take as the distortion
between to finite sequences xn1 and x̂n1 to be
d(xn1 , x̂n1 ) =
n
1X
d(xi , x̂i )
n i=1
which might be something you object to but which is required for our theory
to go through as we will do it.
Definition 8.2: A (2Rn , n)–rate distortion code consists of an encoding
function
fn : X n → {1, 2, . . . , 2nR }
and a decoding (reproduction) function
gn : {1, 2, . . . , 2nR } → X̂ n .
45
If X1n ≡ X1 , . . . , Xn are random variables taking values in X and d is a
distortion function, then the distortion associated with the above code is
D = E[d(X1n , gn (fn (X1n )))].
The way to think of fn is that it breaks up the set of all possible outcomes
X n into 2nR sets (this is the quantization part) and then each quantized
piece is sent to some element of X̂ n which is supposed to “represent” the
original element of X n . Obviously, the larger we take R, the smaller we can
make D (e.g. if |X | = 2 and R = 1, we can get D = 0). In the previous
section, we wanted to take R large, while now we want to take it small.
We now assume that we have a fixed rate distortion function
d and a fixed distribution on X , p(x). We will assume that X1n ≡
X1 , . . . Xn are i.i.d. with distribution p(x). As we did in the last chapter,
for simplicity, we will assume that everything (i.e., X and X̂ ) are finite.
Definition 8.3: The pair (R, D) is achievable if there exists a sequence
of (2Rn , n)–rate distortion codes, (fn , gn ) with
lim sup E[d(X1n , gn (fn (X1n )))] ≤ D.
n
Definition 8.4: The rate distortion function R(D), (where we still have
a fixed distribution p(x) on X and a fixed distortion function d) is defined
by
R(D) = inf{R : (R, D) is achievable }.
46
One should view R(D) as the coarsest possible quantization subject to keeping the distortion at most D. This is clearly a quantity that one should
be interested in although it seems (analogous to (non-information) channel
capacity) that this would be impossible to compute. We will now define
something called the “information rate distortion” function which will play
the role in our rate distortion theory that the information channel capacity
did in the channel capacity theory. This quantity is something much easier
to compute (as was the information channel capacity) and the main theorem
will be that these two (the rate distortion function and the information rate
distortion function) are the same.
Definition 8.5: The information rate distortion function RI (D),
(where we still have a fixed distribution p(x) on X and a fixed distortion
function d) is defined as follows. Let P D be the set of all distributions p(x, x̂)
on X × X̂ whose first marginal is p(x) and which satisfy Ep(x,x̂) d(x, x̂) ≤ D.
Then we define
RI (D) =
min
p(x,x̂)∈P D
I(X; X̂).
Long Exercise: Let X = X̂ = {0, 1} and consider the rate distortion function
which is Hamming distance (which is 1 if the two guys are not the same and
0 otherwise). Assume that X has distribution pδ1 + (1 − p)δ0 . Compute
RI (D) in this case.
We now state out main theorem and go directly to the proof. Recall that
both the rate distortion function and the information rate distortion function
are defined relative to the distribution p(x) on X and the distortion function
d.
47
Theorem 8.6: RI (D) = R(D), i.e., the rate distortion function and the
information rate distortion function are the same.
Lemma 8.7: RI (D) is a nonincreasing convex function of D.
Proof: The nonincreasing part is obvious. For the convexity, consider D1
and D2 and λ ∈ (0, 1). Choose (using compactness) p1 (x, x̂) and p2 (x, x̂)
which minimize I(X; X̂) over the sets P D1 and P D2 . Then (by linearity of
expectation)
pλ ≡ λp1 (x, x̂) + (1 − λ)p2 (x, x̂) ∈ P λD1 +(1−λ)D2 .
By the second part of Theorem 7.13, we have
RI (λD1 + (1 − λ)D2 ) ≤ Ipλ (X; X̂) ≤
λIp1 (X; X̂) + (1 − λ)Ip2 (X; X̂) =
λRI (D1 ) + (1 − λ)RI (D2 ).
2
Proof of R(D) ≥ RI (D): We need to show that if we have a sequence of
(2Rn , n)–rate distortion codes, (fn , gn ) with
lim sup E[d(X1n , gn (fn (X1n )))] ≤ D,
n
then R ≥
RI (D).
The encoding and decoding functions clearly give us a
joint distribution of X n × X̂ n (where the marginal on X n is
Qn
i=1 p(x))
and
we then have
nR ≥ H(X̂1n ) ≥ H(X1n ) − H(X1n |X̂1n ) ≥
n
X
I(Xi ; X̂i ) ≥
i=1
n
X
RI (E[d(Xi , X̂i )]) = n
i=1
n
X
1
nRI (
i=1
n
n
X
i=1
n
X
i=1
H(Xi ) −
n
X
H(Xi |X̂i ) =
i=1
1 I
R (E[d(Xi , X̂i )]) ≥
n
E[d(Xi , X̂i )])( by the convexity of RI ) = nRI (E[d(X1n , X̂1n )]).
48
This gives R ≥ RI (E[d(X1n , X̂1n )]) for all n. Since lim supn E[d(X1n , X̂1n )] ≤
D and RI (D) is nondecreasing, we should be able to conclude that R ≥
RI (D) but note that to make this conclusion, we need the continuity of RI
or if you really think about it, you need only right–continuity. Of course if
for some n, we had that E[d(X1n , X̂1n )] ≤ D, then we wouldn’t have to worry
about this technicality.
Exercise: Prove the needed continuity property. 2
Remark: Note that we actually proved that even if lim inf n E[d(X1n , X̂1n )] ≤
D, then R ≥ RI (D).
We now proceed with the more difficult direction which uses (as in the
channel capacity theorem) a randomization procedure. We introduced the
definition of –jointly typical sequences in the previous section. We now
need to return to this again but with one other condition.
Definition 8.8: Let p(x, x̂) be a joint distribution of X × X̂ and d a distortion function on the same set. We say (xn , y n ) is –jointly typical (relative
to p(x, x̂) and d) if
|−
|−
1
1
p(xn ) − H(X)| < , | − p(y n ) − H(X̂)| < ,
n
n
1
p(xn , y n ) − H(X, X̂)| < , and |d(xn , y n ) − E[d(X, X̂)]| < n
and we denote the set of all such pairs by And, .
The following three lemmas are left to the reader. The first is immediate,
the second takes a little more work and the third is pure analysis.
Lemma 8.9: Let (Xi , X̂i ) be drawn i.i.d. according to p(x, x̂). Then
P (And, ) → 1 as n → ∞.
49
Lemma 8.10: For all (xn , x̂n ) ∈ And, ,
p(x̂n ) ≥ p(x̂n |xn )2−n(I(X;X̂)+3) .
Lemma 8.11: For 0 ≤ x, y ≤ 1 and n > 0
(1 − xy)n ≤ 1 − x + e−yn .
Proof of R(D) ≤ RI (D): Recall that p(x) (the distribution of X) and d
(the distortion function) are fixed and X1 , . . . , Xn are i.i.d. with distribution
p(x). We need to show that for any D and any R > RI (D), (R, D) is
achievable.
EXERCISE: Show that it suffices to show that (R, D + ) is achievable
for all and then of course it suffices to show this for with R > RI (D) + 3.
For this fixed D, choose a joint distribution p(x, x̂) which minimizes
I(X; X̂) in the definition of RI (D) and let p(x̂) be the marginal on x̂. In
particular, we then have that R > I(X; X̂) + 3.
Choose 2nR sequences from X̂ n independently, each with distribution
Qn
i=1 p(x̂)
which then gives us a (random) decoding (or reproduction) func-
tion gn : {1, 2, . . . , 2nR } → X̂ n . Once we have the decoding function gn , we
define the encoding function fn as follows. Send X n to w ∈ {1, 2, . . . , 2nR }
if (X n , gn (w)) ∈ And, (if there is more than one such w send it to any of
them) while if there is no such w, send X n to 1.
Our probability space consists of first choosing a (2Rn , n)–rate distortion
code in the above way and then choosing X n independently from
Qn
i=1 p(x).
Consider the event En that there does not exist w ∈ {1, 2, . . . , 2nR } with
50
(X n , gn (w)) ∈ And, (an event which depends on both the random code chosen
and X n ). The key step is the following which we prove afterwards.
Lemma 8.12: P (En ) → 0 as n → ∞.
Calculating things in this probability space, we get (X̂ n is of course
gn (fn (X n )) here)
E[d(X n , X̂ n )] ≤ D + + dmax P (En )
where dmax is of course the maximum value d can take on. The reason why
this holds is that if Enc occurs, then (X n , X̂ n ) is in And, which implies by
definition that |d(X n , X̂ n ) − E[d(X, X̂)]| < which gives the result since
E[d(X, X̂)]| ≤ D. Applying the lemma, we obtain lim supn E[d(X n , X̂ n )] ≤
D + .
It follows that there must be a sequence of (2Rn , n)–rate distortion codes
(fn , gn ) with lim supn E[d(X1n , gn (fn (X1n )))] ≤ D + , as desired. 2
Proof of Lemma 8.12: This is simply a long computation. Let Cn denote
the set of (2Rn , n)–rate distortion codes. For C ∈ Cn , let J(C) be the set of
sequences xn for which there is some codeword x̂n with (xn , x̂n ) ∈ And, . We
have
P (En ) =
X
C∈Cn
X
xn
p(xn )
X
P (C)
p(xn ) =
xn :xn 6∈J(C)
X
P (C).
C∈Cn :xn 6∈J(C)
Letting K(xn , x̂n ) be the indicator function of (xn , x̂n ) ∈ And, and fixing
some xn , we have
P ((xn , X̂ n ) 6∈ And, ) = 1 −
X
x̂n
51
p(x̂n )K(xn , x̂n )
giving
X
X
p(xn )
xn
P (C) =
X
p(xn )[1 −
xn
C∈Cn :xn 6∈J(C)
X
p(x̂n )K(xn , x̂n )]2
nR
.
x̂n
Next, Lemmas 8.10 and 8.11 (in that order) and some algebra give
X
P (En ) ≤ 1 −
p(xn , x̂n )K(xn , x̂n ) + e−2
−n(I(X;X̂)+3) 2nR
.
xn ,x̂n
Since R > I(X; X̂) + 3, the last term goes (super)–exponentially to 0.
The first two terms are simply P ((And, )c ) which goes to 0 by Lemma 8.9.
This completes the proof. 2
9
Other universal and non–universal estimators
The following question I find to be interesting and a number of papers have
been written about it.
Given a property of ergodic stationary processes, when does there exist
a consistent estimator of it (consistent in the class of all ergodic stationary
processes)?
Exercise: Sticking to processes taking on only the values {0, . . . , 9}, show
that if a property (i.e., a function) is continuous on the set of all ergodic
processes (where the latter is given the usual weak topology), then there
exists a consistent estimator.
(Olle Nerman pointed this out to me when I mentioned the more general
question above).
I won’t write anything in this section, but rather we will go through
some recent research papers. I will attach zeroxed copies of some papers
here.
52
The theme of these papers will be, among other things, to learn how to
construct stationary processes with certain desired behavior.
53
Download