Information theory exercises Solutions

advertisement
Information theory exercises
Solutions
Winter 2011/2012
1. Let us denote the range of A, B, C with A, B, C, respectively, and their
(joint,conditional) distributions with symbols like PA (·), PAB (·, ·), PA|B (·|·).
Then
H(A, B) + H(B, C)
−H(A, B, C) − H(B) = −
X
PAB (a, b) log PAB (a, b)−
a,b
X
PBC (b, c) log PBC (b, c)+
b,c
X
PABC (a, b, c) log PABC (a, b, c)+
a,b,c
X
PB (b) log PB (B)
b
=−
X
a,b
X
b,c
X
a,b,c
X
PB (b)PA|B (a|b) log PB (b)PA|B (a|b) −
|
{z
}
log PB (b)+log PA|B (a|b)
PB (b)PC|B (c|b) log PB (b)PC|B (c|b) +
|
{z
}
log PB (b)+log PC|B (c|b)
PB (b)PAC|B (a, c|b) log PB (b)PAC|B (a, c|b) +
|
{z
}
log PB (b)+log PAC|B (a,c|b)
PB (b) log PB (B)
b
Now consider the terms containing log PB (b). In these the sums over a
and c can be separately calculated for each fixed b, and their result is 1
independently of b. What remains of these sums is the entropy of B, but
twice with positive and twice with negative sign. Treating the sum over
1
b separately, one has
··· =
X
PB (b) −
X
PA|B (a|b) log PA|B (a|b)
a
b
!
−
X
PC|B (c|b) log PC|B (c|b) +
X
PA,C|B (ac|b) log PA,C|B (ac|b)
a,c
c
For each b, PAC|B (·|b) is a probability distribution on A×C with PA|B (·|b)
and PC|B (·|b) being its marginals. Applying subadditivity to these distributions we get the desired inequality. (Alternatively, we can do the same
steps as in the proof of subadditivity inside the sum over b.)
2. By strict concavity of H(·) there is at most one maximum on a convex
subset of a linear space, and a stationary point can only be a maximum.
The maximum remains a maximum when we restrict the function to any
finite dimensional affine subspace containing it, so we can find a candidate
using the Lagrange multiplier method. We seek for a stationary point of
−
∞
X
pi log pi + λ
i=0
∞
X
pi + β
i=0
∞
X
ipi
i=0
subject to the constraints
∞
X
pi = 1
and
i=0
∞
X
ipi = A
i=0
The derivative with respect to pj is
0 = − log pj − 1 + λ + βj
that is,
pj = e−1+λ+βj
Introducing the new variable q := 1 − eβ we have that pj = (1 − q)j q
taking into account the normalization constraint. The expected value
−1
0
of this is 1−q
q , therefore q = (1 + A) . It remains to see that if pj is
an arbitrary other probability distribution with expected value A, then
along the line segment joining p and p0 the maximum of the entropy is at
p.
∞
d
d X
=
(pi + t(p0i − pi )) log(pi + t(p0i − pi ))
− H((1 − t)p + tp0 )
dt
dt
t=0
t=0
i=0
=
∞
X
(p0i − pi ) log pi + pi
i=0
∞
X
∞
X
i=0
i=0
(p0i − pi ) + β
=λ
2
(p0i − pi )
pi
i(p0i − pi ) = 0
where we used that the convergence of the first sum and its (termwise)
derivative is uniform on some neighbourhood t ∈ [0, ). (Check!)
3. a) A permutation of the arguments means reordering terms in a finite
sum.
b) 0α = 0 (note that we use the convention 00 = 0 to ensure continuity
in α at 0)
c)

Hα (X, Y ) =

X

1
log 
PXY (x, y)α 


1−α
x∈X
y∈Y


=

X
1
α
α
log 
P
(x)
P
(y)
X
Y


1−α
x∈X
y∈Y
1
log
=
1−α
1
=
1−α
!
X
!!
X
PX (x)α
x∈X
PY (y)α
x∈Y
!
log
X
PX (x)α
!!
+ log
x∈X
X
PY (y)α
x∈Y
= Hα (X)Hα (Y )
d) The largest term in the sum is at least 1/n, therefore the argument
of log is separated from 0 by at least n−α . The result follows as the
composition of continuous functions is continuous.
4. Observe that H(Xi ) = H(XΘ |Θ = i). Using this we have
pH(X0 ) + (1 − p)H(X1 ) = pH(XΘ |Θ = 0) + (1 − p)H(XΘ |Θ = 0)
= H(XΘ |Θ) ≤ H(XΘ )
5. a) Let A ⊆ A be the support of P . Then
−D(P ||Q) =
P (a) log
Q(a)
P (a)
X
Q(a)
P (a)
X
a∈A
≤ log
P (a)
a∈A
≤ log
X
Q(a) = 0
a∈A
|
3
{z
1
}
b)
X
D(P1 × P2 ||Q1 × Q2 ) =
(P1 × P2 )(a1 , a2 ) log
(a1 ,a2 )∈A1 ×A2
X
=
(a1 ,a2 )∈A1 ×A2
=
X
(P1 × P2 )(a1 , a2 )
(Q1 × Q2 )(a1 , a2 )
P1 (a1 )
P2 (a2 )
P1 (a1 )P2 (a2 ) log
+ log
Q1 (a1 )
Q2 (a2 )
P1 (a1 ) log
a1 ∈A1
X
P1 (a1 )
P2 (a2 )
+
P2 (a2 ) log
Q1 (a1 )
Q2 (a2 )
a2 ∈A2
= D(P1 ||Q1 ) + D(P2 ||Q2 )
c) For nonnegative numbers a1 , . . . , an , b1 , . . . , bn the log-sum inequality
!
Pn
n
n
X
X
ai
ai
ai log
≥
ai log Pi=1
n
bi
i=1 bi
i=1
i=1
holds with equalityPiff ∀i : ai = bi . PThis can be seen by applying
Jensen’s inequality ni=1 αi f (ti ) ≥ f ( ni=1 αi ti ) to the strictly convex
function f (t) = t log t with αi = Pnbi bj and ti = abii .
j=1
Now use this inequality with n = 2, a1 = λP (i), a2 = (1 − λ)P 0 (i),
b1 = λQ(i), b2 = (1 − λ)Q0 (i) to get the inequalities
λP (i) + (1 − λ)P 0 (i)
≤
λQ(i) + (1 − λ)Q0 (i)
λP (i)
(1 − λ)P 0 (i)
λP (i) log
+ (1 − λ)P 0 (i) log
λQ(i)
(1 − λ)Q0 (i)
[λP (i) + (1 − λ)P 0 (i)] log
and take their sum over i.
6. Let X denote the range of the random variable X, and let PX be its
distribution. Let PU denote the uniform distribution on X , i.e. PU (A) =
|A|
|X | for A ⊆ |X|. Then
H(X) = −
X
PX (x) log PX (x)
x∈X
=−
X
x∈X
PX (x)
+ log PU (x)
PX (x) log
PU (x)
= log |X | − D(PX ||PU )
7. Let X, Y be a pair of random variables with range X and Y, respectively.
4
Their mutual information can be written as
I(X : Y ) = H(X) + H(Y ) − H(X, Y )
X
=
PXY (x, y) (− log PX (x) − log PY (y) + log PXY (x, y))
x∈X
y∈Y
=
X
PXY (x, y) log
x∈X
y∈Y
PXY (x, y)
PX (x)PY (y)
= D(PXY ||PX × PY )
8. a) An element of Pn can be uniquely identified by an r-tuple of rational numbers between 0 and 1 with denominator n. Even without
the normalization constraint the number of possible choices for the
numerators is (n + 1)r , and adding a constraint cannot increase this.
b)
Qn (x) =
n
Y
Q(xi )
i=1
=
Y
Q(a)nPx (a)
a∈A
=
Y
2nPx (a) log Q(a)
a∈A
P
−n a∈A (−Px (a) log P (a)+Px (a) log P (a)−Px (a) log Q(a))
=2
= 2−n(H(P )+D(P ||Q))
c) For the upper bound, calculate the probability of TPn with respect to
the distribution P n :
X
1 ≥ P n (TPn ) =
P n (x) = |TPn | · 2−nH(P )
x∈TPn
For the lower bound, we first show that P n (TPn ) ≥ P n (TQn ) for any
Q ∈ Pn .
Q
|TPn | a∈A P (a)nP (a)
P n (TPn )
= n Q
P n (TQn )
|TQ | a∈A P (a)nQ(a)
Y (nQ(a))!
=
P (a)n(P (a)−Q(a))
(nP (a))!
a∈A
Y
≥
(nP (a))nQ(a)−nP (a) P (a)n(P (a)−Q(a))
a∈A
P
n a∈A (Q(a)−P (a))
=n
5
= n0 = 1
m−n . Now
where we used that m!
n! ≥ n
X
1=
P n (TQn ) ≤ |Pn | max P n (TQn ) = |Pn |P n (TPn ) = |Pn |·|TPn |·2−nH(P )
Q∈Pn
Q∈Pn
d)
Qn (TPn ) =
X
Qn (x) =
x∈TPn
X
2−n(H(P )+D(P ||Q)) = |TPn |2−n(H(P )+D(P ||Q))
x∈TPn
Now use the estimates of the cardinality of TPn .
9.
h(pe ) = h
!
n
1X
P rob(Xi 6= fi (Y ))
n
i=1
n
1X
≥
h (P rob(Xi 6= fi (Y )))
n
i=1
n
1X
H(Xi |Y )
≥
n
i=1
1
≥ (H(X1 |Y ) + H(X2 |Y X1 ) + . . . + H(Xn |Y X1 . . . Xn−1 ))
n
1
= H(X|Y )
n
Here we first used Jensen’s inequality, next the Fano inequality, then
strong subadditivity (in the form H(A|BC) ≤ H(A|B)), and finally the
chain rule.
10. Use the data processing inequality, the expansion of mutual information
in terms of entropy and conditional entropy, positivity of conditional
entropy and the bound on the entropy of a probability distribution on a
set of cardinality |Y| to get
I(X : Z) ≤ I(X : Y ) = H(Y ) − H(Y |X) ≤ H(Y ) ≤ log|Y|
11. It is convenient to introduce conditional relative entropy of two joint
probability distributions p(x, y) and q(x, y) on the same set X × Y as
follows:
X
X
p(y|x)
D(p(y|x)||q(y|x)) =
p(x)
p(y|x) log
q(y|x)
x∈X
y∈Y
This quantity is nonnegative and equals to 0 iff p(y|x) = q(y|x) for all
(x, y) ∈ X × Y such that p(x) > 0. There is a chain rule for relative
entropy involving conditional relative entropy:
D(p(x, y)||q(x, y)) = D(p(x)||q(y)) + D(p(y|x)||q(y|x))
6
Using the chain rule in two ways for the relative entropy of the distribu0
one has
tions PXn ,Xn+1 and PXn0 ,Xn+1
0
0
) = D(PXn ||PXn0 ) + D(PXn+1 |Xn ||PXn+1
D(PXn ,Xn+1 ||PXn0 ,Xn+1
|Xn0 )
0
0
)
= D(PXn+1 ||PXn+1
) + D(PXn |Xn+1 ||PXn0 |Xn+1
0
Now the conditional distributions PXn+1 |Xn and PXn+1
|Xn0 are the transition probabilities of the two Markov chains, and hence are the same. By
nonnegativity of conditional relative entropy we have
0
) ≤ D(PXn ||PXn0 )
D(PXn+1 ||PXn+1
as claimed. If the Markov chains are homogeneous µ = PX10 is a stationary
distribution, then µ = PXn0 for all n ≥ 1, in this case D(PXn ||µ) decreases
as n grows.
For example, if the uniform distribution PU on X is a stationary distribution of the Markov chain, then
H(Xn ) = log |X | − D(PXn ||PU ) ≤ log |X | − D(PXn+1 ||PU ) = H(Xn+1 )
12. Order the elements of X according to the (supposed) length of the corresponding codeword: lx1 ≤ lx2 ≤ · · · ≤ lx|X | = l. First draw a complete
|A|-ary tree of height l, and for each non-leaf vertex, label the A edges
going towards the leaves with elements of A (arbitrarily). Now in each
step take the first unused element xi of |X |, pick a vertex at distance lxi
from the root, attach the label xi to it and remove the vertices of the
sub-trees under this vertex. Repeat these steps until all the symbols are
used. Finally, the codewords can be read off from the paths from the root
to the leaves. It is clear that when the algorithm terminates, it gives a
prefix code with the given lengths. What we have to show is that it is
indeed possible to pick a vertex at distance lxi from the root in the ith
step. This is clearly the case if there are vertices at distance l from the
root left. The number of removed leaves at the ith step is
X
X
|A|l−lxj < |A|l
2−|A|x ≤ |A|l
1≤j<i
x∈X
using the assumption. On the right hand side we have the total number
of leaves in the initial tree.
13. Let X denote the range of X and X 0 , and let PX be their (common)
distribution. Applying Jensen’s inequality to the convex function x 7→ 2x
7
gives
P
2−H(X) = 2 x∈X PX (x) log PX (x)
X
≤
PX (x)2log PX (x)
x∈H
=
X
PX (x)2 =
x∈X
X
P rob(X = X 0 = x)
x∈X
14. The function f (t) = −t log t is strictly concave and its value is 0 at t = 0
and t = 1, therefore it is positive for 0 < t < 1. For ν ≤ 21 consider the
chords from 0 ≤ t to t + ν ≤ 1. The maximum of the absolute value of
its slope is at t = 0 or t = 1 − ν, hence
|f (t) − f (t + ν)| ≤ max{f (ν), f (1 − ν)} = −ν log ν
Using this we have
X
|H(P ) − H(Q)| = (f (P (x)) − f (Q(x)))
x∈X
X
≤
|(f (P (x)) − f (Q(x)))|
x∈X
≤
X
f (|P (x) − Q(x)|)
x∈X
X |P (x) − Q(x)|
|P (x) − Q(x)|
= kP − Qk1
−
log
kP − Qk1
kP − Qk1
kP − Qk1
x∈X
= −kP − Qk1 log kP − Qk1 + kP − Qk1 H(R)
≤ −kP − Qk1 (log kP − Qk1 − log |X |)
where we have introduced the probability distribution
R(x) =
|P (x) − Q(x)|
kP − Qk1
on the set X .
15. Let A and B be random variables with distributions P and Q, and let
f : X → I be defined as f (x) = i where x ∈ XI . Then A → f (A) and
B → f (B) can be viewed as two Markov chains with identical transition
probabilities, and hence D(PA ||PB ) ≥ D(Pf (A) ||Pf (B) ). Now observe that
PA = P , PB = Q, Pf (A) = PX and Pf (B) = QX .
16. Let X1 = {x ∈ X |P (x) ≤ Q(x)} and X2 = {x ∈ X |P (x) > Q(x)}.
{X1 , X2 } is clearly a partition of X, and therefore D(P ||Q) ≥ D(PX ||QX ).
8
Also we have
kP − Qk1 =
X
|P (x) − Q(x)|
x∈X
=
X
X
(P (x) − Q(x)) −
x∈X1
(P (x) − Q(x))
x∈X2
= (PX (1) − QX (1)) + (QX (2) − PX (2))
= kPX − QX k1
implying that it suffices to prove that D(PX ||QX ) ≥
1
2 ln 2 kPX
− QX k21 .
Now let PX = (p, 1 − p) and QX = (q, 1 − q), and consider the function
fc,p (q) = p ln
1−p
p
+ (1 − p) ln
− 4c(p − q)2
q
1−q
with parameters p and c. Clearly fc,p (p) = 0 and for 0 < q < 1
p 1−p
1
0
fc,p (p) = − +
+ 8c(p − q) = (q − p)
− 8c
q 1−q
q(1 − q)
For c ≤ 12 the second factor is nonnegative as q(1 − q) ≤ 14 . This ensures
that fc,p (q) attains its minimum at q = p. Therefore under this condition
we have
0 ≤ fc,p = (ln 2)D(PX ||QX ) − c(|p − q| + |(1 − p) − (1 − q)|)2
= (ln 2)D(PX ||QX ) − ckPX − QX k21
By setting c =
1
2
we get Pinsker’s inequality.
17. a) It is enough to show that intervals of the form [0, a) where 0 < a < 1
can be expressed in terms of the I(x1 ,...,xn ) with σ-algebra operations.
We claim that [0, a) is the union of all the intervals I(x1 ,...,xn ) which
are contained in [0, a). For this we need only to show that for 0 ≤
a0 < a there exists an interval I = I(x1 ,...,xn ) such that a0 ∈ I ⊆
[0, a). Observe first that for a given n there is precisely one sequence
(x1 , . . . , xn ) ∈ X n such that a0 ∈ I(x1 ,...,xn ) since these were defined
by iteratively partitioning subsets of [0, 1). Moreover, if (x1 , . . . , xn )
and (x01 , . . . , x0n0 ) are sequences such that a0 ∈ I(x1 ,...,xn ) and a0 ∈
I(x01 ,...,x0 0 ) , then one is a prefix of the other. Thus a0 determines a
n
unique infinite sequence (x1 , x2 , . . .) ∈ X , such that a0 is contained in
the interval corresponding to any of its (finite) prefixes. Let us denote
the diameter of a subset of R by d(·), i.e. d(A) = sup{|x−y||x, y ∈ A}.
In particular, for a < b we have d([a, b]) = d([a, b)) = d((a, b]) =
d((a, b)) = b − a. By construction and using the assumption
d(I(x1 ,...,xn ) ) = PX1 (x1 )PX2 |X1 (x2 |x1 ) . . . PXn |X1 ...Xn−1 (xn |x1 , . . . , xn−1 )
= PX1 ...Xn (x1 , . . . , xn ) → 0
9
as n → ∞, therefore d(I(x1 ,...,xn ) ) < a − a0 for some n. It follows that
a0 ∈ I(x1 ,...,xn ) ⊆ (a0 − (a − a0 ), a0 + a − a0 ) ∩ [0, 1) ⊆ [0, a).
b) The family of subsets in question is a nested sequence of compact
intervals, hence it has the finite intersection property. By compactness
the intersection of all sets in the family is nonempty. If a and b are
in the intersection, then |a − b| cannot be greater than the diameter
of any interval in the family. As limn→∞ d(I(x1 ,...,xn ) ) = 0, we have
|a − b| = 0, i.e. a = b.
c) It suffices to verify that the preimages of a set of generators of the
Borel σ-algebra are in F.
f −1 (I(x̂1 ,...,x̂n ) ) = {(x1 , x2 , . . .) ∈ X N |x1 = x̂1 , . . . , xn = x̂n } ∈ F
d)
P rob({(x1 , x2 , . . .) ∈ X N |x1 = x̂1 , . . . , xn = x̂n }) = PX1 ...Xn (x̂1 , . . . , x̂n )
= d(I(x̂1 ,...,x̂n ) )
which is equal to the measure of I(x̂1 ,...,x̂n ) with respect to the uniform
measure on [0, 1].
18. For PX (0) = p = 1 − PX (1) the output distribution is PY (0) = p + 1−p
2 =
1+p
1−p
2 and PY (1) = 2 . the mutual information is therefore
I(X : Y ) = H(Y ) − H(Y |X)
1+p
− p · 0 − (1 − p) · 1
=h
2
where h(x) = −x log x − (1 − x) log(1 − x). At the stationary point the
derivative is 0:
d
1 0 1+p
1
1−p
0=
I(X : Y ) = h
+ 1 = log
+1
dp
2
2
2
1+p
as h0 (x) = log 1−x
x , so
22 =
1+p
1−p
and hence p = 35 gives the optimal distribution. The capacity is C =
h( 45 ) − 25 = 1.321928 . . .
19. The mutual information between the input and the output can be expressed as I(X : Y ) = H(Y ) − H(Y |X). Now
X
X
H(Y |X) = −
PX (x)
PY |X (y|x) log PY |X (y|x)
x∈X
=−
X
x∈X
y∈X
PX (x)
X
n∈X
10
PN (n) log PN (n) = H(N )
implies that I(X : Y ) = H(Y ) − H(N ) ≤ log k − H(N ), and for uniform
input distribution the distribution of Y is
PY (y) =
X
PX (x)PY |X (y|x) =
x∈X
X
PX (x)PN (y−x) =
x∈X
1X
1
PN (y−x) =
k
k
x∈X
with H(Y ) = log k. Therefore the capacity is C = log k − H(N ).
It follows that the capacity of the identity channel is log k, for the binary
symmetric channel with bit flip probability p it is 1 − H(p), and for the
k
noisy typewriter C = log m
.
20. For the input random variable X the mutual information between the
input and the output is
I(X : Y ) =
X
=
X
=
X
PXY (x, y) log
x,y
PXY (x, y)
PX (x)PY (y)
PX (x)PY |X (y|x) log
x,y
PX (x) p log
x
= −p
X
PY |X (y|x)
PY (y)
p
pPX (x)
+ (1 − p) log P
1−p
P
x X (x)(1 − p)
PX (x) log PX (x) = pH(X)
x
Its maximum is p log |X |, and the optimal input distribution is uniform.
21. Being an additive noise channel the Shannon capacity is given by C =
log |X | − H(N ) = log 5 − log 2 = log 25 = 1.321928 . . ..
a) Let x = (x1 , . . . , xk ), y = (y1 , . . . , yk ) ∈ V (G)k ' V (Gk ) be two possible messages. These two vertices are joined in Gk iff (x1 , . . . , xk−1 )
and (y1 , . . . , yk−1 ) are adjacent in Gk−1 and also xk and yk are adjacent in G. By induction it follows that x and y are adjacent iff the
corresponding messages can produce the same output with nonzero
probability when their letters are sent through the channel by k uses.
An element of a set of such messages can be transmitted with zero
probability of error precisely when no two of them is joined by an
edge. The largest possible size of such a set is α0 (Gk ).
p
p
b) By ak = k1 log α0 (Gk ) = log k α0 (Gk ) ≤ log k |Gk | = log |G| the sequence is bounded from above, and by k1 log α0 (Gk ) ≥ k1 log 1 = 0 it is
also nonnegative. In particular, 0 ≤ lim inf k→∞ ak ≤ lim supk→∞ ak ≤
log |G|, and we only need to show that in the middle we have equality.
For simplicity, in what follows “independent” will always mean “independent disregarding the loops”. If S1 ⊆ V (Gk1 ) and S2 ⊆ V (Gk2 ) are
11
two independent sets then S1 × S2 ⊆ V (Gk1 ) × V (Gk2 ) ' V (Gk1 +k2 )
is also independent, and therefore
1
1
log α0 (Gk1 +k2 ) ≥
log α0 (Gk1 )α0 (Gk2 )
k1 + k2
k1 + k2
1
k1 ak1 + k2 ak2
=
log 2k1 ak1 +k2 ak2 =
k1 + k2
k1 + k2
ak1 +k2 =
In particular, it follows by induction that amk ≥ ak , and for n = mk+l
with m ∈ N and 0 ≤ l ≤ k − 1 the inequality
amk+l ≥
mkak + la1
mkak + ka1
m
1
≥
=
ak +
a1
mk + l
mk + k
m+1
m+1
holds. As m = b nk c, for all ε > 0 there exists Nε ∈ N such that
n ≥ Nε =⇒ an ≥ ak − ε, i.e. ak ≤ lim inf n→∞ an . From this we have
that lim supk→∞ ak ≤ supk∈N ak ≤ lim inf k→∞ ak ≤ lim supk→∞ ak ,
therefore we have equality everywhere, and a limit exists.
c) The vertices S = {(0, 0), (1, 2), (2, 4), (3, 1),
√ (4, 3)} are pairwise non1
k
adjacent in G , hence a2 ≥ 2 log |S| = log 5.
d) By rotational symmetry we only need to observe that hu0 , u0 i =
6 0 as
u0 6= 0 and to verify that hu0 , u1 i = 0 and hu0 , u2 i =
6 0.
hu0 , ui i = (5−1/4 )2 + (1 − 5−1/2 ) cos(
4π
i)
5
2π
8π
We need to find the values of cos( 4π
5 ) = cos(2 5 ) and cos( 5 ) =
2π
cos( 5 ).
These are solutions of the equation cos 5α = 1 which is a polynomial
in cos α. Using cos(α + β) = cos α cos β − sin α sin β, cos 2α = cos2 sin2 α = 2 cos2 α − 1 and sin 2α = 2 sin α cos α we have that
0 = cos 5α − 1 = cos(α + 4α) − 1 = cos α cos 4α − sin α sin 4α − 1
= cos α(2 cos2 2α − 1) − 2 sin α sin 2α cos 2α − 1
= cos α(2(2 cos2 α − 1)2 − 1) − 4 sin2 α cos α(2 cos2 α − 1) − 1
= 16 cos5 α − 20 cos3 α + 5 cos α − 1
= 16y 5 − 20y 3 + 5y − 1 = T5 (y) − 1
where we have introduced y = cos α. Of course one of the solutions
is α = 0, which corresponds to y = 1, therefore we can factor the
polynomial as follows:
16y 5 − 20y 3 + 5y − 1 = (y − 1)(16y 4 + 16y 3 − 4y 2 − 4y + 1)
As it turns out, the second factor is the square of a polynomial. In general, Tn denotes the unique polynomial for which cos nα = Tn (cos α).
Taking the derivative we get
−n sin nα = −Tn0 (cos α) sin α
12
and hence Tn0 (cos α) = 0 iff sin nα = 0 and sin α 6= 0, which is equivalent to | cos nα| = 1 and α ∈
/ πZ. In particular, all zeros of Tn − 1
(or Tn + 1) with absolute value less than 1 are multiple. But there
are b n−1
2 c distinct roots in (−1, 1) and 1 is always a root, while −1
is a root for even n, and hence all the roots in (−1, 1) must have
multiplicity 2.
In our case T5 (y) − 1 = (y − 1)(−1 + 2y + 4y 2 )2 , and hence the roots
are
√
√
4π
−1 − 5
8π
−1 + 5
cos( ) =
and cos( ) =
5
4
5
4
therefore
√ !
1
1
−1 − 5
hu0 , u1 i = √ + 1 − √
4
5
5
!
√
√
5
1
1
−1 − 5 +
+1 =0
=√ +
5
5 4
and
√ !
−1 + 5
4
!
√
√
5
−1 + 5 +
−1 =
6 0
5
1
1
hu0 , u2 i = √ + 1 − √
5
5
1
1
=√ +
5 4
as claimed.
e)
hui1 ⊗ · · · ⊗ uik , uj1 ⊗ · · · ⊗ ujk i = hui1 , uj1 i · · · · · huik , ujk i
is equal to 0 iff at least one factor is 0. This happens precisely when
at least one of the pairs {il , jl } (1 ≤ l ≤ k) is not an edge of G,
which is in turn equivalent to {(i1 , . . . , ik ), (j1 , . . . , jk )} not being an
edge of Gk . An independent set S is therefore mapped into a set of
orthogonal vectors. Also kui k = 1, and therefore this set is a subset
of an orthonormal basis B. Now for any vector v we have
X
hv, bi2 = kvk2
b∈B
and removing nonnegative terms from the left hand side does not
increase the sum. Now
X
1 = kek2 ≥
he ⊗ · · · ⊗ e, ui1 ⊗ · · · ⊗ uik i2
(i1 ,...,ik )∈S
=
X
k
Y
he, uij i2 = |S|5−k/2
(i1 ,...,ik )∈S j=1
13
√
√
and hence ak ≤ k1 log 5k/2 = log 5. Together with Θ(G) ≥ log 5 =
1.160964 . . . we have that here equality holds.
22. Let PX be the distribution of the input random variable, PY and PZ
the corresponding output distributions such that PY |X and PZ|Y are the
appropriate conditional probabilities and let C1 and C2 denote the capacities of the two channels, respectively, and C that of the composite
channel. Using the data processing inequalities we have
C = max I(X : Z) ≤ max I(X : Y ) = C1
PX
PX
and
C = max I(X : Z) ≤ max I(Y : Z) ≤ max I(Y : Z) = C2
PX
PX
PY
Both C1 and C2 are bounded by log |Y|, and therefore also C is. By
induction it follows that the composition of k channels with capacities
C1 , C2 , . . . , Ck has capacity at most min{Ci |1 ≤ i ≤ k}, and if the intermediate alphabets are Y1 , . . . , Yk−1 , then the capacity of the composition
is at most min{log |Yj ||1 ≤ j ≤ k − 1}.
23. a) If PX1 ...Xn is uniform on C ⊆ X n , then PX1 ...Xn |Y1 ...Yn (x|y) = 0 if
x∈
/ C, and
PY1 ...Yn |X1 ...Xn (y|x)
0
x0 ∈C PY1 ...Yn |X1 ...Xn (y|x )
PX1 ...Xn |Y1 ...Yn (x|y) = P
otherwise. This obtains its maximum if x ∈ C and the likelihood
PY1 ...Yn |X1 ...Xn (y|x) is maximal. Also, either both or none of these two
maxima are unique, therefore the two decodings give the same result.
b) The Hamming distance
d(x1 . . . xn , x01 . . . x0n ) = |{i ∈ {1, . . . , n}|xi = x0i }|
= |{i ∈ {1, . . . , n}|x0i = xi }|
= d(x01 . . . x0n , x1 . . . xn )
is symmetric, and 0 iff ∀i : xi = x0i , i.e. when x1 . . . xn = x01 . . . x0n . As
xi 6= x00i =⇒ xi 6= x0i ∨ x0i 6= x00i , the triangle inequality
d(x1 . . . xn , x01 . . . x0n ) + d(x01 . . . x0n , x001 . . . x00n )
= |{i ∈ {1, . . . , n}|xi 6= x0i }| + |{i ∈ {1, . . . , n}|x0i 6= x00i }|
≥ |{i ∈ {1, . . . , n}|xi 6= x0i } ∪ {i ∈ {1, . . . , n}|x0i 6= x00i }|
= |{i ∈ {1, . . . , n}|xi 6= x00i }| = d(x1 . . . xn , x001 . . . x00n )
also holds.
14
The likelihood
PY1 ...Yn |X1 ...Xn (y|x) =
n
Y
PY |X (yi |xi )
i=1
= (1 − p)
|{i∈{1,...,n}|xi =yi }|
= (1 − p)
n
|{i∈{1,...,n}|xi 6=yi }|
p
|X | − 1
d(x,y)
p
(|X | − 1)(1 − p)
decreases monotonically with increasing d(x, y) as long as p < (1 −
p)(|X |−1), i.e. p < 1−|X |−1 . Then the maximum likelihood amounts
to minimal Hamming distance. If p = 1 − |X −1 |, then the likelihood
becomes independent of x, and the decoder always gives failure (the
special symbol ∗). If p < 1 − |X |−1 and |X | > 2 then the maximum
is again nonunique for every output y, and the decoding results in ∗,
while for |X | = 2 we get the unique codeword which maximizes the
distance.
24. By AEP there is a typical set Anε ⊆ |X |n for each ε and large enough n
with |Anε | ≤ 2n(H+ε) and P r(X1 . . . Xn ∈ Anε ) ≥ 1 − ε. Its elements can
be indexed by bit strings of length n(H + ε) which can be transmitted by
n uses of the channel with error probability at most ε if H + ε = R < C.
P r(X n 6= X 0n ) ≤ P r(X n ∈
/ Anε ) + P r(gn (Z) 6= X n |X n ∈ Anε ) ≤ 2ε
therefore we can reconstruct the sequence with low error probability if n
is sufficiently large.
For the converse part we have to show that P r(X n 6= X 0n ) → 0 implies
H < C for any source-channel code (fn , gn )n∈N . By Fano’s inequality we
have
H(X n |X 0n ) ≤ 1 + P r(H 6= H 0 ) log |X n | = 1 + nP r(X n 6= X 0n ) log |X |
For a code therefore
H≤
=
=
=
≤
H(X1 , . . . , Xn )
n
H(X n )
n
1
1
H(X n |X 0n ) + I(X n : X 0n )
n
n
1
1
(1 + nP r(X n 6= X 0n ) log |X |) + I(Y n : Z n )
n
n
1
n
0n
+ P r(X 6= X ) log |X | + C
n
Now we let n → ∞ and get H ≤ C.
15
25. Using the chain rule for entropy we have
I(X1 , . . . , Xn ; Y1 , . . . , Yn )
= H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y1 , . . . , Yn )
= H(X1 , . . . , Xn ) − H(Z1 , . . . , Zn )
= H(X1 , . . . , Xn ) − (H(Z1 ) + H(Z2 |Z1 ) + . . . + H(Zn |Z1 , . . . , Zn−1 ))
≥ H(X1 , . . . , Xn ) − (H(Z1 ) + H(Z2 ) + . . . + H(Zn ))
= H(X1 , . . . , Xn ) − nH(p)
The supremum of the right hand side is n−nH(p) = nC, this gives a lower
bound on the supremum of the mutual information, i.e. the capacity.
26. An (M, n)-code f : {1, . . . , M } → X n , g : Y n → {1, . . . , M } can be
viewed as an (M, n)-code with shared randomness by composing with the
projections {1, . . . , M }×Z → {1, . . . , M } and Y n ×Z → Y n , respectively.
Therefore C ≥ CSR .
For a given value of n and shared random variable Z, let W be uniformly
distributed over {1, . . . , M = 2nR }, X = f (W, Z) the input random variable (with values in X n ), Y the corresponding output, and W 0 = g(Y, Z)
the decoded message. For the Markov chain W → (X, Z) → (Y, Z) → W 0
we have that
I(W ; W 0 ) ≤ I(W ; Y, Z) = I(W ; Z) 0 + I(W ; Y |Z) ≤ I(X; Y |Z) ≤ nC
| {z }
where first we have used a conditional version of the data processing
inequality: I(W ; Y |Z) ≤ I(X; Y |Z), and then that
X
I(X; Y |Z) =
PZ (z)I(X; Y |Z = z)
z∈Z
and for each z the conditional distributions PY |Z and PX|Z are related
by the same channel as it is independent of Z. Therefore
nR = H(W ) = H(W |W 0 ) + I(W ; W 0 )
= 1 + Pe nR + I(W ; W 0 )
≤ 1 + Pe nR + nC
Dividing by n we get
1
+C
n
Now let n → ∞ and assume that Pe → 0 to get R ≤ C.
R ≤ Pe R +
16
Download