Information theory exercises Solutions Winter 2011/2012 1. Let us denote the range of A, B, C with A, B, C, respectively, and their (joint,conditional) distributions with symbols like PA (·), PAB (·, ·), PA|B (·|·). Then H(A, B) + H(B, C) −H(A, B, C) − H(B) = − X PAB (a, b) log PAB (a, b)− a,b X PBC (b, c) log PBC (b, c)+ b,c X PABC (a, b, c) log PABC (a, b, c)+ a,b,c X PB (b) log PB (B) b =− X a,b X b,c X a,b,c X PB (b)PA|B (a|b) log PB (b)PA|B (a|b) − | {z } log PB (b)+log PA|B (a|b) PB (b)PC|B (c|b) log PB (b)PC|B (c|b) + | {z } log PB (b)+log PC|B (c|b) PB (b)PAC|B (a, c|b) log PB (b)PAC|B (a, c|b) + | {z } log PB (b)+log PAC|B (a,c|b) PB (b) log PB (B) b Now consider the terms containing log PB (b). In these the sums over a and c can be separately calculated for each fixed b, and their result is 1 independently of b. What remains of these sums is the entropy of B, but twice with positive and twice with negative sign. Treating the sum over 1 b separately, one has ··· = X PB (b) − X PA|B (a|b) log PA|B (a|b) a b ! − X PC|B (c|b) log PC|B (c|b) + X PA,C|B (ac|b) log PA,C|B (ac|b) a,c c For each b, PAC|B (·|b) is a probability distribution on A×C with PA|B (·|b) and PC|B (·|b) being its marginals. Applying subadditivity to these distributions we get the desired inequality. (Alternatively, we can do the same steps as in the proof of subadditivity inside the sum over b.) 2. By strict concavity of H(·) there is at most one maximum on a convex subset of a linear space, and a stationary point can only be a maximum. The maximum remains a maximum when we restrict the function to any finite dimensional affine subspace containing it, so we can find a candidate using the Lagrange multiplier method. We seek for a stationary point of − ∞ X pi log pi + λ i=0 ∞ X pi + β i=0 ∞ X ipi i=0 subject to the constraints ∞ X pi = 1 and i=0 ∞ X ipi = A i=0 The derivative with respect to pj is 0 = − log pj − 1 + λ + βj that is, pj = e−1+λ+βj Introducing the new variable q := 1 − eβ we have that pj = (1 − q)j q taking into account the normalization constraint. The expected value −1 0 of this is 1−q q , therefore q = (1 + A) . It remains to see that if pj is an arbitrary other probability distribution with expected value A, then along the line segment joining p and p0 the maximum of the entropy is at p. ∞ d d X = (pi + t(p0i − pi )) log(pi + t(p0i − pi )) − H((1 − t)p + tp0 ) dt dt t=0 t=0 i=0 = ∞ X (p0i − pi ) log pi + pi i=0 ∞ X ∞ X i=0 i=0 (p0i − pi ) + β =λ 2 (p0i − pi ) pi i(p0i − pi ) = 0 where we used that the convergence of the first sum and its (termwise) derivative is uniform on some neighbourhood t ∈ [0, ). (Check!) 3. a) A permutation of the arguments means reordering terms in a finite sum. b) 0α = 0 (note that we use the convention 00 = 0 to ensure continuity in α at 0) c) Hα (X, Y ) = X 1 log PXY (x, y)α 1−α x∈X y∈Y = X 1 α α log P (x) P (y) X Y 1−α x∈X y∈Y 1 log = 1−α 1 = 1−α ! X !! X PX (x)α x∈X PY (y)α x∈Y ! log X PX (x)α !! + log x∈X X PY (y)α x∈Y = Hα (X)Hα (Y ) d) The largest term in the sum is at least 1/n, therefore the argument of log is separated from 0 by at least n−α . The result follows as the composition of continuous functions is continuous. 4. Observe that H(Xi ) = H(XΘ |Θ = i). Using this we have pH(X0 ) + (1 − p)H(X1 ) = pH(XΘ |Θ = 0) + (1 − p)H(XΘ |Θ = 0) = H(XΘ |Θ) ≤ H(XΘ ) 5. a) Let A ⊆ A be the support of P . Then −D(P ||Q) = P (a) log Q(a) P (a) X Q(a) P (a) X a∈A ≤ log P (a) a∈A ≤ log X Q(a) = 0 a∈A | 3 {z 1 } b) X D(P1 × P2 ||Q1 × Q2 ) = (P1 × P2 )(a1 , a2 ) log (a1 ,a2 )∈A1 ×A2 X = (a1 ,a2 )∈A1 ×A2 = X (P1 × P2 )(a1 , a2 ) (Q1 × Q2 )(a1 , a2 ) P1 (a1 ) P2 (a2 ) P1 (a1 )P2 (a2 ) log + log Q1 (a1 ) Q2 (a2 ) P1 (a1 ) log a1 ∈A1 X P1 (a1 ) P2 (a2 ) + P2 (a2 ) log Q1 (a1 ) Q2 (a2 ) a2 ∈A2 = D(P1 ||Q1 ) + D(P2 ||Q2 ) c) For nonnegative numbers a1 , . . . , an , b1 , . . . , bn the log-sum inequality ! Pn n n X X ai ai ai log ≥ ai log Pi=1 n bi i=1 bi i=1 i=1 holds with equalityPiff ∀i : ai = bi . PThis can be seen by applying Jensen’s inequality ni=1 αi f (ti ) ≥ f ( ni=1 αi ti ) to the strictly convex function f (t) = t log t with αi = Pnbi bj and ti = abii . j=1 Now use this inequality with n = 2, a1 = λP (i), a2 = (1 − λ)P 0 (i), b1 = λQ(i), b2 = (1 − λ)Q0 (i) to get the inequalities λP (i) + (1 − λ)P 0 (i) ≤ λQ(i) + (1 − λ)Q0 (i) λP (i) (1 − λ)P 0 (i) λP (i) log + (1 − λ)P 0 (i) log λQ(i) (1 − λ)Q0 (i) [λP (i) + (1 − λ)P 0 (i)] log and take their sum over i. 6. Let X denote the range of the random variable X, and let PX be its distribution. Let PU denote the uniform distribution on X , i.e. PU (A) = |A| |X | for A ⊆ |X|. Then H(X) = − X PX (x) log PX (x) x∈X =− X x∈X PX (x) + log PU (x) PX (x) log PU (x) = log |X | − D(PX ||PU ) 7. Let X, Y be a pair of random variables with range X and Y, respectively. 4 Their mutual information can be written as I(X : Y ) = H(X) + H(Y ) − H(X, Y ) X = PXY (x, y) (− log PX (x) − log PY (y) + log PXY (x, y)) x∈X y∈Y = X PXY (x, y) log x∈X y∈Y PXY (x, y) PX (x)PY (y) = D(PXY ||PX × PY ) 8. a) An element of Pn can be uniquely identified by an r-tuple of rational numbers between 0 and 1 with denominator n. Even without the normalization constraint the number of possible choices for the numerators is (n + 1)r , and adding a constraint cannot increase this. b) Qn (x) = n Y Q(xi ) i=1 = Y Q(a)nPx (a) a∈A = Y 2nPx (a) log Q(a) a∈A P −n a∈A (−Px (a) log P (a)+Px (a) log P (a)−Px (a) log Q(a)) =2 = 2−n(H(P )+D(P ||Q)) c) For the upper bound, calculate the probability of TPn with respect to the distribution P n : X 1 ≥ P n (TPn ) = P n (x) = |TPn | · 2−nH(P ) x∈TPn For the lower bound, we first show that P n (TPn ) ≥ P n (TQn ) for any Q ∈ Pn . Q |TPn | a∈A P (a)nP (a) P n (TPn ) = n Q P n (TQn ) |TQ | a∈A P (a)nQ(a) Y (nQ(a))! = P (a)n(P (a)−Q(a)) (nP (a))! a∈A Y ≥ (nP (a))nQ(a)−nP (a) P (a)n(P (a)−Q(a)) a∈A P n a∈A (Q(a)−P (a)) =n 5 = n0 = 1 m−n . Now where we used that m! n! ≥ n X 1= P n (TQn ) ≤ |Pn | max P n (TQn ) = |Pn |P n (TPn ) = |Pn |·|TPn |·2−nH(P ) Q∈Pn Q∈Pn d) Qn (TPn ) = X Qn (x) = x∈TPn X 2−n(H(P )+D(P ||Q)) = |TPn |2−n(H(P )+D(P ||Q)) x∈TPn Now use the estimates of the cardinality of TPn . 9. h(pe ) = h ! n 1X P rob(Xi 6= fi (Y )) n i=1 n 1X ≥ h (P rob(Xi 6= fi (Y ))) n i=1 n 1X H(Xi |Y ) ≥ n i=1 1 ≥ (H(X1 |Y ) + H(X2 |Y X1 ) + . . . + H(Xn |Y X1 . . . Xn−1 )) n 1 = H(X|Y ) n Here we first used Jensen’s inequality, next the Fano inequality, then strong subadditivity (in the form H(A|BC) ≤ H(A|B)), and finally the chain rule. 10. Use the data processing inequality, the expansion of mutual information in terms of entropy and conditional entropy, positivity of conditional entropy and the bound on the entropy of a probability distribution on a set of cardinality |Y| to get I(X : Z) ≤ I(X : Y ) = H(Y ) − H(Y |X) ≤ H(Y ) ≤ log|Y| 11. It is convenient to introduce conditional relative entropy of two joint probability distributions p(x, y) and q(x, y) on the same set X × Y as follows: X X p(y|x) D(p(y|x)||q(y|x)) = p(x) p(y|x) log q(y|x) x∈X y∈Y This quantity is nonnegative and equals to 0 iff p(y|x) = q(y|x) for all (x, y) ∈ X × Y such that p(x) > 0. There is a chain rule for relative entropy involving conditional relative entropy: D(p(x, y)||q(x, y)) = D(p(x)||q(y)) + D(p(y|x)||q(y|x)) 6 Using the chain rule in two ways for the relative entropy of the distribu0 one has tions PXn ,Xn+1 and PXn0 ,Xn+1 0 0 ) = D(PXn ||PXn0 ) + D(PXn+1 |Xn ||PXn+1 D(PXn ,Xn+1 ||PXn0 ,Xn+1 |Xn0 ) 0 0 ) = D(PXn+1 ||PXn+1 ) + D(PXn |Xn+1 ||PXn0 |Xn+1 0 Now the conditional distributions PXn+1 |Xn and PXn+1 |Xn0 are the transition probabilities of the two Markov chains, and hence are the same. By nonnegativity of conditional relative entropy we have 0 ) ≤ D(PXn ||PXn0 ) D(PXn+1 ||PXn+1 as claimed. If the Markov chains are homogeneous µ = PX10 is a stationary distribution, then µ = PXn0 for all n ≥ 1, in this case D(PXn ||µ) decreases as n grows. For example, if the uniform distribution PU on X is a stationary distribution of the Markov chain, then H(Xn ) = log |X | − D(PXn ||PU ) ≤ log |X | − D(PXn+1 ||PU ) = H(Xn+1 ) 12. Order the elements of X according to the (supposed) length of the corresponding codeword: lx1 ≤ lx2 ≤ · · · ≤ lx|X | = l. First draw a complete |A|-ary tree of height l, and for each non-leaf vertex, label the A edges going towards the leaves with elements of A (arbitrarily). Now in each step take the first unused element xi of |X |, pick a vertex at distance lxi from the root, attach the label xi to it and remove the vertices of the sub-trees under this vertex. Repeat these steps until all the symbols are used. Finally, the codewords can be read off from the paths from the root to the leaves. It is clear that when the algorithm terminates, it gives a prefix code with the given lengths. What we have to show is that it is indeed possible to pick a vertex at distance lxi from the root in the ith step. This is clearly the case if there are vertices at distance l from the root left. The number of removed leaves at the ith step is X X |A|l−lxj < |A|l 2−|A|x ≤ |A|l 1≤j<i x∈X using the assumption. On the right hand side we have the total number of leaves in the initial tree. 13. Let X denote the range of X and X 0 , and let PX be their (common) distribution. Applying Jensen’s inequality to the convex function x 7→ 2x 7 gives P 2−H(X) = 2 x∈X PX (x) log PX (x) X ≤ PX (x)2log PX (x) x∈H = X PX (x)2 = x∈X X P rob(X = X 0 = x) x∈X 14. The function f (t) = −t log t is strictly concave and its value is 0 at t = 0 and t = 1, therefore it is positive for 0 < t < 1. For ν ≤ 21 consider the chords from 0 ≤ t to t + ν ≤ 1. The maximum of the absolute value of its slope is at t = 0 or t = 1 − ν, hence |f (t) − f (t + ν)| ≤ max{f (ν), f (1 − ν)} = −ν log ν Using this we have X |H(P ) − H(Q)| = (f (P (x)) − f (Q(x))) x∈X X ≤ |(f (P (x)) − f (Q(x)))| x∈X ≤ X f (|P (x) − Q(x)|) x∈X X |P (x) − Q(x)| |P (x) − Q(x)| = kP − Qk1 − log kP − Qk1 kP − Qk1 kP − Qk1 x∈X = −kP − Qk1 log kP − Qk1 + kP − Qk1 H(R) ≤ −kP − Qk1 (log kP − Qk1 − log |X |) where we have introduced the probability distribution R(x) = |P (x) − Q(x)| kP − Qk1 on the set X . 15. Let A and B be random variables with distributions P and Q, and let f : X → I be defined as f (x) = i where x ∈ XI . Then A → f (A) and B → f (B) can be viewed as two Markov chains with identical transition probabilities, and hence D(PA ||PB ) ≥ D(Pf (A) ||Pf (B) ). Now observe that PA = P , PB = Q, Pf (A) = PX and Pf (B) = QX . 16. Let X1 = {x ∈ X |P (x) ≤ Q(x)} and X2 = {x ∈ X |P (x) > Q(x)}. {X1 , X2 } is clearly a partition of X, and therefore D(P ||Q) ≥ D(PX ||QX ). 8 Also we have kP − Qk1 = X |P (x) − Q(x)| x∈X = X X (P (x) − Q(x)) − x∈X1 (P (x) − Q(x)) x∈X2 = (PX (1) − QX (1)) + (QX (2) − PX (2)) = kPX − QX k1 implying that it suffices to prove that D(PX ||QX ) ≥ 1 2 ln 2 kPX − QX k21 . Now let PX = (p, 1 − p) and QX = (q, 1 − q), and consider the function fc,p (q) = p ln 1−p p + (1 − p) ln − 4c(p − q)2 q 1−q with parameters p and c. Clearly fc,p (p) = 0 and for 0 < q < 1 p 1−p 1 0 fc,p (p) = − + + 8c(p − q) = (q − p) − 8c q 1−q q(1 − q) For c ≤ 12 the second factor is nonnegative as q(1 − q) ≤ 14 . This ensures that fc,p (q) attains its minimum at q = p. Therefore under this condition we have 0 ≤ fc,p = (ln 2)D(PX ||QX ) − c(|p − q| + |(1 − p) − (1 − q)|)2 = (ln 2)D(PX ||QX ) − ckPX − QX k21 By setting c = 1 2 we get Pinsker’s inequality. 17. a) It is enough to show that intervals of the form [0, a) where 0 < a < 1 can be expressed in terms of the I(x1 ,...,xn ) with σ-algebra operations. We claim that [0, a) is the union of all the intervals I(x1 ,...,xn ) which are contained in [0, a). For this we need only to show that for 0 ≤ a0 < a there exists an interval I = I(x1 ,...,xn ) such that a0 ∈ I ⊆ [0, a). Observe first that for a given n there is precisely one sequence (x1 , . . . , xn ) ∈ X n such that a0 ∈ I(x1 ,...,xn ) since these were defined by iteratively partitioning subsets of [0, 1). Moreover, if (x1 , . . . , xn ) and (x01 , . . . , x0n0 ) are sequences such that a0 ∈ I(x1 ,...,xn ) and a0 ∈ I(x01 ,...,x0 0 ) , then one is a prefix of the other. Thus a0 determines a n unique infinite sequence (x1 , x2 , . . .) ∈ X , such that a0 is contained in the interval corresponding to any of its (finite) prefixes. Let us denote the diameter of a subset of R by d(·), i.e. d(A) = sup{|x−y||x, y ∈ A}. In particular, for a < b we have d([a, b]) = d([a, b)) = d((a, b]) = d((a, b)) = b − a. By construction and using the assumption d(I(x1 ,...,xn ) ) = PX1 (x1 )PX2 |X1 (x2 |x1 ) . . . PXn |X1 ...Xn−1 (xn |x1 , . . . , xn−1 ) = PX1 ...Xn (x1 , . . . , xn ) → 0 9 as n → ∞, therefore d(I(x1 ,...,xn ) ) < a − a0 for some n. It follows that a0 ∈ I(x1 ,...,xn ) ⊆ (a0 − (a − a0 ), a0 + a − a0 ) ∩ [0, 1) ⊆ [0, a). b) The family of subsets in question is a nested sequence of compact intervals, hence it has the finite intersection property. By compactness the intersection of all sets in the family is nonempty. If a and b are in the intersection, then |a − b| cannot be greater than the diameter of any interval in the family. As limn→∞ d(I(x1 ,...,xn ) ) = 0, we have |a − b| = 0, i.e. a = b. c) It suffices to verify that the preimages of a set of generators of the Borel σ-algebra are in F. f −1 (I(x̂1 ,...,x̂n ) ) = {(x1 , x2 , . . .) ∈ X N |x1 = x̂1 , . . . , xn = x̂n } ∈ F d) P rob({(x1 , x2 , . . .) ∈ X N |x1 = x̂1 , . . . , xn = x̂n }) = PX1 ...Xn (x̂1 , . . . , x̂n ) = d(I(x̂1 ,...,x̂n ) ) which is equal to the measure of I(x̂1 ,...,x̂n ) with respect to the uniform measure on [0, 1]. 18. For PX (0) = p = 1 − PX (1) the output distribution is PY (0) = p + 1−p 2 = 1+p 1−p 2 and PY (1) = 2 . the mutual information is therefore I(X : Y ) = H(Y ) − H(Y |X) 1+p − p · 0 − (1 − p) · 1 =h 2 where h(x) = −x log x − (1 − x) log(1 − x). At the stationary point the derivative is 0: d 1 0 1+p 1 1−p 0= I(X : Y ) = h + 1 = log +1 dp 2 2 2 1+p as h0 (x) = log 1−x x , so 22 = 1+p 1−p and hence p = 35 gives the optimal distribution. The capacity is C = h( 45 ) − 25 = 1.321928 . . . 19. The mutual information between the input and the output can be expressed as I(X : Y ) = H(Y ) − H(Y |X). Now X X H(Y |X) = − PX (x) PY |X (y|x) log PY |X (y|x) x∈X =− X x∈X y∈X PX (x) X n∈X 10 PN (n) log PN (n) = H(N ) implies that I(X : Y ) = H(Y ) − H(N ) ≤ log k − H(N ), and for uniform input distribution the distribution of Y is PY (y) = X PX (x)PY |X (y|x) = x∈X X PX (x)PN (y−x) = x∈X 1X 1 PN (y−x) = k k x∈X with H(Y ) = log k. Therefore the capacity is C = log k − H(N ). It follows that the capacity of the identity channel is log k, for the binary symmetric channel with bit flip probability p it is 1 − H(p), and for the k noisy typewriter C = log m . 20. For the input random variable X the mutual information between the input and the output is I(X : Y ) = X = X = X PXY (x, y) log x,y PXY (x, y) PX (x)PY (y) PX (x)PY |X (y|x) log x,y PX (x) p log x = −p X PY |X (y|x) PY (y) p pPX (x) + (1 − p) log P 1−p P x X (x)(1 − p) PX (x) log PX (x) = pH(X) x Its maximum is p log |X |, and the optimal input distribution is uniform. 21. Being an additive noise channel the Shannon capacity is given by C = log |X | − H(N ) = log 5 − log 2 = log 25 = 1.321928 . . .. a) Let x = (x1 , . . . , xk ), y = (y1 , . . . , yk ) ∈ V (G)k ' V (Gk ) be two possible messages. These two vertices are joined in Gk iff (x1 , . . . , xk−1 ) and (y1 , . . . , yk−1 ) are adjacent in Gk−1 and also xk and yk are adjacent in G. By induction it follows that x and y are adjacent iff the corresponding messages can produce the same output with nonzero probability when their letters are sent through the channel by k uses. An element of a set of such messages can be transmitted with zero probability of error precisely when no two of them is joined by an edge. The largest possible size of such a set is α0 (Gk ). p p b) By ak = k1 log α0 (Gk ) = log k α0 (Gk ) ≤ log k |Gk | = log |G| the sequence is bounded from above, and by k1 log α0 (Gk ) ≥ k1 log 1 = 0 it is also nonnegative. In particular, 0 ≤ lim inf k→∞ ak ≤ lim supk→∞ ak ≤ log |G|, and we only need to show that in the middle we have equality. For simplicity, in what follows “independent” will always mean “independent disregarding the loops”. If S1 ⊆ V (Gk1 ) and S2 ⊆ V (Gk2 ) are 11 two independent sets then S1 × S2 ⊆ V (Gk1 ) × V (Gk2 ) ' V (Gk1 +k2 ) is also independent, and therefore 1 1 log α0 (Gk1 +k2 ) ≥ log α0 (Gk1 )α0 (Gk2 ) k1 + k2 k1 + k2 1 k1 ak1 + k2 ak2 = log 2k1 ak1 +k2 ak2 = k1 + k2 k1 + k2 ak1 +k2 = In particular, it follows by induction that amk ≥ ak , and for n = mk+l with m ∈ N and 0 ≤ l ≤ k − 1 the inequality amk+l ≥ mkak + la1 mkak + ka1 m 1 ≥ = ak + a1 mk + l mk + k m+1 m+1 holds. As m = b nk c, for all ε > 0 there exists Nε ∈ N such that n ≥ Nε =⇒ an ≥ ak − ε, i.e. ak ≤ lim inf n→∞ an . From this we have that lim supk→∞ ak ≤ supk∈N ak ≤ lim inf k→∞ ak ≤ lim supk→∞ ak , therefore we have equality everywhere, and a limit exists. c) The vertices S = {(0, 0), (1, 2), (2, 4), (3, 1), √ (4, 3)} are pairwise non1 k adjacent in G , hence a2 ≥ 2 log |S| = log 5. d) By rotational symmetry we only need to observe that hu0 , u0 i = 6 0 as u0 6= 0 and to verify that hu0 , u1 i = 0 and hu0 , u2 i = 6 0. hu0 , ui i = (5−1/4 )2 + (1 − 5−1/2 ) cos( 4π i) 5 2π 8π We need to find the values of cos( 4π 5 ) = cos(2 5 ) and cos( 5 ) = 2π cos( 5 ). These are solutions of the equation cos 5α = 1 which is a polynomial in cos α. Using cos(α + β) = cos α cos β − sin α sin β, cos 2α = cos2 sin2 α = 2 cos2 α − 1 and sin 2α = 2 sin α cos α we have that 0 = cos 5α − 1 = cos(α + 4α) − 1 = cos α cos 4α − sin α sin 4α − 1 = cos α(2 cos2 2α − 1) − 2 sin α sin 2α cos 2α − 1 = cos α(2(2 cos2 α − 1)2 − 1) − 4 sin2 α cos α(2 cos2 α − 1) − 1 = 16 cos5 α − 20 cos3 α + 5 cos α − 1 = 16y 5 − 20y 3 + 5y − 1 = T5 (y) − 1 where we have introduced y = cos α. Of course one of the solutions is α = 0, which corresponds to y = 1, therefore we can factor the polynomial as follows: 16y 5 − 20y 3 + 5y − 1 = (y − 1)(16y 4 + 16y 3 − 4y 2 − 4y + 1) As it turns out, the second factor is the square of a polynomial. In general, Tn denotes the unique polynomial for which cos nα = Tn (cos α). Taking the derivative we get −n sin nα = −Tn0 (cos α) sin α 12 and hence Tn0 (cos α) = 0 iff sin nα = 0 and sin α 6= 0, which is equivalent to | cos nα| = 1 and α ∈ / πZ. In particular, all zeros of Tn − 1 (or Tn + 1) with absolute value less than 1 are multiple. But there are b n−1 2 c distinct roots in (−1, 1) and 1 is always a root, while −1 is a root for even n, and hence all the roots in (−1, 1) must have multiplicity 2. In our case T5 (y) − 1 = (y − 1)(−1 + 2y + 4y 2 )2 , and hence the roots are √ √ 4π −1 − 5 8π −1 + 5 cos( ) = and cos( ) = 5 4 5 4 therefore √ ! 1 1 −1 − 5 hu0 , u1 i = √ + 1 − √ 4 5 5 ! √ √ 5 1 1 −1 − 5 + +1 =0 =√ + 5 5 4 and √ ! −1 + 5 4 ! √ √ 5 −1 + 5 + −1 = 6 0 5 1 1 hu0 , u2 i = √ + 1 − √ 5 5 1 1 =√ + 5 4 as claimed. e) hui1 ⊗ · · · ⊗ uik , uj1 ⊗ · · · ⊗ ujk i = hui1 , uj1 i · · · · · huik , ujk i is equal to 0 iff at least one factor is 0. This happens precisely when at least one of the pairs {il , jl } (1 ≤ l ≤ k) is not an edge of G, which is in turn equivalent to {(i1 , . . . , ik ), (j1 , . . . , jk )} not being an edge of Gk . An independent set S is therefore mapped into a set of orthogonal vectors. Also kui k = 1, and therefore this set is a subset of an orthonormal basis B. Now for any vector v we have X hv, bi2 = kvk2 b∈B and removing nonnegative terms from the left hand side does not increase the sum. Now X 1 = kek2 ≥ he ⊗ · · · ⊗ e, ui1 ⊗ · · · ⊗ uik i2 (i1 ,...,ik )∈S = X k Y he, uij i2 = |S|5−k/2 (i1 ,...,ik )∈S j=1 13 √ √ and hence ak ≤ k1 log 5k/2 = log 5. Together with Θ(G) ≥ log 5 = 1.160964 . . . we have that here equality holds. 22. Let PX be the distribution of the input random variable, PY and PZ the corresponding output distributions such that PY |X and PZ|Y are the appropriate conditional probabilities and let C1 and C2 denote the capacities of the two channels, respectively, and C that of the composite channel. Using the data processing inequalities we have C = max I(X : Z) ≤ max I(X : Y ) = C1 PX PX and C = max I(X : Z) ≤ max I(Y : Z) ≤ max I(Y : Z) = C2 PX PX PY Both C1 and C2 are bounded by log |Y|, and therefore also C is. By induction it follows that the composition of k channels with capacities C1 , C2 , . . . , Ck has capacity at most min{Ci |1 ≤ i ≤ k}, and if the intermediate alphabets are Y1 , . . . , Yk−1 , then the capacity of the composition is at most min{log |Yj ||1 ≤ j ≤ k − 1}. 23. a) If PX1 ...Xn is uniform on C ⊆ X n , then PX1 ...Xn |Y1 ...Yn (x|y) = 0 if x∈ / C, and PY1 ...Yn |X1 ...Xn (y|x) 0 x0 ∈C PY1 ...Yn |X1 ...Xn (y|x ) PX1 ...Xn |Y1 ...Yn (x|y) = P otherwise. This obtains its maximum if x ∈ C and the likelihood PY1 ...Yn |X1 ...Xn (y|x) is maximal. Also, either both or none of these two maxima are unique, therefore the two decodings give the same result. b) The Hamming distance d(x1 . . . xn , x01 . . . x0n ) = |{i ∈ {1, . . . , n}|xi = x0i }| = |{i ∈ {1, . . . , n}|x0i = xi }| = d(x01 . . . x0n , x1 . . . xn ) is symmetric, and 0 iff ∀i : xi = x0i , i.e. when x1 . . . xn = x01 . . . x0n . As xi 6= x00i =⇒ xi 6= x0i ∨ x0i 6= x00i , the triangle inequality d(x1 . . . xn , x01 . . . x0n ) + d(x01 . . . x0n , x001 . . . x00n ) = |{i ∈ {1, . . . , n}|xi 6= x0i }| + |{i ∈ {1, . . . , n}|x0i 6= x00i }| ≥ |{i ∈ {1, . . . , n}|xi 6= x0i } ∪ {i ∈ {1, . . . , n}|x0i 6= x00i }| = |{i ∈ {1, . . . , n}|xi 6= x00i }| = d(x1 . . . xn , x001 . . . x00n ) also holds. 14 The likelihood PY1 ...Yn |X1 ...Xn (y|x) = n Y PY |X (yi |xi ) i=1 = (1 − p) |{i∈{1,...,n}|xi =yi }| = (1 − p) n |{i∈{1,...,n}|xi 6=yi }| p |X | − 1 d(x,y) p (|X | − 1)(1 − p) decreases monotonically with increasing d(x, y) as long as p < (1 − p)(|X |−1), i.e. p < 1−|X |−1 . Then the maximum likelihood amounts to minimal Hamming distance. If p = 1 − |X −1 |, then the likelihood becomes independent of x, and the decoder always gives failure (the special symbol ∗). If p < 1 − |X |−1 and |X | > 2 then the maximum is again nonunique for every output y, and the decoding results in ∗, while for |X | = 2 we get the unique codeword which maximizes the distance. 24. By AEP there is a typical set Anε ⊆ |X |n for each ε and large enough n with |Anε | ≤ 2n(H+ε) and P r(X1 . . . Xn ∈ Anε ) ≥ 1 − ε. Its elements can be indexed by bit strings of length n(H + ε) which can be transmitted by n uses of the channel with error probability at most ε if H + ε = R < C. P r(X n 6= X 0n ) ≤ P r(X n ∈ / Anε ) + P r(gn (Z) 6= X n |X n ∈ Anε ) ≤ 2ε therefore we can reconstruct the sequence with low error probability if n is sufficiently large. For the converse part we have to show that P r(X n 6= X 0n ) → 0 implies H < C for any source-channel code (fn , gn )n∈N . By Fano’s inequality we have H(X n |X 0n ) ≤ 1 + P r(H 6= H 0 ) log |X n | = 1 + nP r(X n 6= X 0n ) log |X | For a code therefore H≤ = = = ≤ H(X1 , . . . , Xn ) n H(X n ) n 1 1 H(X n |X 0n ) + I(X n : X 0n ) n n 1 1 (1 + nP r(X n 6= X 0n ) log |X |) + I(Y n : Z n ) n n 1 n 0n + P r(X 6= X ) log |X | + C n Now we let n → ∞ and get H ≤ C. 15 25. Using the chain rule for entropy we have I(X1 , . . . , Xn ; Y1 , . . . , Yn ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y1 , . . . , Yn ) = H(X1 , . . . , Xn ) − H(Z1 , . . . , Zn ) = H(X1 , . . . , Xn ) − (H(Z1 ) + H(Z2 |Z1 ) + . . . + H(Zn |Z1 , . . . , Zn−1 )) ≥ H(X1 , . . . , Xn ) − (H(Z1 ) + H(Z2 ) + . . . + H(Zn )) = H(X1 , . . . , Xn ) − nH(p) The supremum of the right hand side is n−nH(p) = nC, this gives a lower bound on the supremum of the mutual information, i.e. the capacity. 26. An (M, n)-code f : {1, . . . , M } → X n , g : Y n → {1, . . . , M } can be viewed as an (M, n)-code with shared randomness by composing with the projections {1, . . . , M }×Z → {1, . . . , M } and Y n ×Z → Y n , respectively. Therefore C ≥ CSR . For a given value of n and shared random variable Z, let W be uniformly distributed over {1, . . . , M = 2nR }, X = f (W, Z) the input random variable (with values in X n ), Y the corresponding output, and W 0 = g(Y, Z) the decoded message. For the Markov chain W → (X, Z) → (Y, Z) → W 0 we have that I(W ; W 0 ) ≤ I(W ; Y, Z) = I(W ; Z) 0 + I(W ; Y |Z) ≤ I(X; Y |Z) ≤ nC | {z } where first we have used a conditional version of the data processing inequality: I(W ; Y |Z) ≤ I(X; Y |Z), and then that X I(X; Y |Z) = PZ (z)I(X; Y |Z = z) z∈Z and for each z the conditional distributions PY |Z and PX|Z are related by the same channel as it is independent of Z. Therefore nR = H(W ) = H(W |W 0 ) + I(W ; W 0 ) = 1 + Pe nR + I(W ; W 0 ) ≤ 1 + Pe nR + nC Dividing by n we get 1 +C n Now let n → ∞ and assume that Pe → 0 to get R ≤ C. R ≤ Pe R + 16