Lecture 7: Prop: If X is iid, then h(X) = H(X1). Proof: In this case, H(X1, X2 . . . Xn) = n X H(Xi) = nH(X1). i=1 Prop: If X is stationary Markov of order k, then h(X) = H(Xk+1|Xk , . . . , X1) In particular, if k = 1, then h(X) = H(X2|X1) = X piPij log Pij ij where P is the prob. transition matrix and p is the stationary vector. Proof: For n ≥ k + 1, p(xn|xn−1 . . . x1) = p(xn|xn−1 . . . xn−k ) and so H(Xn|Xn−1 . . . X1) = H(Xn|Xn−1 . . . Xn−k ) = H(Xk+1|Xk , . . . , X1) (or use Property 10). For k = 2, h(X) = H(X2|X1)) = X p(X1 = i, X2 = j) log p(X2 = j|X1 = i) = ij X piPij log Pij ij Note: when P is irreducible, i.e., for all i, j, there exists n s.t. Pijn > 0, the stationary vector is unique. 1 Defintion: Let X be a stationary Markov chain and f be a function on the range of X1. Let Y be the stationary process defined by Yi = f (Xi). Then Y is called a function of a Markov chain (FMC). Example: FMC which is not Markov of any order. Let X be Markov chain on states {1, 2, 3} with probability transition matrix: 0 2/3 1/3 P = 1/3 2/3 0 . 2/3 0 1/3 and stationary vector [2/7, 4/7, 1/7]. Let f (1) = a and f (2) = f (3) = b. So, 1222133121213312221 maps to abbbabbabababbabbba. p(y0 = b|y−1 = b, y−2 = b, . . . y−k = b, y−k−1 = b) = a weighted average of 2/3 and 1/3 with relative weights: (4/7)(2/3)k and (1/7)(1/3)k . And p(y0 = b|y−1 = b, y−2 = b, . . . y−k = b, y−k−1 = a) = a weighted average of 2/3 and 1/3 with relative weights: (2/7)(2/3)k and (2/7)(1/3)k . Nofe: There is no known closed form expression for the entropy of an FMC. Theorem (Birch bounds for FMC): Let X be first-order Markov and Y a function of X. 1. H(Yn|Yn−1 . . . Y1, X1) ≤ H(Y ) ≤ H(Yn|Yn−1 . . . Y1). 2. The bounds are tight. 2 3. The lower bounds monotonically increase and the upper bounds monotonically decrease. Proof: 1. We have already established the upper bound. For the lower bound: Fix arbitrary m. By stationarity, Prop. 11 and Prop. 8, H(Yn|Yn−1 . . . Y1, X1) = H(Yn+m|Yn+m−1 . . . Ym+1, Xm+1) = H(Yn+m|Yn+m−1 . . . Ym+1, Xm+1, Xm . . . X1) ≤ H(Yn+m|Yn+m−1 . . . Ym+1, Ym . . . Y1) Let m → ∞. (for the use of Prop. 11, observe that (Yn+m, (Yn+m−1 . . . Ym+1) ⊥Xm+1 (Xm . . . X1) (by 1st-order Markovity). Lecture 8: Note: Webpage off my webpage. 2. H(Yn|Yn−1 . . . Y1) − H(Yn|Yn−1 . . . Y1, X1) = (H(Y1 . . . Yn)−H(Y1 . . . Yn−1))−(H(X1, Y1 . . . Yn)−H(X1, Y1 . . . Yn−1)) = H(X1|Y1 . . . Yn−1) − H(X1|Y1 . . . Yn) P Let an denote this last quantity. Then an is an infinite series of nonnegative terms whose nth partial sum is bounded, by telescoping, by 2H(X1). Thus, the series converges and so an → 0. Note: there is another, less slick, proof. 3 3. We already know this for upper bound. For lower bound: By Property 11, Stationarity and Property 8: H(Yn|Yn−1 . . . Y1, X1) = H(Yn|Yn−1 . . . Y1, X1, X0) = H(Yn+1|Yn . . . Y2, X2, X1) ≤ H(Yn+1|Yn . . . Y2, Y1, X1) Note: analogous result for function of k-th order Markov chain. Convergence is exponential in n, if all transition probabilities are strictly positive. Example: Random function of Markov chain Let X be first order, stationary Markov binary-valued Draw BSC block diagram with X input Y output and E additive noise. Then Y is a random function of X. Assume that E is iid(1 − , ) and X ⊥ E. En(ε) denotes i.i.d. binary noise with pE (0) = 1−ε and pE (1) = ε. Yn(ε) = Xn ⊕ En(ε), where ⊕ denotes binary addition, Xn denotes the binary input, Let X have prob. trans. matrix p00 p01 p10 p11 And E can be viewed as Markov with prob. trans. matrix 1−ε ε 1−ε ε 4 IN FACT, Y is an FMC with underlying Markov chain (Xn, En(ε)) (direct product) which is Markov with prob. trans. matrix (Kronecker product): (0, 0) (0, 1) (1, 0) (1, 1) y (0, 0) P (1 − ε) P ε P (1 − ε) P ε 00 00 01 01 (0, 1) P00(1 − ε) P00ε P01(1 − ε) P01ε . (1, 0) P10(1 − ε) P10ε P11(1 − ε) P11ε (1, 1) P10(1 − ε) P10ε P11(1 − ε) P11ε f ((0, 0)) = 0 = f ((1, 1)) and f ((0, 1)) = 0 = f ((1, 0)) Below, processes will be 2-sided or 1-sided (doesn’t matter). Will define ergodic more precisely in a moment. We will later see that any iid or irreducible stationary Markov process is ergodic. Theorem: AEP (SMB) Let X be a stationary ergodic process. Then log p(x1 . . . xn) lim − = h(X) a.e. n→∞ n This implies convergence in probability since measure space is finite. Will prove later in context of ergodic MPT’s. Proof in iid case: p(x1 . . . xn) = p(x1) . . . p(xn). By the (strong) law of large numbers, lim − n→ log p(x1 . . . xn) = −Ep(log p(x)) = H(X1) = h(X) a.e. n 5 Lecture 9: Re-state AEP (SMB): convergence a.e., which implies convergence in measure (probability) since measure space is finite; also holds in L1, and in fact in Lp. Defn: A stationary process X (with values in finite set X ) is ergodic if whenever measurable A ⊂ X Z such that µ(A) > 0, then for a.e. x ∈ X Z , there exists n > 0 such that σ n(x) ∈ A (here, µ is the prob. measure on X Z determined by X, and σ is the left shift). Ergodicity will turn out to be equivalent to a version of strong law of large numbers: If f is an integrable function (on (X Z , Borel, µ) Z f (x) + f ◦ σ(x) + . . . + f ◦ σ n−1(x) = f dµ a.e. lim n→∞ n (a version of the ergodic theorem, to be proved later). in fact, it also holds in Lp Will use this for Data Compression: Defn: Let X be stationary ergodic. A word x1 . . . xn is called (n, )-typical if 2−n(h2(X)+) ≤ p(x1 . . . xn) ≤ 2−n(h2(X)−) Equivalently, h2(X) − ≤ − log2 p(x1 . . . xn) ≤ h2(X) + n Let An be the set of all (n, )-typical words. 6 Typicality is critical for both data compression and channel coding. AEP Corollary: 1. For suff. large n, µ(An) > 1 − 2. |An| ≤ 2n(h2(X)+) 3. For sufficiently large n, |An| ≥ (1 − )2n(h2(X)−) Proof: 1. Follows from AEP (need only convergence in probability). 2. 1 ≥ µ(An) ≥ |An|2−n(h2(X)+) 3, By 1, for large n, |An|2−n(h2(X)−) ≥ µ(An) > 1 − Note: If X is binary valued, then h2(X) ≤ 1 n (Proof: h2(X) ≤ H2(Xn1...Xn) ≤ log2n(2 ) . ) and will typically be < 1. Thus, |An|/2n → 0 exponentially fast, but it has measure at least 1 − . (direct combinarorial proof for iid(p, 1 − p), p 6= 1/2 For data compression, we will need: Defn: K-L distance (relative entropy): Let p and q be probability vectors of the same length. X pi p(x) D(p, q) = pi log( ) = Ep log qi q(x) i Note: D is not a metric. Prop: D(p, q) ≥ 0 with equality iff p = q. 7 Proof: Apply Jensen: X qi X X qi pi )) = log( qi) = 0. −D(p, q) = pi log( )) ≤ log( p p i i i i i with equality iff qi pi is constant and thus each qi pi = 1. Note: technically must deal with zeros of p. Defn: A prefix-free code (or prefix code) over an alphabet A is a finite collection C of words with symbols in A such that no elemet of C is a proper prefix of any other element. Interested mainly in A = {0, 1}. P-f code example: {0, 10, 110} and earlier examples Kraft Inequality: Let `1, . . . , `k be the lengths of a prefix code. Then k X 2−`i ≤ 1. i=1 Proof: Let L be the length of the longest codeword. Consider a full binary tree with branches 0 and 1 (say 0 left and 1 right) of depth L. The tree has 2L leaves, i.e., paths of depth L. Now, we can represent each codeword in C as a path in the tree from the root (of depth at most L), So, we can view C as a set of paths P in the tree. Draw paths for the code above. By the prefix property, no element of P can be an initial path of any other element. 8 Any path that represents a codeword of depth `i ≤ L has exactly 2L−`i extensions of depth L. These extensions are disjoint. Thus, k X 2L−`i ≤ 2L. i=1 Thus, k X 2−`i ≤ 1. i=1 Notes: 1. The same works for 2 replaced with any integer ≥ 2. 2. The converse holds. 3. Equality holds iff every binary word of length L has a (unique) prefix in the code. Lecture 10 (Sept. 29): Defn: Let X be a finite set. A (binary) prefix encoder on X is a 1-1 mapping φ : X → {0, 1}∗ whose image is a prefix code. Prop: If φ is a prefix encoder, then define its m-th extension: φm : X m → {0, 1}∗ by φm(x1 . . . xm) = φ(x1) . . . φ(xm) Then φm is 1-1. In fact, φm is a prefix encoder. Proof: If φm(x1 . . . xm) = φm(x01 . . . x0m) then φ(x1) and φ(x01) are prefixes of the same binary word. Thus, φ(x1) = φ(x01) and since φ is 1-1, x1 = x01. 9 Induct. So, with a prefix code we can invertibly encode words of arbitrary length. Lemma: Let X ∼ p be r.v. with values in a finite set X . Let φ be a prefix encoder on X . Then X Ep(`(φ(x)) = p(x)`(φ(x)) ≥ H2(X). x Proof: Let Lφ = Ep(`(φ(x)). Lφ−H2(X) = X pi`i+ X i Let c = P j pi log2(pi) = − i X pi log2 2 i −`i + X pi log2(pi) i 2−`j and 2−`i ri = . c Then the above equals: X X X pi − pi log2 cri+ pi log2(pi) = ( pi log2 )−log2 c = D(p, r)−log2 c ri i i i which is ≥ 0 since D ≥ 0 and c ≤ 1 (by Kraft). Note: the proof is based on comparing p with the “correct” distribution r̄ Note: equality holds iff pi = 2−`i . Data Compression Theorem: Let X be a stationary ergodic process and entropy. For each n, let cn = min φ E(p(X1...Xn))(`(φ(x1 . . . xn))) n 10 where the min is taken over all (binary) prefix encoders φ on X n. number of coded binary bits (Expected ) number of process samples Then 1. (negative part) each cn ≥ h2(X). 2. (positive part) limn→∞ cn = h2(X). So, h2(X) is the best possible compression rate. Proof: 1 (Negative): By the Lemma, for any prefix encoder φ on X n, E(p(X1...Xn))(`(φ(x1 . . . xn))) ≥ H2((X1 . . . Xn)) So, H2((X1 . . . Xn)) ≥ h2(X). n NOTE now useful the normalized entropy formula is! cn ≥ 2 (Positive): Let > 0. For suff. large n, a construct prefix encoder on X n as follows. By the AEP Corollary (2), the number of (n, )-typical sequences is at most 2n(h2(X)+). So, there is a set C0 of binary words of size |An| and length exactly dn(h2(X) + )e. There is also a set C1 of binary words of size |(An)c| and length exactly dn log2 |X |e. Prepend each element of C0 with a 0 and each element of C1 with a 1, yielding sets D0 and D1. Then D = D0 ∪ D1 is a prefix code. Define φ on X n, a prefix encoder, such that 11 φ|An is a 1-1 mapping onto D0 and φ|(An)c is a 1-1 mapping onto D1. Now, choose n such that µ(An) > 1 − . Write h = h2(X). Then X E(p(X1...Xn))(`(φ(x1 . . . xn))) ≤ p(x1 . . . xn)(n(h + ) + 2) x1 ...xn ∈An + ≤ X p(x1 . . . xn)(n log |X | + 2) c x1 ...xn ∈(An ) µ(An)(n(h + ) + 2) + µ((An)c)(n log |X | + 2) ≤ n(h + ) + 2) + n log |X | + 2 = n(h + 0) where 0 = + 2/n + log |X | + 2/n. Thus, lim supn→∞ cn ≤ h. By part 1, lim inf n→∞ cn ≥ h. Notes: – Generalizes to UD code replacing prefix codes – Optimal codes for a given finite probability distribution: Huffman code. – Ziv-Lempel data compression: universal coding; learn probabilty distributon “on the fly” Lecture 11 (Sept. 30) CHANNEL CODING Defn: Let (X, Y ) be jointly distributed r.v.’s, finite valued. The mutual information is defined I(X; Y ) = H(X) − H(X|Y ) 12 Discuss meaning in terms of Venn diagrams and common information. Should I(X; Y ) be symmetric in X and Y ? Prop: 1, I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) 2. I(X; Y ) = H(X) + H(Y ) − H(X, Y ) Proof: 1 is equivalent to H(X) + H(Y |X) = H(Y ) + H(X|Y ) which is true since they are both equal to H(X, Y ). 2, H(X) + H(Y ) − H(X, Y ) = H(X) + H(Y ) − (H(Y ) + H(X|Y )) = H(X) − H(X|Y ) Prop: I(X; Y ) ≥ 0 with equality iff X ⊥ Y . Proof: H(X) + H(Y ) − H(X, Y ) ≥ 0 with equality iff H(X) + H(Y ) = H(X, Y ) iff X ⊥ Y , by entropy property 6. So, the only way that X and Y have no common information is if they are independent. 13 Defn: mutual information (rate) of a jointly distributed stationary pair (X, Y ) of stationary processes by: I(X1, . . . , Xn; Y1, . . . , Yn) n→∞ n Prop: limit exists and I(X; Y ) = lim I(X; Y ) = h(X) + h(Y ) − h((X, Y )) Proof: I(X1, . . . , Xn; Y1, . . . , Yn) = n→∞ n H((X1, . . . , Xn)) + H((Y1, . . . , Yn)) − H((X1, . . . , Xn), (Y1, . . . , Yn)) lim n→∞ n Prop: I(X; Y ) = lim 1. I(X; Y ) ≥ 0. 2. If X ⊥ Y , then I(X; Y ) = 0, but converse need not hold. Proof: H((X1, . . . , Xn), (Y1, . . . , Yn)) ≤ H(X1, . . . , Xn), +H(Y1, . . . , Yn) with equality if X ⊥ Y . Failure of converse: Join two zero entropy processes in a skew way. Example: Let X be defined by p(0n) = 1/2 = p(1n) for all n, where an is the sequence of n repetitions of the letter a. Let Y = X with joint distribution p(x, y) = x if y = x and 0 otherwise. Can also get positive entropy examples. Defn: A stationary channel is defined by an input alphabet X , output alphabet Y (both finite), and for each x ∈ X Z , channel a (conditional) probability measure p(·|x) on Y Z , such that 14 1. for each yi . . . yj ∈ Y j−i+1, function of x p(yi . . . yj |x) is measurable as a 2. for each x, p(yi+1 . . . yj+1|σ(x)) = p(yi . . . yj |x) Note: only the conditional probability of an output given an input is specified. Black Box Diagram Idea: A stationary process X is input to the channel and this yields an induced stationary output process Y : Z p(yi . . . yj ) = p(yi . . . yj |x)dµ(x) and in fact a stationary joint process Z p(xi . . . xj , yi . . . yj ) = p(yi . . . yj |ū)dµ(ū) {u∈X Z :ui ...uj =xi ...xj } A bit too general. We will treat only the simplest case (which is a bit too simple). Defn: The channel capacity of a channel is C = sup I(X; Y ) X where the sup is over all stationary processes X on X . Idea: you can transmit only information that is common to input and output. Notes: Will make this more measningful very soon. Defn: Discrete Memoryless Channel (DMC): defined by finite input alphabet X and output alphabet Y, and for each x ∈ X , channel 15 transition probabilities p(y|x) on Y, and define p(·|x) by: p(y1 . . . yn|x1 . . . xn) = p(y1|x1) · · · p(yn|xn) Notes: – Memoryless in that you multiply transition probabilities from time instant to time instant. Given input stationary process X, corresponding joint and output processes are defined by: p(x1 . . . xn, y1 . . . yn) = p(y1|x1) . . . p(yn|xn)p(x1 . . . xn) and p(y1 . . . yn) = X p(y1|x1) . . . p(yn|xn)p(x1 . . . xn) x1 ...xn Review of Example: BSC: X = {0, 1} = Y p(y|x) = 1 − if x = y; p(y|x) = if x 6= y; Note: DMC is a very simple kind of channel. Lemma: For a DMC {p(·|x)}x∈X and an iid X on X , let Y be induced output process. Then Y is iid and I(X; Y ) = I(X1; Y1). Proof: In fact, (X, Y ) is iid: p(x1 . . . xn, y1 . . . yn) = p(y1 . . . yn|x1 . . . xn)p(x1 . . . xn) = p(y1|x1)p(x1) . . . p(yn|xn)p(xn) = p(x1, y1) . . . p(xn, yn) Now sum over all x1 . . . xn to see that Y is iid. Thus, I(X1, . . . , Xn; Y1, . . . , Yn) = 16 H(X1 . . . Xn) + H(Y1 . . . Yn) − H(X1 . . . Xn, Y1 . . . Yn) = n X H(Xi)+H(Yi)−H((Xi, Yi)) = n(H(X1)+H(Y1)−H(X1, Y1) i=1 Lecture 12 (Oct, 3): Pick time to review info theory and discuss homework. Wed., Oct. 12 at 1 or 2? Even for a BSC, and stationary first-order Markov input X, I(X; Y ) can be difficult to compute, since Y is an HMP Nevertheless , Prop: For a DMC, C = sup I(X; Y )( = sup I(X; Y )) X X iid where the first sup is taken over all random variables X on X (and corresponding r.v.’s Y on Y). Proof: Let C 0 be RHS. Given X, let X be corresponding iid process. Let Y be corresponding output r.v. and Y corresponding process. By lemma, I(X; Y ) = I(X; Y ) Thus, C 0 ≤ C. Conversely, given stationary X on X , we claim I(X; Y ) ≤ I(X1; Y1). (where Y1 is the output r.v. corresponding to X1). Proof: (by Hannah and Daniel) (1/n)I(X1 . . . Xn; Y1 . . . Yn) = (1/n) (H(Y1 . . . Yn) − H(Y1 . . . Yn|X1 . . . Xn) 17 Now, H(Y1 . . . Yn) ≤ n X H(Yi) = nH(Y1) i=1 And H(Y1 . . . Yn|X1 . . . Xn) = H(Y1|X1 . . . Xn)+H(Y2|X1 . . . Xn, Y1)+. . .+H(Yn|X1 . . . Xn, Y1, . . . Yn−1) = H(Y1|X1) + H(Y2|X2) + . . . + H(Yn|Xn) = nH(Y1|X1) the next-to-last equality since Yi ⊥ (X1 . . . Xi−1, Xi+1 . . . Xn, Y1, . . . Yi−1) (because the channel is memoryless). Thus, I(X; Y ) = lim (1/n)I(X1 . . . Xn; Y1 . . . Yn) ≤ H(Y1)−H(Y1|X1) = I(X1; Y n→∞ So, C ≤ C 0. Prop: For BSC(), C = log 2 − H(1 − , ). (which = 1 − H2(1 − , ) if base of log is 2). Proof: Let X be a r.v. Then I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − X p(x)H(Y |X = x) x X = H(Y )− p(x)H(1−, ) = H(Y )−H(1−, ) ≤ log 2−H(1−, ) x Let X ∼ (1/2, 1/2). Then Y ∼ (1/2, 1/2): p(Y = 0) = P (Y = 0|X = 0)P (X = 0)+P (Y = 0|X = 1)P (X = 1) = ((1 − ) + )(1/2) and so H(Y ) = log(2), thereby achieving the alleged capacity. 18 Previous result Generalizes to symmetric DMC’s But there is no closed form known for C() = sup I(X; Y ) p 1−p 1 0 X∼ where Y is output over BSC() resulting from X. For DMC, there is no closed-form solution. However, it is amenable to finite-dimensional optimization, in CONTRAST to problem of computing entropy of FMC. Mention Blahut-Arimoto for DMC (possible topic for student talk), but it is a finite-dimensional optimization problem. Defn: An (M, n) code for a DMC consists of a • codebook: set of codewords) C ⊂ X n • encoder: map E from {1, . . . , M } onto C • decoder: a map D from Y n to {1, . . . , M } (C, E, D), which we sometimes denote as simply C. For i = 1, . . . , M , let pCi = P (error|i) = X p(y1 . . . yn|E(i)) y1 ...yn :D(y1 ...yn )6=i (here, p( | ) is given by the DMC channel transition probabilities). The (maximum) error probability of the code is: pmax = pCmax = max pCi i=1,...,M 19 The (average) error probability of the code M 1 X C p pavg = pavg = ( ) M i=1 i C The rate of the code is R = (e..g, R = k n log2 (M ) . n if M = 2k .) Channel Coding Theorem for DMC. (positive part) Let R < C. Then there exists a sequence of n → 0, as n → ∞. (b2nR c, n) codes Cn such that pCmax (negative part) Let R > C. Then there exists b > 0 such that for any code C with rate R, pCavg ≥ b. Rough idea of negative part: Assume that the statistics of a code correspond to some iid X generated by a r.v. X on the input letters. Then the statistics for (X, Y ) are iid and hence so are those for Y . By AEP for Y , there are roughly 2nH(Y ) typical output sequences of length n. “By AEP for Y |X”, for a given typical input x1 . . . xn, there are roughly 2nH(Y |X) high probability output sequences Ox1...xn ⊂ Y n In order to decode with low error probability, we want the Ox1...xn to be roughly disjoint. So, the maximal number of input sequences that give rise to distinguishable sets of outputs is roughly: 2nH(Y ) 2nH(Y |X) ≈ 2nI(X;Y ). And the code rate would be ≈ I(X; Y ) ≤ C. Lecture 13 (Oct. 5): 20 Defn: Let (X, Y ) be jointly distributed r.v.’s. Let p be the distribution of the iid process generated by (X, Y ). s n Let A be set of sequences (x1 . . . xn, y1 . . . yn) of length n such that 1. | − n1 log2 p(x1 . . . xn, y1 . . . yn) − H2(X, Y )| < 2. | − n1 log2 p(x1 . . . xn) − H2(X)| < 3. | − n1 log2 p(y1 . . . yn) − H2(Y )| < Theorem (Joint AEP for iid): Let (X, Y ) be jointly distributed. Let µ be the measure on X Z × Y Z , defined by the iid process generated by (X, Y ) ∼ p(x, y). Let µ⊥ be the measure on X Z × Y Z , defined by the iid process generated by (X, Y ) ∼ p(x)p(y) n 1. For suff. large n, µ(A ) ≥ 1 − . 2. For suff. large n, n µ⊥(A ) ≈ 2−n(I(X;Y ) n −n(I(X;Y )+3) −n(I(X;Y )−3) (1 − )2 ≤ µ⊥(A ) ≤ 2 . Proof: 1. Roughly the same proof as AEP Corollary(1): intersect the An n for (X, Y ), X and Y to get A 2. From roughly the same proof as AEP Corollary(2,3), for sufficiently large n, n |A | ≈ 2n(H(X,Y ) n n(H(X,Y )−) n(H(X,Y )+) (1 − )2 ≤ |A | ≤ 2 21 n (using 1 and the first part of the defn. of A ). Thus, X n p(x1 . . . xn)p(y1 . . . yn) µ⊥(A ) = n (x1 ...xn ,y1 ...yn )∈A ≈ 2nH(X,Y )2−nH(X)2−nH(Y ) = 2−nI(X;Y ) Outline of positive part Let R < C, let > 0 be small, let n be large, let M = b2nR c and C be an (M, n) code. x11 ... ... xM 1 ... ... ... ... x1n ... ... xM n Encoder: E(i) = xi1 . . . xin. Decoder: D(y1 . . . yn) = i iff there is exactly one i such that n (xi1 . . . xin, y1 . . . yn) ∈ A . (otherwise, declare an error) Let X be r.v. such that R < I(X; Y ) − 3. Consider random code, which generates a random array: input x11 ... ... xM 1 output . . . x1n y11 ... ... ... ... ... ... . . . xM y1M n 22 ... ... ... ... yn1 ... ... ynM Q with probability p(xij )p(yji |xij ) This is a random array where codewords (right-half) are drawn randomly and received words (left-half) are drawn according to channel transition probabilities. In particular, the rows are independent. Will show that EC pCavg < , and hence there must be at least one such code such that pCavg < ((With modification, can show that for some code C, pCmax < .)) Proof: M M 1 X 1 X C pi = EC pCi EC pavg = EC M i−1 M i=1 C By symmetry EC pCi does not depend on i. So, = EC pC1 = P r(F1c ∪ F2 ∪ . . . ∪ FM ) ≤ P r(F1c) + M X P r(Fi) i=2 where n Fi = {(xi1 . . . xin, y11 . . . yn1 ) ∈ A } Now, P r(F1c) = µ(F1c), for i = 2, . . . , M, P r(Fi) = µ⊥(Fi) By joint AEP (1), for large n, 1st term < By joint AEP (2), for large n, for i = 2, . . . , M i-th term < 2−n(I(X;Y −3)) Thus, EC pCavg < + 2nR 2−n(I(X;Y )−3) < + 2−n(I(X;Y )−R−3) 23 Then RHS tends to 0 as n → ∞. Modification: if pCavg < , then take C 0 to be the half of the 0 codewords with lowest pCi . Then pCmax < 2. For otherwise, pCi ≥ 2 for every xi1 . . . xin ∈ C\C 0 and so pCavg ≥ , And C 0 is a (bM/2c, n)-code (and so has rate R − 1/n). Mention vast generalizations of channel coding theorem for channels with memory, e.g., RLL + BSC. But no closed form for capacity of this channel (worse than DMC since you need to consider all possible input processes of all possible orders) 24