Information Theory Mike Brookes E4.40, ISE4.51, SO20 Jan 2008 2 Lectures Entropy Properties 1 Entropy - 6 2 Mutual Information – 19 Losless Coding 3 Symbol Codes -30 4 Optimal Codes - 41 5 Stochastic Processes - 55 6 Stream Codes – 68 Channel Capacity 7 Markov Chains - 83 8 Typical Sets - 93 9 Channel Capacity - 105 10 Joint Typicality - 118 11 Coding Theorem - 128 12 Separation Theorem – 135 Continuous Variables 13 Differential Entropy - 145 14 Gaussian Channel - 159 15 Parallel Channels – 172 Lossy Coding 16 Lossy Coding - 184 17 Rate Distortion Bound – 198 Revision 18 Revision - 212 19 20 Jan 2008 3 Claude Shannon • “The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.” (Claude Shannon 1948) • Channel Coding Theorem: It is possible to achieve near perfect communication of information over a noisy channel • In this course we will: – Define what we mean by information – Show how we can compress the information in a source to its theoretically minimum value and show the tradeoff between data compression and distortion. – Prove the Channel Coding Theorem and derive the information capacity of different channels. 1916 - 2001 Jan 2008 4 Textbooks Book of the course: • Elements of Information Theory by T M Cover & J A Thomas, Wiley 2006, 978-0471241959 £30 (Amazon) Alternative book – a denser but entertaining read that covers most of the course + much else: • Information Theory, Inference, and Learning Algorithms, D MacKay, CUP, 0521642981 £28 or free at http://www.inference.phy.cam.ac.uk/mackay/itila/ Assessment: Exam only – no coursework. Acknowledgement: Many of the examples and proofs in these notes are taken from the course textbook “Elements of Information Theory” by T M Cover & J A Thomas and/or the lecture notes by Dr L Zheng based on the book. Jan 2008 5 Notation • Vectors and Matrices – v=vector, V=matrix, ~=elementwise product • Scalar Random Variables – x = R.V, x = specific value, X = alphabet • Random Column Vector of length N – x= R.V, x = specific value, XN = alphabet – xi and xi are particular vector elements • Ranges – a:b denotes the range a, a+1, …, b Jan 2008 6 Discrete Random Variables • A random variable x takes a value x from the alphabet X with probability px(x). The vector of probabilities is px . Examples: X = [1;2;3;4;5;6], px = [1/6; 1/6; 1/6; 1/6; 1/6; 1/6] pX is a “probability mass vector” “english text” X = [a; b;…, y; z; <space>] px = [0.058; 0.013; …; 0.016; 0.0007; 0.193] Note: we normally drop the subscript from px if unambiguous Jan 2008 7 Expected Values • If g(x) is real valued and defined on X then Ex g ( x ) = ∑ p ( x ) g ( x ) often write E for EX x∈X Examples: X = [1;2;3;4;5;6], px = [1/6; 1/6; 1/6; 1/6; 1/6; 1/6] E x = 3.5 = μ E x 2 = 15.17 = σ 2 + μ 2 E sin(0.1x ) = 0.338 E − log 2 ( p(x )) = 2.58 This is the “entropy” of X Jan 2008 8 Shannon Information Content • The Shannon Information Content of an outcome with probability p is –log2p • Example 1: Coin tossing – X = [Heads; Tails], p = [½; ½], SIC = [1; 1] bits • Example 2: Is it my birthday ? – X = [No; Yes], p = [364/365; 1/365], SIC = [0.004; 8.512] bits Unlikely outcomes give more information Jan 2008 10 Minesweeper • Where is the bomb ? • 16 possibilities – needs 4 bits to specify Guess Prob 1. 15/ No 16 SIC 0.093 bits Jan 2008 11 Entropy The entropy, H (x ) = E − log 2 ( px (x )) = −pTx log 2 p x – H(x) = the average Shannon Information Content of x – H(x) = the average information gained by knowing its value – the average number of “yes-no” questions needed to find x is in the range [H(x),H(x)+1) We use log(x) ≡ log2(x) and measure H(x) in bits – if you use loge it is measured in nats – 1 nat = log2(e) bits = 1.44 bits • log 2 ( x) = ln( x) ln(2) d log 2 x log 2 e = dx x H(X ) depends only on the probability vector pX not on the alphabet X, so we can write H(pX) Jan 2008 12 Entropy Examples 1 (1) Bernoulli Random Variable 0.6 H(p) X = [0;1], px = [1–p; p] H (x ) = −(1 − p) log(1 − p) − p log p Very common – we write H(p) to mean H([1–p; p]). 0.8 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 p (2) Four Coloured Shapes X = [z; ; ; V], px = [½; ¼; 1/8; 1/8] H (x ) = H (p x ) = ∑ − log( p( x)) p( x) = 1 × ½ + 2 × ¼ + 3 × 18 + 3 × 18 = 1.75 bits Jan 2008 13 Bernoulli Entropy Properties X = [0;1], px = [1–p; p] H ( p) = −(1 − p) log(1 − p) − p log p H ′( p ) = log(1 − p) − log p H ′′( p) = − p −1 (1 − p) −1 log e Bounds on H(p) 1 Quadratic Bounds 0.6 H ( p ) ≤ 1 − 2 log e( p − ½) 2 H(p) 0.8 1-2.89(p-0.5)2 H(p) 1-4(p-0.5)2 2min(p,1-p) 0.4 0.2 0 0 = 1 − 2.89( p − ½) 2 0.2 0.4 0.6 p H ( p ) ≥ 1 − 4( p − ½) 2 ≥ 2 min( p,1 − p ) 0.8 1 Proofs in problem sheet Jan 2008 14 Joint and Conditional Entropy p(x,y) y=0 y=1 x=0 ½ ¼ H (x , y ) = E − log p (x , y ) = −½ log ½ − ¼ log ¼ − 0 log 0 − ¼ log ¼ = 1.5 bits x=1 0 ¼ Conditional Entropy : H(y | x) p(y|x) Joint Entropy: H(x,y) Note: 0 log 0 = 0 y=0 y=1 x=0 2/ 3 1/ 3 x=1 0 1 Note: rows sum to 1 H (y | x ) = E − log p (y | x ) = −½ log 2 3 − ¼ log 13 − 0 log 0 − ¼ log 1 = 0.689 bits Jan 2008 15 Conditional Entropy – view 1 p(x, y) y=0 y=1 p(x) Additional Entropy: x=0 ½ ¼ ¾ p (y | x ) = p ( x , y ) ÷ p ( x ) 0 ¼ x=1 ¼ H (y | x ) = E − log p (y | x ) = E {− log p (x , y )}− E {− log p (x )} = H (x , y ) − H (x ) = H (½,¼,0,¼) − H (¼) = 0.689 bits H(Y|X) is the average additional information in Y when you know X H(x ,y) H(y |x) H(x) Jan 2008 16 Conditional Entropy – view 2 Average Row Entropy: p(x, y) y=0 y=1 H(y | x=x) p(x) x=0 ½ ¼ H(1/3) ¾ x=1 0 ¼ H(1) ¼ H (y | x ) = E − log p (y | x ) = ∑ − p ( x, y ) log p ( y | x) x, y = ∑ − p ( x) p ( y | x) log p ( y | x) = ∑ p ( x)∑ − p ( y | x) log p ( y | x) x∈ X x, y y∈Y = ∑ p ( x) H (y | x = x ) = ¾ × H ( 1 3 ) + ¼ × H (0) = 0.689 bits x∈ X Take a weighted average of the entropy of each row using p(x) as weight Jan 2008 17 Chain Rules • Probabilities p ( x , y , z ) = p (z | x , y ) p (y | x ) p ( x ) • Entropy H ( x , y , z ) = H (z | x , y ) + H (y | x ) + H ( x ) n H (x 1:n ) = ∑ H (x i | x 1:i −1 ) i =1 The log in the definition of entropy converts products of probability into sums of entropy Jan 2008 18 Summary • Entropy: H (x ) = ∑ − log 2 ( p( x)) p( x) = E − log 2 ( p X ( x)) x∈X – Bounded 0 ≤ H (x ) ≤ log | X | ♦ H(x ,y) • Chain Rule: H ( x , y ) = H (y | x ) + H ( x ) H(x |y) H(y |x) H(x) • Conditional Entropy: H(y) H (y | x ) = H (x , y ) − H (x ) = ∑ p ( x) H (y | x ) x∈X – Conditioning reduces entropy H (y | x ) ≤ H (y ) ♦ ♦ = inequalities not yet proved Jan 2008 19 Lecture 2 • Mutual Information – If x and y are correlated, their mutual information is the average information that y gives about x • E.g. Communication Channel: x transmitted but y received • Jensen’s Inequality • Relative Entropy – Is a measure of how different two probability mass vectors are • Information Inequality and its consequences – Relative Entropy is always positive • Mututal information is positive • Uniform bound • Conditioning and Correlation reduce entropy Jan 2008 20 Mutual Information The mutual information is the average amount of information that you get about x from observing the value of y I ( x ; y ) = H ( x ) − H ( x | y ) = H ( x ) + H (y ) − H ( x , y ) Information in x Information in x when you already know y H(x ,y) Mutual information is symmetrical H(x |y) I(x ;y) H(y |x) H(x) I ( x ; y ) = I (y ; x ) H(y) Use “;” to avoid ambiguities between I(x;y,z) and I(x,y;z) Jan 2008 21 Mutual Information Example p(x,y) y=0 y=1 x=0 ½ ¼ x=1 0 ¼ • If you try to guess y you have a 50% chance of being correct. • However, what if you know x ? – Best guess: choose y = x =0 (p=0.75) then 66% correct prob – If x =1 (p=0.25) then 100% correct prob – If x – Overall 75% correct probability H(x ,y)=1.5 I (x ; y ) = H (x ) − H (x | y ) H(x |y) =0.5 I(x ;y) =0.311 H(x)=0.811 H(y |x) =0.689 H(y)=1 = H ( x ) + H (y ) − H ( x , y ) H (x ) = 0.811, H (y ) = 1, H (x , y ) = 1.5 I (x ; y ) = 0.311 Jan 2008 22 Conditional Mutual Information Conditional Mutual Information I (x ; y | z ) = H (x | z ) − H (x | y , z ) = H ( x | z ) + H (y | z ) − H ( x , y | z ) Note: Z conditioning applies to both X and Y Chain Rule for Mutual Information I (x 1 , x 2 , x 3 ; y ) = I (x 1; y ) + I (x 2 ; y | x 1 ) + I (x 3 ; y | x 1 , x 2 ) n I (x 1:n ; y ) = ∑ I (x i ; y | x 1:i −1 ) i =1 H(x ,y) Jan 2008 23 Review/Preview • Entropy: H(x |y) I(x ;y) H(y |x) H(x) H(y) H (x ) = ∑ - log 2 ( p ( x)) p( x) = E − log 2 ( p X ( x)) x∈X – Always positive • Chain Rule: H (x ) ≥ 0 H ( x , y ) = H ( x ) + H (y | x ) ≤ H ( x ) + H (y ) ♦ – Conditioning reduces entropy H (y | x ) ≤ H (y ) ♦ • Mutual Information: I (y ; x ) = H (y ) − H (y | x ) = H ( x ) + H (y ) − H ( x , y ) – Positive and Symmetrical I (x ; y ) = I (y ; x ) ≥ 0 – x and y independent ⇔ H (x , y ) = H (y ) + H (x ) ♦ = inequalities not yet proved ⇔ I (x ; y ) = 0 ♦ ♦ Jan 2008 24 Convex & Concave functions f(x) is strictly convex over (a,b) if f (λu + (1 − λ )v) < λf (u ) + (1 − λ ) f (v) ∀u ≠ v ∈ (a, b), 0 < λ < 1 f(x) lies above f(x) – f(x) is concave ⇔ –f(x) is convex – every chord of Concave is like this • Examples – Strictly Convex: x , x , e , x log x [ x ≥ 0] – Strictly Concave: log x, x [ x ≥ 0] – Convex and Concave: x 2 – Test: 4 x d2 f > 0 ∀x ∈ (a, b) dx 2 ⇒ f(x) is strictly convex “convex” (not strictly) uses “≤” in definition and “≥” in test Jan 2008 25 Jensen’s Inequality Jensen’s Inequality: (a) f(x) convex ⇒ Ef(x) ≥ f(Ex) (b) f(x) strictly convex ⇒ Ef(x) > f(Ex) unless x constant Proof by induction on |X| These sum to 1 – |X|=1: E f (x ) = f ( E x ) = f ( x1 ) k k −1 pi f ( xi ) i =1 1 − pk – |X|=k: E f (x ) = ∑ pi f ( xi ) = pk f ( xk ) + (1 − pk )∑ i =1 ⎞ ⎛ k −1 pi xi ⎟⎟ ≥ pk f ( xk ) + (1 − pk ) f ⎜⎜ ∑ Assume JI is true ⎝ i =1 1 − pk ⎠ for |X|=k–1 k −1 ⎞ ⎛ pi xi ⎟⎟ = f (E x ) ≥ f ⎜⎜ pk xk + (1 − pk )∑ i =1 1 − pk ⎠ ⎝ Can replace by “>” if f(x) is strictly convex unless pk∈{0,1} or xk = E(x | x∈{x1:k-1}) Jan 2008 26 Jensen’s Inequality Example Mnemonic example: f(x) f(x) = x2 : strictly convex X = [–1; +1] 4 p = [½; ½] 3 Ex =0 2 f(E x)=0 1 E f(x) = 1 > f(E x) 0 -2 -1 0 x 1 2 Jan 2008 27 Relative Entropy Relative Entropy or Kullback-Leibler Divergence between two probability mass vectors p and q D(p || q) = ∑ p ( x) log x∈X p( x) p(x ) = Ep log = Ep (− log q(x ) ) − H (x ) q( x) q(x ) where Ep denotes an expectation performed using probabilities p D(p||q) measures the “distance” between the probability mass functions p and q. We must have pi=0 whenever qi=0 else D(p||q)=∞ Beware: D(p||q) is not a true distance because: – – (1) it is asymmetric between p, q and (2) it does not satisfy the triangle inequality. Jan 2008 28 Relative Entropy Example X = [1 2 3 4 5 6]T [6 q = [1 10 p= 1 1 1 1 1 1 ] ⇒ H (p) = 2.585 6 6 6 6 6 1 1 1 1 1 ⇒ H (q) = 2.161 10 10 10 10 2 D(p || q) = Ep (− log qx ) − H (p) = 2.935 − 2.585 = 0.35 ] D(q || p) = Eq (− log px ) − H (q) = 2.585 − 2.161 = 0.424 Jan 2008 29 Information Inequality Information (Gibbs’) Inequality: D(p || q) ≥ 0 • Define A = {x : p( x) > 0} ⊆ X p( x ) q( x ) • Proof − D(p || q) = − ∑ p( x) log q( x) = ∑ p( x ) log p( x) x∈A x∈A ⎛ q( x ) ⎞ ⎛ ⎞ ⎛ ⎞ ≤ log⎜⎜ ∑ p ( x ) ⎟⎟ = log⎜ ∑ q( x ) ⎟ ≤ log⎜ ∑ q( x ) ⎟ = log 1 = 0 p( x ) ⎠ ⎝ x∈A ⎠ ⎝ x∈X ⎠ ⎝ x∈A If D(p||q)=0: Since log( ) is strictly concave we have equality in the proof only if q(x)/p(x), the argument of log, equals a constant. But ∑ p( x) =∑ q( x) =1 x∈X x∈X so the constant must be 1 and p ≡ q Jan 2008 30 Information Inequality Corollaries • Uniform distribution has highest entropy – Set q = [|X|–1, …, |X|–1]T giving H(q)=log|X| bits D(p || q) = Ep {− log q(x )} − H (p) = log | X | − H (p) ≥ 0 • Mutual Information is non-negative I (y ; x ) = H (x ) + H (y ) − H (x , y ) = E log p(x , y ) p ( x ) p (y ) = D (p x , y || p x ⊗ p y ) ≥ 0 with equality only if p(x,y) ≡ p(x)p(y) ⇔ x and y are independent. Jan 2008 31 More Corollaries • Conditioning reduces entropy 0 ≤ I ( x ; y ) = H (y ) − H (y | x ) ⇒ H (y | x ) ≤ H (y ) with equality only if x and y are independent. • Independence Bound n n i =1 i =1 H (x 1:n ) = ∑ H (x i | x 1:i −1 ) ≤ ∑ H (x i ) with equality only if all xi are independent. E.g.: If all xi are identical H(x1:n) = H(x1) Jan 2008 32 Conditional Independence Bound • Conditional Independence Bound n n H (x 1:n | y 1:n ) = ∑ H (x i | x 1:i −1 , y 1:n ) ≤ ∑ H (x i | y i ) i =1 i =1 • Mutual Information Independence Bound If all xi are independent or, by symmetry, if all yi are independent: I (x 1:n ; y 1:n ) = H (x 1:n ) − H (x 1:n | y 1:n ) n n n i =1 i =1 i =1 ≥ ∑ H (x i ) − ∑ H (x i | y i ) = ∑ I (x i ; y i ) E.g.: If n=2 with xi i.i.d. Bernoulli (p=0.5) and y1=x2 and y2=x1, then I(xi;yi)=0 but I(x1:2; y1:2) = 2 bits. Jan 2008 33 Summary • Mutual Information • Jensen’s Inequality: I (x ; y ) = H (x ) − H (x | y ) ≤ H (x ) f(x) convex ⇒ Ef(x) ≥ f(Ex) • Relative Entropy: – D(p||q) = 0 iff p ≡ q D(p || q) = Ep log • Corollaries p(x ) ≥0 q(x ) – Uniform Bound: Uniform p maximizes H(p) – I(x ; y) ≥ 0 ⇒ Conditioning reduces entropy – Indep bounds: H (x 1:n ) ≤ n ∑ H (x ) H (x 1:n | y 1:n ) ≤ i i =1 I (x 1:n ; y 1:n ) ≥ ∑ I (x ; y ) i =1 ∑ H (x i =1 n i n i if x i or y i are indep i | yi) Jan 2008 34 Lecture 3 • Symbol codes – uniquely decodable – prefix • Kraft Inequality • Minimum code length • Fano Code Jan 2008 35 Symbol Codes • Symbol Code: C is a mapping X→D+ – D+ = set of all finite length strings from D – e.g. {E, F, G} →{0,1}+ : C(E)=0, C(F)=10, C(G)=11 • Extension: C+ is mapping X+ →D+ formed by concatenating C(xi) without punctuation • e.g. C+(EFEEGE) =01000110 • Non-singular: x1≠ x2 ⇒ C(x1) ≠ C(x2) • Uniquely Decodable: C+ is non-singular – that is C+(x+) is unambiguous Jan 2008 36 Prefix Codes • Instantaneous or Prefix Code: No codeword is a prefix of another • Prefix ⇒ Uniquely Decodable ⇒ Non-singular Examples: – C(E,F,G,H) = (0, 1, 00, 11) – C(E,F) = (0, 101) – C(E,F) = (1, 101) – C(E,F,G,H) = (00, 01, 10, 11) – C(E,F,G,H) = (0, 01, 011, 111) PU PU PU PU PU Jan 2008 37 Code Tree Prefix code: C(E,F,G,H) = (00, 11, 100, 101) Form a D-ary tree where D = |D| – D branches at each node 0 – Each node along the path to a leaf is a prefix of the leaf 1 ⇒ can’t be a leaf itself – Some leaves may be unused all used ⇒ | X | −1 is a multiple of D − 1 0 E 1 G 0 0 1 1 H F 111011000000→ FHGEE Jan 2008 38 Kraft Inequality (binary prefix) • Label each node at depth l with 2–l • Each node equals the sum of all its leaves • Codeword lengths: | X| l1, l2, …, l|X| ⇒ ∑ 2−l 0 ½ 0 1 E 1/ ¼ 8 0 G ¼ 1 i ¼ ≤1 0 ½ 1 1 1 i =1 • Equality iff all leaves are utilised • Total code budget = 1 Code 00 uses up ¼ of the budget Code 100 uses up 1/8 of the budget 1 ¼ /8 H F Same argument works with D-ary tree Jan 2008 40 Kraft Inequality If uniquely decodable C has codeword lengths | X| l1, l2, …, l|X| , then ∑ D −l | X| i ≤1 i =1 Proof: Let S = ∑ D −li and M = max li then for any N , i =1 N | X| | X| ⎛ | X| − li ⎞ S = ⎜⎜ ∑ D ⎟⎟ = ∑∑ i1 =1 i 2 =1 ⎝ i =1 ⎠ N NM | X| ∑D ( − l i 1 + l i2 + i N =1 = ∑ D | x : l = length{C (x)} | ≤ l =1 −l + + l iN )= ∑D − length{C + ( x)} x∈ X N NM ∑D l =1 −l NM D = ∑1 = NM l l =1 If S > 1 then SN > NM for some N. Hence S ≤ 1 Jan 2008 41 Converse to Kraft Inequality | X| If ∑ D −l i =1 Proof: i then ∃ a prefix code with codeword lengths l1, l2, …, l|X| ≤1 – Assume li ≤ li+1 and think of codewords as base-D decimals 0.d1d2…dli k −1 – Let codeword ck = ∑ D −l i i =1 with lk digits k −1 −l – For any j < k we have ck = c j + ∑ D −l ≥ c j + D i j i= j – So cj cannot be a prefix of ck because they differ in the first lj digits. ⇒ non-prefix symbol codes are a waste of time Jan 2008 42 Kraft Converse Example 5| Suppose l = [2; 2; 3; 3; 3] ⇒ ∑ 2−l i = 0.875 ≤ 1 i =1 k −1 ck = ∑ D − li Code 2 0.0 = 0.002 00 2 0.25 = 0.012 01 3 0.5 = 0.1002 100 3 0.625 = 0.1012 101 3 0.75 = 0.1102 110 lk i =1 Each ck is obtained by adding 1 to the LSB of the previous row For code, express ck in binary and take the first lk binary places Jan 2008 44 Minimum Code Length If l(x) = length(C(x)) then C is optimal if LC=E l(X) is as small as possible. Uniquely decodable code ⇒ LC ≥ H(X)/log2D Proof: We define q by q( x) = c −1D −l ( x ) where c = ∑ D −l ( x ) ≤ 1 LC − H (x ) / log 2 D = E l (x ) + E log D p (x ) ( x ) = E − log D D − l ( x ) + log D p(x ) = E (− log D cq(x ) + log D p(x ) ) ⎛ p(x ) ⎞ ⎟⎟ − log D c = log D 2(D(p || q ) − log c ) ≥ 0 = E ⎜⎜ log D q ( ) x ⎝ ⎠ with equality only if c=1 and D(p||q) = 0 ⇒ p = q ⇒ l(x) = –logD(x) Jan 2008 45 Fano Code Fano Code (also called Shannon-Fano code) 1. 2. Put probabilities in decreasing order Split as close to 50-50 as possible; repeat with each half a 0.20 b 0.19 c 0.17 d 0.15 e 0.14 f 0.06 g 0.05 h 0.04 0 00 0 1 0 1 1 0 010 1 011 0 1 100 0 110 1 101 0 1110 1 1111 H(p) = 2.81 bits LSF = 2.89 bits Always H (p) ≤ LF ≤ H (p) + 1 − 2 min(p) ≤ H (p) + 1 Not necessarily optimal: the best code for this p actually has L = 2.85 bits Jan 2008 46 Summary • Kraft Inequality for D-ary |codes: X| – any uniquely decodable C has ∑ D −l ≤ 1 | X| i =1 −l – If ∑ D ≤ 1 then you can create a prefix code i i i =1 • Uniquely decodable ⇒ LC ≥ H(X)/log2D • Fano code – Order the probabilities, then repeatedly split in half to form a tree. – Intuitively natural but not optimal Jan 2008 47 Lecture 4 • Optimal Symbol Code – Optimality implications – Huffman Code • Optimal Symbol Code lengths – Entropy Bound • Shannon Code Jan 2008 48 Huffman Code An Optimal Binary prefix code must satisfy: 1. p ( xi ) > p ( x j ) ⇒ li ≤ l j (else swap codewords) 2. The two longest codewords have the same length (else chop a bit off the longer codeword) 3. ∃ two longest codewords differing only in the last bit (else chop a bit off all of them) Huffman Code construction 1. Take the two smallest p(xi) and assign each a different last bit. Then merge into a single symbol. 2. Repeat step 1 until only one symbol remains Jan 2008 49 Huffman Code Example X = [a, b, c, d, e], px = [0.25 0.25 0.2 0.15 0.15] a 0.25 0.25 b 0.25 0.25 c 0.2 0.2 d 0.15 0 0.3 e 0.15 1 0 0.25 0 0.45 0.55 0.45 0 1.0 1 1 0.3 1 Read diagram backwards for codewords: C(X) = [00 10 11 010 011], LC = 2.3, H(x) = 2.286 For D-ary code, first add extra zero-probability symbols until |X|–1 is a multiple of D–1 and then group D symbols at a time Jan 2008 51 Huffman Code is Optimal Prefix Code Huffman traceback gives codes for progressively larger alphabets: 0 0 a 0.25 0.25 b 0.25 p2=[0.55 0.45], c 0.2 c2=[0 1], L2=1 d 0.15 p3=[0.25 0.45 0.3], e 0.15 c3=[00 1 01], L3=1.55 p4=[0.25 0.25 0.2 0.3], c4=[00 10 11 01], L4=2 p5=[0.25 0.25 0.2 0.15 0.15], c5=[00 10 11 010 011], L5=2.3 0.25 0.2 0 0 0.25 0.55 0.45 0.45 1.0 1 1 0.3 0.3 1 1 We want to show that all these codes are optimal including C5 Jan 2008 52 Huffman Optimality Proof Suppose one of these codes is sub-optimal: – ∃m>2 with cm the first sub-optimal code (note c 2 is definitely optimal) – An optimal c′m must have LC'm < LCm – Rearrange the symbols with longest codes in c′m so the two lowest probs pi and pj differ only in the last digit (doesen’t change optimality) – Merge xi and xj to create a new code c′m −1 as in Huffman procedure – L C'm–1 =L C'm– pi– pj since identical except 1 bit shorter with prob pi+ pj – But also L Cm–1 =L Cm– pi– pj hence L C'm–1 < LCm–1 which contradicts assumption that cm is the first sub-optimal code Note: Huffman is just one out of many possible optimal codes Jan 2008 53 How short are Optimal Codes? Huffman is optimal but hard to estimate its length. If l(x) = length(C(x)) then C is optimal if LC=E l(x) is as small as possible. We want to minimize 1. ∑D −l ( x ) ∑ p( x)l ( x) subject to x∈X ≤1 x∈X 2. all the l(x) are integers Simplified version: Ignore condition 2 and assume condition 1 is satisfied with equality. less restrictive so lengths may be shorter than actually possible ⇒ lower bound Jan 2008 54 Optimal Codes (non-integer li) | X| • Minimize ∑ p( x )l i i =1 i subject to | X| ∑D − li =1 i =1 Use lagrange multiplier: Define | X| | X| i =1 i =1 J = ∑ p ( xi )li + λ ∑ D −li and set ∂J =0 ∂li ∂J = p ( xi ) − λ ln( D) D − li = 0 ⇒ D − li = p ( xi ) / λ ln( D ) ∂li | X| also ∑D − li = 1 ⇒ λ = 1 / ln( D) ⇒ i =1 with these li , E l (x ) = E − log D ( p (x ) ) = li = –logD(p(xi)) E − log 2 ( p (x ) ) H (x ) = log 2 D log 2 D no uniquely decodable code can do better than this (Kraft inequality) Jan 2008 55 Shannon Code Round up optimal code lengths: li = ⎡− log D p ( xi )⎤ • li are bound to satisfy the Kraft Inequality (since the optimum lengths do) • Hence prefix code exists: put li into ascending order and set k −1 ck = ∑ D to li places − li i =1 • Average length: or k −1 ck = ∑ p( xi ) i =1 H (X ) H (X ) ≤ LC < +1 log 2 D log 2 D equally good since p ( xi ) ≥ D − li (since we added <1 to optimum values) Note: since Huffman code is optimal, it also satisfies these limits Jan 2008 56 Shannon Code Examples Example 1 (good) p x = [0.5 0.25 0.125 0.125] − log 2 p x = [1 2 3 3] l x = ⎡− log 2 p x ⎤ = [1 2 3 3] LC = 1.75 bits, H (x ) = 1.75 bits Example 2 (bad) p x = [0.99 0.01] Dyadic probabilities − log 2 p x = [0.0145 6.64] l x = ⎡− log 2 p x ⎤ = [1 7] (obviously stupid to use 7) LC = 1.06 bits, H (x ) = 0.08 bits We can make H(x)+1 bound tighter by encoding longer blocks as a super-symbol Jan 2008 57 Shannon versus Huffman Shannon p x = [0.36 0.34 0.25 0.05] ⇒ H (x ) = 1.78 bits − log 2 p x = [1.47 1.56 2 4.32] l S = ⎡− log 2 p x ⎤ = [2 2 2 5] LS = 2.15 bits Huffman l H = [1 2 3 3] LH = 1.94 bits a 0.36 0.36 b 0.34 0.34 c 0.25 d 0.05 0 0 0.36 0.64 0 1.0 1 1 0.3 1 Individual codewords may be longer in Huffman than Shannon but not the average Jan 2008 58 Shannon Competitive Optimality • l(x) is length of a uniquely decodable code • lS ( x) = ⎡− log p ( x) ⎤ is length of Shannon code then p(l (x ) ≤ lS (x ) − c ) ≤ 21−c Proof: Define A = {x : p ( x) < 2 −l ( x ) −c +1} x with especially short l(x) p (l (x ) ≤ ⎡− log p (x ) ⎤ − c ) ≤ p (l (x ) < − log p (x ) − c + 1) = p ( x ∈ A) = ∑ p( x) ≤ x∈ A ∑ max( p( x) | x ∈ A) < ∑ 2 x∈ A x∈ A ≤ ∑ 2− l ( x ) − c +1 = 2− ( c −1) ∑ 2− l ( x ) ≤ 2− ( c −1) x now over all x − l ( x ) − c +1 Kraft inequality x No other symbol code can do much better than Shannon code most of the time Jan 2008 59 Dyadic Competitive optimality ⇒ Shannon is optimal If p is dyadic ⇔ log p(xi) is integer, ∀i then p(l (x ) < lS (x ) ) ≤ p(l (x ) > lS (x ) ) with equality iff l(x) ≡ lS(x) Proof: sgn(x)={–1,0,+1} for {x<0, x=0, x>0} – Note: sgn(i) ≤ 2i –1 for all integers i – Define equality iff i=0 or 1 p(lS (x ) > l (x ) ) − p (lS (x ) < l (x ) ) = ∑ p ( x) sgn (lS ( x) − l ( x) ) ( A x ) ≤ ∑ p ( x) 2l S ( x ) − l ( x ) − 1 = −1 + ∑ 2− l S ( x ) 2l S ( x ) − l ( x ) x = −1 + ∑ 2 B −l ( x) ≤ −1 + 1 = 0 x Rival code cannot be shorter than Shannon more than half the time. x sgn() property dyadic ⇒ p=2–l Kraft inequality equality @ A ⇒ l(x) = lS(x) – {0,1} but l(x) < lS(x) would violate Kraft @ B since Shannon has Σ=1 Jan 2008 60 Shannon with wrong distribution If the real distribution of x is p but you assign Shannon lengths using the distribution q what is the penalty ? Answer: D(p||q) Proof: E l ( X ) = ∑ pi ⎡− log qi ⎤ < ∑ pi (1 − log qi ) i i ⎛ ⎞ p = ∑ pi ⎜⎜1 + log i − log pi ⎟⎟ qi i ⎝ ⎠ = 1 + D(p || q) + H (p) Therefore H (p) + D (p || q) ≤ E l (x ) < H (p) + D(p || q) + 1 If you use the wrong distribution, the penalty is D(p||q) Proof of lower limit is similar but without the 1 Jan 2008 61 Summary • Any uniquely decodable code: E l (x ) ≥ H D (x ) = • Fano Code: H D (x ) ≤ E l (x ) ≤ H D (x ) + 1 H (x ) log 2 D – Intuitively natural top-down design • Huffman Code: – Bottom-up design – Optimal ⇒ at least as good as Shannon/Fano • Shannon Code: li = ⎡− log D pi ⎤ H D (x ) ≤ E l (x ) ≤ H D (x ) + 1 – Close to optimal and easier to prove bounds Note: Not everyone agrees on the names of Shannon and Fano codes Jan 2008 62 Lecture 5 • • • • Stochastic Processes Entropy Rate Markov Processes Hidden Markov Processes Jan 2008 63 Stochastic process Stochastic Process {xi} = x1, x2, … often Entropy: H ({x }) = H (x ) + H (x | x ) + … = ∞ 1 2 1 i 1 n →∞ n Entropy Rate: H ( X ) = lim H (x 1:n ) if limit exists – Entropy rate estimates the additional entropy per new sample. – Gives a lower bound on number of code bits per sample. – If the xi are not i.i.d. the entropy rate limit may not exist. Examples: – xi i.i.d. random variables: H ( X ) = H (x i ) – xi indep, H(xi) = 0 1 00 11 0000 1111 00000000 … no convergence Jan 2008 64 Lemma: Limit of Cesàro Mean 1 n ∑ ak → b n k =1 an → b ⇒ Proof: • Choose • Set • Then ε >0 N 0 such that | an − b |< ½ε and find ∀n > N 0 N1 = 2 N 0ε −1 max(| ar − b |) for r ∈ [1, N 0 ] ∀n > N1 n −1 n ∑a k =1 k −b = n −1 N0 ∑a k =1 k −b +n ( −1 n ∑a k = N 0 +1 k −b ) ≤ N1−1 N 0 ½ N 0−1 N1ε + n −1n(½ε ) = ½ ε + ½ε = ε The partial means of ak are called Cesàro Means Jan 2008 65 Stationary Process Stochastic Process {xi} is stationary iff p (x 1:n = a1:n ) = p (x k + (1:n ) = a1:n ) ∀k , n, ai ∈ X If {xi} is stationary then H(X) exists and 1 H ( X ) = lim H (x 1:n ) = lim H (x n | x 1:n −1 ) n →∞ n n →∞ Proof: (a ) (b) 0 ≤ H (x n | x 1:n −1 ) ≤ H (x n | x 2:n −1 ) = H (x n −1 | x 1:n − 2 ) (a) conditioning reduces entropy, (b) stationarity Hence H(xn|x1:n-1) is +ve, decreasing ⇒ tends to a limit, say b Hence from Cesàro Mean lemma: H (x k | x 1:k −1 ) → b ⇒ 1 1 n H (x 1:n ) = ∑ H (x k | x 1:k −1 ) → b = H ( X ) n n k =1 Jan 2008 66 Block Coding If xi is a stochastic process – encode blocks of n symbols – 1-bit penalty of Shannon/Huffman is now shared between n symbols n −1 H (x 1:n ) ≤ n −1 E l (x 1:n ) ≤ n −1 H (x 1:n ) + n −1 If entropy rate of xi exists (⇐ xi is stationary) n −1H (x 1:n ) → H ( X ) ⇒ n −1E l (x 1:n ) → H ( X ) The extra 1 bit inefficiency becomes insignificant for large blocks Jan 2008 67 Block Coding Example 1 n-1E l X = [A;B], px = [0.9; 0.1] H ( xi ) = 0.469 0.8 0.6 0.4 B 0.1 1 4 6 8 10 12 n • n=1 sym A prob 0.9 code 0 • n=2 sym AA AB BA BB prob 0.81 0.09 0.09 0.01 code 0 11 100 101 • n=3 sym prob code AAA 0.729 0 2 n −1E l = 1 AAB 0.081 101 … … … n −1E l = 0.645 BBA BBB 0.009 0.001 10010 10011 n −1E l = 0.583 Jan 2008 68 Markov Process Discrete-valued Stochastic Process {xi} is • Independent iff p(xn|x0:n–1)=p(xn) • Markov iff p(xn|x0:n–1)=p(xn|xn–1) – time-invariant iff p(xn=b|xn–1=a) = pab indep of n – Transition matrix: T = {tab} • Rows sum to 1: T1 = 1 where 1 is a vector of 1’s • pn = TTpn–1 1 • Stationary distribution: p$ = TTp$ t13 3 t12 2 t14 t34 t43 t24 4 Independent Stochastic Process is easiest to deal with, Markov is next easiest Jan 2008 69 Stationary Markov Process If a Markov process is a) irreducible: you can go from any a to any b in a finite number of steps • irreducible iff (I+TT)|X|–1 has no zero entries b) aperiodic: ∀a, the possible times to go from a to a have highest common factor = 1 then it has exactly one stationary distribution, p$. T – p$ is the eigenvector of TT with λ = 1: T p $ = p $ 1pT$ where 1 = [1 1 1]T – T n n→ →∞ Jan 2008 71 Chess Board H(p )=0, 1 Random Walk H(p | p )=0 • Move ÙÚ ÛÜÝÞ equal prob • p1 = [1 0 … 0]T – H(p1) = 0 • p$ = 1/40 × [3 5 3 5 8 5 3 5 3]T – H(p$) = 3.0855 • H ( X ) = lim H (x n | x n −1 ) = ∑ − p$,i ti , j log(ti , j ) n →∞ i, j H ( X ) = 2.2365 Time-invariant and p1 = p$ ⇒ stationary – 1 0 Jan 2008 72 Chess Board Frames H(p )=0, 1 H(p )=3.111, 5 H(p | p )=0 1 0 H(p | p )=2.30177 5 4 H(p )=1.58496, H(p | p )=1.58496 H(p )=3.10287, H(p | p )=2.54795 H(p )=2.99553, H(p | p )=2.09299 H(p )=3.07129, H(p | p )=2.20683 H(p )=3.09141, H(p | p )=2.24987 H(p )=3.0827, H(p | p )=2.23038 2 6 2 6 1 5 3 7 3 7 2 6 4 8 4 8 Jan 2008 3 7 73 ALOHA Wireless Example M users share wireless transmission channel – For each user independently in each timeslot: • if its queue is non-empty it transmits with prob q • a new packet arrives for transmission with prob p – If two packets collide, they stay in the queues – At time t, queue sizes are xt = (n1, …, nM) • {xt} is Markov since p(xt) depends only on xt–1 Transmit vector is yt : ⎧0 p ( yi ,t = 1) = ⎨ ⎩q xi ,t = 0 xi ,t > 0 – {yt} is not Markov since p(yt) is determined by xt but is not determined by yt–1. {yt} is called a Hidden Markov Process. Jan 2008 75 ALOHA example Waiting Packets h g f TX en ) o e n d c ) y x: 3 0 2 1 1 1 2 1 1 1 0 2 1 2 2 1 TX enable, e: 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 0 y: 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 0 = (x>0)e is a deterministic function of the Markov [x; e] Jan 2008 76 Hidden Markov Process If {xi} is a stationary Markov process and y=f(x) then {yi} is a stationary Hidden Markov process. What is entropy rate H(Y) ? – Stationarity ⇒ H (y n | y 1:n −1 ) ≥ H (Y) and → H (Y) – Also (1) n →∞ (2) H (y n | y 1:n −1 , x 1 ) ≤ H (Y) and → H (Y) n →∞ So H(Y) is sandwiched between two quantities which converge to the same value for large n. Proof of (1) and (2) on next slides Jan 2008 77 Hidden Markov Process – (1) Proof (1): H (y n | y 1:n −1 , x 1 ) ≤ H (Y) H (y n | y 1:n −1 , x 1 ) = H (y n | y 1:n −1 , x − k :1 ) ∀k x markov = H (y n | y 1:n −1 , x − k :1 , y − k :1 ) = H (y n | y − k :n −1 , x − k :1 ) ≤ H (y n | y − k :n −1 ) ∀k y=f(x) conditioning reduces entropy = H (y k + n | y 0:k + n −1 ) → H (Y) y stationary k →∞ Just knowing x1 in addition to y1:n–1 reduces the conditional entropy to below the entropy rate. Jan 2008 78 Hidden Markov Process – (2) Proof (2): H (y n | y 1:n −1 , x 1 ) → H (Y) n →∞ k Note that ∑ I (x ; y n =1 Hence So 1 n | y 1:n −1 ) = I (x 1; y 1:k ) ≤ H (x 1 ) I (x 1; y n | y 1:n −1 ) → 0 n→∞ chain rule defn of I(A;B) bounded sum of non-negative terms H (y n | y 1:n −1 , x 1 ) = H (y n | y 1:n −1 ) − I (x 1; y n | y 1:n −1 ) → H ( Y) − 0 n →∞ The influence of x1 on yn decreases over time. defn of I(A;B) Jan 2008 79 Summary • Entropy Rate: 1 H (x 1:n ) n →∞ n H ( X ) = lim if it exists – {xi} stationary: H ( X ) = lim H (x n | x 1:n −1 ) n →∞ – {xi} stationary Markov: H ( X ) = H (x n | x n −1 ) = ∑ − p$,i ti , j log(ti , j ) – y = f(x): Hidden Markov: i, j H (y n | y 1:n −1 , x 1 ) ≤ H (Y) ≤ H (y n | y 1:n −1 ) with both sides tending to H(Y) Jan 2008 80 Lecture 6 • Stream Codes • Arithmetic Coding • Lempel-Ziv Coding Jan 2008 81 Huffman: Good and Bad Good – Shortest possible symbol code Bad H (x ) H (x ) ≤ LS ≤ +1 log 2 D log 2 D – Redundancy of up to 1 bit per symbol • Expensive if H(x) is small • Less so if you use a block of N symbols • Redundancy equals zero iff p(xi)=2–k(i) ∀i – Must recompute entire code if any symbol probability changes • A block of N symbols needs |X|N pre-calculated probabilities Jan 2008 82 Arithmetic Coding • Take all possible blocks of N symbols and sort into lexical order, xr for r=1: |X|N • Calculate cumulative probabilities in binary: Qr = ∑ p (x i ), Q0 = 0 i≤r • To encode xr, transmit enough binary places to define the interval (Qr–1, Qr) unambiguously. • Use first lr places of mr 2 − lr where lr is least integer with Qr −1 ≤ mr 2 − l < (mr + 1)2 − l ≤ Qr r r X=[a b], p=[0.6 0.4], N=3 Code Jan 2008 83 Arithmetic Coding – Code lengths Δ −d • The interval corresponding to xr has width p ( xr ) = Qr − Qr −1 = 2 r k r = ⎡d r ⎤ ⇒ d r ≤ k r < d r + 1 ⇒ ½ p( xr ) < 2 − k ≤ p( xr ) Qr–1 rounded up to kr bits • Set mr = ⎡2 k Qr −1 ⎤ ⇒ Qr −1 ≤ mr 2 − k • Define r r • If r (mr + 1)2 − k ≤ Qr r then set lr = kr; otherwise ⎡ – set lr = k r + 1 and redefine mr = 2lr Qr −1 – now ⎤ ⇒ (mr − 1)2 −l < Qr −1 ≤ mr 2 −l r r (mr + 1)2 − l = (mr − 1)2 − l + 2 − k < Qr −1 + p( xr ) = Qr r r r • We always have lr ≤ k r + 1 < d r + 2 = − log( pr ) + 2 – Always within 2 bits of the optimum code for the block (kr is Shannon len) d 6,7 = − log p ( x6,7 ) = − log 0.096 = 3.38 bits Q7 = 0.9360 = 0.111011 bbb m8=1111 bba m7=11011 Q6 = 0.8400 = 0.110101 bab Q5 = 0.7440 = 0.101111 m6=1100 k7 = 4, m7 = 14, 15 × 2 −4 > Q7 8 ⇒ l7 = 5, m7 = 27, 28 × 2 −5 ≤ Q7 9 k6 = 4, m6 = 12, 13 × 2 −4 < Q6 Jan 2008 9 84 Arithmetic Coding - Advantages • Long blocks can be used – Symbol blocks are sorted lexically rather than in probability order – Receiver can start decoding symbols before the entire code has been received – Transmitter and receiver can work out the codes on the fly • no need to store entire codebook • Transmitter and receiver can use identical finiteprecision arithmetic – rounding errors are the same at transmitter and receiver – rounding errors affect code lengths slightly but not transmission accuracy Jan 2008 85 Arithmetic Coding Receiver Qr probabilities b bbb bab babbaa bb b bba bab ba baab baa abb 1 ab 1 0 0 1 1 1 abbb abba aab 0.111110 0.111011 0.111001 0.110101 0.110011 0.7440 = 0.101111 0.6864 = 0.101011 0.6000 = 0.100110 0.5616 = 0.100011 0.5040 = 0.100000 0.4464 = 0.011100 abaa aabb a 0.9744 = 0.9360 = 0.8976 = 0.8400 = 0.8016 = baaa abab aba Each additional bit received narrows down the possible interval. bbba bbab bbaa babb baba 0.3600 = 0.010111 0.3024 = 0.010011 aaba 0.2160 = 0.001101 aa aaab 0.1296 = 0.001000 aaa aaaa X=[a b], p=[0.6 0.4] 0.0000 = 0.000000 Jan 2008 86 Arithmetic Coding/Decoding b a b b a a a Transmitter Min Max 00000000 11111111 10011001 11111111 10011001 11010111 10111110 11010111 11001101 11010111 11001101 11010011 11001101 11010000 11001101 11001111 b 11001110 11001111 … … … Input • • Send 1 10 011 1 Min 00000000 Receiver Test 10011001 10011001 10011001 10011001 11010111 10011001 10111110 11001101 11001101 11001101 … 11010111 10111110 11001101 11010011 11010000 11001111 … Max Output 11111111 11111111 11010111 11010111 11010111 11010011 11010000 … b a b b a a … Min/Max give the limits of the input or output interval; identical in transmitter and receiver. Blue denotes transmitted bits - they are compared with the corresponding bits of the receiver’s test value and Red bit show the first difference. Gray identifies unchanged words. Jan 2008 87 Arithmetic Coding Algorithm Input Symbols: X = [a b], p = [p q] [min , max] = Input Probability Range Note: only keep untransmitted bits of min and max Coding Algorithm: Initialize [min , max] = [000…0 , 111…1] For each input symbol, s If s=a then max=min+p(max–min) else min=min+p(max–min) while min and max have the same MSB transmit MSB and set min=(min<<1) and max=(max<<1)+1 end while end for • Decoder is almost identical. Identical rounding errors ⇒ no symbol errors. • Simple to modify algorithm for |X|>2 and/or D>2. • Need to protect against range underflow when [x y] = [011111…, 100000…]. Jan 2008 88 Adaptive Probabilities Number of guesses for next letter (a-z, space): o r a n g e s a n d l e mo n s 17 7 8 4 1 1 2 1 1 5 1 1 1 1 1 1 1 1 We can change the input symbol probabilities based on the context (= the past input sequence) Example: Bernoulli source with unknown p. Adapt p based on symbol frequencies so far: X = [a b], pn = [1–pn pn], pn = 1 + count ( xi = b) 1≤ i < n 1+ n Jan 2008 89 Adaptive Arithmetic Coding 1.0000 = 0.000000 pn = 1 + count ( xi = b) 1≤ i < n 1+ n bbb bbbb bb bbba b bba p1 = 0.5 p2 = 1/3 or 2/3 p3 = ¼ or ½ or ¾ p4 = … bab ba baa abb ab aba aab a bbab bbaa babb baba baab baaa abbb abba abab abaa aabb aaba aaab 0.8000 = 0.110011 0.7500 = 0.110000 0.7000 = 0.101100 0.6667 = 0.101010 0.6167 = 0.100111 0.5833 = 0.100101 0.5500 = 0.100011 0.5000 = 0.011111 0.4500 = 0.011100 0.4167 = 0.011010 0.3833 = 0.011000 0.3333 = 0.010101 0.3000 = 0.010011 0.2500 = 0.010000 0.2000 = 0.001100 aa aaa aaaa 0.0000 = 0.000000 Coder and decoder only need to calculate the probabilities along the path that actually occurs Jan 2008 90 Lempel-Ziv Coding Memorize previously occurring substrings in the input data – parse input into the shortest possible distinct ‘phrases’ – number the phrases starting from 1 (0 is the empty string) 1011010100010… 12_3_4__5_6_7 – each phrase consists of a previously occurring phrase (head) followed by an additional 0 or 1 (tail) – transmit code for head followed by the additional bit for tail 01001121402010… – for head use enough bits for the max phrase number so far: 100011101100001000010… – decoder constructs an identical dictionary prefix codes are underlined Jan 2008 91 Lempel-Ziv Example Input = 1011010100010010001001010010 Dictionary 000 00 0 0000 001 01 1 0001 010 10 0010 011 11 0011 0100 100 0101 101 0110 110 111 0111 1000 1001 φ 1 0 11 01 010 00 10 0100 01001 Send Decode 1 00 011 101 1000 0100 0010 1010 10001 10010 1 0 11 01 010 00 10 0100 01001 010010 Improvement • Each head can only be used twice so at its second use we can: – Omit the tail bit – Delete head from the dictionary and re-use dictionary entry Jan 2008 92 LempelZiv Comments Dictionary D contains K entries D(0), …, D(K–1). We need to send M=ceil(log K) bits to specify a dictionary entry. Initially K=1, D(0)= φ = null string and M=ceil(log K) = 0 bits. Input 1 0 1 1 0 1 0 1 0 Action “1” ∉D so send “1” and set D(1)=“1”. Now K=2 ⇒ M=1. “0” ∉D so split it up as “φ”+”0” and send “0” (since D(0)= φ) followed by “0”. Then set D(2)=“0” making K=3 ⇒ M=2. “1” ∈ D so don’t send anything yet – just read the next input bit. “11” ∉D so split it up as “1” + “1” and send “01” (since D(1)= “1” and M=2) followed by “1”. Then set D(3)=“11” making K=4 ⇒ M=2. “0” ∈ D so don’t send anything yet – just read the next input bit. “01” ∉D so split it up as “0” + “1” and send “10” (since D(2)= “0” and M=2) followed by “1”. Then set D(4)=“01” making K=5 ⇒ M=3. “0” ∈ D so don’t send anything yet – just read the next input bit. “01” ∈ D so don’t send anything yet – just read the next input bit. “010” ∉D so split it up as “01” + “0” and send “100” (since D(4)= “01” and M=3) followed by “0”. Then set D(5)=“010” making K=6 ⇒ M=3. So far we have sent 1000111011000 where dictionary entry numbers are in red. Jan 2008 93 Lempel-Ziv properties • Widely used – many versions: compress, gzip, TIFF, LZW, LZ77, … – different dictionary handling, etc • Excellent compression in practice – many files contain repetitive sequences – worse than arithmetic coding for text files • Asymptotically optimum on stationary ergodic source (i.e. achieves entropy rate) – {Xi} stationary ergodic ⇒ lim sup n −1l ( X 1:n ) ≤ H ( X ) with prob 1 • Proof: C&T chapter 12.10 n →∞ – may only approach this for an enormous file Jan 2008 94 Summary • Stream Codes – Encoder and decoder operate sequentially • no blocking of input symbols required – Not forced to send ≥1 bit per input symbol • can achieve entropy rate even when H(X)<1 • Require a Perfect Channel – A single transmission error causes multiple wrong output symbols – Use finite length blocks to limit the damage Jan 2008 95 Lecture 7 • Markov Chains • Data Processing Theorem – you can’t create information from nothing • Fano’s Inequality – lower bound for error in estimating X from Y Jan 2008 96 Markov Chains If we have three random variables: x, y, z p ( x, y , z ) = p ( z | x, y ) p ( y | x ) p ( x ) they form a Markov chain x→y→z if p ( z | x, y ) = p ( z | y ) ⇔ p ( x, y , z ) = p ( z | y ) p ( y | x ) p ( x ) A Markov chain x→y→z means that – the only way that x affects z is through the value of y – if you already know y, then observing x gives you no additional information about z, i.e. I ( x ; z | y ) = 0 ⇔ H (z | y ) = H (z | x , y ) – if you know y, then observing z gives you no additional information about x. A common special case of a Markov chain is when z = f(y) Jan 2008 97 Markov Chain Symmetry Iff x→y→z p ( x, y , z ) ( a ) p ( x, y ) p ( z | y ) p ( x, z | y ) = = = p( x | y ) p( z | y ) p( y) p( y) (a) p ( z | x, y ) = p ( z | y ) Hence x and z are conditionally independent given y Also x→y→z iff z→y→x since p ( z | y ) p( y ) (a) p ( x, z | y ) p ( y ) p ( x, y, z ) p( x | y) = p( x | y) = = p( y, z ) p ( y, z ) p( y, z ) = p( x | y, z ) (a) p( x, z | y ) = p ( x | y ) p ( z | y ) Markov chain property is symmetrical Jan 2008 98 Data Processing Theorem If x→y→z then I(x ;y) ≥ I(x ; z) – processing y cannot add new information about x If x→y→z then I(x ;y) ≥ I(x ; y | z) – Knowing z can only decrease the amount x tells you about y Proof: I (x ; y , z ) = I (x ; y ) + I (x ; z | y ) = I (x ; z ) + I (x ; y | z ) (a) but I (x ; z | y ) = 0 hence I (x ; y ) = I (x ; z ) + I (x ; y | z ) so I ( x ; y ) ≥ I ( x ; z ) and I ( x ; y ) ≥ I ( x ; y | z ) (a) I(x ;z)=0 iff x and z are independent; Markov ⇒ p(x,z |y)=p(x |y)p(z |y) Jan 2008 99 Non-Markov: Conditioning can increase I Noisy Channel: z =x +y – X=Y=[0,1]T pX=pY=[½,½]T – I(x ;y)=0 since independent – but I(x ;y | z)=½ H(x |z) = H(y |z) = H(x, y |z) = 0×¼+1×½+0×¼ = ½ since in each case z≠1 ⇒ H()=0 I(x; y |z) = H(x |z)+H(y |z)–H(x, y |z) = ½+½–½ = ½ x z + y XY 00 01 10 11 0 ¼ Z 1 2 ¼ ¼ ¼ If you know z, then x and y are no longer independent Jan 2008 100 Long Markov Chains If x1→ x2 → x3 → x4 → x5 → x6 then Mutual Information increases as you get closer together: – e.g. I(x3, x4) ≥ I(x2, x4) ≥ I(x1, x5) ≥ I(x1, x6) Jan 2008 101 Sufficient Statistics If pdf of x depends on a parameter θ and you extract a statistic T(x) from your observation, then θ → x → T (x ) ⇒ I (θ ; T (x )) ≤ I (θ ; x ) T(x) is sufficient for θ if the stronger condition: θ → x → T (x ) → θ ⇔ I (θ ; T (x )) = I (θ ; x ) ⇔ θ → T (x ) → x → θ ⇔ p (x | T (x ),θ ) = p (x | T (x )) Example: xi ~ Bernoulli(θ ), n T (x 1:n ) = ∑ x i i =1 p (X 1:n ⎧⎪ n Ck −1 if = x1:n | θ , ∑ x i = k ) = ⎨ if ⎪⎩ 0 ∑x ∑x i =k i ≠k independent of θ ⇒ sufficient Jan 2008 102 Fano’s Inequality If we estimate x from y, what is pe = p(xˆ ≠ x ) ? H (x | y ) ≤ H ( pe ) + pe log(| X | −1) ⇒ ( H (x | y ) − H ( pe ) ) (a) (H (x | y ) − 1) pe ≥ ≥ log(| X | −1) log(| X | −1) x y x^ (a) the second form is weaker but easier to use Proof: Define a random variable e = (xˆ ≠ x ) ?1 : 0 H (e , x | y ) = H (x | y ) + H (e | x , y ) = H (e | y ) + H (x | e , y ) chain rule H≥0; H(e |y)≤H(e) ⇒ H (x | y ) + 0 ≤ H (e ) + H (x | e , y ) = H (e ) + H (x | y , e = 0)(1 − pe ) + H (x | y , e = 1) pe ≤ H ( pe ) + 0 × (1 − pe ) + log(| X | −1) pe H(e)=H(p ) e Fano’s inequality is used whenever you need to show that errors are inevitable Jan 2008 103 Fano Example X = {1:5}, px = [0.35, 0.35, 0.1, 0.1, 0.1]T Y = {1:2} if x ≤2 then y =x with probability 6/7 while if x >2 then y =1 or 2 with equal prob. Our best strategy is to guess xˆ = y – px |y=1 = [0.6, 0.1, 0.1, 0.1, 0.1]T – actual error prob: pe = 0.4 Fano bound: pe ≥ H (x | y ) − 1 1.771 − 1 = = 0.3855 log(| X | −1) log(4) Main use: to show when error free transmission is impossible since pe > 0 Jan 2008 104 Summary • Markov: x → y → z ⇔ p( z | x, y ) = p( z | y ) ⇔ I (x ; z | y ) = 0 • Data Processing Theorem: if x→y→z then – I(x ; y) ≥ I(x ; z) – I(x ; y) ≥ I(x ; y | z) • Fano’s Inequality: if pe ≥ can be false if not Markov x → y → xˆ then H ( x | y ) − H ( pe ) H ( x | y ) − 1 H ( x | y ) − 1 ≥ ≥ log(| X | −1) log(| X | −1) log | X | weaker but easier to use since independent of pe Jan 2008 105 Lecture 8 • Weak Law of Large Numbers • The Typical Set – Size and total probability • Asymptotic Equipartition Principle Jan 2008 106 Strong and Weak Typicality X = {a, b, c, d}, p = [0.5 0.25 0.125 0.125] –log p = [1 2 3 3] ⇒ H(p) = 1.75 bits Sample eight i.i.d. values • strongly typical ⇒ correct proportions aaaabbcd –log p(x) = 14 = 8×1.75 • [weakly] typical ⇒ log p(x) = nH(x) aabbbbbb –log p(x) = 14 = 8×1.75 • not typical at all ⇒ log p(x) ≠ nH(x) dddddddd –log p(x) = 24 Strongly Typical ⇒ Typical Jan 2008 107 Convergence of Random Numbers • Convergence x n → y ⇒ ∀ε > 0, ∃m such that ∀n > m, | x n − y |< ε n →∞ Example: x n = ±2 – n , p = [½; ½] choose m = 1 − log ε • Convergence in probability (weaker than convergence) prob xn → y Example: ⇒ ∀ε > 0, P( | x n − y |> ε ) → 0 xn ∈ {0; 1}, p = [1 − n −1 ; n −1 ] →∞ for any small ε , p (| xn |> ε ) = n −1 ⎯n⎯ ⎯→ 0 Note: y can be a constant or another random variable Jan 2008 108 Weak law of Large Numbers Given i.i.d. {xi} ,Cesáro mean – E sn = E x = μ 1 n sn = ∑xi n i =1 Var s n = n −1 Var x = n −1σ 2 As n increases, Var sn gets smaller and the values become clustered around the mean WLLN: prob sn → μ ⇔ ∀ε > 0, P(| s n − μ |> ε ) → 0 n →∞ The “strong law of large numbers” says that convergence is actually almost sure provided that X has finite variance Jan 2008 109 Proof of WLLN • Chebyshev’s Inequality 2 2 Var y = E (y − μ ) = ∑ ( y − μ ) p( y ) y∈Y ≥ ∑μ (εy − μ ) 2 p( y ) ≥ y:| y − | > • WLLN ∑μ εε 2 p( y) = ε 2 p( y − μ > ε ) y:| y − | > For any choice of ε 1 n s n = ∑ x i where E x i = μ and Var x i = σ 2 n i =1 2 ε 2 p ( s n − μ > ε ) ≤ Var s n = σ prob Hence sn → μ →0 n n →∞ Actually true even if σ = ∞ Jan 2008 110 Typical Set xn is the i.i.d. sequence {xi} for 1 ≤ i ≤ n n – Prob of a particular sequence is p( x ) = ∏ p(x i ) i =1 – E − log p( x ) = n E − log p( xi ) = nH (x ) – Typical set: Tε( n ) = {x ∈ X n : − n −1 log p( x ) − H (x ) < ε } Example: N=4, =41% N=128, p=0.2,e=0.1, e=0.1,ppTpT=29% =85% N=8, N=2,p=0.2, N=64, =0% =72% N=1, N=32, p=0.2, e=0.1, N=16, =45% T=62% T – xi Bernoulli with p(xi =1)=p – e.g. p([0 1 1 0 0 0])=p2(1–p)4 – For p=0.2, H(X)=0.72 bits – Red bar shows T0.1(n) -2.5 -2 -1.5 -1 -1 N log p(x) -0.5 0 Jan 2008 111 Typical Set Frames N=1, p=0.2, e=0.1, pT=0% -2.5 -2 -1.5 -1 N-1log p(x) N=2, p=0.2, e=0.1, pT=0% -0.5 0 -2.5 N=16, p=0.2, e=0.1, pT=45% -2.5 -2 -1.5 -1 N-1log p(x) -2 -1.5 -1 N-1log p(x) N=4, p=0.2, e=0.1, pT=41% -0.5 0 -2.5 0 -2.5 -2 -1.5 -1 N-1log p(x) -0.5 -1.5 -1 N-1log p(x) -0.5 0 -2.5 N=64, p=0.2, e=0.1, pT=72% N=32, p=0.2, e=0.1, pT=62% -0.5 -2 N=8, p=0.2, e=0.1, pT=29% 0 -2.5 -2 -1.5 -1 N-1log p(x) -0.5 -2 -1.5 -1 N-1log p(x) -0.5 0 N=128, p=0.2, e=0.1, pT=85% 0 -2.5 -2 -1.5 -1 N-1log p(x) -0.5 0 1111, 00110011, 11, 0010001100010010, 1, 010,1101, 00 00110010, 1010, 0100, 0001001010010000, 00010001, 0000 00010000, 0001000100010000, 00000000 0000100001000000, 0000000010000000 Jan 2008 112 Typical Set: Properties 1. Individual prob: x ∈ Tε( n ) ⇒ log p( x ) = − nH (x ) ± nε 2. Total prob: 3. Size: p ( x ∈ Tε( n ) ) > 1 − ε for n > N ε (1 − ε )2 n ( H ( x ) −ε ) n> Nε < Tε( n ) ≤ 2 n ( H ( x ) +ε ) Proof 2: − n −1 log p( x ) = n −1 n − log p( x ) prob → E − log p ( xi ) = H (x ) ∑ i i =1 Hence ∀ε > 0 ∃N ε s.t. ∀n > N ε Proof 3a: Proof 3b: p ( − n −1 log p ( x ) − H (x ) > ε ) < ε f.l.e. n, 1 − ε < p ( x ∈ Tε( n ) ) ≤ 1 = ∑ p( x ) ≥ x ∑2 x∈Tε( n ) ∑ p( x ) ≥ ∑ 2 x∈Tε( n ) − n ( H ( x ) −ε ) x∈Tε( n ) − n ( H ( x ) +ε ) = 2 − n ( H ( x ) −ε ) Tε( n ) = 2 − n ( H ( x ) +ε ) Tε( n ) Jan 2008 113 Asymptotic Equipartition Principle • for any ε and for n > Nε “Almost all events are almost equally surprising” (n ) • p( x ∈ Tε ) > 1 − ε and log p ( x ) = − nH (x ) ± nε Coding consequence |X|n elements – x ∈ Tε(n ) : ‘0’ + at most 1+n(H+ε) bits (n ) – x ∉ Tε : ‘1’ + at most 1+nlog|X| bits – L = Average code length ≤ 2 + n( H + ε ) + ε (n log X ) ( = n H + ε + ε log X + 2n −1 ) ≤ 2 n ( H ( x ) +ε ) elements Jan 2008 114 Source Coding & Data Compression For any choice of δ > 0, we can, by choosing block size, n, large enough, do either of the following: • make a lossless code using only H(x)+δ bits per symbol on average: L ≤ n H + ε + ε log X + 2n −1 ( ) • make a code with an error probability < ε using H(x)+ δ bits for each symbol – just code Tε using n(H+ε+n–1) bits and use a random wrong code if x∉Tε ( Jan 2008 ) 115 What N ε ensures that p x ∈ Tε(n ) > 1 − ε ? 2 From WLLN, if Var(− log p ( xi ) ) = σ then for any n and ⎛1 n ⎞ σ2 ⎜ ε p⎜ ∑ − log p( xi ) − H ( X ) > ε ⎟⎟ ≤ ⎝ n i =1 ⎠ n Choose N ε = σ 2ε −3 ⇒ ( ⇒ 2 p x ∉ Tε ( (n) σ2 ) ≤ nε 2 ε Chebyshev ) p x ∈ Tε( n ) > 1 − ε Nε increases radidly for small ε For this choice of Nε , if x∈Tε(n) log p( x ) = − nH (x ) ± nε = −nH (x ) ± σ 2ε −2 So within Tε(n), p(x) can vary by a factor of 2 2σ 2ε −2 Within the Typical Set, p(x) can actually vary a great deal when ε is small Jan 2008 116 Smallest high-probability Set Tε(n) is a small subset of Xn containing most of the probability mass. Can you get even smaller ? For any 0 < ε < 1, choose N0 = –ε–1log ε , then for any n>max(N0,Nε) and any subset S(n) satisfying S ( n) < 2 n(H ( x ) − 2ε) ( ) ( ) ( p x ∈ S ( n ) = p x ∈ S ( n ) ∩ Tε( n ) + p x ∈ S ( n ) ∩ Tε( n ) ( (n) < S ( n ) max p ( x ) + p x ∈ T ε (n) x∈Tε ) < 2 n ( H − 2ε ) 2 − n ( H − ε ) + ε = 2−nε + ε < 2ε for n > N 0, ) for n > N ε 2 − nε < 2 log ε = ε Answer: No Jan 2008 117 Summary • Typical Set – Individual Prob – Total Prob – Size x ∈ Tε( n ) ⇒ log p ( x ) = − nH (x ) ± nε p ( x ∈ Tε( n ) ) > 1 − ε for n > N ε n > Nε (1 − ε )2 n ( H ( x ) −ε ) < Tε( n ) ≤ 2 n ( H ( x ) +ε ) • No other high probability set can be much smaller than Tε(n ) • Asymptotic Equipartition Principle – Almost all event sequences are equally surprising Jan 2008 118 Lecture 9 • Source and Channel Coding • Discrete Memoryless Channels – Symmetric Channels – Channel capacity • Binary Symmetric Channel • Binary Erasure Channel • Asymmetric Channel Jan 2008 119 Source and Channel Coding Source Coding Channel Coding In Compress Encode Noisy Channel Decode Decompress Out • Source Coding – Compresses the data to remove redundancy • Channel Coding – Adds redundancy to protect against channel errors Jan 2008 120 Discrete Memoryless Channel • Input: x∈X, Output y∈Y x Noisy Channel y • Time-Invariant Transition-Probability Matrix (Q y x ) | i, j = p (y = y j | x = xi ) – Hence p y = Q y |x p x – Q: each row sum = 1, average column sum = |X||Y|–1 T • Memoryless: p(yn|x1:n, y1:n–1) = p(yn|xn) • DMC = Discrete Memoryless Channel Jan 2008 121 Binary Channels ⎛1 − f ⎜⎜ ⎝ f • Binary Symmetric Channel – X = [0 1], Y = [0 1] • Binary Erasure Channel – X = [0 1], Y = [0 ? 1] ⎛1 − f ⎜⎜ ⎝ 0 • Z Channel – X = [0 1], Y = [0 1] f ⎞ ⎟ 1 − f ⎟⎠ f f ⎞ ⎟⎟ ⎠ 0 1 1 0 0 x 0 ⎞ ⎟ 1 − f ⎟⎠ 0 ⎛1 ⎜⎜ ⎝ f 1− f 0 x x y ? y 1 1 0 0 1 1 y Symmetric: rows are permutations of each other; columns are permutations of each other Weakly Symmetric: rows are permutations of each other; columns have the same sum Jan 2008 122 Weakly Symmetric Channels Weakly Symmetric: 1. All columns of Q have the same sum = |X||Y|–1 – If x is uniform (i.e. p(x) = |X|–1) then y is uniform p ( y ) = ∑ p ( y | x) p ( x) = X x∈X 2. −1 ∑ p( y | x) = X −1 ×XY −1 =Y −1 x∈X All rows are permutations of each other – Each row of Q has the same entropy so H (y | x ) = ∑ p ( x) H (y | x = x) = H (Q1,: ) ∑ p ( x) = H (Q1,: ) x∈X x∈X where Q1,: is the entropy of the first (or any other) row of the Q matrix Symmetric: 1. All rows are permutations of each other 2. All columns are permutations of each other Symmetric ⇒ weakly symmetric Jan 2008 123 Channel Capacity • Capacity of a DMC channel: C = max I (x ; y ) px – Maximum is over all possible input distributions px – ∃ only one maximum since I(x ;y) is concave in px for fixed py|x – We want to find the px that maximizes I(x ;y) – Limits on C: H(x ,y) 0 ≤ C ≤ min(H (x ), H (y ) ) ≤ min (log X , log Y ) • Capacity for n uses of channel: C (n) = ♦ H(x |y) I(x ;y) H(y |x) H(y) H(x) 1 max I (x 1:n ; y 1:n ) n px 1:n ♦ = proved in two pages time Jan 2008 124 Mutual Information Plot Binary Symmetric Channel Bernoulli Input 0 1–p I(X;Y) 1–f f f x p 1 0 1–f y 1 1 0.5 0 1 -0.5 0 0.8 0.2 0.6 0.4 0.4 0.6 0.2 0.8 1 Input Bernoulli Prob (p) 0 I ( x ; y ) = H (y ) − H ( x | y ) = H ( f − 2 pf + p) − H ( f ) Channel Error Prob (f) Jan 2008 125 Mutual Information Concave in pX Mutual Information I(x;y) is concave in px for fixed py|x Proof: Let u and v have prob mass vectors u and v – Define z: bernoulli random variable with p(1) = λ – Let x = u if z=1 and x=v if z=0 ⇒ px=λu+(1–λ)v I ( x , z ; y ) = I ( x ; y ) + I (z ; y | x ) = I (z ; y ) + I ( x ; y | z ) U but I (z ; y | x ) = H (y | x ) − H (y | x , z ) = 0 so 1 I (x ; y ) ≥ I (x ; y | z ) X V Z 0 p(Y|X) Y z = Deterministic = λI (x ; y | z = 1) + (1 − λ ) I (x ; y | z = 0) = λI (u ; y ) + (1 − λ ) I (v ; y ) Special Case: y=x ⇒ I(x; x)=H(x) is concave in px Jan 2008 126 Mutual Information Convex in pY|X Mutual Information I(x ;y) is convex in py|x for fixed px Proof (b) define u, v, x etc: – p y |x = λpu |x + (1 − λ )pv |x I (x ; y , z ) = I (x ; y | z ) + I (x ; z ) = I (x ; y ) + I (x ; z | y ) but I (x ; z ) = 0 and I ( x ; z | y ) ≥ 0 so z = Deterministic I (x ; y ) ≤ I (x ; y | z ) = λI (x ; y | z = 1) + (1 − λ ) I (x ; y | z = 0) = λI (x ;u ) + (1 − λ ) I (x ;v ) Jan 2008 127 n-use Channel Capacity For Discrete Memoryless Channel: I (x 1:n ; y 1:n ) = H (y 1:n ) − H (y 1:n | x 1:n ) n n = ∑ H (y i | y 1:i −1 ) − ∑ H (y i | x i ) i =1 n n n i =1 i =1 i =1 Chain; Memoryless i =1 ≤ ∑ H (y i ) − ∑ H (y i | x i ) = ∑ I ( x i ; y i ) Conditioning Reduces Entropy with equality if xi are independent ⇒ yi are independent We can maximize I(x;y) by maximizing each I(xi;yi) independently and taking xi to be i.i.d. – We will concentrate on maximizing I(x; y) for a single channel use Jan 2008 128 Capacity of Symmetric Channel If channel is weakly symmetric: I (x ; y ) = H (y ) − H (y | x ) = H (y ) − H (Q1,: ) ≤ log | Y | − H (Q1,: ) with equality iff input distribution is uniform ∴ Information Capacity of a WS channel is log|y|–H(Q1,:) For a binary symmetric channel (BSC): – |Y| = 2 – H(Q1,:) = H(f) – I(x;y) ≤ 1 – H(f) ∴ Information Capacity of a BSC is 1–H(f) 0 1–f f f x 1 0 1–f y 1 Jan 2008 129 Binary Erasure Channel (BEC) ⎛1 − f ⎜⎜ ⎝ 0 f f 0 1− f 1–f 0 ⎞ ⎟⎟ ⎠ 0 f x ? y f 1 I (x ; y ) = H (x ) − H (x | y ) = H (x ) − p (y = 0) × 0 − p(y = ?) H (x ) − p (y = 1) × 0 1 1–f H(x |y) = 0 when y=0 or y=1 = H (x ) − H (x ) f = (1 − f ) H (x ) ≤ 1− f since max value of H(x) = 1 with equality when x is uniform since a fraction f of the bits are lost, the capacity is only 1–f and this is achieved when x is uniform Jan 2008 130 Asymmetric Channel Capacity Let px = [a a 1–2a]T ⇒ py=QTpx = px H (y ) = −2a log a − (1 − 2a ) log(1 − 2a ) H (y | x ) = 2aH ( f ) + (1 − 2a ) H (1) = 2aH ( f ) 0: a x To find C, maximize I(x ;y) = H(y) – H(y |x) 1–f 0 f f 1: a 1–f 2: 1–2a 1 y 2 I = −2a log a − (1 − 2a ) log(1 − 2a ) − 2aH ( f ) f 0⎞ ⎛1 − f ⎜ ⎟ dI Q = f 1 − f 0 ⎜ ⎟ = −2 log e − 2 log a + 2 log e + 2 log(1 − 2a ) − 2 H ( f ) = 0 ⎜ 0 da 0 1 ⎟⎠ ⎝ 1 − 2a −1 log = log a −1 − 2 = H ( f ) ⇒ a = (2 + 2 H ( f ) ) Note: a d(log x) = x–1 log e ⇒ C = −2a log a 2 H ( f ) − (1 − 2a ) log(1 − 2a ) = − log(1 − 2a ) ( ) ( Examples: ) f = 0 ⇒ H(f) = 0 ⇒ a = 1/3 ⇒ C = log 3 = 1.585 bits/use f = ½ ⇒ H(f) = 1 ⇒ a = ¼ ⇒ C = log 2 = 1 bits/use Jan 2008 131 Lecture 10 • Jointly Typical Sets • Joint AEP • Channel Coding Theorem – Random Coding – Jointly typical decoding Jan 2008 132 Significance of Mutual Information • Consider blocks of n symbols: x1:n Noisy Channel 2nH(x) 2nH(y|x) 2nH(y) y1:n – An average input sequence x1:n corresponds to about 2nH(y|x) typical output sequences – There are a total of 2nH(y) typical output sequences – For nearly error free transmission, we select a number of input sequences whose corresponding sets of output sequences hardly overlap – The maximum number of distinct sets of output sequences is 2n(H(y)–H(y|x)) = 2nI(y ;x) Channel Coding Theorem: for large n can transmit at any rate < C with negligible errors Jan 2008 133 Jointly Typical Set xyn is the i.i.d. sequence {xi,yi} for 1 ≤ i ≤ n N – Prob of a particular sequence is p (x, y ) = ∏ p ( xi , yi ) i =1 – E − log p( x , y ) = n E − log p( xi , yi ) = nH (x , y ) – Jointly Typical set: { J ε( n ) = x, y ∈ XY n : − n −1 log p (x) − H (x ) < ε , − n −1 log p (y ) − H (y ) < ε , − n −1 log p (x, y ) − H (x , y ) < ε } Jan 2008 134 Jointly Typical Example Binary Symmetric Channel f = 0.2, p x = (0.75 0.25) T ⎛ 0.6 0.15 ⎞ T ⎟⎟ p y = (0.65 0.35) , Pxy = ⎜⎜ 0 . 05 0 . 2 ⎝ ⎠ 0 1–f f f x 1 0 1–f y 1 Jointly Typical example (for any ε): x=11111000000000000000 y=11110111000000000000 all combinations of x and y have exactly the right frequencies Jan 2008 135 Jointly Typical Diagram Each point defines both an x sequence and a y sequence 2nlog|Y| 2nH(Y) Typical in both x, y: 2n(H(x)+H(y)) 2nlog|X| (x, nH 2nH(x) J tly oi n ty a p ic Inner rectangle represents pairs that are typical in x or y but not necessarily jointly typical y) l: 2 2nH(x|y) 2nH(y|x) All sequences: 2nlog(|X||Y|) • • • • Dots represent jointly typical pairs (x,y) There are about 2nH(x) typical x’s in all Each typical y is jointly typical with about 2nH(x|y) of these typical x’s The jointly typical pairs are a fraction 2–nI(x ;y) of the inner rectangle Channel Code: choose x’s whose J.T. y’s don’t overlap; use J.T. for decoding Jan 2008 136 Joint Typical Set Properties (n) 1. Indiv Prob: x, y ∈ J ε ⇒ log p(x , y ) = −nH (x , y ) ± nε 2. Total Prob: p(x , y ∈ J ε( n ) ) > 1 − ε for n > N ε n> N n ( H ( x , y ) −ε ) 3. Size: (1 − ε )2 < J ε( n ) ≤ 2 n (H ( x ,y ) +ε ) ε Proof 2: (use weak law of large numbers) ( ) p − n −1 log p( x ) − H (x ) > ε < Choose N1 such that ∀n > N1 , ε 3 Similarly choose N 2 , N 3 for other conditions and set N ε = max( N1 , N 2 , N 3 ) Proof 3: 1 − ε < ∑ p(x, y ) ≤ J ε (n) x , y∈J ε( n ) 1≥ ∑ p(x, y ) ≥ J ε (n) (n) x , y∈J ε max( n ) p ( x , y ) = J ε( n ) 2 − n ( H ( x ,y ) −ε ) n>Nε x , y∈J ε min p ( x , y ) = J ε( n ) 2 − n ( H ( x ,y ) +ε ) x , y∈J ε( n ) ∀n Jan 2008 137 Joint AEP If px'=px and py' =py with x' and y' independent: ( ) (1 − ε )2 − n ( I ( x ,y ) +3ε ) ≤ p x ' , y '∈ J ε( n ) ≤ 2 − n ( I ( x ,y ) −3ε ) for n > N ε Proof: |J| × (Min Prob) ≤ Total Prob ≤ |J| × (Max Prob) ( ) ∑ p( x ' , y ' ) = ∑ p( x ' ) p( y ' ) ) ≤ J max p(x ' ) p(y ' ) p x ' , y '∈ J ε( n ) = ( p x ' , y '∈ J ε( n ) ( ) x ', y '∈J ε( n ) x ', y '∈J ε( n ) (n) ε x ', y '∈J ε( n ) ≤ 2 n ( H ( x ,y ) +ε )2 − n ( H ( x ) −ε )2 − n ( H ( y ) −ε ) = 2 − n ( I ( x ;y ) −3ε ) p x ' , y '∈ J ε( n ) ≥ J ε( n ) min( n ) p ( x ' ) p( y ' ) x ', y '∈J ε ≥ (1 − ε )2 − n ( I ( x ;y ) +3ε ) for n > N ε Jan 2008 138 Channel Codes • Assume Discrete Memoryless Channel with known Qy|x • An (M,n) code is – A fixed set of M codewords x(w)∈Xn for w=1:M – A deterministic decoder g(y)∈1:M • Error probability λw = p(g (y (w)) ≠ w) = ∑ p(y | x(w))δ g (y)≠ w y∈Y n – Maximum Error Probability λ( n ) = max λw – Average Error probability Pe( n ) = 1≤ w≤ M 1 M M ∑λ w w =1 δC = 1 if C is true or 0 if it is false Jan 2008 139 Achievable Code Rates • The rate of an (M,n) code: R=(log M)/n bits/transmission • A rate, R, is achievable if – ∃ a sequence of ( ⎡2 nR ⎤ , n ) codes for n=1,2,… – max prob of error λ(n)→0 as n→∞ – Note: we will normally write ( 2 nR , n ) to mean ( ⎡2 ⎤ , n) nR • The capacity of a DMC is the sup of all achievable rates • Max error probability for a code is hard to determine – – – – Shannon’s idea: consider a randomly chosen code show the expected average error probability is small Show this means ∃ at least one code with small max error prob Sadly it doesn’t tell you how to find the code Jan 2008 140 Channel Coding Theorem • A rate R is achievable if R<C and not achievable if R>C – If R<C, ∃ a sequence of (2nR,n) codes with max prob of error λ(n)→0 as n→∞ – Any sequence of (2nR,n) codes with max prob of error λ(n)→0 as n→∞ must have R ≤ C A very counterintuitive result: Despite channel errors you can get arbitrarily low bit error rates provided that R<C Bit error probability 0.2 0.15 Achievable 0.1 Impossible 0.05 0 0 1 2 Rate R/C 3 Jan 2008 141 Lecture 11 • Channel Coding Theorem Jan 2008 142 Channel Coding Principle • Consider blocks of n symbols: x1:n Noisy Channel 2nH(x) 2nH(y|x) 2nH(y) y1:n – An average input sequence x1:n corresponds to about 2nH(y|x) typical output sequences – Random Codes: Choose 2nR random code vectors x(w) • their typical output sequences are unlikely to overlap much. – Joint Typical Decoding: A received vector y is very likely to be in the typical output set of the transmitted x(w) and no others. Decode as this w. Channel Coding Theorem: for large n, can transmit at any rate R < C with negligible errors Jan 2008 143 Random (2nR,n) Code • Choose ε ≈ error prob, joint typicality ⇒ Nε , choose n>Nε • Choose px so that I(x ;y)=C, the information capacity • Use px to choose a code C with random x(w)∈Xn, w=1:2nR – the receiver knows this code and also the transition matrix Q • Assume (for now) the message W∈1:2nR is uniformly distributed • If received value is y; decode the message by seeing how many x(w)’s are jointly typical with y – if x(k) is the only one then k is the decoded message – if there are 0 or ≥2 possible k’s then 1 is the decoded message – we calculate error probability averaged over all C and all W p(E ) = ∑ p (C )2 − nR C 2 nR ∑λ w =1 w (C ) = 2 − nR ( a) 2 nR ∑∑ p(C)λ w =1 C w (C ) = ∑ p (C )λ1 (C ) = p(E | w = 1) C (a) since error averaged over all possible codes is independent of w Jan 2008 144 Channel Coding Principle • Assume we transmit x(1) and receive y • Define the events e w = {x ( w), y ∈ J ε( n) } for w ∈ 1 : 2 nR x1:n Noisy Channel y1:n • We have an error if either e1 false or ew true for w ≥ 2 • The x(w) for w ≠ 1 are independent of x(1) and hence also − n ( I ( x ,y ) −3ε ) for any w ≠ 1 independent of y. So p (ew true) < 2 Joint AEP Jan 2008 145 Error Probability for Random Code • We transmit x(1), receive y and decode using joint typicality • We have an error if either e1 false or ew true for w ≥ 2 ( ) ( ) p(E | W = 1) = p e1 ∪ e 2 ∪ e 3 ∪ 2 nR ∪ e 2nR ≤ p e1 + ∑ e w p(A∪B)≤p(A)+p(B) w= 2 2 nR (1) Joint typicality ≤ ε + ∑ 2 − n ( I ( x ;y ) −3ε ) = ε + 2 nR 2 − n ( I ( x ;y ) −3ε ) i =2 ≤ ε + 2 − n ( I ( x ;y ) − R −3ε ) ≤ 2ε for R < C − 3ε and n > − log ε C − R − 3ε (2) Joint AEP • Since average of Pe(n) over all codes is ≤ 2ε there must be at least one code for which this is true: this code has 2 − nR λw ≤ 2ε ∑ ♦ w • Now throw away the worst half of the codewords; the remaining ones must all have λw ≤ 4ε. The resultant code has rate R–n–1 ≅ R. ♦ ♦ = proved on next page Jan 2008 146 Code Selection & Expurgation • Since average of Pe(n) over all codes is ≤ 2ε there must be at least one code for which this is true. Proof: 2ε ≥ K −1 K ∑ Pe(,ni ) ≥K i =1 −1 ∑ min(P ) = min(P ) K i =1 (n) e ,i i i (n) e ,i K = num of codes • Expurgation: Throw away the worst half of the codewords; the remaining ones must all have λw≤ 4ε. Proof: Assume λw are in descending order 2ε ≥ M −1 M ∑λ w =1 ⇒ λ½ M ≤ 4ε w ≥M −1 ½M ∑λ w w =1 ⇒ λw ≤ 4ε ≥M −1 ½M ∑λ ½M ≥ ½ λ½ M w =1 ∀ w > ½M M ′ = ½ × 2 nR messages in n channel uses ⇒ R ′ = n −1 log M ′ = R − n −1 Jan 2008 147 Summary of Procedure • Given R´<C, choose ε<¼(C–R´) and set R=R´+ε ⇒ R<C – 3ε { −1 • Set n = max N ε , − (log ε ) /(C − R − 3ε ), ε } see (a),(b),(c) below • Find the optimum pX so that I(x ; y) = C • Choosing codewords randomly (using pX) and using joint typicality (a) as the decoder, construct codes with 2nR codewords • Since average of Pe(n) over all codes is ≤ 2ε there must be at least one code for which this is true. Find it by exhaustive search. • Throw away the worst half of the codewords. Now the worst codeword has an error prob ≤ 4ε with rate R´ = R – n–1 > R – ε (b) (c) • The resultant code transmits at a rate R´ with an error probability that can be made as small as desired (but n unnecessarily large). Note: ε determines both error probability and closeness to capacity Jan 2008 148 Lecture 12 • Converse of Channel Coding Theorem – Cannot achieve R>C – Minimum bit-error rate • Capacity with feedback – no gain but simpler encode/decode • Joint Source-Channel Coding – No point for a DMC Jan 2008 149 Converse of Coding Theorem • Fano’s Inequality: if Pe(n) is error prob when estimating w from y, H (w | y ) ≤ 1 + Pe( n ) log W = 1 + nRPe( n ) nR = H (w ) = H (w | y ) + I (w ; y ) • Hence Definition of I ≤ H (w | y ) + I ( x (w ); y ) Markov : w → x → y → wˆ ≤ 1 + nRPe( n ) + I ( x; y ) Fano ≤ 1 + nRPe( n ) + nC ⇒ Pe( n ) ≥ • R −C −n R −1 n-use DMC capacity → n →∞ R −C R Hence for large n, Pe(n) has a lower bound of (R–C)/R if w equiprobable – If R>C was achievable for small n, it could be achieved also for large n by concatenation. Hence it cannot be achieved for any n. Jan 2008 150 Minimum Bit-error Rate Suppose – v1:nR is i.i.d. bits with H(vi)=1 Δ ˆ – The bit-error rate is Pb = E p (v i ≠ v i ) = E{p (e i )} i (a) Then nC ≥ I (x 1:n ; y 1:n ) { } i (b) ≥ I (v 1:nR ;vˆ1:nR ) = H (v 1:nR ) − H (v 1:nR |vˆ1:nR ) ( ) = nR − ∑ H (v i |vˆ1:nR ,v 1:i −1 ) ≥ nR − ∑ H (v i |vˆi ) = nR 1 − Ei {H (v i |vˆi )} nR (d) ( nR (c) ) i =1 ( i =1 ) (e) ( ) = nR 1 − E{H (e i |vˆi )} ≥ nR 1 − E{H (e i )} ≥ nR 1 − H ( E Pb ,i ) = nR(1 − H ( Pb ) ) i (c) i (a) n-use capacity (b) Data processing theorem (c) Conditioning reduces entropy (d) e i = v i ⊕vˆi R ≤ C (1 − H ( Pb ) ) −1 Pb ≥ H −1 (1 − C / R ) Bit error probability 0.2 Hence 0.15 0.1 0.05 0 0 i 1 2 Rate R/C 3 (e) Jensen: E H(x) ≤ H(E x) Jan 2008 151 Channel with Feedback w∈1:2nR y1:i–1 Encoder xi Noisy Channel ^ w yi Decoder • Assume error-free feedback: does it increase capacity ? • A (2nR,n) feedback code is – A sequence of mappings xi = xi(w,y1:i–1) for i=1:n – A decoding function wˆ = g (y 1:n ) • A rate R is achievable if ∃ a sequence of (2nR,n) feedback ⎯→ 0 codes such that Pe( n ) = P(wˆ ≠ w ) ⎯n⎯ →∞ • Feedback capacity, CFB≥C, is the sup of achievable rates Jan 2008 152 Capacity with Feedback w∈1:2nR I (w ; y ) = H ( y ) − H ( y | w ) n = H ( y ) − ∑ H (y i | y 1:i −1 ,w ) y1:i–1 Encoder xi Noisy Channel yi ^ w Decoder i =1 n = H ( y ) − ∑ H (y i | y 1:i −1 ,w , x i ) since xi=xi(w,y1:i–1) i =1 n = H ( y ) − ∑ H (y i | x i ) i =1 n since yi only directly depends on xi n n i =1 i =1 ≤ ∑ H (y i ) − ∑ H (y i | x i ) = ∑ I (x i ; y i ) ≤ nC i =1 cond reduces ent Hence nR = H (w ) = H (w | y ) + I (w ; y ) ≤ 1 + nRPe( n ) + nC ⇒P (n) e R − C − n −1 ≥ R Fano The DMC does not benefit from feedback: CFB = C Jan 2008 153 Example: BEC with feedback • Capacity is 1–f • Encode algorithm – If yi=?, retransmit bit i = 0 f x – Average number of transmissions per bit: 1+ f + f 2 + 1–f 0 ? y f 1 1 1− f 1–f 1 • Average number of bits per transmission = 1–f • Capacity unchanged but encode/decode algorithm much simpler. Jan 2008 154 Joint Source-Channel Coding v1:n Encoder x1:n Noisy Channel y1:n ^ Decoder v1:n • Assume vi satisfies AEP and |V|<∞ – Examples: i.i.d.; markov; stationary ergodic • Capacity of DMC channel is C – if time-varying: C = lim n −1 I ( x; y ) n →∞ • Joint Source Channel Coding Theorem: ⎯→ 0 iff H(V) < C ∃ codes with Pe( n ) = P(vˆ1:n ≠ v 1:n ) ⎯n⎯ →∞ ♦ – errors arise from incorrect (i) encoding of V or (ii) decoding of Y • Important result: source coding and channel coding might as well be done separately since same capacity ♦ = proved on next page Jan 2008 155 Source-Channel Proof (⇐) • For n>Nε there are only 2n(H(V)+ε) v’s in the typical set: encode using n(H(V)+ε) bits – encoder error < ε • Transmit with error prob less than ε so long as H(V)+ ε < C • Total error prob < 2ε Jan 2008 156 Source-Channel Proof (⇒) v1:n Encoder Fano’s Inequality: x1:n Noisy Channel y1:n entropy rate of stationary process = n −1 H (v 1:n |vˆ1:n ) + n −1 I (v 1:n ;vˆ1:n ) ) ≤ n −1 1 + Pe( n ) n log V + n −1 I (x 1:n ; y 1:n ) ≤ n −1 + Pe( n ) log V + C Let Decoder H ( v | vˆ ) ≤ 1 + Pe( n ) n log V H (V ) ≤ n −1 H (v 1:n ) ( ^ v1:n n → ∞ ⇒ Pe( n ) → 0 ⇒ H (V ) ≤ C definition of I Fano + Data Proc Inequ Memoryless channel Jan 2008 157 Separation Theorem • For a (time-varying) DMC we can design the source encoder and the channel coder separately and still get optimum performance • Not true for – Correlated Channel and Source – Multiple access with correlated sources • Multiple sources transmitting to a single receiver – Broadcasting channels • one source transmitting possibly different information to multiple receivers Jan 2008 158 Lecture 13 • Continuous Random Variables • Differential Entropy – can be negative – not a measure of the information in x – coordinate-dependent • Maximum entropy distributions – Uniform over a finite range – Gaussian if a constant variance Jan 2008 159 Continuous Random Variables Changing Variables • pdf: f x (x) CDF: Fx ( x) = ∫ x −∞ f x (t )dt −1 • For g(x) monotonic: y = g (x ) ⇔ x = g (y ) ( Fy ( y ) = Fx g −1 ( y ) ) ( or 1 − Fx g −1 ( y ) ) according to slope of g(x) dFY ( y ) dg −1 ( y ) dx −1 fy ( y) = = f x g ( y) = f x (x ) dy dy dy ( ) where x = g −1 ( y ) • Examples: Suppose f x ( x) = 0.5 for (a) y = 4x ⇒ x = 0.25y (b) z =x 4 ⇒ x =z¼ ⇒ x ∈ (0,2) ⇒ Fx ( x) = 0.5 x ⇒ f y ( y ) = 0.5 × 0.25 = 0.125 for f z ( z ) = 0.5 × ¼ z −¾ = 0.125 z −¾ y ∈ (0,8) for z ∈ (0,16) Jan 2008 160 Joint Distributions Distributions Joint pdf: f x , y ( x, y ) ∞ f x ( x) = ∫ f x ,y ( x, y )dy Marginal pdf: −∞ Independence: ⇔ f x ,y ( x, y ) = f x ( x) f y ( y ) f ( x, y ) Conditional pdf: f x |y ( x) = x ,y y y fy ( y) f x ,y = 1 for y ∈ (0,1), x ∈ ( y, y + 1) f x |y = 1 for x ∈ ( y, y + 1) f y |x = x fX 1 for y ∈ (max(0, x − 1), min( x,1) ) min( x,1 − x) x fY Example: Jan 2008 161 Quantised Random Variables • Given a continuous pdf f(x), we divide the range of x into bins of width Δ ( i +1) Δ – For each i, ∃xi with f ( xi )Δ = ∫ f ( x)dx iΔ mean value theorem • Define a discrete random variable Y – Y = {xi} and py = {f(xi)Δ} – Scaled,quantised version of f(x) with slightly unevenly spaced xi • H (y ) = −∑ f ( xi )Δ log( f ( xi )Δ ) = − log Δ − ∑ f ( xi ) log( f ( xi ) )Δ → Δ →0 − log Δ − ∫ ∞ −∞ f ( x) log f ( x)dx = − log Δ + h(x ) ∞ • Differential entropy: h(x ) = − ∫ f x ( x) log f x ( x)dx −∞ Jan 2008 162 Differential Entropy Δ ∞ Differential Entropy: h(x ) =− ∫ f x ( x) log f x ( x)dx = E − log f x ( x) −∞ Bad News: – h(x) does not give the amount of information in x – h(x) is not necessarily positive – h(x) changes with a change of coordinate system Good News: – h1(x)–h2(x) does compare the uncertainty of two continuous random variables provided they are quantised to the same precision – Relative Entropy and Mutual Information still work fine – If the range of x is normalized to 1 and then x is quantised to n bits, the entropy of the resultant discrete random variable is approximately h(x)+n Jan 2008 163 Differential Entropy Examples • Uniform Distribution: x ~ U (a, b) – f ( x) = (b − a ) −1 for x ∈ (a, b) and f ( x) = 0 elsewhere – h(x ) = − ∫ (b − a ) −1 log(b − a ) −1 dx = log(b − a ) b a – Note that h(x) < 0 if (b–a) < 1 • Gaussian Distribution: x ~ N ( μ , σ 2 ) – f ( x) = (2πσ 2 ) exp(− ½( x − μ ) 2 σ − 2 ) −½ ∞ – h(x ) = −(log e) ∫ f ( x) ln f ( x)dx −∞ ( ∞ = −(log e) ∫ f ( x) − ½ ln(2πσ 2 ) − ½( x − μ ) 2 σ − 2 −∞ ( = ½(log e)(ln(2πσ ( = ½(log e) ln(2πσ 2 ) + σ −2 E ( x − μ ) 2 2 ) )) 2 ) + 1 = ½ log(2πeσ ) ≅ log(4.1σ ) bits ) Jan 2008 164 Multivariate Gaussian Given mean, m, and symmetric +ve definite covariance matrix K, x 1:n ~ N(m, K ) ⇔ f (x) = 2πK −½ ( exp − ½(x − m)T K −1 (x − m) ( ) ) h( f ) = −(log e )∫ f (x) × − ½(x − m)T K −1 (x − m) − ½ ln 2πK dx ( ( )) = ½ log(e )× (ln 2πK + E tr ((x − m)(x − m) K )) = ½ log(e )× (ln 2πK + tr (E (x − m )(x − m ) K )) = ½ log(e )× (ln 2πK + tr (KK )) = ½ log(e )× (ln 2πK + n ) = ½ log(e ) + ½ log( 2πK ) = ½ log(e )× ln 2πK + E (x − m)T K −1 (x − m) −1 n = ½ log( 2π eK ) bits T −1 T −1 Jan 2008 165 Other Differential Quantities Joint Differential Entropy h(x , y ) = − ∫∫ f x ,y ( x, y ) log f x ,y ( x, y )dxdy = E − log f x ,y ( x, y ) x, y Conditional Differential Entropy h(x | y ) = − ∫∫ f x ,y ( x, y ) log f x ,y ( x | y )dxdy = h(x , y ) − h(y ) x, y Mutual Information I (x ; y ) = ∫∫ f x ,y ( x, y ) log x, y f x , y ( x, y ) f x ( x) f y ( y ) dxdy = h(x ) + h(y ) − h(x , y ) Relative Differential Entropy of two pdf’s: f ( x) dx g ( x) = −h f (x ) − E f log g (x ) D( f || g ) = ∫ f ( x) log (a) must have f(x)=0 ⇒ g(x)=0 (b) continuity ⇒ 0 log(0/0) = 0 Jan 2008 166 Differential Entropy Properties Chain Rules h ( x , y ) = h ( x ) + h (y | x ) = h (y ) + h ( x | y ) I ( x , y ; z ) = I ( x ; z ) + I (y ; z | x ) Information Inequality: D( f || g ) ≥ 0 Proof: Define S = {x : f (x) > 0} − D ( f || g ) = ∫ x∈S f (x) log ⎛ g ( x) g (x ) ⎞ ⎟⎟ dx = E ⎜⎜ log f ( x) f ( x ) ⎠ ⎝ ⎛ g ( x) ⎞ ⎛ g (x ) ⎞ dx ⎟⎟ ⎟⎟ = log⎜⎜ ∫ f (x) ≤ log⎜⎜ E f ( x ) ⎝ f (x) ⎠ ⎝S ⎠ ⎛ ⎞ = log⎜⎜ ∫ g (x)dx ⎟⎟ ≤ log 1 = 0 ⎝S ⎠ Jensen + log() is concave all the same as for H() Jan 2008 167 Information Inequality Corollaries Mutual Information ≥ 0 I (x ; y ) = D( f x ,y || f x f y ) ≥ 0 Conditioning reduces Entropy h( x ) − h( x | y ) = I ( x ; y ) ≥ 0 Independence Bound n n i =1 i =1 h(x 1:n ) = ∑ h(x i | x 1:i −1 ) ≤ ∑ h(x i ) all the same as for H() Jan 2008 168 Change of Variable Change Variable: y = g (x ) from earlier ( f y ( y ) = f x g −1 ( y ) −1 ) dg dy( y) h(y ) = − E log( f y (y )) = − E log( f x ( g −1 (y )) ) − E log = − E log( f x (x ) ) − E log dx dy dx dx = h(x ) − E log dy dy Examples: – Translation: – Scaling: y = x + a ⇒ dy / dx = 1 ⇒ h(y ) = h(x ) y = cx ⇒ dy / dx = c ⇒ h(y ) = h(x ) − log c −1 – Vector version: y 1:n = Ax 1:n ⇒ h( y ) = h( x ) + log det( A) not the same as for H() Jan 2008 169 Concavity & Convexity • Differential Entropy: – h(x) is a concave function of fx(x) ⇒ ∃ a maximum • Mutual Information: – I(x ; y) is a concave function of fx (x) for fixed fy|x (y) – I(x ; y) is a convex function of fy|x (y) for fixed fx (x) Proofs: Exactly the same as for the discrete case: pz = [1–λ, λ]T U 1 X V U Z 0 Z H (x ) ≥ H (x | z ) U 1 X X V 0 f(u|x) 1 f(y|x) Y I (x ;y ) ≥ I (x ;y |z ) Y f(v|x) V 0 Z I (x ;y ) ≤ I (x ;y |z ) Jan 2008 170 Uniform Distribution Entropy What distribution over the finite range (a,b) maximizes the entropy ? Answer: A uniform distribution u(x)=(b–a)–1 Proof: Suppose f(x) is a distribution for x∈(a,b) 0 ≤ D( f || u ) = −h f (x ) − E f log u (x ) = −h f (x ) + log(b − a ) ⇒ h f (x ) ≤ log(b − a) Jan 2008 171 Maximum Entropy Distribution What zero-mean distribution maximizes the entropy on (–∞, ∞)n for a given covariance matrix K ? −½ Answer: A multivariate Gaussian φ (x) = 2πK exp(− ½ xT K −1x ) Proof: 0 ≤ D ( f || φ ) = −h f ( x ) − E f log φ ( x ) ( ⇒ h f ( x ) ≤ −(log e )E f − ½ ln ( 2πK ) − ½ x T K −1x ( ( = ½(log e ) ln ( 2πK ) + tr E f xx T K −1 )) ) = ½(log e )(ln ( 2πK ) + tr (I )) = ½ log( 2πeK ) = hφ (x ) EfxxT = K tr(I)=n=ln(en) Since translation doesn’t affect h(X), we can assume zero-mean w.l.o.g. Jan 2008 172 Lecture 14 • Discrete-time Gaussian Channel Capacity – Sphere packing • Continuous Typical Set and AEP • Gaussian Channel Coding Theorem • Bandlimited Gaussian Channel – Shannon Capacity – Channel Codes Jan 2008 173 Capacity of Gaussian Channel Discrete-time channel: yi = xi + zi Zi – Zero-mean Gaussian i.i.d. zi ~ N(0,N) Xi n Yi + −1 2 – Average power constraint n ∑ xi ≤ P i =1 Ey = E (x + z ) = Ex + 2 E (x ) E (z ) + Ez 2 ≤ P + N 2 2 2 X,Z indep and EZ=0 Information Capacity I (x ; y ) – Define information capacity: C = max 2 Ex ≤ P I ( x ; y ) = h (y ) − h ( y | x ) = h (y ) − h ( x + z | x ) (a) = h ( y ) − h ( z | x ) = h (y ) − h (z ) (a) Translation independence ≤ ½ log 2πe(P + N ) − ½ log 2πeN = ½ log(1 + PN −1 ) = 1 6 Gaussian Limit with equality when x~N(0,P) [ P+NN ]dB The optimal input is Gaussian & the worst noise is Gaussian Jan 2008 174 Gaussian Channel Code Rate z w∈1:2nR Encoder x y + ^ Decoder w∈0:2nR • An (M,n) code for a Gaussian Channel with power constraint is – A set of M codewords x(w)∈Xn for w=1:M with x(w)Tx(w) ≤ nP ∀w – A deterministic decoder g(y)∈0:M where 0 denotes failure max : λ( n ) average : Pe( n ) – Errors: codeword : λi i i • Rate R is achievable if ∃ seq of (2nR,n) codes with λ( n ) → 0 n →∞ • Theorem: R achievable iff R < C = ½ log(1 + PN −1 ) ♦ ♦ = proved on next pages Jan 2008 175 Sphere Packing • Each transmitted xi is received as a probabilistic cloud yi – cloud ‘radius’ = Var (y | x ) = nN • Energy of yi constrained to n(P+N) so clouds must fit into a hypersphere of radius n(P + N ) • Volume of hypersphere ∝ r n • Max number of non-overlapping clouds: (nP + nN )½ n ½ n log (1+ PN −1 ) = 2 (nN )½ n • Max rate is ½log(1+PN–1) Jan 2008 176 Continuous AEP Typical Set: Continuous distribution, discrete time i.i.d. For any ε>0 and any n, the typical set with respect to f(x) is { Tε( n ) = x ∈ S n : − n −1 log f (x) − h(x ) ≤ ε where } S is the support of f ⇔ {x : f(x)>0} n f (x) = ∏ f ( xi ) since xi are independent i =1 h(x ) = E − log f (x ) = −n −1 E log f ( x ) Typical Set Properties 1. p ( x ∈ Tε( n ) ) > 1 − ε for n > N ε 2. (1 − ε )2 n ( h ( x ) −ε ) ≤ Vol(Tε( n ) ) ≤ 2 n ( h ( x ) +ε ) n> Nε where Vol( A) = ∫ dx x∈A Proof: WLLN Proof: Integrate max/min prob Jan 2008 177 Continuous AEP Proof Proof 1: By weak law of large numbers n prob − n −1 log f (x 1:n ) = −n −1 ∑ log f (x i ) → E − log f (x ) = h(x ) i =1 prob Reminder: x n → y ⇒ ∀ε > 0, ∃N ε such that ∀n > N ε , P(| x n − y |> ε ) < ε Proof 2a: 1− ε ≤ ∫ f (x)dx for n > N ε Tε( n ) Property 1 ( ) max f(x) within T ( ) min f(x) within T ≤ 2 − n (h ( X ) −ε ) ∫ dx = 2 − n (h ( X ) −ε ) Vol Tε( n ) (n) Tε Proof 2b: 1= ∫ f (x)dx ≥ ∫ f (x)dx Sn Tε( n ) ≥ 2 − n (h ( X ) +ε ) ∫ dx = 2 − n (h ( X ) +ε ) Vol Tε( n ) (n) Tε Jan 2008 178 Jointly Typical Set Jointly Typical: xi,yi i.i.d from ℜ2 with fx,y (xi,yi) { J ε( n ) = x, y ∈ ℜ 2 n : − n −1 log f X (x) − h(x ) < ε , − n −1 log fY (y ) − h(y ) < ε , − n −1 log f X ,Y (x, y ) − h(x , y ) < ε Properties: } 1. Indiv p.d.: x, y ∈ J ε( n ) ⇒ log f x ,y (x, y ) = − nh(x , y ) ± nε 2. Total Prob: p x , y ∈ J ε( n ) > 1 − ε 3. Size: 4. Indep x′,y′: ( (1 − ε )2 ) n ( h ( x , y ) −ε ) (1 − ε )2 n> Nε for n > N ε ( ) ≤ Vol J ε( n ) ≤ 2 n (h ( x ,y ) +ε ) − n ( I ( x ;y ) + 3ε ) n> Nε ( ) ≤ p x ' , y '∈ J ε( n ) ≤ 2 − n ( I ( x ;y ) −3ε ) Proof of 4.: Integrate max/min f(x´, y´)=f(x´)f(y´), then use known bounds on Vol(J) Jan 2008 179 Gaussian Channel Coding Theorem ( R is achievable iff R < C = ½ log 1 + PN −1 Proof (⇐): ) Choose ε > 0 Random codebook: x w ∈ ℜn for w = 1 : 2nR where xw are i.i.d. ~ N (0, P − ε ) Use Joint typicality decoding Errors: 1. Power too big p (x T x > nP ) → 0 ⇒ ≤ ε for n > M ε p ( x , y ∉ J ε( n ) ) < ε for n > N ε 2. y not J.T. with x 3. another x J.T. with y 2 nR ∑ p( x j =2 − n ( I ( X ;Y ) − R −3ε ) j , y i ∈ J ε( n ) ) ≤ (2 nR − 1) × 2 − n ( I ( x ;y ) −3ε ) ≤ 3ε for large n if R < I ( X ; Y ) − 3ε Total Err Pε ≤ ε + ε + 2 Expurgation: Remove half of codebook*: λ( n ) < 6ε now max error We have constructed a code achieving rate R–n–1 (n) *:Worst codebook half includes xi: xiTxi >nP ⇒ λi=1 Jan 2008 180 Gaussian Channel Coding Theorem Proof (⇒): Assume Pε( n ) → 0 and n -1xT x < P for each x( w) nR = H (w ) = I (w ; y 1:n ) + H (w | y 1:n ) ≤ I (x 1:n ; y 1:n ) + H (w | y 1:n ) Data Proc Inequal = h(y 1:n ) − h(y 1:n | x 1:n ) + H (w | y 1:n ) n ≤ ∑ h(y i ) − h(z 1:n ) + H (w | y 1:n ) Indep Bound + Translation i =1 n ≤ ∑ I (x i ; y i ) + 1 + nRPε( n ) Z i.i.d + Fano, |W|=2nR i =1 ( n ) ≤ ∑ ½ log 1 + PN −1 + 1 + nRPε( n ) i =1 ( ) max Information Capacity ( −1 R ≤ ½ log 1 + PN −1 + n −1 + RPε( n ) → ½ log 1 + PN ) Jan 2008 181 Bandlimited Channel • Channel bandlimited to f∈(–W,W) and signal duration T • Nyquist: Signal is completely defined by 2WT samples • Can represent as a n=2WT-dimensional vector space with prolate spheroidal functions as an orthonormal basis – white noise with double-sided p.s.d. ½N0 becomes i.i.d gaussian N(0,½N0) added to each coefficient – Signal power constraint = P ⇒ Signal energy ≤ PT • Energy constraint per coefficient: n–1xTx<PT/2WT=½W–1P • Capacity: ( ) C = ½ log 1 + ½W −1 P(½ N 0 ) −1 × 2W ( ) = W log 1 + N 0−1W −1 P bits/second Compare discrete time version: ½log(1+PN–1) bits per channel use Jan 2008 182 Shannon Capacity Bandwidth = W Hz, Signal variance = ½W −1 P, Noise variance = ½ N 0 Signal Power = P, Noise power = N 0W , Min bit energy = Eb = PC −1 Capacity = C = W log(1 + PN 0−1W −1 ) bits per second Define: ( W0 = PN 0−1 ⇒ C / W0 = (W / W0 ) log 1 + (W / W0 ) −1 ⇒ C W0 = Eb N −1 0 −1 ) → log e W →∞ → ln 2 = −1.6 dB W →∞ • For fixed power, high bandwidth is better – Ultra wideband 30 Eb/N0 (dB) Capacity / W0 1.5 1 0.5 0 0 P constant 2 4 Bandwidth / W 0 6 20 C constant 10 0 -10 0 0.5 1 1.5 Bandwidth/Capacity (Hz/bps) 2 Jan 2008 183 Practical Channel Codes Code Classification: – Very good: arbitrarily small error up to the capacity – Good: arbitrarily small error up to less than capacity – Bad: arbitrarily small error only at zero rate (or never) Coding Theorem: Nearly all codes are very good – but nearly all codes need encode/decode computation ∝ 2n Practical Good Codes: – Practical: Computation & memory ∝ nk for some k – Convolution Codes: convolve bit stream with a filter • Concatenation, Interleaving, turbo codes (1993) – Block codes: encode a block at a time • Hamming, BCH, Reed-Solomon, LD parity check (1995) Jan 2008 184 Channel Code Performance • Power Limited – – – – High bandwidth Spacecraft, Pagers Use QPSK/4-QAM Block/Convolution Codes • Bandwidth Limited – Modems, DVB, Mobile phones – 16-QAM to 256-QAM – Convolution Codes • Value of 1 dB for space – Better range, lifetime, weight, bit rate – $80 M (1999) Diagram from “An overview of Communications” by John R Barry, TTC = Japanese Technology Telecommunications Council Jan 2008 185 Lecture 15 • Parallel Gaussian Channels – Waterfilling • Gaussian Channel with Feedback Jan 2008 186 Parallel Gaussian Channels • n gaussian channels (or one channel n times) – e.g. digital audio, digital TV, Broadband ADSL • Noise is independent zi ~ N(0,Ni) • Average Power constraint ExTx ≤ P • Information Capacity: C = max f ( x ):E f x T x ≤ P z1 I ( x; y ) x1 • R<C ⇔ R achievable – proof as before • What is the optimal f(x) ? + y1 zn xn + yn Jan 2008 187 Parallel Gaussian: Max Capacity Need to find f(x): C = max f ( x ):E f x T x ≤ P I ( x; y ) I ( x ; y ) = h ( y ) − h ( y | x ) = h ( y ) − h (z | x ) Translation invariance n = h ( y ) − h(z ) = h ( y ) − ∑ h (z i ) x,z indep; Zi indep i =1 (a) n ( (b) n ≤ ∑ (h(y i ) − h(z i ) ) ≤ ∑ ½ log 1 + Pi N i−1 i =1 ) (a) indep bound; (b) capacity limit i =1 Equality when: (a) yi indep ⇒ xi indep; (b) xi ~ N(0, Pi) ∑ ½ log(1 + P N ) n We need to find the Pi that maximise i i =1 −1 i Jan 2008 188 Parallel Gaussian: Optimal Powers We need to find the Pi that maximise log(e)∑ ½ ln (1 + Pi N i−1 ) n n – subject to power constraint – use Lagrange multiplier n ( ) ∑P = P i =1 n ½λ–1 J = ∑ ½ ln 1 + Pi N i−1 − λ ∑ Pi i =1 i =1 ∂J −1 −1 = ½(Pi + N i ) − λ = 0 ⇒ Pi + N i = ½λ ∂Pi ⎛ ⎞ Also ∑ Pi = P ⇒ λ = ½n⎜ P + ∑ N i ⎟ i =1 i =1 ⎝ ⎠ n i =1 i n P1 −1 Water Filling: put most power into least noisy channels to make equal power + noise in each channel N1 P2 N2 P3 N3 Jan 2008 189 Very Noisy Channels • Must have Pj ≥ 0 ∀i • If ½λ–1 < Nj then set Pj=0 and recalculate λ ½λ–1 Kuhn Tucker Conditions: (not examinable) – Max f(x) subject to Ax+b=0 and P1 P2 N1 N2 N3 g i (x) ≥ 0 for i ∈1 : M with f , g i concave M T – set J (x) = f (x) − ∑ μi g i (x) − λ Ax i =1 – Solution x0, λ, μi iff ∇J (x 0 ) = 0, Ax + b = 0, g i (x 0 ) ≥ 0, μi ≥ 0, μi g i (x 0 ) = 0 Jan 2008 190 Correlated Noise • Suppose y = x + z where E zzT = KZ and E xxT = KX • We want to find KX to maximize capacity subject to n power constraint: E ∑ x i2 ≤ nP ⇔ tr (K X ) ≤ nP i =1 – Find noise eigenvectors: K Z = QDQT with QQT = I QTy = QTx + QTz = QTx + w where E wwT = E QTzzTQ = E QTKZQ = D is diagonal – Now • ⇒ Wi are now independent ( ) ( ) T T – Power constraint is unchanged tr Q K X Q = tr K X QQ = tr (K X ) – Choose QT K X Q = LI − D where L = P + n −1 tr (D) ⇒ K X = Q(LI − D)QT Jan 2008 191 Power Spectrum Water Filling Noise Spectrum • If z is from a stationary process power spectrum then diag(D) n→ →∞ 5 4.5 4 3.5 3 – To achieve capacity use waterfilling on noise power spectrum L 2.5 2 1.5 1 P = ∫ max(L − N ( f ),0 ) df W 0.5 0 -0.5 −W W ⎛ max(L − N ( f ),0 ) ⎞ ⎟⎟ df C = ∫ ½ log⎜⎜1 + −W N f ( ) ⎝ ⎠ 0 Frequency 0.5 T r a n s m it S p e c t r u m 2 1 .5 1 0 .5 0 -0 .5 0 F re qu e n c y 0 .5 Jan 2008 192 Gaussian Channel + Feedback zi Does Feedback add capacity ? w Coder – White noise – No – Coloured noise – Not much xi n I (w ; y ) = h( y ) − h( y | w ) = h( y ) − ∑ h(y i | w , y 1:i −1 ) Chain rule + yi x i = xi (w , y 1:i −1 ) i =1 n = h( y ) − ∑ h(y i | w , y 1:i −1 , x 1:i , z 1:i −1 ) i =1 n xi=xi(w,y1:i–1), z = y – x = h( y ) − ∑ h(z i | w , y 1:i −1 , x 1:i , z 1:i −1 ) z = y – x and translation invariance = h( y ) − ∑ h(z i | z 1:i −1 ) zi depends only on z1:i–1 = h ( y ) − h (z ) Chain rule, h(z) = ½ log( 2π eK Z ) bits i =1 n i =1 = ½ log Ky Kz ⇒ maximize I(w ; y) by maximizing h(y) ⇒ y gaussian ⇒ we can take z and x = y – z jointly gaussian Jan 2008 193 Gaussian Feedback Coder x and z jointly gaussian ⇒ w x = Bz + v (w ) zi x i = xi (w , y 1:i −1 ) Coder xi yi + where v is indep of z and B is strictly lower triangular since xi indep of zj for j>i. y = x + z = (B + I )z + v ( = E (Bzz B ) K y = Eyy T = E (B + I )zzT (B + I )T + vv T = (B + I )K z (B + I )T + K v T K x = Exx T T ½n Capacity: Cn, FB = max K ,B + vv T −1 v subject to ( | Ky | | Kz | ) = BK z BT + K v −1 = max ½ n log (B + I )K z (B + I )T + K v | Kz | K v ,B ) K x = tr BK z BT + K v ≤ nP hard to solve / Jan 2008 194 Gaussian Feedback: Toy Example ⎛2 1⎞ ⎛ 0 0⎞ ⎟⎟, B = ⎜⎜ ⎟⎟ n = 2, P = 2, K Z = ⎜⎜ b 1 2 0 ⎝ ⎠ ⎝ ⎠ 25 ( det(K Y ) = det (B + I )K z (B + I ) + K v T det(KY) x = Bz + v Goal: Maximize (w.r.t. Kv and b) 15 10 ) 5 Subject to: 0 -1.5 ( ) Power constraint : tr BK z BT + K v ≤ 4 Solution (via numerically search): b=0: det(Ky)=16 b=0.69: det(Ky)=20.7 C=0.604 bits C=0.697 bits -0.5 0 b 0.5 4 PV2 3 1 1.5 PbZ1 2 PV1 1 0 -1.5 Feedback increases C by 16% -1 5 Power: V1 V2 bZ1 K v must be positive definite det(Ky) 20 -1 -0.5 Px 1 = Pv 1 0 b 0.5 1 Px 2 = Pv 2 + Pbz 1 1.5 Jan 2008 195 Max Benefit of Feedback: Lemmas Lemma 1: Kx+z+Kx–z = 2(Kx+Kz) K x + z + K x −z = E (x + z )(x + z ) + E (x − z )(x − z ) T ( = E (2 xx T = E xx T + xzT + zx T + zzT T ) ) + xx T − xzT − zx T + zzT + 2zzT = 2(K x + K z ) Lemma 2: If F,G are +ve definite then det(F+G)≥det(F) Consider two indep random vectors f~N(0,F), g~N(0,G) ( ) ½ log (2πe ) det( F + G ) = h(f + g) n ≥ h(f + g | g) = h(f | g) ( Conditioning reduces h() Translation invariance = h(f ) = ½ log (2πe ) det(F) n ) f,g independent Hence: det(2(Kx+Kz)) = det(Kx+z+Kx–z) ≥ det(Kx+z) = det(Ky) Jan 2008 196 Maximum Benefit of Feedback zi w Coder Cn , FB ≤ max ½ n −1 log + yi det(K y ) det(K z ) det (2(K x + K z )) ≤ max ½ n −1 log tr( K x ) ≤ nP det(K z ) tr( K x ) ≤ nP no constraint on B ⇒ ≤ Lemmas 1 & 2: det(2(Kx+Kz)) ≥ det(Ky) 2 n det (K x + K z ) = max ½ n log tr( K x ) ≤ nP det(K z ) −1 = ½ + max ½ n −1 log tr( K x ) ≤ nP xi det(kA)=kndet(A) det (K x + K z ) = ½ + Cn bits/transmission det(K z ) Ky = Kx+Kz if no feedback Having feedback adds at most ½ bit per transmission Jan 2008 197 Lecture 16 • Lossy Coding • Rate Distortion – Bernoulli Scurce – Gaussian Source • Channel/Source Coding Duality Jan 2008 198 Lossy Coding x1:n f(x1:n)∈1:2nR Encoder f( ) Decoder g( ) Distortion function: d ( x, xˆ ) ≥ 0 – sequences: d (x, xˆ ) = n −1 ⎧0 x = xˆ ⎩1 x ≠ xˆ (ii) d H ( x, xˆ ) = ⎨ – examples: (i) d S ( x, xˆ ) = ( x − xˆ ) 2 Δ n ∑ d ( x , xˆ ) i =1 x^1:n i i Distortion of Code fn(),gn(): D = Ex∈X d (x , xˆ ) = E d (x , g ( f ( x )) ) n Rate distortion pair (R,D) is achievable for source X if ∃ a sequence fn() and gn() such that lim Ex∈X d ( x , g n ( f n ( x ))) ≤ D n →∞ n Jan 2008 199 Rate Distortion Function Rate Distortion function for {xi} where p(x) is known is R( D) = inf R such that (R,D) is achievable Theorem: R ( D ) = min I (x ; xˆ ) over all p (x , xˆ ) such that : p (x ) is correct (b) Ex ,xˆ d (x , xˆ ) ≤ D (a) – this expression is the Information Rate Distortion function for X We will prove this next lecture If D=0 then we have R(D)=I(x ; x)=H(x) Jan 2008 200 R(D) bound for Bernoulli Source Bernoulli: X = [0,1], pX = [1–p , p] assume p ≤ ½ d ( x, xˆ ) = x ⊕ xˆ 1 0.8 – If D≥p, R(D)=0 since we can set g( )≡0 – For D < p ≤ ½, if E d ( x, xˆ ) ≤ D then p=0.5 0.4 0 0 = H ( p ) − H (x ⊕ xˆ | xˆ ) ≥ H ( p ) − H (x ⊕ xˆ ) Hence R(D) ≥ H(p) – H(D) 0.6 0.2 I (x ; xˆ ) = H (x ) − H (x | xˆ ) ≥ H ( p) − H ( D) R(D), p=0.2,0.5 – Hamming Distance: p=0.2 0.1 0.2 0.3 0.4 0.5 D ⊕ is one-to-one Conditioning reduces entropy p ( x ⊕ xˆ ) ≤ D and so for D ≤ ½ H ( x ⊕ xˆ ) ≤ H ( D ) as H ( p ) monotonic Jan 2008 201 R(D) for Bernoulli source We know optimum satisfies R(D) ≥ H(p) – H(D) – We show we can find a p(xˆ , x ) that attains this. – Peculiarly, we consider a channel with xˆ as the input and error probability D Now choose r to give x the correct 1–r 0 1–D 0 D probabilities: ^ x r (1 − D ) + (1 − r ) D = p r 1 −1 ⇒ r = ( p − D)(1 − 2 D ) Now I (x ; xˆ ) = H (x ) − H (x | xˆ ) = H ( p ) − H ( D) and p (x ≠ xˆ ) = D D 1–D 1 1–p x p Hence R(D) = H(p) – H(D) Jan 2008 202 R(D) bound for Gaussian Source • Assume X ~ N (0, σ 2 ) and d ( x, xˆ ) = ( x − xˆ ) 2 • Want to minimize I (x ; xˆ ) subject to E (x − xˆ ) 2 ≤ D I (x ; xˆ ) = h(x ) − h(x | xˆ ) = ½ log 2πeσ 2 − h(x − xˆ | xˆ ) ≥ ½ log 2πeσ 2 − h(x − xˆ ) Translation Invariance Conditioning reduces entropy ≥ ½ log 2πeσ 2 − ½ log(2πe Var (x − xˆ )) ≥ ½ log 2πeσ 2 − ½ log 2πeD ⎛ σ2 ⎞ ˆ I (x ; x ) ≥ max⎜⎜ ½ log ,0 ⎟⎟ D ⎝ ⎠ Gauss maximizes entropy for given covariance require Var (x − xˆ ) ≤ E (x − xˆ ) 2 ≤ D I(x ;y) always positive Jan 2008 203 R(D) for Gaussian Source To show that we can find a p ( xˆ , x) that achieves the bound, we construct a test channel that introduces distortion D<σ2 z ~ N(0,D) x^ ~ N(0,σ2–D) x ~ N(0,σ2) + I (x ; xˆ ) = h(x ) − h(x | xˆ ) 3.5 = ½ log 2πeσ 2 − h(x − xˆ | xˆ ) = ½ log 2πeσ − h(z | xˆ ) 3 2.5 = ½ log ⎛σ ⎞ ⇒ D( R) = ⎜ R ⎟ ⎝2 ⎠ σ2 R(D) 2 D achievable 2 1.5 1 2 0.5 impossible 0 0 0.5 1 D/σ 1.5 2 Jan 2008 204 Lloyd Algorithm Problem: Find optimum quantization levels for Gaussian pdf a. b. Bin boundaries are midway between quantization levels Each quantization level equals the mean value of its own bin Lloyd algorithm: Pick random quantization levels then apply conditions (a) and (b) in turn until convergence. 0.2 2 0.15 1 MSE Quantization levels 3 0 0.1 -1 0.05 -2 -3 0 10 1 10 Iteration 10 2 Solid lines are bin boundaries. Initial levels uniform in [–1,+1] . 0 0 10 1 10 Iteration Best mean sq error for 8 levels = 0.0345σ2. Predicted D(R) = (σ/8)2 = 0.0156σ2 10 2 Jan 2008 205 Vector Quantization To get D(R), you have to quantize many values together – True even if the values are independent D = 0.035 D = 0.028 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 1 1.5 2 0 0 2.5 0.5 1 1.5 2 2.5 Two gaussian variables: one quadrant only shown – Independent quantization puts dense levels in low prob areas – Vector quantization is better (even more so if correlated) Jan 2008 206 Multiple Gaussian Variables • Assume x1:n are independent gaussian sources with different variances. How should we apportion the available total distortion between the sources? • Assume x i ~ N (0,σ i2 ) and d (x, xˆ ) = n −1 (x − xˆ ) (x − xˆ ) ≤ D T n I (x 1:n ; xˆ1:n ) ≥ ∑ I (x i ; xˆ i ) Mut Info Independence Bound for independent xi i =1 ⎛ σ i2 ⎞ ≥ ∑ R ( Di ) = ∑ max⎜⎜ ½ log ,0 ⎟⎟ D i =1 i =1 i ⎝ ⎠ n n ⎛ σ i2 ⎞ We must find the Di that minimize ∑ max⎜⎜ ½ log ,0 ⎟ Di ⎟⎠ i =1 ⎝ n R(D) for individual Gaussian Jan 2008 207 Reverse Waterfilling n ⎛ σ i2 ⎞ Minimize ∑ max⎜⎜ ½ log ,0 ⎟⎟ subject to ∑ Di ≤ nD D i =1 i =1 i ⎝ ⎠ n Use a lagrange multiplier: n n σ2 J = ∑ ½ log i + λ ∑ Di i =1 Di i =1 ∂J −1 = −½ Di−1 + λ = 0 ⇒ Di = ½λ = D0 ∂Di n ∑D i =1 i = nD0 = nD ⇒ D0 = D Choose Ri for equal distortion • If σi2 <D then set Ri=0 and increase D0 to maintain the average distortion equal to D • If xi are correlated then reverse waterfill the eigenvectors of the correlation matrix Ri = ½ log σ i2 D ½logσi2 ½logD R1 X1 R2 R3 X2 X3 ½logσi2 ½logD0 R1 R =0 2 R3 X1 X3 ½logD X2 Jan 2008 208 Channel/Source Coding Duality • Channel Coding – Find codes separated enough to give non-overlapping output images. – Image size = channel noise – The maximum number (highest rate) is when the images just fill the sphere. Noisy channel Sphere Packing Lossy encode • Source Coding – Find regions that cover the sphere – Region size = allowed distortion – The minimum number (lowest rate) is when they just don’t overlap Sphere Covering Jan 2008 209 Channel Decoder as Source Coder z1:n ~ N(0,D) w∈2nR Encoder y1:n ~ N(0,σ2–D) ( ( ^ ∈2nR w x1:n ~ N(0,σ2) + Decoder ) ) • For R ≅ C = ½ log 1 + σ 2 − D D −1 , we can find a channel encoder/decoder so that p(wˆ ≠ w ) < ε and E (x i − y i ) 2 = D • Reverse the roles of encoder and decoder. Since p(w ≠ wˆ ) < ε , also p (xˆ ≠ y ) < ε and E (x i − xˆ i ) 2 ≅ E (x i − y i ) 2 = D Z1:n ~ N(0,D) W Encoder Y1:n X1:n ~ N(0,σ2) + Decoder ^ ∈2nR W Encoder ^ X 1:n We have encoded x at rate R=½log(σ2D–1) with distortion D Jan 2008 210 High Dimensional Space In n dimensions – “Vol” of unit hypercube: 1 – “Vol” of unit-diameter hypersphere: ⎧π ½ n −½ (½ n − ½ )! / n! n odd Vn = ⎨ ½ n −n n even ⎩ π 2 /(½ n)! – “Area” of unit-diameter hypersphere: d n An = (2r ) V n = 2nVn dr r =½ n Vn An 1 1 2 2 0.79 3.14 3 0.52 3.14 4 0.31 2.47 10 2.5e–3 5e–2 100 1.9e–70 3.7e–68 – >63% of Vn is in shell (1 − n −1 ) R ≤ r ≤ R ( Proof : 1 - n-1 ) n < → e −1 = 0.3679 n →∞ Most of n-dimensional space is in the corners Jan 2008 211 Lecture 17 • Rate Distortion Theorem Jan 2008 212 Review x1:n Encoder fn( ) f(x1:n)∈1:2nR Decoder gn( ) ^ x1:n Rate Distortion function for x whose px(x) is known is R ( D ) = inf R such that ∃f n , g n with lim Ex∈X d ( x , xˆ ) ≤ D Rate Distortion Theorem: n n →∞ R ( D) = min I (x ; xˆ ) over all p ( xˆ | x) such that Ex ,xˆ d (x , xˆ ) ≤ D 3.5 We will prove this theorem for discrete X and bounded d(x,y)≤dmax 3 R(D) 2.5 2 achievable 1.5 1 R(D) curve depends on your choice of d(,) 0.5 0 0 impossible 0.5 1 D/σ 2 1.5 Jan 2008 213 Rate Distortion Bound Suppose we have found an encoder and decoder at rate R0 with expected distortion D for independent xi (worst case) We want to prove that R0 ≥ R( D) = R(E d ( x; xˆ ) ) – We show first that R0 ≥ n −1 ∑ I (x i ; xˆ i ) i – We know that I (x i ; xˆ i ) ≥ R (E d (x i ; xˆ i ) ) Defn of R(D) – and use convexity to show n −1 ∑ R(E d (x i ; xˆ i ) ) ≥ R(E d ( x; xˆ ) ) = R( D) i We prove convexity first and then the rest Jan 2008 214 Convexity of R(D) If p1 ( xˆ | x ) and p2 ( xˆ | x ) are 3.5 3 associated with ( D1 , R1 ) and ( D2 , R2 ) Then E pλ d ( x, xˆ ) = λD1 + (1 − λ ) D2 = Dλ R(D) on the R ( D ) curve we define pλ ( xˆ | x ) = λp1 ( xˆ | x ) + (1 − λ ) p2 ( xˆ | x ) 2.5 Rλ 2 (D1,R1) 1.5 1 0.5 (D2,R2) 0 0 R ( Dλ ) ≤ I pλ (x ; xˆ ) ≤ λI p1 (x ; xˆ ) + (1 − λ ) I p2 (x ; xˆ ) = λR( D1 ) + (1 − λ ) R( D2 ) Dλ 0.5 1 1.5 D/σ2 R( D) = min I (x ; xˆ ) p ( xˆ |x ) I (x ; xˆ ) convex w.r.t. p( xˆ | x) p1 and p2 lie on the R(D) curve Jan 2008 215 Proof that R ≥ R(D) nR0 ≥ H (xˆ1:n ) = H (xˆ1:n ) − H (xˆ1:n | x 1:n ) Uniform bound; H (xˆ | x ) = 0 = I (xˆ1:n ; x 1:n ) Definition of I(;) n ≥ ∑ I (x i ; xˆ i ) xi indep: Mut Inf Independence Bound i =1 n n ≥ ∑ R(E d (x i ; xˆ i ) ) = n∑ n −1 R(E d (x i ; xˆ i ) ) i =1 definition of R i =1 ⎛ −1 n ⎞ ≥ nR⎜ n ∑ E d (x i ; xˆ i ) ⎟ = nR(E d (x 1:n ; xˆ1:n ) ) ⎝ i =1 ⎠ ≥ nR(D) convexity defn of vector d( ) original assumption that E(d) ≤ D and R(D) monotonically decreasing Jan 2008 216 Rate Distortion Achievability x1:n Encoder fn( ) f(x1:n)∈1:2nR Decoder gn( ) ^ x1:n We want to show that for any D, we can find an encoder and decoder that compresses x1:n to nR(D) bits. • pX is given • Assume we know the p( xˆ | x ) that gives I (x ; xˆ ) = R( D) • Random Decoder: Choose 2nR random xˆ i ~ p xˆ – There must be at least one code that is as good as the average • Encoder: Use joint typicality to design – We show that there is almost always a suitable codeword First define the typical set we will use, then prove two preliminary results. Jan 2008 217 Distortion Typical Set Distortion Typical: (x i , xˆ i )∈ X × Xˆ drawn i.i.d. ~p(x,xˆ) { J d( n,ε) = x, xˆ ∈ X n × Xˆ n : − n −1 log p (x) − H (x ) < ε , − n −1 log p (xˆ ) − H (xˆ ) < ε , − n −1 log p(x, xˆ ) − H (x , xˆ ) < ε d (x, xˆ ) − E d (x , xˆ ) < ε } new condition Properties: 1. Indiv p.d.: x, xˆ ∈ J d( n,ε) ⇒ log p (x, xˆ ) = −nH (x , xˆ ) ± nε 2. Total Prob: p x, xˆ ∈ J d( n,ε) > 1 − ε ( ) for n > N ε weak law of large numbers; d (x i , xˆ i ) are i.i.d. Jan 2008 218 Conditional Probability Bound Lemma: Proof: x, xˆ ∈ J d( n,ε) ⇒ p(xˆ ) ≥ p (xˆ | x )2 − n (I ( x ;x ) +3ε ) ˆ p (xˆ , x ) p(x ) p(xˆ , x ) = p(xˆ ) p (xˆ ) p(x ) p (xˆ | x ) = take max of top and min of bottom 2 − n ( H ( x , x ) −ε ) ≤ p (xˆ ) − n ( H ( x ) +ε ) − n (H ( xˆ ) +ε ) 2 2 ˆ = p(xˆ )2 n (I ( x ;x ) +3ε ) ˆ bounds from defn of J defn of I Jan 2008 219 Curious but necessary Inequality Lemma: u, v ∈ [0,1], m > 0 ⇒ (1 − uv) m ≤ 1 − u + e − vm Proof: u=0: e − vm ≥ 0 ⇒ (1 − 0) m ≤ 1 − 0 + e − vm u=1: Define f (v) = e − v − 1 + v ⇒ f ′(v) = 1 − e − v f (0) = 0 and f ′(v) > 0 for v > 0 ⇒ Hence for v ∈ [0,1], 0 ≤ 1 − v ≤ e − v 0<u<1: f (v) ≥ 0 for v ∈ [0,1] ⇒ (1 − v ) ≤ e − vm m Define g v (u ) = (1 − uv) m ⇒ g v′′( x) = m(m − 1)v 2 (1 − uv) n − 2 ≥ 0 ⇒ g v (u ) convex (1 − uv) m = g v (u ) ≤ (1 − u ) g v (0) + ug v (1) for u,v ∈ [0,1] convexity for u,v ∈ [0,1] = (1 − u )1 + u (1 − v) m ≤ 1 − u + ue − vm ≤ 1 − u + e − vm Jan 2008 220 Achievability of R(D): preliminaries x1:n Encoder fn( ) f(x1:n)∈1:2nR Decoder gn( ) ^ x1:n • Choose D and find a p( xˆ | x) such that I (x ; xˆ ) = R( D); E d ( x, xˆ ) ≤ D Choose δ > 0 and define p xˆ = { p ( xˆ ) = ∑ p ( x) p( xˆ | x) } x • Decoder: For each w ∈1 : 2 choose g n ( w) = xˆ w drawn i.i.d. ~ p nxˆ nR • Encoder: f n (x) = min w such that (x,xˆ w ) ∈ J d( n,ε) else 1 if no such w • Expected Distortion: D = Ex , g d (x, xˆ ) – over all input vectors x and all random decode functions, g – for large n we show D = D + δ code so there must be one good Jan 2008 221 Expected Distortion We can divide the input vectors x into two categories: a) if ∃w such that (x,xˆ w ) ∈ J d( n,ε) then d (x, xˆ w ) < D + ε since E d (x, xˆ ) ≤ D b) if no such w exists we must have d ( x, xˆ w ) < d max since we are assuming that d(,) is bounded. Supose the probability of this situation is Pe. Hence D = Ex , g d (x, xˆ ) ≤ (1 − Pe )( D + ε ) + Pe d max ≤ D + ε + Pe d max We need to show that the expected value of Pe is small Jan 2008 222 Error Probability Define the set of valid inputs for code g V ( g ) = { x : ∃w with ( x, g ( w)) ∈ J d( n,ε) } We have Pe = ∑ p( g ) g ∑ p(x ) = ∑ p(x ) ∑ p( g ) x∉V ( g ) x g:x∉V ( g ) Define K (x, xˆ ) = 1 if (x, xˆ ) ∈ J d( n,ε) else 0 Prob that a random xˆ does not match x is 1 − ∑ p (xˆ ) K (x, xˆ ) xˆ ⎛ ⎞ Prob that an entire code does not match is ⎜1 − ∑ p (xˆ ) K (x, xˆ ) ⎟ xˆ ⎝ ⎠ ⎛ ⎞ Hence Pe = ∑ p (x)⎜1 − ∑ p (xˆ ) K (x, xˆ ) ⎟ x xˆ ⎝ ⎠ 2 nR 2 nR Jan 2008 223 Achievability for average code (n) − n ( I ( x ; x ) + 3ε ) Since x, xˆ ∈ J d ,ε ⇒ p(xˆ ) ≥ p (xˆ | x )2 ˆ 2 nR ⎛ ⎞ Pe = ∑ p(x)⎜1 − ∑ p(xˆ ) K (x, xˆ ) ⎟ x xˆ ⎝ ⎠ 2 nR ⎛ ⎞ ˆ ≤ ∑ p(x)⎜1 − ∑ p (xˆ | x) K (x, xˆ ) 2 − n (I ( x ;x ) +3ε ) ⎟ x xˆ ⎝ ⎠ Using (1 − uv ) m ≤ 1 − u + e − vm with u = ∑ p ( xˆ | x ) K ( x , xˆ ); v = 2 − nI ( x ; x ) − 3 n ε ; ˆ m = 2 nR xˆ ( ) ⎛ ⎞ ˆ ≤ ∑ p (x)⎜1 − ∑ p (xˆ | x) K (x, xˆ ) + exp − 2 − n (I ( x ;x ) +3ε )2 nR ⎟ x xˆ ⎝ ⎠ Note : 0 ≤ u , v ≤ 1 as required Jan 2008 224 Achievability for average code ( ) ⎛ ⎞ ˆ Pe ≤ ∑ p (x)⎜1 − ∑ p (xˆ | x) K (x, xˆ ) + exp − 2 − n (I ( X ; X ) +3ε )2 nR ⎟ x xˆ ⎝ ⎠ ( ( take out terms not involving x )) = 1 + exp − 2 − n (I ( X ; X ) +3ε )2 nR − ∑ p (x) p (xˆ | x) K (x, xˆ ) ˆ ( x , xˆ ˆ = 1 − ∑ p (x, xˆ ) K (x, xˆ ) + exp − 2 n (R − I ( X ; X ) −3ε ) { x , xˆ } ( = P (x, xˆ ) ∉ J d( n,ε) + exp − 2 n (R − I ( X ; X ) −3ε ) ˆ ) ) both terms → 0 as n → ∞ provided nR > I ( x , xˆ ) + 3ε ⎯n⎯ ⎯→ 0 →∞ Hence ∀δ > 0, D = Ex , g d (x, xˆ ) can be made ≤ D + δ Jan 2008 225 Achievability Since ∀δ > 0, D = Ex , g d (x, xˆ ) can be made ≤ D + δ there must be at least one g with Ex d (x, xˆ ) ≤ D + δ Hence (R,D) is achievable for any R>R(D) x1:n Encoder fn( ) f(x1:n)∈1:2nR Decoder gn ( ) ^ x1:n that is lim E X 1:n ( x, xˆ ) ≤ D n →∞ In fact a stronger result is true: ∀δ > 0, D and R > R ( D), ∃f n , g n with p (d (x, xˆ ) ≤ D + δ ) → 1 n →∞ Jan 2008 226 Lecture 18 • Revision Lecture Jan 2008 227 Summary (1) • Entropy: H (x ) = ∑ p ( x) × − log 2 p ( x) = E − log 2 ( p X ( x)) x∈X 0 ≤ H (x ) ≤ log X – Bounds: – Conditioning reduces entropy: H (y | x ) ≤ H (y ) n n H (x 1:n ) = ∑ H (x i | x 1:i −1 ) ≤ ∑ H (x i ) – Chain Rule: i =1 i =1 n • Relative Entropy: H (x 1:n | y 1:n ) ≤ ∑ H (x i | y i ) i =1 D(p || q ) = Ep log( p (x ) / q (x ) ) ≥ 0 Jan 2008 228 Summary (2) H(x ,y) • Mutual Information: H(x |y) I(x ;y) H(y |x) H(x) I (y ; x ) = H (y ) − H (y | x ) = H (x ) + H (y ) − H (x , y ) = D (p x ,y || p x p y ) H(y) – Positive and Symmetrical: I (x ; y ) = I (y ; x ) ≥ 0 – x, y indep ⇔ H (x , y ) = H (y ) + H (x ) ⇔ I (x ; y ) = 0 n – Chain Rule: I (x 1:n ; y ) = ∑ I (x i ; y | x 1:i −1 ) i =1 n x i independent ⇒ I (x 1:n ; y 1:n ) ≥ ∑ I (x i ; y i ) i =1 n p (y i | x 1:n ; y 1:i −1 ) = p (y i | x i ) ⇒ I (x 1:n ; y 1:n ) ≤ ∑ I (x i ; y i ) i =1 Jan 2008 229 Summary (3) • Convexity: f’’ (x) ≥ 0 ⇒ f(x) convex ⇒ Ef(x) ≥ f(Ex) – H(p) concave in p – I(x ; y) concave in px for fixed py|x – I(x ; y) convex in py|x for fixed px • Markov: x → y → z ⇔ p ( z | x, y ) = p ( z | y ) ⇔ I ( x ; z | y ) = 0 ⇒ I(x ; y) ≥ I(x ; z) and I(x ; y) ≥ I(x; y | z) • Fano: H (x | y ) − 1 log(| X | −1) −1 H ( X ) = lim n H (x 1:n ) x → y → xˆ ⇒ p(xˆ ≠ x ) ≥ • Entropy Rate: n →∞ – Stationary process – Markov Process: – Hidden Markov: H ( X ) ≤ H (x n | x 1:n −1 ) H ( X ) = lim H (x n | x n −1 ) = as n → ∞ = if stationary n →∞ H (y n | y 1:n −1 , x 1 ) ≤ H (Y) ≤ H (y n | y 1:n −1 ) = as n → ∞ Jan 2008 230 Summary (4) • Kraft: Uniquely Decodable ⇒ | X| ∑D − li ≤ 1 ⇒ ∃ prefix code i =1 • Average Length: Uniquely Decodable ⇒ LC = E l(x) ≥ HD(x) • Shannon-Fano: Top-down 50% splits. LSF ≤ HD(x)+1 LS ≤ H D (x ) + 1 • Shannon: l x = ⎡− log D p( x)⎤ • Huffman: Bottom-up design. Optimal. LH ≤ H D (x ) + 1 – Designing with wrong probabilities, q ⇒ penalty of D(p||q) – Long blocks disperse the 1-bit overhead • Arithmetic Coding: C(x N ) = ∑ p( x N i ) – Long blocks reduce 2-bit overhead – Efficient algorithm without calculating all possible probabilities – Can have adaptive probabilities xiN < x N Jan 2008 231 Summary (5) • Typical Set – – – – Individual Prob x ∈ Tε ⇒ log p(x) = − nH (x ) ± nε p ( x ∈ Tε( n ) ) > 1 − ε for n > N ε Total Prob n> N n ( H ( x ) −ε ) ( 1 − ε ) 2 < Tε( n ) ≤ 2 n ( H ( x ) +ε ) Size No other high probability set can be much smaller (n) ε • Asymptotic Equipartition Principle – Almost all event sequences are equally surprising Jan 2008 232 Summary (6) • DMC Channel Capacity: • Coding Theorem C = max I (x ; y ) px – Can achieve capacity: random codewords, joint typical decoding – Cannot beat capacity: Fano • Feedback doesn’t increase capacity but simplifies coder • Joint Source-Channel Coding doesn’t increase capacity Jan 2008 233 Summary (7) • Differential Entropy: h(x ) = E − log f x (x ) – Not necessarily positive – h(x+a) = h(x), h(ax) = h(x) + log|a|, h(x|y) ≤ h(x) – I(x; y) = h(x) + h(y) – h(x, y) ≥ 0, D(f||g)=E log(f/g) ≥ 0 • Bounds: – Finite range: Uniform distribution has max: h(x) = log(b–a) – Fixed Covariance: Gaussian has max: h(x) = ½log((2πe)n|K|) • Gaussian Channel – Discrete Time: C=½log(1+PN–1) – Bandlimited: C=W log(1+PN0–1W–1) ln 2 = −1.6 dB • For constant C: Eb N 0−1 = PC −1 N 0−1 = (W / C ) 2(W / C ) − 1 W→ →∞ – Feedback: Adds at most ½ bit for coloured noise ( −1 ) Jan 2008 234 Summary (8) • Parallel Gaussian Channels: Total power constraint – White noise: Waterfilling: Pi = max (P0 − N i ,0) – Correlated noise: Waterfill on noise eigenvectors • Rate Distortion: R( D) = min p xˆ |x s .t . Ed ( x , xˆ ) ≤ D ∑P = P i I (x ; xˆ ) – Bernoulli Source with Hamming d: R(D) = max(H(px)–H(D),0) – Gaussian Source with mean square d: R(D) = max(½log(σ2 D–1),0) – Can encode at rate R: random decoder, joint typical encoder – Can’t encode below rate R: independence bound • Lloyd Algorithm: iterative optimal vector quantization