Chapter 6 Entropy and Shannon’s First Theorem Information A quantitative measure of the amount of information any event represents. I(p) = the amount of information in the occurrence of an event of probability p. Axioms: A. I(p) ≥ 0 B. I(p1∙p2) = I(p1) + I(p2) C. I(p) Cauchy functional equation Existence: I(p) = log_(1/p) source single symbol for any event p p1 & p2 are independent events is a continuous function of p units of information: in base 2 = a bit in base e = a nat in base 10 = a Hartley 6.2 Uniqueness: Suppose I′(p) satisfies the axioms. Since I′(p) ≥ 0, take any 0 < p0 < 1, any base k = (1/p0)(1/I′(p0)). So kI′(p0) = 1/p0, and hence logk (1/p0) = I′(p0). Now, any z (0,1) can be written as p0r, r a real number R+ (r = logp0 z). The Cauchy Functional Equation implies that I′(p0n) = n I′(p0) and m Z+, I′(p01/m) = (1/m) I′(p0), which gives I′(p0n/m) = (n/m) I′(p0), and hence by continuity I′(p0r) = r I′(p0). Hence I′(z) = r∙logk (1/p0) = logk (1/p0r) = logk (1/z). Note: In this proof, we introduce an arbitrary p0, show how any z relates to it, and then eliminate the dependency on that particular p0. 6.2 Entropy The average amount of information received on a per symbol basis from a source S = {s1, …, sq} of symbols, si has probability pi. It is measuring the information rate. In radix r, when all the probabilities are independent: weighted arithmeticmean of information information of the weighted geometricmean pi pi q q q 1 1 1 H r ( S ) pi logr logr logr pi i 1 i 1 i 1 pi pi • Entropy is amount of information in probability distribution. Alternative approach: consider a long message of N symbols from S = {s1, …, sq} with probabilities p1, …, pq. You expect si to appear Npi times, and the probability of this typical message is: q P pi i 1 Np i q 1 1 whose information is log N pi log N H (S) pi P i 1 6.3 Consider f(p) = p ln (1/p): (works for any base, not just e) f′(p) = (-p ln p)′ = -p(1/p) – ln p = -1 + ln (1/p) f″(p) = p(-p-2) = - 1/p < 0 for p (0,1) f is concave down f′(1/e) = 0 f(1/e) = 1/e 1/ e f f′(1) = -1 f′(0) = ∞ 0 lim f ( p) lim p 0 p 0 ln p1 1 p 1/ e p 1 f(1) = 0 ln p ( ln p) p 1 lim 1 lim lim 0 1 2 p 0 p p 0 ( p ) p 0 p 6.3 Basic information about logarithm function Tangent line to y = ln x at x = 1 (y ln 1) = (ln)′x=1(x 1) y=x1 (ln x)″ = (1/x)′ = -(1/x2) < 0 x ln x is concave down. y=x1 ln x 0 x 1 -1 Conclusion: ln x x 1 6.4 Fundamental Gibbs inequality q q i 1 i 1 Let xi 1 and yi 1 be two probability distributions, and consider q yi xi log xi i 1 only when xi yi q q q q yi xi (1 ) ( xi yi ) xi yi 1 1 0 xi i 1 i 1 i 1 i 1 • Minimum Entropy occurs when one pi = 1 and all others are 0. • Maximum Entropy occurs when? Consider Gibbs with 1 distribution y i q1 q q q 1 q H (S ) logq pi log logq pi pi log 0 pi pi i 1 i 1 i 1 • Hence H(S) ≤ log q, and equality occurs only when pi = 1/q. 1 6.4 Entropy Examples S = {s1} S = {s1,s2} S = {s1, …, sr} p1 = 1 p1 = p2 = ½ p1 = … = pr = 1/r H(S) = 0 H2(S) = 1 Hr(S) = 1 (no information) (1 bit per symbol) but H2(S) = log2r. • Run length coding (for instance, in binary predictive coding): p = 1 q is probability of a 0. H2(S) = p log2(1/p) + q log2(1/q) As q 0 the term q log2(1/q) dominates (compare slopes). C.f. average run length = 1/q and average # of bits needed = log2(1/q). So q log2(1/q) = avg. amount of information per bit of original code. Entropy as a Lower Bound for Average Code Length Given an instantaneous code with length li in radix r, let q 1 r li K li 1 ; Qi ; Qi 1 K i 1 r i 1 q Qi Qi 1 1 So by Gibbs, pi logr 0, applyinglog log log pi pi Qi i 1 pi q q q 1 1 H r ( S ) pi logr pi logr pi (logr K li logr r ) pi i 1 Qi i 1 i 1 q q logr K pi li . Since K 1, logr K 0, and henceH r ( S ) L. i 1 By the McMillan inequality, this hold for all uniquely decodable codes. Equality occurs when K = 1 (the decoding tree is complete) and p r li i 6.5 Shannon-Fano Coding Simplest variable length method. Less efficient than Huffman, but allows one to code symbol si with length li directly from probability pi. li = logr(1/pi) pi 1 1 1 r li li logr li logr 1 r pi r . pi pi pi pi r K q Summing this inequality over i: p i 1 i q 1 r i 1 li q i 1 pi 1 r r Kraft inequality is satisfied, therefore there is an instantaneous code with these lengths. 6.6 q q 1 Also, H r ( S ) pi logr pi li H r ( S ) 1 pi i 1 i 1 L by summing multipliedby pi Example: p’s: ¼, ¼, ⅛, ⅛, ⅛, ⅛ l’s: 2, 2, 3, 3, 3, 3 K = 1 0 H2(S) = 2.5 0 1 1 L = 5/2 0 0 1 1 0 1 6.6 The Entropy of Code Extensions Recall: The nth extension of a source S = {s1, …, sq} with probabilities p1, …, pq is the set of symbols T = Sn = {si1 ∙∙∙ sin : sij S 1 j n} where concatenation multiplication ti = si1 ∙∙∙ sin has probability pi1 ∙∙∙ pin = Qi assuming independent probabilities. Let i = (i1−1, …, in−1)q + 1, an n-digit number base q. The entropy is: [] qn qn 1 1 H ( S ) H (T ) Qi log Qi log Qi i 1 pi1 pin i 1 n 1 1 Qi log log pi1 pin i 1 qn qn qn Qi log 1 Qi log 1 . i 1 pi1 pin i 1 6.8 qn qn 1 1 Consider the kth term Qi log pi1 pi n log pi k i 1 pi k i 1 q q q q q 1 1 ˆ ˆ pi1 pi n log i k pi1 pi k pi n pi k log pi k i1 1 i n 1 pi k i 1 1 i n 1 i k 1 q i 1 1 iˆk q pi1 pˆ i k pi n H (S ) H (S ) i n 1 pi1 pˆ i k pi n is just a probability in the(n 1)st extension,and adding themall up gives1. H(Sn) = n∙H(S) Hence the average S-F code length Ln for T satisfies: H(T) Ln < H(T) + 1 n ∙ H(S) Ln < n ∙ H(S) + 1 H(S) (Ln/n) < H(S) + 1/n [now let n go to infinity] 6.8 Extension Example S = {s1, s2} H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1) p1 = 2/3 p2 = 1/3 ~ 0.9182958 … Huffman: s1 = 0 s2 = 1 Avg. coded length = (2/3)∙1+(1/3)∙1 = 1 Shannon-Fano: l1 = 1 l2 = 2 Avg. length = (2/3)∙1+(1/3)∙2 = 4/3 2nd extension: p11 = 4/9 p12 = 2/9 = p21 p22 = 1/9 S-F: l11 = log2 (9/4) = 2 l12 = l21 = log2 (9/2) = 3 l22 = log2 (9/1) = 4 LSF(2) = avg. coded length = (4/9)∙2+(2/9)∙3∙2+(1/9)∙4 = 24/9 = 2.666… Sn = (s1 + s2)n, probabilities are corresponding terms in (p1 + p2)n i n i i n i n i n 2 1 2 p1 p2 So thereare symbolswith probability n i 3 3 3 i 0 i n 3n T hecorresponding SF lengthis log2 i n log2 3 i n log2 3 i 2 6.9 Extension cont. LSF (n) n 2i 1 n n i n n log2 3 i n 2 n log2 3 i 3 i 0 i i 0 i 3 n n i n n i 1 2n n log 3 2 2 i n log 3 2 2 n 3 i i 3 i 0 i 0 n (2 + 1)n = 3n (n ) LSF Hence n 2n 3n-1 * 2 log2 3 H2 (S ) n 3 as n x 1 n i n i dx n i n 1 n i 1 (2 x ) 2 x n (2 x ) 2 (n i ) x n 3n 1 i 0 i i 0 i n n n n n i n i n n i n 1 i n n 1 2 ( n i ) n 3 n 2 i 2 2 i n 3 n 3 i 0 i i 0 i i 0 i i 0 i n n 6.9 Markov Process Entropy p( si | si1 sim ) conditional probability thatsi follows si1 sim . For an mth order process,thinkof let tingthestate s si1 , , sim . 1 Hence, I ( s | s ) log , and so i p( si | s ) p( s | s ) I ( s | s ) H (S | s ) si S i i Now, let p(s ) theprobability of being in states . T henH (S ) p(s ) H (S | s ) s S m p(s ) p(s | s ) I (s | s ) p(s ) p(s p(s , s ) I (s | s ) p(s , s ) log s i S s S m s S m s i S i i i i s S m s i S i s , s i S m 1 i | s ) I (si | s ) 1 p ( s i |s ) 6.10 .8 previous state 0, 0 .2 .5 .5 0, 1 1, 0 .5 .5 .2 1, 1 .8 equilibrium probabilities: p(0,0) = 5/14 = p(1,1) p(0,1) = 2/14 = p(1,0) next state Si1 Si2 Si 0 0 0 0.8 5/14 4/14 0 0 1 0.2 5/14 1/14 0 1 0 0.5 2/14 1/14 0 1 1 0.5 2/14 1/14 1 0 0 0.5 2/14 1/14 1 0 1 0.5 2/14 1/14 1 1 0 0.2 5/14 1/14 1 1 1 0.8 5/14 4/14 H 2 (S ) {0,1}3 2 Example p(si | si1, si2) p(si1, si2) p(si1, si2, si) 1 p(si1 , si2 , si ) log2 p(si | si1 , si2 ) 4 1 1 1 1 1 log 2 2 log 2 4 log 2 0.801377 14 0.8 14 0.2 14 0.5 6.11 The Fibonacci numbers Let f0 = 1 f1 = 2 f2 = 3 f3 = 5 f4 = 8 , …. 𝑓𝑛+1 be defined by fn+1 = fn + fn−1. The lim = 𝑛→∞ 𝑓𝑛 1+ 5 2 = the golden ratio, a root of the equation x2 = x + 1. Use these as the weights for a system of number representation with digits 0 and 1, without adjacent 1’s (because (100)phi = (11)phi). Base Fibonacci Representation Theorem: every number from 0 to fn − 1 can be uniquely written as an n-bit number with no adjacent one’s . Existence: Basis: n = 0 0 ≤ i ≤ 0. 0 = (0)phi = ε Induction: Let 0 ≤ i ≤ fn+1 If i < fn , we are done by induction hypothesis. Otherwise, fn ≤ i < fn+1 = fn−1 + fn , so 0 ≤ (i − fn) < fn−1, and is uniquely representable by i − fn = (bn−2 … b0)phi with bi in {0, 1} ¬(bi = bi+1 = 1). Hence i = (10bn−2 … b0)phi which also has no adjacent ones. Uniqueness: Let i be the smallest number ≥ 0 with two distinct representations (no leading zeros). i = (bn−1 … b0)phi = (b′n−1 … b′0)phi . By minimality of i bn−1 ≠ b′n−1 , and so without loss of generality, let bn−1 = 1 b′n−1 = 0, implies (b′n−2 … b′0)phi ≥ fn−1 which can’t be true. Base Fibonacci The golden ratio = (1+√5)/2 is a solution to x2 − x − 1 = 0 and is equal to the limit of the ratio of adjacent Fibonacci numbers. 1/2 1/ 1/2 1/r H2 = log2 r 0 1 0 1/ 0 … r−1 1st order Markov process: 1 0 1/ Think of source as 0 emitting variable 10 length symbols: 1/2 1/ 1/2 1 0 1/ + 1/2 = 1 Entropy = (1/)∙log + ½(1/²)∙log ² = log which is maximal take into account variable length symbols