ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range SX = { 1,2,3,. . . k} and pmf pk = PX (X = k) Let A { X k} Uncertainty of A I ( X k ) 1 ln pk Thus pk 1 pk 0 Uncertainty = 0 Uncertainty → ∞ Entropy of X ≡ expected uncertainty of outcomes H X E I X K p ln p k k 1 k • If log2 is used units are bits , with ln, units are nats. • By convention, if P(X= x) = 0 i.e., when P(X = x) = 1 -0 log (0) = 0 For a binary random variable X { 0,1} , Let p ≡ P(X=1) H X ( 1 p ) log( 1 p ) p log p HX is maximum when p=0.5 ↔ ↔ p = 1, p= 0 no uncertainty 0,1 are equally probable max uncertainty → HX = 0 Image: http://en.wikipedia.org/wiki/GNU_Free_Documentation_License Let S X { 00,01,11,10} equally probable HX 3 3 1 1 pk log 2 pk log 2 4 k 0 k 0 4 2 bits If given that the first bit is 1, H X |1st bit 1 1 1 1 1 log 2 log 2 1 bit 2 2 2 2 for 01 for 11 In general, HX of n equally probable outcomes= n bits e.g., n-bit equiprobable numbers n bits As each bit is specified, HX decreases by 1 bit. When all n-bits are specified, HX = 0 Relative Entropy: If p = ( p1 , p2 , . . . , pK ) q = ( q1 , q2 , . . . , qK ) X~p Y~q K outcomes for both X and Y are two pmf’s . H (p;q) ≡ relative entropy of q with respect to p K p k 1 k ln qk H X ~ p k ln qk K p k 1 H p; q K pk ln k 1 p k ln pk pk qk H p; q 0 H p ; q 0 pk qk k 1,.. . , K H (p;q) is often used as a metric for probability distributions, is called the Kullback – Leibler Distance. log x To prove the assertions, use x ln 1 x ln 1 H p; q 1 x 1 1 x K p pk ln k qk k 1 x 1 qk p 1 k pk k 1 1 K p k k q k 0 k H p; q 0 to get H p ; q 0, qk 1 pk k i.e. q p x 1 k 1, ..., K K H p; q ln K H X ~ p If qk K pk pk ln 1 k 1 K 0 H X ln K H X ln K iff pK 1 k K This is called maximum entropy (ME) or the minimum relative entropy (MRE) situation. Thus: 0 H X ln K Only one possible outcome K equally probable outcomes Differential Entropy: For a continuous random variable all entropies are maximally uncertain. entropy cannot be defined as for discrete random variables. Instead, differential entropy is used HX f x ln X f X x dx E ln f X x In fact, the integral extends only over the region where fX (x) > 0 f X x ln f X x 0 if f X x 0 e.g . If f X x x 1 e 2 2 E ln f X x ln ln ln ln ln H X ~ Gaussian 2 x E 2 22 E x 2 2 2 2 2 2 2 1 2 ln e 2 2 ln 22 2 e 2 Relative entropies for continuous random varaibles X and Y H f X ; fY f X x ln f X x dx fY x Information Theory Let X be a random variable with SX = { x1 , …., xk } Information about outcomes of X is to be sent over a channel. Channel X Source Receiver Destination How can outcomes { x1 , …., xk } be coded so that all information is carried with maximal efficiency? Best code → minimum expected codeword length. Code must be instantaneously decodable, i.e. no codeword is a prefix for any other. → construct a code tree e.g. S = {x1 , x2 , x3 , x4 ,x5 } 0 0 x1 1 1 0 x2 x3 1 0 x4 x1 = 00 1 x5 x2 = 01 x3 = 10 x4 = 110 x5 = 111 If lk = length of code for xk E (codeword length) E lk K px l k 1 k k For instantaneous binary codes K l k 2 1 k 1 For D-ary code K l k D 1 k 1 Kraft Inequality Consider E lk H X K p k 1 p lk p log p k 1 K k k pk log pk 2 lk k 1 k K K k 1 k log pk log 2 lk 0 Relative Entropy of pk and qk 2 lk which is 0 E lk H X E lk H X iff pk 2 lk 1 i .e. lk log 2 k pk Shannon' s source coding theorem i.e. 1. minimum average codeword length = entropy of X 2. most efficient code is obtained when length(xk) = - log pk i.e. word lengths are inversely proportional to their probabilities. 1 bits of information in X = entropy of X 2 a maximally efficient code can always be found when all pk are powers of 2 otherwise H X E lk best H X 1 One such optimal code is the Huffman code constructed by a Huffman tree. e.g. Let SX = { A , B , C , D , E } with pmf = { 0.1, 0.3, 0.25, 0.2, 0.15} At every step, combine nodes with minimal sum: 1) 0.25 A 3) 0.3 0.25 0.2 B C D E 0.3 2) A 0.55 C E 4) 0.45 B C D 1 0.55 0 A E 0.25 0 A D 1 0 0.25 0.45 B 0.25 1 E 0.45 1 0 B C 1 D A = 000 B = 01 C = 10 D = 11 E = 001 To prove -lk 2 1 (for any binary tree code) k for any binary tree with each leaf a codeword Let lmax = longest codeword leaf at level lmax (root = level 0) l If all leaves are at level lmax # leaves = 2 max If a leaf is at lk < lmax it eliminates 2lmax lk leaves from the full tree. 2 k lmax lk 2lmax ( Remember, each leaf is eliminated by exactly one codeword) 2lmax lk lmax 2 2 k l k 2 1 k 0 1 A B 2 3 C A eliminates D 23 - 1 4 leaves B eliminates 23 - 2 2 leaves C, D do not eliminate any leaves 2 k lk 2 1 2 2 2 3 2 3 1 In General if the tree is complete, lk 2 1 k If not , it is < 1 e.g. A B 2 lk C 2 1 2 3 2 3 k 3 4 Maximum Entropy Method Given random variable X , SX = { x1 , …., xk } unknown pmf pk = p(xk) constraint E(g(x)) = r 1 Estimate pk Hypothesis : p k c e g ( xk ) is the maximum entropy pmf. Proof: Suppose pmf q ≠ p satisfies 1 0 H q ; p q k k ln qk pk q ln q k k ln c g xk k ln c qk g xk H X q k ln c r H X q H X p H X q H X p H X q In general, given n constraints, E(g1(x)) = r1 . . . E(gn(x)) = rn _____ 1-1 _____ 1-n the ME pmf has the form pk c e 1 g1 ( xk ) ... n g n ( xk ) Where c and i are chosen to satisfy 1 1 ... 1 n and p k 1 k If X is continuous , the ME pdf is of the form f X x c e 1 g1 ( x ) ... n g n ( x ) Note that g i x may be moments ME method allows pmf/pdf estimates if some moments are known