The maximum entropy framework The maximum entropy principle — an example Suppose we have a random variable X with known states (values of the observations, x1 , . . . , xn ) but unknown probabilities p1 , . . . , pn ; plus some extra constrains, eg. hXi is known. We are given the task to attempt to have a good guess for the probabilities. Example: X can take 1, 2 or 3 with unknown probabilities, and hXi = x is known. What is the “best guess” for the probabilities? Need to find the maximum of H(p1 , p2 , p3 ) as a function of p1 , p2 , p3 , under two constraints: hXi = 1p1 + 2p2 + 3p3 = x, and p1 + p2 + p3 = 1. " ! !# 3 3 X X 0 = d H(p1 , p2 , p3 ) − λ ipi − x − µ pi − 1 i=1 " =d − pi log pi − λ i=1 3 X i=1 ipi − µ 3 X pi i=1 i=1 i=1 3 X = 3 X # − log pi − 1 − λi − µ dpi = 0 the curly brackets need to be zero: − log(pi ) − 1 − λi − µ = 0 , i = 1, 2, 3 which with the notation λ0 = µ + 1 gives pi = e−λ0 −λi . The constraint on the sum of probabilities: 1= 3 X pi = e−λ0 i=1 3 X e−λi e−λ0 = ⇒ i=1 so pi = 1 e−λ + e−2λ + e−3λ eλ(1−i) e−λi = e−λ + e−2λ + e−3λ 1 + e−λ + e−2λ The other constraint, hXi = x: x= 3 X ipi = i=1 1 + 2e−λ + 3e−2λ 1 + e−λ + e−2λ (1) Multiplying the equation with the denominator gives a second degree equation for e−λ : (x − 3) e−λ with this p2 becomes p2 = −1 + p 2 4 − 3(x − 2)2 , 3 + (x − 2)e−λ + x − 1 = 0 p1 = 3 − x − p2 , 2 p3 = x − 1 − p2 2 Maximum entropy principle — general form Suppose we have a random variable X taking (known) values x1 , . . . , xn with unknown probabilities p1 , . . . , pn . In addition, we have m constraint functions fk (x) with 1 ≤ k ≤ m < n, where hfk (X)i = Fk , 1 the Fk s are fixed. Then the maximum entropy principle assigns probabilities in such a way that maximises the information entropy of X under the above constraints. 0 = d H(p1 , . . . , pn ) − = n X i=1 ( m X n X λk i=1 k=1 − log(pi ) − 1 − fk (xi )pi − Fk m X ! ) − µ |{z} λ0 −1 n X i=1 ! pi − 1 λk fk (xi ) − (λ0 − 1) dpi k=1 Since this is zero for any dpi , all n braces have to be zero: pi = exp −λ0 − m X λk fk (xi ) k=1 ! (2) The sum of probabilities give 1= n X pi = e −λ0 i=1 n X exp − m X i=1 k=1 n X m X λk fk (xi ) ! Introducing partition function: Z(λ1 , . . . , λm ) := exp − i=1 With this notation e−λ0 = λk fk (xi ) k=1 1 , Z(λ1 , . . . , λm ) ! (3) λ0 = log Z(λ1 , . . . , λm ) (4) The other constraints are Fk = n X fk (xi )pi = e −λ0 i=1 n X fk (xi ) exp − i=1 m X λk fk (xi ) k=1 ∂ log Z(λ1 , . . . , λm ) , =− ∂λk ! =− 1 ∂Z(λ1 , . . . , λm ) Z ∂λk (5) m X 1 pi = λk fk (xi ) exp − Z(λ1 , . . . , λm ) k=1 ! (6) The value of the maximised information entropy: S(F1 , . . . , Fm ) = H(p1 , . . . , pn ) = − | {z } = λ0 + from (6) m X λk k=1 n X n X pi log(pi ) = − n X pi −λ0 − i=1 i=1 fk (xi )pi = log Z(λ1 , . . . , λm ) + i=1 m X λk fk (xi ) k=1 m X λ k Fk ! (7) k=1 Now calculate the partial derivatives of S w.r.t. the Fk s, being careful about what is kept constant in the partial derivatives m m X X ∂ log Z ∂λ` ∂λ` ∂S = + F` + λ k = λ k (8) ∂Fk {F } ∂λ` {λ} ∂Fk {F } ∂Fk {F } `=1 `=1 | {z } F` 2