Lecture 1: Pass out sign-up sheet Course is NOT a comprehensive intro. to any of the subjects, but rather will stress the connections of entropy among the subjecs. Pass out Outline and briefly go over it. – BRIEF discussion of origins of entropy stat. mech. ([K], ch. 1) – Entropy in info. theory: concrete, using only prob. theory, no measure theory ([CT], ch. 2,3,4) – Entropy in ergodic theory: more abstract, using measure theory; will emphsazie single measure-presevation transformations on probability spaces (actions of Z +), but will consider more general actions ([W], ch. 1, 2, 4, [K], ch. 1,2,3) – Entropy in top. dyn.: continuous transformations of compact metric spaces [W], ch. 5,7, [LM], ch. 2,4; – Connections among entropy in erg. theory and top. dynamics; thermodynamic formalism ([W] 6,8,9, [K] ch. 4,5 Describe the four texts Will mention exercises and then go over in class Regular time for class: MWF 11 (or 12, 1 or 2) ??? ——————————————————(Shannon) ENTROPY: H 1. Entropy of a R.V. X with finite range: X ∼ p(x): i.e., p(x) = p(X = x): H(X) = Hb(X) = − X x 1 p(x) logb p(x) base, b, of log is irrelevant, just changes by a multiplicative constant; In information theory, use base 2 (bits); in ergodic theory uss base e (nats). We will mostly use base e. We take: 0 log 0 ≡ 0, since limx→0 x log x = 0 2. Entropy of Prob. dist. Note: H(X) does not depend on values of X, but only on the distribution p. The values of X can be anything – reals, vectors, subsets of a set, horses. Let p = (p1, . . . , pn) is a probability vector; then define X pi log pi H(p) = − i Intuitive meaning: H(X) represents — uncertainty in outcomes of X — information gained in revealing an outcome of X. — degree of randomness or disorder of X We will see later that subject to some simple natural axioms, there is really only one choice for H (up to the base of the logarithm). Binary entropy function: f (p) = H(p, 1 − p) = −p log p − (1 − p) log(1 − p) f 0(p) = − log p − 1 + log(1 − p) + 1 = log( (base e) So, critical point when 1−p p = 1, i.e., p = 1/2. f 00(p) = −1/p − 1/(1 − p) So, strictly concave. 2 1−p ) p Graph of f (p) Prop: 0) H(p) ≥ 0 1) H(p) = 0 iff p is determinisitc, i.e., for some unique i, pi = 1. 2) H(p) is continuous and strictly concave 3) H(p) ≤ log n with equality iff p is uniform, i.e., p = (1/n, . . . , 1/n) Briefly talk through the proof This agrees with intuitive meaning of H(p). STAT MECH: Boltzmann, Gibbs, Ideal gas Micro-state: physical state of system at a given time, e.g., positions and momenta of all particles or clustering of such vectors Macro-state: probability distribution on set of micro-states Laws of Thermodynamics: – 1st law of thermo: without any external influences, energy of macro-state is fixed – 2nd law of thermo: the macro-state tends to a state of maximal disorder, i.e., maximum entropy, subjected to to fixed energy; such a state is called an “equilibrium state” our defn. will be different but at least in this context it is the same as this) TRY to make this precise: Let {s1, . . . , sn} be the collection of micro-states. Let ui = U (si) be the energy of si (here, U is the energy function) Let E ∗ be a fixed value of energy of the system. 3 FIND arg max Ep U =E ∗ H(p) Constrained optimization problem: Maximize H(p) subject to P u p = E∗ Pi i i i pi = 1 pi ≥ 0. Apply Lagrange multipliers: grad H = βU + λ1 (viewed as vector equation) − log pi − 1 = βui + λ (assuming base = e) Solution: e−βui = pi = ce Z(β) where Z is the normalization factor: X Z(β) = e−βuj . −βui j Call this prob dist. µβ ; note β = β ∗ uniquely determined by Eµ∗β (U ) = E ∗ Check that: H(µβ ∗ ) = log Z(β ∗) + β ∗Eµβ ∗ (U ). (Assumes U is not constant and min U < E ∗ < max U ). Will later show that this really is a global max. 4 A prob. dist. of form µβ is called a Gibbs state (or MaxwellBoltzmann state). So, every equilibrium state is an (explicit) Gibbs state. A MAJOR GOAL of this course: to show that this and its converse hold in great generality, with general defns of equilibrium state and Gibbs state. More details later. —————————————– ————————————– SKIPPED: Roots of Entropy: Claussius (1865): classically, dS = dQ/T . Idea of entropy maximization: find the most random source subject to constraints (so as to minimize bias). Etymology: chose a greek word, entropia, translated roughly ’transformation’ to (disorder), having to do with the transformation of a body as it loses heat. ms Roots of Ergodicity: A system is ergodic if starting at almost any initial micro-state (say, the set of all positions/momenta of all particles) of a system, the trajectory (of micro-states) visits approximately every other microstate); Boltzmann’s ergodic hypothesis was that this holds for certain physical systems, such as an ideal gas. Then one can compute the average value of an observable evolving over time as the expected value of the observable w.r.t the equilibrium state. Etymology: work path or energy path —————————————– —————————————– Lecture 2: Recall that for a r.v. X with finite range, 5 H(X) = − P x p(x) logb p(x) H(p) = − X pi log pi i Let U = (u1, . . . , un) ∈ Rn (viewed as a function on {1, . . . , n}. Let β ∈ R. Let µβ be the probability vector (p1, . . . , pn) defined by e−βui pi = Z(β) where Z is the normalization factor: X Z(β) = e−βuj . j Theorem: ([K], section 1.1) Let U, β, µβ be as above. Assume that U is NOT constant. 1. The map R → (min U, max U ), β 7→ Eµβ U is a bijection (actually, a homeo) 2. Let min U < E ∗ < max U . For the unique β ∗ ∈ R s.t Eµβ ∗ (U ) = E ∗, µβ ∗ = arg max Ep U =E ∗ H(p) (unique) and H(µβ ∗ ) = log Z(β ∗) + β ∗E ∗. Proof: 1. Onto: 6 limβ→−∞ µβ = uniform on ({i : ui = max U } limβ→∞ µβ = uniform on {i : ui = min U } Since map is continuous, it is onto. 1-1: Calculate: dEµβ (U ) dβ = −Varianceµβ (U ) < 0. So, map is 1-1. 2. More or less did this last time using Lagrange multipliers. That approach helps you find arg max. Apply Jensen to log(x), which is concave since (log(x))00 = −1/x2 < 0. Given arbitrary p, we have: H(p)−β ∗Ep(U ) = n X i=1 ∗ ∗ n X e−β ui e−β ui ) ≤ log( ) = log Z(β ∗) pi log( pi pi pi i=1 ∗ with equality iff e−β ui pi is constant, i.e., p = µβ ∗ . Take care re: some pi = 0 In particular, if we restrict to p that satisfy constraint, we get H(p) − β ∗E ∗ ≤ log Z(β ∗) with equality iff p = µβ ∗ . Thus, µβ ∗ achieves max and max = log Z(β ∗) + β ∗E ∗. 7 Note: If U is constant: uniform distribution achieves the unconstrained maximum. INFORMATION THEORY: Shannon For a sequence of jointly distributed r.v.’s, (X1, . . . , Xn) is a r.v. and so H(X1, . . . , Xn) is already defined. For a stationary process: X = X1, X2, . . . , (one-sided) – or – X = . . . , X−1, X0, X1, X2, . . . , (two-sided) define: H(X1, . . . , Xn) n→∞ n called entropy or entropy rate of process. h(X) = lim Will show later that limit exists! – For i.i.d. X, it turns out that h(X) = H(X1). – For stationary Markov, there is a simple formula. – For more general stationary processs, h(X) can be complicated Fact: h2(X) quantifies the optimal (smallest) compression bit rate for a stationary random process. —- a compression encoder: invertibly encodes X process sequences into short binary strings —- compression bit rate = expected ratio of bits per process symbol: number of coded binary bits number of process symbols THEOREM: For stationary process X, the optimal compression bit rate is h2(X) + . 8 – Specific Example: X : iid(.6, .3, .1) with values a, b, c Encode symbols to codewords: a 7→ 0, b 7→ 10, c 7→ 11 High probability letters 7→ short binary word Compress sequence of process outputs by concatenation of codewords abacb 7→ 010011 Decodable: because of prefix-free list abacb 7→ |0|10|0|11 Expected length of codeword: .6 + .3 ∗ 2 + .1 ∗ 2 = 1.4 bits per source sysmbol Try again: aa 7→ 00, ab 7→ 10, ba 7→ 010, bb 7→ 110, ac 7→ 011, ca 7→ 1110, bc 7→ 11110, cb 7→ 111110, cc 7→ 111111 Expected length of codeword: 1.35 bits per source symbol h2(X) ≈ 1.2955 Shannon entropy also used to quantify optimal (highest) bit rate of transmission across a noisy channel. Draw black box. Can also consider: X = (Xi)i∈I , I = Z, N, Z d, Z+d , or other sets. 9 Lecture 3: – Will return soon to entropy of stationary processes and data compression – Everybody OK with Jensen? ERGODIC THEORY: Study of measure-preserving transformations (MPT) For SIMPLICITY, we will now consider only invertible MPT’s (IMPT) – (M, A, µ) a probability space – T : M → M , invertible (a.e.), bi-measurable, and bi-measurepreserving: —- for A ∈ A, T (A), T −1(A) ∈ A and µ(T (A)) = µ(A), µ(T −1(A)) = µ(A). We will study various properties of MPT’s that have to do with iteration of T : ms – for x ∈ M , consider the“orbit” {. . . , T −1x, x, T x, T 2(x) ≡ T (T (x)), . . .} – or orbits of sets – classical properties studied (1930’s, 1940’s): —– recurrence (most points come back to near where they started): if µ(A) > 0, do almost all points of A return to A? —– ergodicity (most points visit most of the space; uniformly distributed): if µ(A) > 0, do almost all points of M visit A? —– mixing (asymptotic independence of orbit of a set): if µ(A), µ(B) > 0, does µ(T n(A) ∩ B) → µ(A)µ(B) Example 1: (Circle rotation by angle α) 10 M : circle identified as [0, 2π] with normalized Lebesgue measure Tα (θ) = θ + α mod 2π Iterate rotation: – recurrent, yes – ergodic: yes iff α/(2π) is irrational – mixing: no – RIGID Example 2: (Baker’s transformation) M : unit square with Lebesgue measure T (x, y) = (2x mod 1, (1/2)y) if 0 ≤ x < 1/2 T (x, y) = (2x mod 1, (1/2)y + 1/2) if 1/2 ≤ x < 1 Describe by picture: Iterate Baker: – recurrent, ergodic and mixing – CHAOTIC Example 3: (X shift): M = F Z = {· · · x−1x̂0x1x2 · · · : xi ∈ F } (F is a finite alphabet) µ : defined by a (two-sided) stationary process X = (· · · X−1X0X1X2 · · · – T = σ, the left shift map: – σ(x) = y, where yi = xi+1 (left shift) for A = {x ∈ M : xi = a1 . . . xj = aj−i+1} 11 define µ(A) = p(Xi = a1 . . . Xj = aj−i+1) Extend to sigma-algebra. Then σ(A) = {x ∈ M : xi−1 = a1 . . . xj−1 = aj−i+1} µ(σ(A)) = µ(A) σ is MPT ⇐⇒ X process is stationary properties depend very heavily on X —- all of these properties are invariant under a notion of isomorphism. – Given IMPT’s T on (M, A, µ) and S on (N, B, ν), we say that T and S are isomorphic, and we write T ' S, if there exists an invertible (a.e.) bi-measurable., bi-measure-preserving mapping φ : M → N s.t. φ◦T =S◦φ Example: T = iid(1/2, 1/2) shift ' S = Baker: φ(· · · x−1x̂0x1x2 · · ·) = (.x0x1x2 · · · , .x−1x−2 · · ·) (in binary) φ◦T (· · · x−1x̂0x1x2 · · ·) = φ(· · · x−1x0x̂1x2 · · ·) = (.x1x2 · · · , .x0x−1 · · ·) S ◦ φ(· · · x−1x̂0x1x2 · · ·) = S(.x0x1x2 · · · , .x−1x−2 · · ·) = (.x1x2 · · · , .0x−1x−2 · · · + .x000000000 · · ·) Define 3-Baker Similarly, iid(1/3, 1/3, 1/3) shift ' 3-Baker transformaton : 12 Q: Is 2-Baker ' 3-Baker? Equivalently: Q: Is iid(1/2, 1/2)-shift ' iid(1/3, 1/3, 1/3)-shift? None of the classical invariants distinguished them? Answer is No! Thanks to entropy of IMPT’s! – h(2-Baker) = log 2 and h(3-Baker) = log 3 Lecture 4: STEPS to compute h(T ) for an MPT T : a. Entropy of a partition Let α = {A1, . . . , An} be a measurable partition of a probablity space. (X, B, µ). Let p = (µ(A1), . . . , µ(An)). H(α) = H(p) b. W.r.t. a fixed finite partition, α of M , T looks like a (twosided) stationary process X; define h(T, α) as Shannon entropy of X. p(xi . . . xj ) = µ(∩jk=i T −k (Axk )) T IMPT =⇒ process is stationary Example: Baker The partition {A0 = left, A1 = right} makes (T, α) look like X = iid(1/2, 1/2). c. Define h(T ) = supα h(T, α) – Kolmogorov and Sinai (1957-1958) showed how to compute h(T ) 13 ——— If α is sufficiently fine, then h(T ) = h(T, α) Corollary: Let p be a probability vector. Let X p denote iid(p) stationary process (two-sided) Let σp the iid(p) shift, as an IMPT. Then h(σp) = h(X p) = Hp) Thus, h(σ(1/2,1/2)) 6= h(σ(1/3,1/3,1/3)) Thus, σ(1/2,1/2) 6' σ(1/3,1/3,1/3). Ornstein(1969): Entropy is a complete invariant for σp. Example: p = (1/4, 1/4, 1/4, 1/4), q = (1/2, 1/8, 1/8, 1/8, 1/8) Both have entropy log 4: 4(−(1/4) log(1/4)) −(1/2) log(1/2) − 4(1/8 log 1/8)) TOPOLOGICAL DYNAMICS: – M : compact metric space, T : M → M homeomorphism – Examples 1 and 3 above (and homeos that imitate Baker: hyperbolic toral auto.) – Topological analoguses of ergodicity, mixing, entropy (AdlerKonheim-MacAndrew , Bowen, 1960’s) – Classify up to topological conjugacy: commutative diagram, homeomorphism COMBINE Ergodic Theory and Topological Dynamics: – Consider invariant measures for homeo T of compact metric space M . 14 – Variational Principle: Top. Entropy = sup of meas-theo entropies, over all invariant measures – In many cases, sup is achieved and is unique (gives a natural measure) – Generalize to pressure, equilibrium states and Gibbs states. – Generalize to actions of group or semi-group, esp. Z, Z+, Z d, Z+d —————————————SKIPPED: Review of Jensen’s inequality: Recall Defn: f : D ⊆ R → R is strictly concave if for all x, y and 0 ≤ α ≤ 1, f (αx + (1 − α)y) ≥ αf (x) + (1 − α)f (y) with equality iff x = y or α = 0, 1 Draw picture Facts: 1. f : D ⊆ R → R is strictly concave iff for all x1, . . . , xn and P 0 < α1, . . . , αn < 1, i αi = 1, f( n X αif (xi)) ≥ i=1 n X αif (xi) i=1 with equality iff all xi corresponding to non-zero αi are the same. 2. If f 00 < 0, then f is strictly concave. Note f (x) = log x is strictly concave: f 00(x) = −1/x2 (if base b = e) And g(x) = −x log(x) is also strictly concave: g 00(x) = −1/x 15 ————————————— Lecture 5: For an MPT, there exists a (finite) partition which separates points if T is ergodic and h(T ) < ∞. Recall: X: r.v. with finite range Prop: 0) H(p) ≥ 0 (H(X) ≥ 0) 1) H(p) = 0 iff p is deterministic, i.e., for some unique i, pi = 1 ((H(X) = 0 iff X is deterministic) 2) H(p) is continuous and concave 3) H(p) ≤ log n with equality iff p is uniform. JOINT ENTROPY: Let (X, Y ) be a correlated pair of r.v.s with dist. p(x, y). X H(X, Y ) = − p(x, y) log p(x, y) (x,y) Note: below when we write p(x) (or p(y)) we mean pX (x) (or pY (y)). CONDITIONAL ENTROPY: H(Y |X) = − X p(x, y) log p(y|x)) (x,y) Equivalently, X X X H(Y |X) = p(x)(− p(y|x) log p(y|x)) = p(x)H(Y |X = x) x y x 16 where H(Y |X = x) means the entropy of the r.v. Y conditioned on X = x, i.e., dist. p(y = ·|x). So, H(Y |X) is a weighted average of the H(Y |X = x). Note: when p(x, y) = 0 we take the corresponding term to be 0. Explain intuition. Property 1: H(X, Y ) = H(X) + H(Y |X) Proof: p(x, y) = p(x)p(y|x) log p(x, y) = log p(x) + log p(y|x) H(X, Y ) = −Ep(x,y) log p(x, y) = −Ep(x,y) log p(x)−Ep(x,y) log p(y|x) = −Ep(x) log p(x) − Ep(x,y) log p(y|x) = H(X) + H(Y |X) Property 2: H(Y |X) ≤ H(Y ) with equality iff X ⊥ Y Proof: H(Y |X) = − X p(x, y) log p(y|x)) = y p(y) y (x,y) X X X p(x|y) log( x 1 )≤ p(y|x) X p(x|y) X X p(x) p(y) log( )= p(y) log( )= p(y|x) p(y) x y x X y p(y) log( 1 = H(Y ) p(y) and by Jensen, with equality iff for each y, p(y|x) is constant as a function of x, and therefore X ⊥ Y . 17 Note: literally, we have shown that p(x, y) = p(x)p(y), only when p(x, y) > 0. But it then follows for all (x, y). Defn. Let X be a r.v. (with finite range). Let f : range of X → some set. Then Y = f (X) is the r.v. with distribution pY (y) = P p(y) = x:f (x)=y p(x). Convention: (X, f (X)) ∼ p(x, f (x)). Example: Property 3: H(Y |X) ≥ 0 with equality iff Y = f (X) (for some f ). Proof: H(Y |X) is a weighted average (wtd, by p(x)) of entropies of random variables. Each entropy is nonnegative. So, H(Y |X) ≥ 0, Equality holds iff for a.e. x, H(Y |X = x) = 0 iff for a.e. x, p(Y |X = x) is deterministic iff for a.e. x, there is a unique y such that p(y|x) = 1; set y = f (x). (draw graph of f ). Property 4: H(X, f (X)) = H(X) Proof: H(X, f (X)) = H(X) + H(f (X)|X) = H(X) by Prop 3. Property 5: P H(X1, . . . , Xn) = ni=1 H(Xi|Xi−1, . . . , X1) Proof: By induction, and Prop 1: H(X1, . . . , Xn) = H(X1, . . . , Xn−1) + H(Xn|X1, . . . , Xn−1) 18 n−1 X H(Xi|Xi−1, . . . , X1) + H(Xn|X1, . . . , Xn−1) i=1 Property 6: P H(X1, . . . , Xn)) ≤ i H(Xi) with equality iff Xi are independent. Proof: Apply Prop. 2 and 5. Property 7: H(f (X)) ≤ H(X) with equality iff f is 1-1 (a.e.) Proof: By Prop 4, H(X, f (X)) = H(X) Also, H(X, f (X)) = H(f (X)) + H(X|f (X)) By Prop 1 and (the last equality by Prop 3 (‘if’)). So, H(f (X)) ≤ H(X) – and – H(f (X)) = H(X) iff H(X|f (X)) = 0 iff X = g(f (X)) iff f is 1-1 a.e. (by Prop 3 (’only if’)). Property 8: H(Y |X) ≤ H(Y |f (X)) Proof: Let Then X X X X X LHS = H(Y |X = x) = −p(x, y) log(p(y|x) z z x:f (x)=z x:f (x)=z y And RHS = X H(Y |f (X) = z) = z XX z −p(z, y) log p(y|z) y We show that the inequality holds term by term for each y and z, or 19 equivalently, dividing by z, X p(x, y) − log p(y|x) ≤ −p(y|z) log p(y|z) p(z) x:f (x)=z Proof: X p(x) LHS = (−p(y|x) log(p(y|x)) p(z) x:f (x)=z By Jensen applied to −u log u, LHS ≤ −u log u where X p(y, x) X p(x) p(y|x) = = p(y|z) u= p(z) p(z) x:f (x)=z x:f (x)=z (equality holds iff for each y and z, either p(y|z) = 0 or {p(y|x : f (x) = z} are all the same). And many more, e.g., Property 9: H(X, Y |Z) = H(X|Z) + H(Y |X, Z) Proof: Version of Prop. 1. Property 10: H(Y |Z, X) ≤ H(Y |Z) with equality iff X ⊥Z Y Proof: Version of Prop. 2. Lecture 6: Another approach to proof of Property 8 (Hannah Cairns): For any fixed z, the r.v. (Y |f (X) = z) is a convex combination of {Y |X = x)}{f (x)=z}. Since H(p) is a concave function of p, and so X p(x|z)H(Y |X = x) ≤ H(Y |f (X) = z). x:f (x)=z 20 equaivalently X p(x, z)H(Y |X = x) ≤ p(z)H(Y |f (X) = z). x:f (x)=z Now, add over all values z of f . Property 11: If X ⊥Z (Y, W ), then H(Y |W, Z, X) = H(Y |W, Z). Property 12: H(f (X)|Y ) ≤ H(X|Y ) Property 13: If X is stationary, then for all i, j, k H(Xi . . . Xj ) = H(Xi+k . . . Xj+k ) Proposition (Simple Clumping): for n ≥ 2, H(p1, . . . , pn) = H(p1+p2, p3, . . . , pn)+(p1+p2)H( p2 p1 , ) p1 + p2 p1 + p2 Proof: Let X ∼ p. Let Y = f (X) where f (1) = f (2) = 2 and f (i) = i for i ≥ 2. By Prop 1 and Prop 4, H(p1, . . . , pn) = H(X) = H(X, f (X)) = H(f (X)) + H(X|f (X)) By defn, H(f (X)) = H(p1 + p2, p3, . . . , pn) And H(X|f (X)) is a weighted avg. of H(X|f (X) = z) which are p2 1 , ) with weight = all zero except when z = 2 when it is H( p1p+p p 2 1 +p2 p1 + p2 . A: Axiomatic Treatment of Entropy Prop: H(p) is the unique function such that: 1. H(p) ≥ 0 and H(p) 6≡ 0. 2. (Symmetry) H(p1, . . . , pn is a symmetric function of p for each n 21 3. (Continuity) H(p1, . . . , pn is a continuous function of p for each n (n = 2 is sufficient) 4. (Simple Clumping) H(p1, . . . , pn) = H(p1+p2, p3, . . . , pn)+(p1+p2)H( p2 p1 , ) p1 + p2 p1 + p2 Note: Shannon (1948) originally proved this with the additional assumption that H(1/n, . . . , 1/n) is monotone increasing in n. The result abobe was proven in the 50’s Information function: random variable I(x) = IX (x) = − log p(x) = log(1/(p(x)) ∈ [1, ∞] Quantifies the amount of information revealed when a sample from X with value x is drawn: – if p(x) ≈ 1, then very little information is revealed. – if p(x) ≈ 0, then a lot of information is revealed. Observe: H(X) = EpI(x) (expected amount of information revealed). Note: we could possibly also consider other functions as an information function. Proposition (Exercise): I(x) = − log p(x) is the unique function s.t. 1. I(x) ≥ 0 and I(x) 6≡ 0. 2. I(x) = I(p(x)). 3. If (X, Y ) are jointly distributed and the events (X = x) and (Y = y) are independent, then I(x, y) = I(x) + I(y). ————————————– SKIPPED FOR NOW: 22 Entropy of r.v. with countable range: P Say p = (p1, p2, . . .), with each pi ≥ 0 and i pi1, then H(p) is P defined as before: − i pi log pi. P 1 — 1, Let qn = n(log n)2 and S = n qn < ∞ by the integral test. —— Let pn = ( n(log1 n)2 )/S. —— Then H(p) = ∞: X 1 H(p) = (1/S) )(log n + 2 log log n + log S) ( 2 n(log n) n X = (1/S) ( n X log log n 1 ) + (1/S) ( ) + log S 2 n(log n) n(log n) n ∞ + finite + finite — 2. Let pn = 1/(n2S), where S = π 2/6. —— Then H(p) < ∞: X X 2 H(p) = (1/S) (2 log n)/n + 1/S (log S)/n2 < ∞ n n Entropy of r.v. with continuous range (differential entropy) —————————————— Define Entropy (rate) of a stationary process X (one-sided or two sided): H(X1, . . . , Xn) h(X) = lim n→∞ n Proposition: 1. H(Xn|Xn−1 . . . X1) is montonically decreasing (non-increasing) 2. h(X) = lim H(Xn|Xn−1 . . . X1) n→∞ and so exists. 23 3. H(X1 ,...,Xn ) n is monotonically decreasing (non-increasing). Proof: 1. By Property 8, H(Xn|Xn−1 . . . X1) ≤ H(Xn|Xn−1 . . . X2) By statioinarity, H(Xn|Xn−1 . . . X2) = H(Xn−1|Xn−2 . . . X1) 2. Apply Property 5: H(X1, . . . , Xn) = n H(X1) + H(X2|X1) + H(X3|X2, X1) + . . . + H(Xn|Xn−1 . . . X1) n But this is the Cesaro average of the sequence of conditional entropies which converges. Thus, this sequence also converges and converges to the same limit. 3. The Cesaro average of a monotone sequence is monotone. So, we get two sequences of approximations to h(X). Prop: (Forwards, Backwards) If X is stationary, then H(Xn|Xn−1 . . . X1) = H(X1|X2 . . . Xn) and so h(X) = lim H(X1|X2 . . . Xn) n→∞ Proof: H(Xn|Xn−1 . . . X1) = H(X1, X2 . . . Xn) − H(X1 . . . Xn−1) = H(X1, X2 . . . Xn) − H(X2 . . . Xn) = H(X1|X2 . . . Xn) Note that this holds even if X is not reversible. 24