Lecture 1: Pass out sign-up sheet

advertisement
Lecture 1:
Pass out sign-up sheet
Course is NOT a comprehensive intro. to any of the subjects, but
rather will stress the connections of entropy among the subjecs.
Pass out Outline and briefly go over it.
– BRIEF discussion of origins of entropy stat. mech. ([K], ch. 1)
– Entropy in info. theory: concrete, using only prob. theory, no
measure theory ([CT], ch. 2,3,4)
– Entropy in ergodic theory: more abstract, using measure theory;
will emphsazie single measure-presevation transformations on probability spaces (actions of Z +), but will consider more general actions
([W], ch. 1, 2, 4, [K], ch. 1,2,3)
– Entropy in top. dyn.: continuous transformations of compact
metric spaces [W], ch. 5,7, [LM], ch. 2,4;
– Connections among entropy in erg. theory and top. dynamics;
thermodynamic formalism ([W] 6,8,9, [K] ch. 4,5
Describe the four texts
Will mention exercises and then go over in class
Regular time for class: MWF 11 (or 12, 1 or 2) ???
——————————————————(Shannon) ENTROPY: H
1. Entropy of a R.V. X with finite range:
X ∼ p(x): i.e., p(x) = p(X = x):
H(X) = Hb(X) = −
X
x
1
p(x) logb p(x)
base, b, of log is irrelevant, just changes by a multiplicative constant;
In information theory, use base 2 (bits); in ergodic theory uss base
e (nats). We will mostly use base e.
We take: 0 log 0 ≡ 0, since limx→0 x log x = 0
2. Entropy of Prob. dist.
Note: H(X) does not depend on values of X, but only on the
distribution p. The values of X can be anything – reals, vectors,
subsets of a set, horses.
Let p = (p1, . . . , pn) is a probability vector; then define
X
pi log pi
H(p) = −
i
Intuitive meaning: H(X) represents
— uncertainty in outcomes of X
— information gained in revealing an outcome of X.
— degree of randomness or disorder of X
We will see later that subject to some simple natural axioms, there
is really only one choice for H (up to the base of the logarithm).
Binary entropy function:
f (p) = H(p, 1 − p) = −p log p − (1 − p) log(1 − p)
f 0(p) = − log p − 1 + log(1 − p) + 1 = log(
(base e) So, critical point when
1−p
p
= 1, i.e., p = 1/2.
f 00(p) = −1/p − 1/(1 − p)
So, strictly concave.
2
1−p
)
p
Graph of f (p)
Prop:
0) H(p) ≥ 0
1) H(p) = 0 iff p is determinisitc, i.e., for some unique i, pi = 1.
2) H(p) is continuous and strictly concave
3) H(p) ≤ log n with equality iff p is uniform, i.e., p = (1/n, . . . , 1/n)
Briefly talk through the proof
This agrees with intuitive meaning of H(p).
STAT MECH: Boltzmann, Gibbs,
Ideal gas
Micro-state: physical state of system at a given time, e.g., positions
and momenta of all particles or clustering of such vectors
Macro-state: probability distribution on set of micro-states
Laws of Thermodynamics:
– 1st law of thermo: without any external influences, energy of
macro-state is fixed
– 2nd law of thermo: the macro-state tends to a state of maximal
disorder, i.e., maximum entropy, subjected to to fixed energy; such a
state is called an “equilibrium state” our defn. will be different but
at least in this context it is the same as this)
TRY to make this precise:
Let {s1, . . . , sn} be the collection of micro-states.
Let ui = U (si) be the energy of si (here, U is the energy function)
Let E ∗ be a fixed value of energy of the system.
3
FIND
arg
max
Ep U =E ∗
H(p)
Constrained optimization problem:
Maximize H(p) subject to
P
u p = E∗
Pi i i
i pi = 1
pi ≥ 0.
Apply Lagrange multipliers:
grad H = βU + λ1 (viewed as vector equation)
− log pi − 1 = βui + λ
(assuming base = e)
Solution:
e−βui
=
pi = ce
Z(β)
where Z is the normalization factor:
X
Z(β) =
e−βuj .
−βui
j
Call this prob dist. µβ ; note β = β ∗ uniquely determined by
Eµ∗β (U ) = E ∗
Check that:
H(µβ ∗ ) = log Z(β ∗) + β ∗Eµβ ∗ (U ).
(Assumes U is not constant and min U < E ∗ < max U ).
Will later show that this really is a global max.
4
A prob. dist. of form µβ is called a Gibbs state (or MaxwellBoltzmann state). So, every equilibrium state is an (explicit)
Gibbs state.
A MAJOR GOAL of this course: to show that this and its converse
hold in great generality, with general defns of equilibrium state and
Gibbs state. More details later.
—————————————– ————————————–
SKIPPED:
Roots of Entropy:
Claussius (1865): classically, dS = dQ/T .
Idea of entropy maximization: find the most random source subject to constraints (so as to minimize bias).
Etymology: chose a greek word, entropia, translated roughly ’transformation’ to (disorder), having to do with the transformation of a
body as it loses heat. ms
Roots of Ergodicity:
A system is ergodic if starting at almost any initial micro-state
(say, the set of all positions/momenta of all particles) of a system, the
trajectory (of micro-states) visits approximately every other microstate); Boltzmann’s ergodic hypothesis was that this holds for certain
physical systems, such as an ideal gas. Then one can compute the
average value of an observable evolving over time as the expected
value of the observable w.r.t the equilibrium state.
Etymology: work path or energy path
—————————————– —————————————–
Lecture 2:
Recall that for a r.v. X with finite range,
5
H(X) = −
P
x p(x) logb p(x)
H(p) = −
X
pi log pi
i
Let U = (u1, . . . , un) ∈ Rn (viewed as a function on {1, . . . , n}.
Let β ∈ R.
Let µβ be the probability vector (p1, . . . , pn) defined by
e−βui
pi =
Z(β)
where Z is the normalization factor:
X
Z(β) =
e−βuj .
j
Theorem: ([K], section 1.1)
Let U, β, µβ be as above. Assume that U is NOT constant.
1. The map
R → (min U, max U ), β 7→ Eµβ U
is a bijection (actually, a homeo)
2. Let min U < E ∗ < max U .
For the unique β ∗ ∈ R s.t Eµβ ∗ (U ) = E ∗,
µβ ∗ = arg
max
Ep U =E ∗
H(p)
(unique) and
H(µβ ∗ ) = log Z(β ∗) + β ∗E ∗.
Proof:
1.
Onto:
6
limβ→−∞ µβ = uniform on ({i : ui = max U }
limβ→∞ µβ = uniform on {i : ui = min U }
Since map is continuous, it is onto.
1-1:
Calculate:
dEµβ (U )
dβ
= −Varianceµβ (U ) < 0.
So, map is 1-1.
2. More or less did this last time using Lagrange multipliers. That
approach helps you find arg max.
Apply Jensen to log(x), which is concave since (log(x))00 = −1/x2 <
0.
Given arbitrary p, we have:
H(p)−β ∗Ep(U ) =
n
X
i=1
∗
∗
n
X
e−β ui
e−β ui
) ≤ log(
) = log Z(β ∗)
pi log(
pi
pi
pi
i=1
∗
with equality iff
e−β ui
pi
is constant, i.e., p = µβ ∗ .
Take care re: some pi = 0
In particular, if we restrict to p that satisfy constraint, we get
H(p) − β ∗E ∗ ≤ log Z(β ∗)
with equality iff p = µβ ∗ .
Thus, µβ ∗ achieves max and
max = log Z(β ∗) + β ∗E ∗.
7
Note: If U is constant: uniform distribution achieves the unconstrained maximum.
INFORMATION THEORY: Shannon
For a sequence of jointly distributed r.v.’s, (X1, . . . , Xn) is a r.v.
and so H(X1, . . . , Xn) is already defined.
For a stationary process:
X = X1, X2, . . . , (one-sided)
– or –
X = . . . , X−1, X0, X1, X2, . . . , (two-sided)
define:
H(X1, . . . , Xn)
n→∞
n
called entropy or entropy rate of process.
h(X) = lim
Will show later that limit exists!
– For i.i.d. X, it turns out that h(X) = H(X1).
– For stationary Markov, there is a simple formula.
– For more general stationary processs, h(X) can be complicated
Fact: h2(X) quantifies the optimal (smallest) compression bit rate
for a stationary random process.
—- a compression encoder: invertibly encodes X process sequences
into short binary strings
—- compression bit rate = expected ratio of bits per process
symbol:
number of coded binary bits
number of process symbols
THEOREM: For stationary process X, the optimal compression
bit rate is h2(X) + .
8
– Specific Example: X : iid(.6, .3, .1) with values a, b, c
Encode symbols to codewords:
a 7→ 0, b 7→ 10, c 7→ 11
High probability letters 7→ short binary word
Compress sequence of process outputs by concatenation of codewords
abacb 7→ 010011
Decodable: because of prefix-free list
abacb 7→ |0|10|0|11
Expected length of codeword:
.6 + .3 ∗ 2 + .1 ∗ 2 = 1.4 bits per source sysmbol
Try again:
aa 7→ 00, ab 7→ 10, ba 7→ 010, bb 7→ 110, ac 7→ 011,
ca 7→ 1110, bc 7→ 11110, cb 7→ 111110, cc 7→ 111111
Expected length of codeword: 1.35 bits per source symbol
h2(X) ≈ 1.2955
Shannon entropy also used to quantify optimal (highest) bit rate
of transmission across a noisy channel. Draw black box.
Can also consider:
X = (Xi)i∈I , I = Z, N, Z d, Z+d , or other sets.
9
Lecture 3:
– Will return soon to entropy of stationary processes and data
compression
– Everybody OK with Jensen?
ERGODIC THEORY:
Study of measure-preserving transformations (MPT)
For SIMPLICITY, we will now consider only invertible MPT’s
(IMPT)
– (M, A, µ) a probability space
– T : M → M , invertible (a.e.), bi-measurable, and bi-measurepreserving:
—- for A ∈ A, T (A), T −1(A) ∈ A and µ(T (A)) = µ(A), µ(T −1(A)) =
µ(A).
We will study various properties of MPT’s that have to do with
iteration of T : ms
– for x ∈ M , consider the“orbit” {. . . , T −1x, x, T x, T 2(x) ≡
T (T (x)), . . .}
– or orbits of sets
– classical properties studied (1930’s, 1940’s):
—– recurrence (most points come back to near where they started):
if µ(A) > 0, do almost all points of A return to A?
—– ergodicity (most points visit most of the space; uniformly
distributed): if µ(A) > 0, do almost all points of M visit A?
—– mixing (asymptotic independence of orbit of a set): if µ(A), µ(B) >
0, does µ(T n(A) ∩ B) → µ(A)µ(B)
Example 1: (Circle rotation by angle α)
10
M : circle identified as [0, 2π] with normalized Lebesgue measure
Tα (θ) = θ + α mod 2π
Iterate rotation:
– recurrent, yes
– ergodic: yes iff α/(2π) is irrational
– mixing: no
– RIGID
Example 2: (Baker’s transformation)
M : unit square with Lebesgue measure
T (x, y) = (2x mod 1, (1/2)y)
if 0 ≤ x < 1/2
T (x, y) = (2x mod 1, (1/2)y + 1/2)
if 1/2 ≤ x < 1
Describe by picture:
Iterate Baker:
– recurrent, ergodic and mixing
– CHAOTIC
Example 3: (X shift):
M = F Z = {· · · x−1x̂0x1x2 · · · : xi ∈ F }
(F is a finite alphabet)
µ : defined by a (two-sided) stationary process X = (· · · X−1X0X1X2 · · ·
– T = σ, the left shift map:
– σ(x) = y, where yi = xi+1 (left shift)
for
A = {x ∈ M : xi = a1 . . . xj = aj−i+1}
11
define
µ(A) = p(Xi = a1 . . . Xj = aj−i+1)
Extend to sigma-algebra.
Then
σ(A) = {x ∈ M : xi−1 = a1 . . . xj−1 = aj−i+1}
µ(σ(A)) = µ(A)
σ is MPT ⇐⇒ X process is stationary
properties depend very heavily on X
—- all of these properties are invariant under a notion of isomorphism.
– Given IMPT’s T on (M, A, µ) and S on (N, B, ν), we say that
T and S are isomorphic, and we write T ' S, if there exists an
invertible (a.e.) bi-measurable., bi-measure-preserving mapping φ :
M → N s.t.
φ◦T =S◦φ
Example: T = iid(1/2, 1/2) shift ' S = Baker:
φ(· · · x−1x̂0x1x2 · · ·) = (.x0x1x2 · · · , .x−1x−2 · · ·) (in binary)
φ◦T (· · · x−1x̂0x1x2 · · ·) = φ(· · · x−1x0x̂1x2 · · ·) = (.x1x2 · · · , .x0x−1 · · ·)
S ◦ φ(· · · x−1x̂0x1x2 · · ·) = S(.x0x1x2 · · · , .x−1x−2 · · ·) =
(.x1x2 · · · , .0x−1x−2 · · · + .x000000000 · · ·)
Define 3-Baker
Similarly, iid(1/3, 1/3, 1/3) shift ' 3-Baker transformaton :
12
Q: Is 2-Baker ' 3-Baker?
Equivalently:
Q: Is iid(1/2, 1/2)-shift ' iid(1/3, 1/3, 1/3)-shift?
None of the classical invariants distinguished them?
Answer is No! Thanks to entropy of IMPT’s!
– h(2-Baker) = log 2 and h(3-Baker) = log 3
Lecture 4:
STEPS to compute h(T ) for an MPT T :
a. Entropy of a partition
Let α = {A1, . . . , An} be a measurable partition of a probablity
space. (X, B, µ). Let p = (µ(A1), . . . , µ(An)).
H(α) = H(p)
b. W.r.t. a fixed finite partition, α of M , T looks like a (twosided) stationary process X; define h(T, α) as Shannon entropy of
X.
p(xi . . . xj ) = µ(∩jk=i T −k (Axk ))
T IMPT =⇒ process is stationary
Example: Baker
The partition {A0 = left, A1 = right} makes (T, α) look like
X = iid(1/2, 1/2).
c. Define h(T ) = supα h(T, α)
– Kolmogorov and Sinai (1957-1958) showed how to compute h(T )
13
——— If α is sufficiently fine, then h(T ) = h(T, α)
Corollary:
Let p be a probability vector.
Let X p denote iid(p) stationary process (two-sided)
Let σp the iid(p) shift, as an IMPT.
Then
h(σp) = h(X p) = Hp)
Thus, h(σ(1/2,1/2)) 6= h(σ(1/3,1/3,1/3))
Thus, σ(1/2,1/2) 6' σ(1/3,1/3,1/3).
Ornstein(1969): Entropy is a complete invariant for σp.
Example: p = (1/4, 1/4, 1/4, 1/4), q = (1/2, 1/8, 1/8, 1/8, 1/8)
Both have entropy log 4:
4(−(1/4) log(1/4))
−(1/2) log(1/2) − 4(1/8 log 1/8))
TOPOLOGICAL DYNAMICS:
– M : compact metric space, T : M → M homeomorphism
– Examples 1 and 3 above (and homeos that imitate Baker: hyperbolic toral auto.)
– Topological analoguses of ergodicity, mixing, entropy (AdlerKonheim-MacAndrew , Bowen, 1960’s)
– Classify up to topological conjugacy: commutative diagram,
homeomorphism
COMBINE Ergodic Theory and Topological Dynamics:
– Consider invariant measures for homeo T of compact metric
space M .
14
– Variational Principle: Top. Entropy = sup of meas-theo entropies, over all invariant measures
– In many cases, sup is achieved and is unique (gives a natural
measure)
– Generalize to pressure, equilibrium states and Gibbs states.
– Generalize to actions of group or semi-group, esp. Z, Z+, Z d, Z+d
—————————————SKIPPED:
Review of Jensen’s inequality:
Recall Defn: f : D ⊆ R → R is strictly concave if for all x, y and
0 ≤ α ≤ 1,
f (αx + (1 − α)y) ≥ αf (x) + (1 − α)f (y)
with equality iff x = y or α = 0, 1
Draw picture
Facts:
1. f : D ⊆ R → R is strictly concave iff for all x1, . . . , xn and
P
0 < α1, . . . , αn < 1, i αi = 1,
f(
n
X
αif (xi)) ≥
i=1
n
X
αif (xi)
i=1
with equality iff all xi corresponding to non-zero αi are the same.
2. If f 00 < 0, then f is strictly concave.
Note f (x) = log x is strictly concave: f 00(x) = −1/x2 (if base
b = e)
And g(x) = −x log(x) is also strictly concave: g 00(x) = −1/x
15
—————————————
Lecture 5:
For an MPT, there exists a (finite) partition which separates points
if T is ergodic and h(T ) < ∞.
Recall:
X: r.v. with finite range
Prop:
0) H(p) ≥ 0 (H(X) ≥ 0)
1) H(p) = 0 iff p is deterministic, i.e., for some unique i, pi = 1
((H(X) = 0 iff X is deterministic)
2) H(p) is continuous and concave
3) H(p) ≤ log n with equality iff p is uniform.
JOINT ENTROPY:
Let (X, Y ) be a correlated pair of r.v.s with dist. p(x, y).
X
H(X, Y ) = −
p(x, y) log p(x, y)
(x,y)
Note: below when we write p(x) (or p(y)) we mean pX (x) (or
pY (y)).
CONDITIONAL ENTROPY:
H(Y |X) = −
X
p(x, y) log p(y|x))
(x,y)
Equivalently,
X
X
X
H(Y |X) =
p(x)(−
p(y|x) log p(y|x)) =
p(x)H(Y |X = x)
x
y
x
16
where H(Y |X = x) means the entropy of the r.v. Y conditioned on
X = x, i.e., dist. p(y = ·|x).
So, H(Y |X) is a weighted average of the H(Y |X = x).
Note: when p(x, y) = 0 we take the corresponding term to be 0.
Explain intuition.
Property 1:
H(X, Y ) = H(X) + H(Y |X)
Proof:
p(x, y) = p(x)p(y|x)
log p(x, y) = log p(x) + log p(y|x)
H(X, Y ) = −Ep(x,y) log p(x, y) = −Ep(x,y) log p(x)−Ep(x,y) log p(y|x) =
−Ep(x) log p(x) − Ep(x,y) log p(y|x) = H(X) + H(Y |X)
Property 2:
H(Y |X) ≤ H(Y ) with equality iff X ⊥ Y
Proof:
H(Y |X) = −
X
p(x, y) log p(y|x)) =
y
p(y)
y
(x,y)
X
X
X
p(x|y) log(
x
1
)≤
p(y|x)
X p(x|y)
X
X p(x)
p(y) log(
)=
p(y) log(
)=
p(y|x)
p(y)
x
y
x
X
y
p(y) log(
1
= H(Y )
p(y)
and by Jensen, with equality iff for each y, p(y|x) is constant as a
function of x, and therefore X ⊥ Y .
17
Note: literally, we have shown that p(x, y) = p(x)p(y), only when
p(x, y) > 0. But it then follows for all (x, y).
Defn. Let X be a r.v. (with finite range). Let f : range of X →
some set. Then Y = f (X) is the r.v. with distribution pY (y) =
P
p(y) = x:f (x)=y p(x).
Convention: (X, f (X)) ∼ p(x, f (x)).
Example:
Property 3:
H(Y |X) ≥ 0 with equality iff Y = f (X) (for some f ).
Proof:
H(Y |X) is a weighted average (wtd, by p(x)) of entropies of random variables. Each entropy is nonnegative. So, H(Y |X) ≥ 0,
Equality holds iff for a.e. x, H(Y |X = x) = 0 iff for a.e. x,
p(Y |X = x) is deterministic iff for a.e. x, there is a unique y such
that p(y|x) = 1; set y = f (x).
(draw graph of f ).
Property 4:
H(X, f (X)) = H(X)
Proof:
H(X, f (X)) = H(X) + H(f (X)|X) = H(X)
by Prop 3.
Property 5:
P
H(X1, . . . , Xn) = ni=1 H(Xi|Xi−1, . . . , X1)
Proof: By induction, and Prop 1:
H(X1, . . . , Xn) = H(X1, . . . , Xn−1) + H(Xn|X1, . . . , Xn−1)
18
n−1
X
H(Xi|Xi−1, . . . , X1) + H(Xn|X1, . . . , Xn−1)
i=1
Property 6:
P
H(X1, . . . , Xn)) ≤ i H(Xi) with equality iff Xi are independent.
Proof: Apply Prop. 2 and 5.
Property 7:
H(f (X)) ≤ H(X) with equality iff f is 1-1 (a.e.)
Proof: By Prop 4,
H(X, f (X)) = H(X)
Also,
H(X, f (X)) = H(f (X)) + H(X|f (X))
By Prop 1 and (the last equality by Prop 3 (‘if’)).
So, H(f (X)) ≤ H(X) – and –
H(f (X)) = H(X) iff H(X|f (X)) = 0 iff X = g(f (X)) iff f is
1-1 a.e. (by Prop 3 (’only if’)).
Property 8:
H(Y |X) ≤ H(Y |f (X))
Proof: Let Then
X X
X X X
LHS =
H(Y |X = x) =
−p(x, y) log(p(y|x)
z
z
x:f (x)=z
x:f (x)=z
y
And
RHS =
X
H(Y |f (X) = z) =
z
XX
z
−p(z, y) log p(y|z)
y
We show that the inequality holds term by term for each y and z, or
19
equivalently, dividing by z,
X
p(x, y)
−
log p(y|x) ≤ −p(y|z) log p(y|z)
p(z)
x:f (x)=z
Proof:
X p(x)
LHS =
(−p(y|x) log(p(y|x))
p(z)
x:f (x)=z
By Jensen applied to −u log u,
LHS ≤ −u log u
where
X p(y, x)
X p(x)
p(y|x) =
= p(y|z)
u=
p(z)
p(z)
x:f (x)=z
x:f (x)=z
(equality holds iff for each y and z, either p(y|z) = 0 or {p(y|x :
f (x) = z} are all the same).
And many more, e.g.,
Property 9: H(X, Y |Z) = H(X|Z) + H(Y |X, Z)
Proof: Version of Prop. 1.
Property 10: H(Y |Z, X) ≤ H(Y |Z) with equality iff X ⊥Z Y
Proof: Version of Prop. 2.
Lecture 6:
Another approach to proof of Property 8 (Hannah Cairns):
For any fixed z, the r.v. (Y |f (X) = z) is a convex combination
of {Y |X = x)}{f (x)=z}. Since H(p) is a concave function of p, and
so
X
p(x|z)H(Y |X = x) ≤ H(Y |f (X) = z).
x:f (x)=z
20
equaivalently
X
p(x, z)H(Y |X = x) ≤ p(z)H(Y |f (X) = z).
x:f (x)=z
Now, add over all values z of f .
Property 11: If X ⊥Z (Y, W ), then H(Y |W, Z, X) = H(Y |W, Z).
Property 12: H(f (X)|Y ) ≤ H(X|Y )
Property 13: If X is stationary, then for all i, j, k
H(Xi . . . Xj ) = H(Xi+k . . . Xj+k )
Proposition (Simple Clumping): for n ≥ 2,
H(p1, . . . , pn) = H(p1+p2, p3, . . . , pn)+(p1+p2)H(
p2
p1
,
)
p1 + p2 p1 + p2
Proof: Let X ∼ p. Let Y = f (X) where f (1) = f (2) = 2 and
f (i) = i for i ≥ 2. By Prop 1 and Prop 4,
H(p1, . . . , pn) = H(X) = H(X, f (X)) = H(f (X)) + H(X|f (X))
By defn,
H(f (X)) = H(p1 + p2, p3, . . . , pn)
And H(X|f (X)) is a weighted avg. of H(X|f (X) = z) which are
p2
1
,
) with weight =
all zero except when z = 2 when it is H( p1p+p
p
2
1 +p2
p1 + p2 .
A: Axiomatic Treatment of Entropy
Prop: H(p) is the unique function such that:
1. H(p) ≥ 0 and H(p) 6≡ 0.
2. (Symmetry) H(p1, . . . , pn is a symmetric function of p for each
n
21
3. (Continuity) H(p1, . . . , pn is a continuous function of p for each
n (n = 2 is sufficient)
4. (Simple Clumping)
H(p1, . . . , pn) = H(p1+p2, p3, . . . , pn)+(p1+p2)H(
p2
p1
,
)
p1 + p2 p1 + p2
Note: Shannon (1948) originally proved this with the additional
assumption that H(1/n, . . . , 1/n) is monotone increasing in n. The
result abobe was proven in the 50’s
Information function: random variable I(x) = IX (x) = − log p(x) =
log(1/(p(x)) ∈ [1, ∞]
Quantifies the amount of information revealed when a sample from
X with value x is drawn:
– if p(x) ≈ 1, then very little information is revealed.
– if p(x) ≈ 0, then a lot of information is revealed.
Observe: H(X) = EpI(x) (expected amount of information revealed).
Note: we could possibly also consider other functions as an information function.
Proposition (Exercise): I(x) = − log p(x) is the unique function
s.t.
1. I(x) ≥ 0 and I(x) 6≡ 0.
2. I(x) = I(p(x)).
3. If (X, Y ) are jointly distributed and the events (X = x) and
(Y = y) are independent, then I(x, y) = I(x) + I(y).
————————————–
SKIPPED FOR NOW:
22
Entropy of r.v. with countable range:
P
Say p = (p1, p2, . . .), with each pi ≥ 0 and i pi1, then H(p) is
P
defined as before: − i pi log pi.
P
1
— 1, Let qn = n(log n)2 and S = n qn < ∞ by the integral test.
—— Let pn = ( n(log1 n)2 )/S.
—— Then H(p) = ∞:
X
1
H(p) = (1/S)
)(log n + 2 log log n + log S)
(
2
n(log
n)
n
X
= (1/S)
(
n
X log log n
1
) + (1/S)
(
) + log S
2
n(log n)
n(log
n)
n
∞ + finite + finite
— 2. Let pn = 1/(n2S), where S = π 2/6.
—— Then H(p) < ∞:
X
X
2
H(p) = (1/S)
(2 log n)/n + 1/S
(log S)/n2 < ∞
n
n
Entropy of r.v. with continuous range (differential entropy)
——————————————
Define Entropy (rate) of a stationary process X (one-sided or two
sided):
H(X1, . . . , Xn)
h(X) = lim
n→∞
n
Proposition:
1. H(Xn|Xn−1 . . . X1) is montonically decreasing (non-increasing)
2.
h(X) = lim H(Xn|Xn−1 . . . X1)
n→∞
and so exists.
23
3.
H(X1 ,...,Xn )
n
is monotonically decreasing (non-increasing).
Proof:
1. By Property 8,
H(Xn|Xn−1 . . . X1) ≤ H(Xn|Xn−1 . . . X2)
By statioinarity,
H(Xn|Xn−1 . . . X2) = H(Xn−1|Xn−2 . . . X1)
2. Apply Property 5:
H(X1, . . . , Xn)
=
n
H(X1) + H(X2|X1) + H(X3|X2, X1) + . . . + H(Xn|Xn−1 . . . X1)
n
But this is the Cesaro average of the sequence of conditional entropies
which converges. Thus, this sequence also converges and converges
to the same limit.
3. The Cesaro average of a monotone sequence is monotone.
So, we get two sequences of approximations to h(X).
Prop: (Forwards, Backwards) If X is stationary, then
H(Xn|Xn−1 . . . X1) = H(X1|X2 . . . Xn)
and so
h(X) = lim H(X1|X2 . . . Xn)
n→∞
Proof:
H(Xn|Xn−1 . . . X1) = H(X1, X2 . . . Xn) − H(X1 . . . Xn−1) =
H(X1, X2 . . . Xn) − H(X2 . . . Xn) = H(X1|X2 . . . Xn)
Note that this holds even if X is not reversible.
24
Download