Chapter 6
Entropy and
Shannon’s First
Theorem
Information
A quantitative measure of the amount of information any
event represents. I(p) = the amount of information in the
occurrence of an event of probability p.
Axioms:
A. I(p) ≥ 0
B. I(p1∙p2) = I(p1) + I(p2)
C. I(p)
Cauchy functional equation
Existence: I(p) = log_(1/p)
source
single
symbol
for any event p
p1 & p2 are independent events
is a continuous function of p
units of information:
in base 2 = a bit
in base e = a nat
in base 10 = a Hartley
6.2
Uniqueness:
Suppose I′(p) satisfies the axioms. Since I′(p) ≥ 0, take
any 0 < p0 < 1, any base k = (1/p0)(1/I′(p0)). So kI′(p0) = 1/p0,
and hence logk (1/p0) = I′(p0). Now, any z (0,1) can be
written as p0r, r a real number R+ (r = logp0 z). The
Cauchy Functional Equation implies that I′(p0n) = n I′(p0)
and m Z+, I′(p01/m) = (1/m) I′(p0), which gives I′(p0n/m) =
(n/m) I′(p0), and hence by continuity I′(p0r) = r I′(p0).
Hence I′(z) = r∙logk (1/p0) = logk (1/p0r) = logk (1/z).
Note: In this proof, we introduce an arbitrary p0, show how
any z relates to it, and then eliminate the dependency
on that particular p0.
6.2
Entropy
The average amount of information received on a per
symbol basis from a source S = {s1, …, sq} of symbols, si has
probability pi. It is measuring the information rate.
In radix r, when all the probabilities are independent:
weighted arithmeticmean
of information
information of the
weighted geometricmean
pi
pi
q
q
q
1
1
1
H r ( S ) pi logr
logr logr
pi
i 1
i 1
i 1 pi
pi
• Entropy is amount of information in probability distribution.
Alternative approach: consider a long message of N symbols
from S = {s1, …, sq} with probabilities p1, …, pq. You expect si to
appear Npi times, and the probability of this typical message is:
q
P pi
i 1
Np i
q
1
1
whose information is log N pi log N H (S)
pi
P
i 1
6.3
Consider f(p) = p ln (1/p): (works for any base, not just e)
f′(p) = (-p ln p)′ = -p(1/p) – ln p = -1 + ln (1/p)
f″(p) = p(-p-2) = - 1/p < 0 for p (0,1) f is concave down
f′(1/e) = 0
f(1/e) = 1/e
1/
e
f
f′(1) = -1
f′(0) = ∞
0
lim f ( p) lim
p 0
p 0
ln p1
1
p
1/
e
p
1
f(1) = 0
ln p
( ln p)
p 1
lim 1 lim
lim
0
1
2
p 0 p
p 0 ( p )
p 0 p
6.3
Basic information about logarithm function
Tangent line to y = ln x at x = 1
(y ln 1) = (ln)′x=1(x
1)
y=x1
(ln x)″ = (1/x)′ = -(1/x2) < 0 x
ln x is concave down.
y=x1
ln x
0
x
1
-1
Conclusion: ln x x 1
6.4
Fundamental Gibbs inequality
q
q
i 1
i 1
Let xi 1 and yi 1 be two probability distributions, and consider
q
yi
xi log
xi
i 1
only when xi yi
q
q
q
q
yi
xi (1 ) ( xi yi ) xi yi 1 1 0
xi
i 1
i 1
i 1
i 1
• Minimum Entropy occurs when one pi = 1 and all others are 0.
• Maximum Entropy occurs when? Consider
Gibbs with
1 distribution y i q1
q
q
q
1
q
H (S ) logq pi log logq pi pi log
0
pi
pi
i 1
i 1
i 1
• Hence H(S) ≤ log q, and equality occurs only when pi = 1/q.
1
6.4
Entropy Examples
S = {s1}
S = {s1,s2}
S = {s1, …, sr}
p1 = 1
p1 = p2 = ½
p1 = … = pr = 1/r
H(S) = 0
H2(S) = 1
Hr(S) = 1
(no information)
(1 bit per symbol)
but H2(S) = log2r.
• Run length coding (for instance, in binary predictive coding):
p = 1 q is probability of a 0. H2(S) = p log2(1/p) + q log2(1/q)
As q 0 the term q log2(1/q) dominates (compare slopes). C.f.
average run length = 1/q and average # of bits needed = log2(1/q).
So q log2(1/q) = avg. amount of information per bit of original code.
Entropy as a Lower Bound for Average Code Length
Given an instantaneous code with length li in radix r, let
q
1
r li
K li 1 ; Qi
; Qi 1
K
i 1 r
i 1
q
Qi
Qi
1
1
So by Gibbs, pi logr 0, applyinglog log log
pi
pi
Qi
i 1
pi
q
q
q
1
1
H r ( S ) pi logr
pi logr
pi (logr K li logr r )
pi i 1
Qi i 1
i 1
q
q
logr K pi li . Since K 1, logr K 0, and henceH r ( S ) L.
i 1
By the McMillan inequality, this hold for all uniquely decodable
codes. Equality occurs when K = 1 (the decoding tree is complete)
and
p r li
i
6.5
Shannon-Fano Coding
Simplest variable length method. Less efficient than Huffman, but
allows one to code symbol si with length li directly from
probability pi.
li = logr(1/pi)
pi
1
1
1
r
li
li
logr li logr 1 r pi r .
pi
pi
pi
pi
r
K
q
Summing this inequality over i:
p
i 1
i
q
1 r
i 1
li
q
i 1
pi 1
r r
Kraft inequality is satisfied, therefore there is an instantaneous
code with these lengths.
6.6
q
q
1
Also, H r ( S ) pi logr
pi li H r ( S ) 1
pi
i 1
i 1
L
by summing multipliedby pi
Example: p’s: ¼, ¼, ⅛, ⅛, ⅛, ⅛ l’s: 2, 2, 3, 3, 3, 3 K = 1
0
H2(S) = 2.5
0
1
1
L = 5/2
0
0
1
1
0
1
6.6
The Entropy of Code Extensions
Recall: The nth extension of a source S = {s1, …, sq} with
probabilities p1, …, pq is the set of symbols
T = Sn = {si1 ∙∙∙ sin : sij S 1 j n} where
concatenation
multiplication
ti = si1 ∙∙∙ sin has probability pi1 ∙∙∙ pin = Qi assuming independent
probabilities. Let i = (i1−1, …, in−1)q + 1, an n-digit number base q.
The entropy is:
[]
qn
qn
1
1
H ( S ) H (T ) Qi log Qi log
Qi i 1
pi1 pin
i 1
n
1
1
Qi log
log
pi1
pin
i 1
qn
qn
qn
Qi log 1 Qi log 1 .
i 1
pi1
pin
i 1
6.8
qn
qn
1
1
Consider the kth term Qi log
pi1 pi n log
pi k i 1
pi k
i 1
q
q
q
q
q
1
1
ˆ
ˆ
pi1 pi n log
i k pi1 pi k pi n pi k log
pi k i1 1 i n 1
pi k
i 1 1
i n 1
i k 1
q
i 1 1
iˆk
q
pi1 pˆ i k pi n H (S ) H (S )
i n 1
pi1 pˆ i k pi n is just a probability in the(n 1)st
extension,and adding themall up gives1.
H(Sn) = n∙H(S)
Hence the average S-F code length Ln for T satisfies:
H(T) Ln < H(T) + 1 n ∙ H(S) Ln < n ∙ H(S) + 1
H(S) (Ln/n) < H(S) + 1/n [now let n go to infinity]
6.8
Extension Example
S = {s1, s2}
H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1)
p1 = 2/3 p2 = 1/3
~ 0.9182958 …
Huffman: s1 = 0 s2 = 1 Avg. coded length = (2/3)∙1+(1/3)∙1 = 1
Shannon-Fano: l1 = 1 l2 = 2 Avg. length = (2/3)∙1+(1/3)∙2 = 4/3
2nd extension: p11 = 4/9 p12 = 2/9 = p21 p22 = 1/9 S-F:
l11 = log2 (9/4) = 2 l12 = l21 = log2 (9/2) = 3 l22 = log2 (9/1) = 4
LSF(2) = avg. coded length = (4/9)∙2+(2/9)∙3∙2+(1/9)∙4 = 24/9 = 2.666…
Sn = (s1 + s2)n, probabilities are corresponding terms in (p1 + p2)n
i
n i
i
n i
n i
n
2
1
2
p1 p2 So thereare symbolswith probability
n
i
3
3
3
i 0 i
n
3n
T hecorresponding SF lengthis log2 i n log2 3 i n log2 3 i
2
6.9
Extension cont.
LSF
(n)
n 2i
1 n n i
n n log2 3 i n 2 n log2 3 i
3 i 0 i
i 0 i 3
n
n i n n i
1
2n
n
log
3
2
2
i
n
log
3
2
2
n
3
i
i
3
i 0
i 0
n
(2 + 1)n = 3n
(n )
LSF
Hence
n
2n 3n-1 *
2
log2 3 H2 (S )
n
3
as
n
x 1
n i n i dx
n i
n 1
n i 1
(2 x ) 2 x n (2 x ) 2 (n i ) x
n 3n 1
i 0 i
i 0 i
n
n
n
n
n i
n i
n
n i
n 1
i
n
n 1
2
(
n
i
)
n
3
n
2
i
2
2
i
n
3
n
3
i 0 i
i 0 i
i 0 i
i 0 i
n
n
6.9
Markov Process Entropy
p( si | si1 sim ) conditional probability thatsi follows si1 sim .
For an mth order process,thinkof let tingthestate s si1 , , sim .
1
Hence, I ( s | s )
log
, and so
i
p( si | s )
p( s | s ) I ( s | s )
H (S | s )
si S
i
i
Now, let p(s ) theprobability of being in states .
T henH (S ) p(s ) H (S | s )
s S m
p(s ) p(s | s ) I (s | s ) p(s ) p(s
p(s , s ) I (s | s ) p(s , s ) log
s i S
s S m
s S
m
s i S
i
i
i
i
s S m s i S
i
s , s i S
m 1
i
| s ) I (si | s )
1
p ( s i |s )
6.10
.8
previous
state
0, 0
.2
.5
.5
0, 1
1, 0
.5
.5
.2
1, 1
.8
equilibrium probabilities:
p(0,0) = 5/14 = p(1,1)
p(0,1) = 2/14 = p(1,0)
next
state
Si1
Si2
Si
0
0
0
0.8
5/14
4/14
0
0
1
0.2
5/14
1/14
0
1
0
0.5
2/14
1/14
0
1
1
0.5
2/14
1/14
1
0
0
0.5
2/14
1/14
1
0
1
0.5
2/14
1/14
1
1
0
0.2
5/14
1/14
1
1
1
0.8
5/14
4/14
H 2 (S )
{0,1}3
2
Example
p(si | si1, si2) p(si1, si2) p(si1, si2, si)
1
p(si1 , si2 , si ) log2
p(si | si1 , si2 )
4
1
1
1
1
1
log 2
2 log 2
4 log 2
0.801377
14
0.8
14
0.2
14
0.5
6.11
The Fibonacci numbers
Let f0 = 1 f1 = 2 f2 = 3 f3 = 5 f4 = 8 , ….
𝑓𝑛+1
be defined by fn+1 = fn + fn−1. The lim
=
𝑛→∞ 𝑓𝑛
1+ 5
2
= the golden ratio, a root of the
equation x2 = x + 1. Use these as the
weights for a system of number
representation with digits 0 and 1, without
adjacent 1’s (because (100)phi = (11)phi).
Base Fibonacci
Representation Theorem: every number from 0 to fn − 1 can
be uniquely written as an n-bit number with no adjacent one’s .
Existence: Basis: n = 0 0 ≤ i ≤ 0. 0 = (0)phi = ε
Induction: Let 0 ≤ i ≤ fn+1 If i < fn , we are done by induction
hypothesis. Otherwise, fn ≤ i < fn+1 = fn−1 + fn , so 0 ≤ (i − fn) <
fn−1, and is uniquely representable by i − fn = (bn−2 … b0)phi with
bi in {0, 1} ¬(bi = bi+1 = 1). Hence i = (10bn−2 … b0)phi which also
has no adjacent ones.
Uniqueness: Let i be the smallest number ≥ 0 with two distinct
representations (no leading zeros). i = (bn−1 … b0)phi = (b′n−1 …
b′0)phi . By minimality of i bn−1 ≠ b′n−1 , and so without loss of
generality, let bn−1 = 1 b′n−1 = 0, implies (b′n−2 … b′0)phi ≥ fn−1
which can’t be true.
Base Fibonacci
The golden ratio = (1+√5)/2 is a solution
to x2 − x − 1 = 0 and is equal to the limit of
the ratio of adjacent Fibonacci numbers.
1/2
1/
1/2
1/r
H2 = log2 r
0
1
0
1/
0
…
r−1
1st order Markov process:
1
0
1/
Think of source as 0
emitting variable 10
length symbols:
1/2
1/ 1/2
1
0
1/ + 1/2 = 1
Entropy = (1/)∙log + ½(1/²)∙log ² = log which is maximal
take into account
variable length symbols