Information Theory Lectures

advertisement
Information Theory
Mike Brookes
E4.40, ISE4.51, SO20
Jan 2008
2
Lectures
Entropy Properties
1 Entropy - 6
2 Mutual Information – 19
Losless Coding
3 Symbol Codes -30
4 Optimal Codes - 41
5 Stochastic Processes - 55
6 Stream Codes – 68
Channel Capacity
7 Markov Chains - 83
8 Typical Sets - 93
9 Channel Capacity - 105
10 Joint Typicality - 118
11 Coding Theorem - 128
12 Separation Theorem – 135
Continuous Variables
13 Differential Entropy - 145
14 Gaussian Channel - 159
15 Parallel Channels – 172
Lossy Coding
16 Lossy Coding - 184
17 Rate Distortion Bound – 198
Revision
18 Revision - 212
19
20
Jan 2008
3
Claude Shannon
•
“The fundamental problem of communication is that
of reproducing at one point either exactly or
approximately a message selected at another
point.” (Claude Shannon 1948)
•
Channel Coding Theorem:
It is possible to achieve near perfect communication
of information over a noisy channel
•
In this course we will:
– Define what we mean by information
– Show how we can compress the information in a
source to its theoretically minimum value and show
the tradeoff between data compression and distortion.
– Prove the Channel Coding Theorem and derive the
information capacity of different channels.
1916 - 2001
Jan 2008
4
Textbooks
Book of the course:
• Elements of Information Theory by T M Cover & J A
Thomas, Wiley 2006, 978-0471241959 £30 (Amazon)
Alternative book – a denser but entertaining read that
covers most of the course + much else:
• Information Theory, Inference, and Learning Algorithms,
D MacKay, CUP, 0521642981 £28 or free at
http://www.inference.phy.cam.ac.uk/mackay/itila/
Assessment: Exam only – no coursework.
Acknowledgement: Many of the examples and proofs in these notes are taken from the course textbook “Elements of
Information Theory” by T M Cover & J A Thomas and/or the lecture notes by Dr L Zheng based on the book.
Jan 2008
5
Notation
• Vectors and Matrices
– v=vector, V=matrix, ~=elementwise product
• Scalar Random Variables
– x = R.V, x = specific value, X = alphabet
• Random Column Vector of length N
– x= R.V, x = specific value, XN = alphabet
– xi and xi are particular vector elements
• Ranges
– a:b denotes the range a, a+1, …, b
Jan 2008
6
Discrete Random Variables
• A random variable x takes a value x from
the alphabet X with probability px(x). The
vector of probabilities is px .
Examples:
X = [1;2;3;4;5;6], px = [1/6; 1/6; 1/6; 1/6; 1/6; 1/6]
pX is a “probability mass vector”
“english text”
X = [a; b;…, y; z; <space>]
px = [0.058; 0.013; …; 0.016; 0.0007; 0.193]
Note: we normally drop the subscript from px if unambiguous
Jan 2008
7
Expected Values
• If g(x) is real valued and defined on X then
Ex g ( x ) = ∑ p ( x ) g ( x )
often write E for EX
x∈X
Examples:
X = [1;2;3;4;5;6], px = [1/6; 1/6; 1/6; 1/6; 1/6; 1/6]
E x = 3.5 = μ
E x 2 = 15.17 = σ 2 + μ 2
E sin(0.1x ) = 0.338
E − log 2 ( p(x )) = 2.58
This is the “entropy” of X
Jan 2008
8
Shannon Information Content
• The Shannon Information Content of an
outcome with probability p is –log2p
• Example 1: Coin tossing
– X = [Heads; Tails], p = [½; ½], SIC = [1; 1] bits
• Example 2: Is it my birthday ?
– X = [No; Yes], p = [364/365; 1/365],
SIC = [0.004; 8.512] bits
Unlikely outcomes give more information
Jan 2008
10
Minesweeper
• Where is the bomb ?
• 16 possibilities – needs 4 bits to specify
Guess
Prob
1.
15/
No
16
SIC
0.093 bits
Jan 2008
11
Entropy
The entropy, H (x ) = E − log 2 ( px (x )) = −pTx log 2 p x
– H(x) = the average Shannon Information Content of x
– H(x) = the average information gained by knowing its value
– the average number of “yes-no” questions needed to find x is in
the range [H(x),H(x)+1)
We use log(x) ≡ log2(x) and measure H(x) in bits
– if you use loge it is measured in nats
– 1 nat = log2(e) bits = 1.44 bits
• log 2 ( x) =
ln( x)
ln(2)
d log 2 x log 2 e
=
dx
x
H(X ) depends only on the probability vector pX not on the alphabet X,
so we can write H(pX)
Jan 2008
12
Entropy Examples
1
(1) Bernoulli Random Variable
0.6
H(p)
X = [0;1], px = [1–p; p]
H (x ) = −(1 − p) log(1 − p) − p log p
Very common – we write H(p) to
mean H([1–p; p]).
0.8
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
p
(2) Four Coloured Shapes
X = [z; „; ‹; V], px = [½; ¼; 1/8; 1/8]
H (x ) = H (p x ) = ∑ − log( p( x)) p( x)
= 1 × ½ + 2 × ¼ + 3 × 18 + 3 × 18 = 1.75 bits
Jan 2008
13
Bernoulli Entropy Properties
X = [0;1], px = [1–p; p]
H ( p) = −(1 − p) log(1 − p) − p log p
H ′( p ) = log(1 − p) − log p
H ′′( p) = − p −1 (1 − p) −1 log e
Bounds on H(p)
1
Quadratic Bounds
0.6
H ( p ) ≤ 1 − 2 log e( p − ½) 2
H(p)
0.8
1-2.89(p-0.5)2
H(p)
1-4(p-0.5)2
2min(p,1-p)
0.4
0.2
0
0
= 1 − 2.89( p − ½) 2
0.2
0.4
0.6
p
H ( p ) ≥ 1 − 4( p − ½) 2
≥ 2 min( p,1 − p )
0.8
1
Proofs in problem sheet
Jan 2008
14
Joint and Conditional Entropy
p(x,y)
y=0 y=1
x=0
½
¼
H (x , y ) = E − log p (x , y )
= −½ log ½ − ¼ log ¼ − 0 log 0 − ¼ log ¼ = 1.5 bits
x=1
0
¼
Conditional Entropy : H(y | x)
p(y|x)
Joint Entropy: H(x,y)
Note: 0 log 0 = 0
y=0 y=1
x=0
2/
3
1/
3
x=1
0
1
Note: rows sum to 1
H (y | x ) = E − log p (y | x )
= −½ log 2 3 − ¼ log 13 − 0 log 0 − ¼ log 1 = 0.689 bits
Jan 2008
15
Conditional Entropy – view 1
p(x, y) y=0 y=1 p(x)
Additional Entropy:
x=0
½
¼
¾
p (y | x ) = p ( x , y ) ÷ p ( x )
0
¼
x=1
¼
H (y | x ) = E − log p (y | x )
= E {− log p (x , y )}− E {− log p (x )}
= H (x , y ) − H (x ) = H (½,¼,0,¼) − H (¼) = 0.689 bits
H(Y|X) is the average additional information in Y when you know X
H(x ,y)
H(y |x)
H(x)
Jan 2008
16
Conditional Entropy – view 2
Average Row Entropy:
p(x, y)
y=0
y=1 H(y | x=x)
p(x)
x=0
½
¼
H(1/3)
¾
x=1
0
¼
H(1)
¼
H (y | x ) = E − log p (y | x ) = ∑ − p ( x, y ) log p ( y | x)
x, y
= ∑ − p ( x) p ( y | x) log p ( y | x) = ∑ p ( x)∑ − p ( y | x) log p ( y | x)
x∈ X
x, y
y∈Y
= ∑ p ( x) H (y | x = x ) = ¾ × H ( 1 3 ) + ¼ × H (0) = 0.689 bits
x∈ X
Take a weighted average of the entropy of each row using p(x) as weight
Jan 2008
17
Chain Rules
• Probabilities
p ( x , y , z ) = p (z | x , y ) p (y | x ) p ( x )
• Entropy
H ( x , y , z ) = H (z | x , y ) + H (y | x ) + H ( x )
n
H (x 1:n ) = ∑ H (x i | x 1:i −1 )
i =1
The log in the definition of entropy converts products of
probability into sums of entropy
Jan 2008
18
Summary
• Entropy: H (x ) = ∑ − log 2 ( p( x)) p( x) = E − log 2 ( p X ( x))
x∈X
– Bounded
0 ≤ H (x ) ≤ log | X |
♦
H(x ,y)
• Chain Rule:
H ( x , y ) = H (y | x ) + H ( x )
H(x |y)
H(y |x)
H(x)
• Conditional Entropy:
H(y)
H (y | x ) = H (x , y ) − H (x ) = ∑ p ( x) H (y | x )
x∈X
– Conditioning reduces entropy H (y | x ) ≤ H (y ) ♦
♦ = inequalities not yet proved
Jan 2008
19
Lecture 2
• Mutual Information
– If x and y are correlated, their mutual information is the average
information that y gives about x
• E.g. Communication Channel: x transmitted but y received
• Jensen’s Inequality
• Relative Entropy
– Is a measure of how different two probability mass vectors are
• Information Inequality and its consequences
– Relative Entropy is always positive
• Mututal information is positive
• Uniform bound
• Conditioning and Correlation reduce entropy
Jan 2008
20
Mutual Information
The mutual information is the average
amount of information that you get about x
from observing the value of y
I ( x ; y ) = H ( x ) − H ( x | y ) = H ( x ) + H (y ) − H ( x , y )
Information in x
Information in x when you already know y
H(x ,y)
Mutual information is
symmetrical
H(x |y) I(x ;y) H(y |x)
H(x)
I ( x ; y ) = I (y ; x )
H(y)
Use “;” to avoid ambiguities between I(x;y,z) and I(x,y;z)
Jan 2008
21
Mutual Information Example
p(x,y)
y=0 y=1
x=0
½
¼
x=1
0
¼
• If you try to guess y you have a 50%
chance of being correct.
• However, what if you know x ?
– Best guess: choose y = x
=0 (p=0.75) then 66% correct prob
– If x =1 (p=0.25) then 100% correct prob
– If x
– Overall 75% correct probability
H(x ,y)=1.5
I (x ; y ) = H (x ) − H (x | y )
H(x |y)
=0.5
I(x ;y)
=0.311
H(x)=0.811
H(y |x)
=0.689
H(y)=1
= H ( x ) + H (y ) − H ( x , y )
H (x ) = 0.811, H (y ) = 1, H (x , y ) = 1.5
I (x ; y ) = 0.311
Jan 2008
22
Conditional Mutual Information
Conditional Mutual Information
I (x ; y | z ) = H (x | z ) − H (x | y , z )
= H ( x | z ) + H (y | z ) − H ( x , y | z )
Note: Z conditioning applies to both X and Y
Chain Rule for Mutual Information
I (x 1 , x 2 , x 3 ; y ) = I (x 1; y ) + I (x 2 ; y | x 1 ) + I (x 3 ; y | x 1 , x 2 )
n
I (x 1:n ; y ) = ∑ I (x i ; y | x 1:i −1 )
i =1
H(x ,y)
Jan 2008
23
Review/Preview
• Entropy:
H(x |y) I(x ;y) H(y |x)
H(x)
H(y)
H (x ) = ∑ - log 2 ( p ( x)) p( x) = E − log 2 ( p X ( x))
x∈X
– Always positive
• Chain Rule:
H (x ) ≥ 0
H ( x , y ) = H ( x ) + H (y | x ) ≤ H ( x ) + H (y ) ♦
– Conditioning reduces entropy H (y | x ) ≤ H (y )
♦
• Mutual Information:
I (y ; x ) = H (y ) − H (y | x ) = H ( x ) + H (y ) − H ( x , y )
– Positive and Symmetrical I (x ; y ) = I (y ; x ) ≥ 0
– x and y independent ⇔ H (x , y ) = H (y ) + H (x )
♦ = inequalities not yet proved
⇔ I (x ; y ) = 0
♦
♦
Jan 2008
24
Convex & Concave functions
f(x) is strictly convex over (a,b) if
f (λu + (1 − λ )v) < λf (u ) + (1 − λ ) f (v) ∀u ≠ v ∈ (a, b), 0 < λ < 1
f(x) lies above f(x)
– f(x) is concave ⇔ –f(x) is convex
– every chord of
Concave is like this
• Examples
– Strictly Convex: x , x , e , x log x [ x ≥ 0]
– Strictly Concave: log x, x [ x ≥ 0]
– Convex and Concave: x
2
– Test:
4
x
d2 f
> 0 ∀x ∈ (a, b)
dx 2
⇒ f(x) is strictly convex
“convex” (not strictly) uses “≤” in definition and “≥” in test
Jan 2008
25
Jensen’s Inequality
Jensen’s Inequality: (a) f(x) convex ⇒ Ef(x) ≥ f(Ex)
(b) f(x) strictly convex ⇒ Ef(x) > f(Ex) unless x constant
Proof by induction on |X|
These sum to 1
– |X|=1: E f (x ) = f ( E x ) = f ( x1 )
k
k −1
pi
f ( xi )
i =1 1 − pk
– |X|=k: E f (x ) = ∑ pi f ( xi ) = pk f ( xk ) + (1 − pk )∑
i =1
⎞
⎛ k −1 pi
xi ⎟⎟
≥ pk f ( xk ) + (1 − pk ) f ⎜⎜ ∑
Assume JI is true
⎝ i =1 1 − pk ⎠
for |X|=k–1
k −1
⎞
⎛
pi
xi ⎟⎟ = f (E x )
≥ f ⎜⎜ pk xk + (1 − pk )∑
i =1 1 − pk
⎠
⎝
Can replace by “>” if f(x) is strictly convex unless pk∈{0,1} or xk = E(x | x∈{x1:k-1})
Jan 2008
26
Jensen’s Inequality Example
Mnemonic example:
f(x)
f(x) = x2 : strictly convex
X = [–1; +1]
4
p = [½; ½]
3
Ex =0
2
f(E x)=0
1
E f(x) = 1 > f(E x)
0
-2
-1
0
x
1
2
Jan 2008
27
Relative Entropy
Relative Entropy or Kullback-Leibler Divergence
between two probability mass vectors p and q
D(p || q) = ∑ p ( x) log
x∈X
p( x)
p(x )
= Ep log
= Ep (− log q(x ) ) − H (x )
q( x)
q(x )
where Ep denotes an expectation performed using probabilities p
D(p||q) measures the “distance” between the probability
mass functions p and q.
We must have pi=0 whenever qi=0 else D(p||q)=∞
Beware: D(p||q) is not a true distance because:
–
–
(1) it is asymmetric between p, q and
(2) it does not satisfy the triangle inequality.
Jan 2008
28
Relative Entropy Example
X = [1 2 3 4 5 6]T
[6
q = [1
10
p= 1
1
1
1
1
1
]
⇒ H (p) = 2.585
6
6
6
6
6
1
1
1
1
1
⇒ H (q) = 2.161
10
10
10
10
2
D(p || q) = Ep (− log qx ) − H (p) = 2.935 − 2.585 = 0.35
]
D(q || p) = Eq (− log px ) − H (q) = 2.585 − 2.161 = 0.424
Jan 2008
29
Information Inequality
Information (Gibbs’) Inequality: D(p || q) ≥ 0
• Define A = {x : p( x) > 0} ⊆ X
p( x )
q( x )
• Proof − D(p || q) = − ∑ p( x) log q( x) = ∑ p( x ) log p( x)
x∈A
x∈A
⎛
q( x ) ⎞
⎛
⎞
⎛
⎞
≤ log⎜⎜ ∑ p ( x )
⎟⎟ = log⎜ ∑ q( x ) ⎟ ≤ log⎜ ∑ q( x ) ⎟ = log 1 = 0
p( x ) ⎠
⎝ x∈A
⎠
⎝ x∈X
⎠
⎝ x∈A
If D(p||q)=0: Since log( ) is strictly concave we have equality in the
proof only if q(x)/p(x), the argument of log, equals a constant.
But
∑ p( x) =∑ q( x) =1
x∈X
x∈X
so the constant must be 1 and p ≡ q
Jan 2008
30
Information Inequality Corollaries
• Uniform distribution has highest entropy
– Set q = [|X|–1, …, |X|–1]T giving H(q)=log|X| bits
D(p || q) = Ep {− log q(x )} − H (p) = log | X | − H (p) ≥ 0
• Mutual Information is non-negative
I (y ; x ) = H (x ) + H (y ) − H (x , y ) = E log
p(x , y )
p ( x ) p (y )
= D (p x , y || p x ⊗ p y ) ≥ 0
with equality only if p(x,y) ≡ p(x)p(y) ⇔ x and y are independent.
Jan 2008
31
More Corollaries
• Conditioning reduces entropy
0 ≤ I ( x ; y ) = H (y ) − H (y | x ) ⇒ H (y | x ) ≤ H (y )
with equality only if x and y are independent.
• Independence Bound
n
n
i =1
i =1
H (x 1:n ) = ∑ H (x i | x 1:i −1 ) ≤ ∑ H (x i )
with equality only if all xi are independent.
E.g.: If all xi are identical H(x1:n) = H(x1)
Jan 2008
32
Conditional Independence Bound
• Conditional Independence Bound
n
n
H (x 1:n | y 1:n ) = ∑ H (x i | x 1:i −1 , y 1:n ) ≤ ∑ H (x i | y i )
i =1
i =1
• Mutual Information Independence Bound
If all xi are independent or, by symmetry, if all yi are independent:
I (x 1:n ; y 1:n ) = H (x 1:n ) − H (x 1:n | y 1:n )
n
n
n
i =1
i =1
i =1
≥ ∑ H (x i ) − ∑ H (x i | y i ) = ∑ I (x i ; y i )
E.g.: If n=2 with xi i.i.d. Bernoulli (p=0.5) and y1=x2 and y2=x1,
then I(xi;yi)=0 but I(x1:2; y1:2) = 2 bits.
Jan 2008
33
Summary
• Mutual Information
• Jensen’s Inequality:
I (x ; y ) = H (x ) − H (x | y ) ≤ H (x )
f(x) convex ⇒ Ef(x) ≥ f(Ex)
• Relative Entropy:
– D(p||q) = 0 iff p ≡ q
D(p || q) = Ep log
• Corollaries
p(x )
≥0
q(x )
– Uniform Bound: Uniform p maximizes H(p)
– I(x ; y) ≥ 0 ⇒ Conditioning reduces entropy
– Indep bounds:
H (x 1:n ) ≤
n
∑ H (x )
H (x 1:n | y 1:n ) ≤
i
i =1
I (x 1:n ; y 1:n ) ≥
∑ I (x ; y )
i =1
∑ H (x
i =1
n
i
n
i
if x i or y i are indep
i
| yi)
Jan 2008
34
Lecture 3
• Symbol codes
– uniquely decodable
– prefix
• Kraft Inequality
• Minimum code length
• Fano Code
Jan 2008
35
Symbol Codes
• Symbol Code: C is a mapping X→D+
– D+ = set of all finite length strings from D
– e.g. {E, F, G} →{0,1}+ : C(E)=0, C(F)=10, C(G)=11
• Extension: C+ is mapping X+ →D+ formed by
concatenating C(xi) without punctuation
• e.g. C+(EFEEGE) =01000110
• Non-singular: x1≠ x2 ⇒ C(x1) ≠ C(x2)
• Uniquely Decodable: C+ is non-singular
– that is C+(x+) is unambiguous
Jan 2008
36
Prefix Codes
• Instantaneous or Prefix Code:
No codeword is a prefix of another
• Prefix ⇒ Uniquely Decodable ⇒ Non-singular
Examples:
– C(E,F,G,H) = (0, 1, 00, 11)
– C(E,F) = (0, 101)
– C(E,F) = (1, 101)
– C(E,F,G,H) = (00, 01, 10, 11)
– C(E,F,G,H) = (0, 01, 011, 111)
PU
PU
PU
PU
PU
Jan 2008
37
Code Tree
Prefix code: C(E,F,G,H) = (00, 11, 100, 101)
Form a D-ary tree where D = |D|
– D branches at each node
0
– Each node along the path to
a leaf is a prefix of the leaf
1
⇒ can’t be a leaf itself
– Some leaves may be unused
all used ⇒ | X | −1 is a multiple of D − 1
0
E
1
G
0
0
1
1
H
F
111011000000→ FHGEE
Jan 2008
38
Kraft Inequality (binary prefix)
• Label each node at depth l with 2–l
• Each node equals the sum of
all its leaves
• Codeword lengths: | X|
l1, l2, …, l|X| ⇒ ∑ 2−l
0
½
0
1
E
1/
¼
8
0
G
¼
1
i
¼
≤1
0
½
1
1
1
i =1
• Equality iff all leaves are utilised
• Total code budget = 1
Code 00 uses up ¼ of the budget
Code 100 uses up 1/8 of the budget
1
¼
/8
H
F
Same argument works with D-ary tree
Jan 2008
40
Kraft Inequality
If uniquely decodable C has codeword lengths
| X|
l1, l2, …, l|X| , then ∑ D −l
| X|
i
≤1
i =1
Proof: Let S = ∑ D −li and M = max li then for any N ,
i =1
N
| X| | X|
⎛ | X| − li ⎞
S = ⎜⎜ ∑ D ⎟⎟ = ∑∑
i1 =1 i 2 =1
⎝ i =1
⎠
N
NM
| X|
∑D
(
− l i 1 + l i2 +
i N =1
= ∑ D | x : l = length{C (x)} | ≤
l =1
−l
+
+ l iN
)=
∑D
− length{C + ( x)}
x∈ X N
NM
∑D
l =1
−l
NM
D = ∑1 = NM
l
l =1
If S > 1 then SN > NM for some N. Hence S ≤ 1
Jan 2008
41
Converse to Kraft Inequality
| X|
If ∑ D −l
i =1
Proof:
i
then ∃ a prefix code with
codeword lengths l1, l2, …, l|X|
≤1
– Assume li ≤ li+1 and think of codewords as
base-D decimals 0.d1d2…dli
k −1
– Let codeword ck = ∑ D −l
i
i =1
with lk digits
k −1
−l
– For any j < k we have ck = c j + ∑ D −l ≥ c j + D
i
j
i= j
– So cj cannot be a prefix of ck because they differ in
the first lj digits.
⇒ non-prefix symbol codes are a waste of time
Jan 2008
42
Kraft Converse Example
5|
Suppose l = [2; 2; 3; 3; 3] ⇒ ∑ 2−l
i
= 0.875 ≤ 1
i =1
k −1
ck = ∑ D − li
Code
2
0.0 = 0.002
00
2
0.25 = 0.012
01
3
0.5 = 0.1002
100
3
0.625 = 0.1012
101
3
0.75 = 0.1102
110
lk
i =1
Each ck is obtained by adding 1 to the LSB of the previous row
For code, express ck in binary and take the first lk binary places
Jan 2008
44
Minimum Code Length
If l(x) = length(C(x)) then C is optimal if
LC=E l(X) is as small as possible.
Uniquely decodable code ⇒ LC ≥ H(X)/log2D
Proof: We define q by q( x) = c −1D −l ( x ) where c = ∑ D −l ( x ) ≤ 1
LC − H (x ) / log 2 D = E l (x ) + E log D p (x )
(
x
)
= E − log D D − l ( x ) + log D p(x ) = E (− log D cq(x ) + log D p(x ) )
⎛
p(x ) ⎞
⎟⎟ − log D c = log D 2(D(p || q ) − log c ) ≥ 0
= E ⎜⎜ log D
q
(
)
x
⎝
⎠
with equality only if c=1 and D(p||q) = 0 ⇒ p = q ⇒ l(x) = –logD(x)
Jan 2008
45
Fano Code
Fano Code (also called Shannon-Fano code)
1.
2.
Put probabilities in decreasing order
Split as close to 50-50 as possible; repeat with each half
a
0.20
b
0.19
c
0.17
d
0.15
e
0.14
f
0.06
g
0.05
h
0.04
0
00
0
1
0
1
1
0
010
1
011
0
1
100
0
110
1
101
0
1110
1
1111
H(p) = 2.81 bits
LSF = 2.89 bits
Always H (p) ≤ LF ≤ H (p) + 1 − 2 min(p)
≤ H (p) + 1
Not necessarily optimal: the
best code for this p actually
has L = 2.85 bits
Jan 2008
46
Summary
• Kraft Inequality for D-ary |codes:
X|
– any uniquely decodable C has ∑ D −l ≤ 1
| X|
i =1
−l
– If ∑ D ≤ 1 then you can create a prefix code
i
i
i =1
• Uniquely decodable ⇒ LC ≥ H(X)/log2D
• Fano code
– Order the probabilities, then repeatedly split in half to
form a tree.
– Intuitively natural but not optimal
Jan 2008
47
Lecture 4
• Optimal Symbol Code
– Optimality implications
– Huffman Code
• Optimal Symbol Code lengths
– Entropy Bound
• Shannon Code
Jan 2008
48
Huffman Code
An Optimal Binary prefix code must satisfy:
1. p ( xi ) > p ( x j ) ⇒ li ≤ l j
(else swap codewords)
2. The two longest codewords have the same length
(else chop a bit off the longer codeword)
3. ∃ two longest codewords differing only in the last bit
(else chop a bit off all of them)
Huffman Code construction
1. Take the two smallest p(xi) and assign each a
different last bit. Then merge into a single symbol.
2. Repeat step 1 until only one symbol remains
Jan 2008
49
Huffman Code Example
X = [a, b, c, d, e], px = [0.25 0.25 0.2 0.15 0.15]
a
0.25
0.25
b
0.25
0.25
c
0.2
0.2
d
0.15 0
0.3
e
0.15 1
0
0.25
0
0.45
0.55
0.45
0
1.0
1
1
0.3
1
Read diagram backwards for codewords:
C(X) = [00 10 11 010 011], LC = 2.3, H(x) = 2.286
For D-ary code, first add extra zero-probability symbols until
|X|–1 is a multiple of D–1 and then group D symbols at a time
Jan 2008
51
Huffman Code is Optimal Prefix Code
Huffman traceback gives codes for progressively larger
alphabets:
0
0
a
0.25
0.25
b
0.25
p2=[0.55 0.45],
c
0.2
c2=[0 1], L2=1
d
0.15
p3=[0.25 0.45 0.3],
e
0.15
c3=[00 1 01], L3=1.55
p4=[0.25 0.25 0.2 0.3],
c4=[00 10 11 01], L4=2
p5=[0.25 0.25 0.2 0.15 0.15],
c5=[00 10 11 010 011], L5=2.3
0.25
0.2
0
0
0.25
0.55
0.45
0.45
1.0
1
1
0.3
0.3
1
1
We want to show that all these codes are optimal including C5
Jan 2008
52
Huffman Optimality Proof
Suppose one of these codes is sub-optimal:
– ∃m>2 with cm the first sub-optimal code (note c 2 is definitely
optimal)
– An optimal c′m must have LC'm < LCm
– Rearrange the symbols with longest codes in c′m so the two
lowest probs pi and pj differ only in the last digit (doesen’t
change optimality)
– Merge xi and xj to create a new code c′m −1 as in Huffman
procedure
– L C'm–1 =L C'm– pi– pj since identical except 1 bit shorter with prob
pi+ pj
– But also L Cm–1 =L Cm– pi– pj hence L C'm–1 < LCm–1 which
contradicts assumption that cm is the first sub-optimal code
Note: Huffman is just one out of many possible optimal codes
Jan 2008
53
How short are Optimal Codes?
Huffman is optimal but hard to estimate its length.
If l(x) = length(C(x)) then C is optimal if
LC=E l(x) is as small as possible.
We want to minimize
1.
∑D
−l ( x )
∑ p( x)l ( x)
subject to
x∈X
≤1
x∈X
2. all the l(x) are integers
Simplified version:
Ignore condition 2 and assume condition 1 is satisfied with equality.
less restrictive so lengths may be shorter than actually possible ⇒ lower bound
Jan 2008
54
Optimal Codes (non-integer li)
| X|
• Minimize
∑ p( x )l
i
i =1
i
subject to
| X|
∑D
− li
=1
i =1
Use lagrange multiplier:
Define
| X|
| X|
i =1
i =1
J = ∑ p ( xi )li + λ ∑ D −li and set
∂J
=0
∂li
∂J
= p ( xi ) − λ ln( D) D − li = 0 ⇒ D − li = p ( xi ) / λ ln( D )
∂li
| X|
also
∑D
− li
= 1 ⇒ λ = 1 / ln( D) ⇒
i =1
with these li , E l (x ) = E − log D ( p (x ) ) =
li = –logD(p(xi))
E − log 2 ( p (x ) ) H (x )
=
log 2 D
log 2 D
no uniquely decodable code can do better than this (Kraft inequality)
Jan 2008
55
Shannon Code
Round up optimal code lengths:
li = ⎡− log D p ( xi )⎤
• li are bound to satisfy the Kraft Inequality (since the
optimum lengths do)
• Hence prefix code exists:
put li into ascending order and set
k −1
ck = ∑ D
to li places
− li
i =1
• Average length:
or
k −1
ck = ∑ p( xi )
i =1
H (X )
H (X )
≤ LC <
+1
log 2 D
log 2 D
equally good
since p ( xi ) ≥
D − li
(since we added <1
to optimum values)
Note: since Huffman code is optimal, it also satisfies these limits
Jan 2008
56
Shannon Code Examples
Example 1
(good)
p x = [0.5 0.25 0.125 0.125]
− log 2 p x = [1 2 3 3]
l x = ⎡− log 2 p x ⎤ = [1 2 3 3]
LC = 1.75 bits, H (x ) = 1.75 bits
Example 2
(bad)
p x = [0.99 0.01]
Dyadic probabilities
− log 2 p x = [0.0145 6.64]
l x = ⎡− log 2 p x ⎤ = [1 7]
(obviously stupid to use 7)
LC = 1.06 bits, H (x ) = 0.08 bits
We can make H(x)+1 bound tighter by encoding longer blocks as a super-symbol
Jan 2008
57
Shannon versus Huffman
Shannon
p x = [0.36 0.34 0.25 0.05] ⇒ H (x ) = 1.78 bits
− log 2 p x = [1.47 1.56 2 4.32]
l S = ⎡− log 2 p x ⎤ = [2 2 2 5]
LS = 2.15 bits
Huffman
l H = [1 2 3 3]
LH = 1.94 bits
a
0.36
0.36
b
0.34
0.34
c
0.25
d
0.05
0
0
0.36
0.64
0
1.0
1
1
0.3
1
Individual codewords may be longer in
Huffman than Shannon but not the average
Jan 2008
58
Shannon Competitive Optimality
• l(x) is length of a uniquely decodable code
• lS ( x) = ⎡− log p ( x) ⎤ is length of Shannon code
then p(l (x ) ≤ lS (x ) − c ) ≤ 21−c
Proof:
Define A = {x : p ( x) < 2 −l ( x ) −c +1}
x with especially short l(x)
p (l (x ) ≤ ⎡− log p (x ) ⎤ − c ) ≤ p (l (x ) < − log p (x ) − c + 1) = p ( x ∈ A)
= ∑ p( x) ≤
x∈ A
∑ max( p( x) | x ∈ A) < ∑ 2
x∈ A
x∈ A
≤ ∑ 2− l ( x ) − c +1 = 2− ( c −1) ∑ 2− l ( x ) ≤ 2− ( c −1)
x
now over all x
− l ( x ) − c +1
Kraft inequality
x
No other symbol code can do much better than Shannon code most of the time
Jan 2008
59
Dyadic Competitive optimality
⇒ Shannon is optimal
If p is dyadic ⇔ log p(xi) is integer, ∀i
then p(l (x ) < lS (x ) ) ≤ p(l (x ) > lS (x ) ) with equality iff l(x) ≡ lS(x)
Proof:
sgn(x)={–1,0,+1} for {x<0, x=0, x>0}
– Note: sgn(i) ≤ 2i –1 for all integers i
– Define
equality iff i=0 or 1
p(lS (x ) > l (x ) ) − p (lS (x ) < l (x ) ) = ∑ p ( x) sgn (lS ( x) − l ( x) )
(
A
x
)
≤ ∑ p ( x) 2l S ( x ) − l ( x ) − 1 = −1 + ∑ 2− l S ( x ) 2l S ( x ) − l ( x )
x
= −1 + ∑ 2
B
−l ( x)
≤ −1 + 1 = 0
x
Rival code cannot be shorter than
Shannon more than half the time.
x
sgn() property
dyadic ⇒ p=2–l
Kraft inequality
equality @ A ⇒ l(x) = lS(x) – {0,1} but l(x) < lS(x)
would violate Kraft @ B since Shannon has Σ=1
Jan 2008
60
Shannon with wrong distribution
If the real distribution of x is p but you assign Shannon
lengths using the distribution q what is the penalty ?
Answer: D(p||q)
Proof: E l ( X ) = ∑ pi ⎡− log qi ⎤ < ∑ pi (1 − log qi )
i
i
⎛
⎞
p
= ∑ pi ⎜⎜1 + log i − log pi ⎟⎟
qi
i
⎝
⎠
= 1 + D(p || q) + H (p)
Therefore
H (p) + D (p || q) ≤ E l (x ) < H (p) + D(p || q) + 1
If you use the wrong distribution, the penalty is D(p||q)
Proof of lower limit
is similar but
without the 1
Jan 2008
61
Summary
• Any uniquely decodable code: E l (x ) ≥ H D (x ) =
• Fano Code: H D (x ) ≤ E l (x ) ≤ H D (x ) + 1
H (x )
log 2 D
– Intuitively natural top-down design
• Huffman Code:
– Bottom-up design
– Optimal ⇒ at least as good as Shannon/Fano
• Shannon Code: li = ⎡− log D pi ⎤
H D (x ) ≤ E l (x ) ≤ H D (x ) + 1
– Close to optimal and easier to prove bounds
Note: Not everyone agrees on the names of Shannon and Fano codes
Jan 2008
62
Lecture 5
•
•
•
•
Stochastic Processes
Entropy Rate
Markov Processes
Hidden Markov Processes
Jan 2008
63
Stochastic process
Stochastic Process {xi} = x1, x2, …
often
Entropy: H ({x }) = H (x ) + H (x | x ) + … = ∞
1
2
1
i
1
n →∞ n
Entropy Rate: H ( X ) = lim H (x 1:n ) if limit exists
– Entropy rate estimates the additional entropy per new sample.
– Gives a lower bound on number of code bits per sample.
– If the xi are not i.i.d. the entropy rate limit may not exist.
Examples:
– xi i.i.d. random variables:
H ( X ) = H (x i )
– xi indep, H(xi) = 0 1 00 11 0000 1111 00000000 … no convergence
Jan 2008
64
Lemma: Limit of Cesàro Mean
1 n
∑ ak → b
n k =1
an → b ⇒
Proof:
•
Choose
•
Set
•
Then
ε >0
N 0 such that | an − b |< ½ε
and find
∀n > N 0
N1 = 2 N 0ε −1 max(| ar − b |) for r ∈ [1, N 0 ]
∀n > N1
n
−1
n
∑a
k =1
k
−b = n
−1
N0
∑a
k =1
k
−b +n
(
−1
n
∑a
k = N 0 +1
k
−b
)
≤ N1−1 N 0 ½ N 0−1 N1ε + n −1n(½ε )
= ½ ε + ½ε = ε
The partial means of ak are called Cesàro Means
Jan 2008
65
Stationary Process
Stochastic Process {xi} is stationary iff
p (x 1:n = a1:n ) = p (x k + (1:n ) = a1:n ) ∀k , n, ai ∈ X
If {xi} is stationary then H(X) exists and
1
H ( X ) = lim H (x 1:n ) = lim H (x n | x 1:n −1 )
n →∞ n
n →∞
Proof:
(a )
(b)
0 ≤ H (x n | x 1:n −1 ) ≤ H (x n | x 2:n −1 ) = H (x n −1 | x 1:n − 2 )
(a) conditioning reduces entropy, (b) stationarity
Hence H(xn|x1:n-1) is +ve, decreasing ⇒ tends to a limit, say b
Hence from Cesàro Mean lemma:
H (x k | x 1:k −1 ) → b ⇒
1
1 n
H (x 1:n ) = ∑ H (x k | x 1:k −1 ) → b = H ( X )
n
n k =1
Jan 2008
66
Block Coding
If xi is a stochastic process
– encode blocks of n symbols
– 1-bit penalty of Shannon/Huffman is now shared
between n symbols
n −1 H (x 1:n ) ≤ n −1 E l (x 1:n ) ≤ n −1 H (x 1:n ) + n −1
If entropy rate of xi exists (⇐ xi is stationary)
n −1H (x 1:n ) → H ( X ) ⇒ n −1E l (x 1:n ) → H ( X )
The extra 1 bit inefficiency becomes insignificant for large blocks
Jan 2008
67
Block Coding Example
1
n-1E l
X = [A;B], px = [0.9; 0.1]
H ( xi ) = 0.469
0.8
0.6
0.4
B
0.1
1
4
6
8
10
12
n
• n=1
sym A
prob 0.9
code 0
• n=2
sym AA
AB
BA
BB
prob 0.81 0.09 0.09 0.01
code 0
11 100 101
• n=3
sym
prob
code
AAA
0.729
0
2
n −1E l = 1
AAB
0.081
101
…
…
…
n −1E l = 0.645
BBA
BBB
0.009 0.001
10010 10011
n −1E l = 0.583
Jan 2008
68
Markov Process
Discrete-valued Stochastic Process {xi} is
• Independent iff p(xn|x0:n–1)=p(xn)
• Markov iff
p(xn|x0:n–1)=p(xn|xn–1)
– time-invariant iff p(xn=b|xn–1=a) = pab indep of n
– Transition matrix: T = {tab}
• Rows sum to 1: T1 = 1 where 1 is a vector of 1’s
• pn = TTpn–1
1
• Stationary distribution: p$ = TTp$
t13
3
t12
2
t14
t34
t43
t24
4
Independent Stochastic Process is easiest to deal with, Markov is next easiest
Jan 2008
69
Stationary Markov Process
If a Markov process is
a) irreducible: you can go from any a to any b
in a finite number of steps
•
irreducible iff (I+TT)|X|–1 has no zero entries
b) aperiodic: ∀a, the possible times to go from
a to a have highest common factor = 1
then it has exactly one stationary distribution, p$.
T
– p$ is the eigenvector of TT with λ = 1: T p $ = p $
1pT$ where 1 = [1 1
1]T
– T n n→
→∞
Jan 2008
71
Chess Board
H(p )=0,
1
Random Walk
H(p | p )=0
• Move ÙÚ ÛÜÝÞ equal prob
• p1 = [1 0 … 0]T
– H(p1) = 0
• p$ = 1/40 × [3 5 3 5 8 5 3 5 3]T
– H(p$) = 3.0855
•
H ( X ) = lim H (x n | x n −1 ) = ∑ − p$,i ti , j log(ti , j )
n →∞
i, j
H ( X ) = 2.2365
Time-invariant and p1 = p$ ⇒ stationary
–
1
0
Jan 2008
72
Chess Board Frames
H(p )=0,
1
H(p )=3.111,
5
H(p | p )=0
1
0
H(p | p )=2.30177
5
4
H(p )=1.58496,
H(p | p )=1.58496
H(p )=3.10287,
H(p | p )=2.54795
H(p )=2.99553,
H(p | p )=2.09299
H(p )=3.07129,
H(p | p )=2.20683
H(p )=3.09141,
H(p | p )=2.24987
H(p )=3.0827,
H(p | p )=2.23038
2
6
2
6
1
5
3
7
3
7
2
6
4
8
4
8
Jan 2008
3
7
73
ALOHA Wireless Example
M users share wireless transmission channel
– For each user independently in each timeslot:
• if its queue is non-empty it transmits with prob q
• a new packet arrives for transmission with prob p
– If two packets collide, they stay in the queues
– At time t, queue sizes are xt = (n1, …, nM)
• {xt} is Markov since p(xt) depends only on xt–1
Transmit vector is yt :
⎧0
p ( yi ,t = 1) = ⎨
⎩q
xi ,t = 0
xi ,t > 0
– {yt} is not Markov since p(yt) is determined by xt but is not
determined by yt–1. {yt} is called a Hidden Markov Process.
Jan 2008
75
ALOHA example
Waiting Packets
h
g
f
TX en
)
o
e
n
d
c
)
y
x:
3
0
2
1
1
1
2
1
1
1
0
2
1
2
2
1
TX enable, e:
1
1
0
1
0
0
1
0
1
1
1
1
1
0
1
0
y:
1
0
0
1
0
0
1
0
1
1
0
1
1
0
1
0
= (x>0)e is a deterministic function of the Markov [x; e]
Jan 2008
76
Hidden Markov Process
If {xi} is a stationary Markov process and y=f(x) then
{yi} is a stationary Hidden Markov process.
What is entropy rate H(Y) ?
– Stationarity ⇒ H (y n | y 1:n −1 ) ≥ H (Y) and → H (Y)
– Also
(1)
n →∞
(2)
H (y n | y 1:n −1 , x 1 ) ≤ H (Y) and → H (Y)
n →∞
So H(Y) is sandwiched between two quantities which converge
to the same value for large n.
Proof of (1) and (2) on next slides
Jan 2008
77
Hidden Markov Process – (1)
Proof (1):
H (y n | y 1:n −1 , x 1 ) ≤ H (Y)
H (y n | y 1:n −1 , x 1 ) = H (y n | y 1:n −1 , x − k :1 ) ∀k
x markov
= H (y n | y 1:n −1 , x − k :1 , y − k :1 ) = H (y n | y − k :n −1 , x − k :1 )
≤ H (y n | y − k :n −1 ) ∀k
y=f(x)
conditioning reduces entropy
= H (y k + n | y 0:k + n −1 ) → H (Y)
y stationary
k →∞
Just knowing x1 in addition to y1:n–1 reduces the conditional entropy to
below the entropy rate.
Jan 2008
78
Hidden Markov Process – (2)
Proof (2):
H (y n | y 1:n −1 , x 1 ) → H (Y)
n →∞
k
Note that
∑ I (x ; y
n =1
Hence
So
1
n
| y 1:n −1 ) = I (x 1; y 1:k )
≤ H (x 1 )
I (x 1; y n | y 1:n −1 ) → 0
n→∞
chain rule
defn of I(A;B)
bounded sum of
non-negative terms
H (y n | y 1:n −1 , x 1 ) = H (y n | y 1:n −1 ) − I (x 1; y n | y 1:n −1 )
→ H ( Y) − 0
n →∞
The influence of x1 on yn decreases over time.
defn of I(A;B)
Jan 2008
79
Summary
• Entropy Rate:
1
H (x 1:n )
n →∞ n
H ( X ) = lim
if it exists
– {xi} stationary: H ( X ) = lim H (x n | x 1:n −1 )
n →∞
– {xi} stationary Markov:
H ( X ) = H (x n | x n −1 ) = ∑ − p$,i ti , j log(ti , j )
– y = f(x): Hidden Markov:
i, j
H (y n | y 1:n −1 , x 1 ) ≤ H (Y) ≤ H (y n | y 1:n −1 )
with both sides tending to H(Y)
Jan 2008
80
Lecture 6
• Stream Codes
• Arithmetic Coding
• Lempel-Ziv Coding
Jan 2008
81
Huffman: Good and Bad
Good
– Shortest possible symbol code
Bad
H (x )
H (x )
≤ LS ≤
+1
log 2 D
log 2 D
– Redundancy of up to 1 bit per symbol
• Expensive if H(x) is small
• Less so if you use a block of N symbols
• Redundancy equals zero iff p(xi)=2–k(i) ∀i
– Must recompute entire code if any symbol
probability changes
• A block of N symbols needs |X|N pre-calculated
probabilities
Jan 2008
82
Arithmetic Coding
• Take all possible blocks of N
symbols and sort into lexical
order, xr for r=1: |X|N
• Calculate cumulative
probabilities in binary:
Qr = ∑ p (x i ), Q0 = 0
i≤r
• To encode xr, transmit enough
binary places to define the
interval (Qr–1, Qr)
unambiguously.
• Use first lr places of mr 2 − lr
where lr is least integer with
Qr −1 ≤ mr 2 − l < (mr + 1)2 − l ≤ Qr
r
r
X=[a b], p=[0.6 0.4], N=3
Code
Jan 2008
83
Arithmetic Coding – Code lengths
Δ
−d
• The interval corresponding to xr has width p ( xr ) = Qr − Qr −1 = 2 r
k r = ⎡d r ⎤ ⇒ d r ≤ k r < d r + 1 ⇒ ½ p( xr ) < 2 − k ≤ p( xr )
Qr–1 rounded up to kr bits
• Set mr = ⎡2 k Qr −1 ⎤ ⇒ Qr −1 ≤ mr 2 − k
• Define
r
r
• If
r
(mr + 1)2 − k ≤ Qr
r
then set lr = kr; otherwise
⎡
– set lr = k r + 1 and redefine mr = 2lr Qr −1
– now
⎤
⇒ (mr − 1)2 −l < Qr −1 ≤ mr 2 −l
r
r
(mr + 1)2 − l = (mr − 1)2 − l + 2 − k < Qr −1 + p( xr ) = Qr
r
r
r
• We always have lr ≤ k r + 1 < d r + 2 = − log( pr ) + 2
– Always within 2 bits of the optimum code for the block (kr is Shannon len)
d 6,7 = − log p ( x6,7 ) = − log 0.096 = 3.38 bits
Q7 = 0.9360 = 0.111011
bbb
m8=1111
bba
m7=11011
Q6 = 0.8400 = 0.110101
bab
Q5 = 0.7440 = 0.101111
m6=1100
k7 = 4, m7 = 14, 15 × 2 −4 > Q7
8
⇒ l7 = 5, m7 = 27, 28 × 2 −5 ≤ Q7 9
k6 = 4, m6 = 12, 13 × 2 −4 < Q6
Jan 2008
9
84
Arithmetic Coding - Advantages
• Long blocks can be used
– Symbol blocks are sorted lexically rather than in probability
order
– Receiver can start decoding symbols before the entire code has
been received
– Transmitter and receiver can work out the codes on the fly
• no need to store entire codebook
• Transmitter and receiver can use identical finiteprecision arithmetic
– rounding errors are the same at transmitter and receiver
– rounding errors affect code lengths slightly but not transmission
accuracy
Jan 2008
85
Arithmetic Coding Receiver
Qr probabilities
b
bbb
bab
babbaa
bb
b
bba
bab
ba
baab
baa
abb
1
ab
1 0 0 1 1 1
abbb
abba
aab
0.111110
0.111011
0.111001
0.110101
0.110011
0.7440 = 0.101111
0.6864 = 0.101011
0.6000 = 0.100110
0.5616 = 0.100011
0.5040 = 0.100000
0.4464 = 0.011100
abaa
aabb
a
0.9744 =
0.9360 =
0.8976 =
0.8400 =
0.8016 =
baaa
abab
aba
Each additional bit received
narrows down the possible
interval.
bbba
bbab
bbaa
babb
baba
0.3600 = 0.010111
0.3024 = 0.010011
aaba
0.2160 = 0.001101
aa
aaab
0.1296 = 0.001000
aaa
aaaa
X=[a b], p=[0.6 0.4]
0.0000 = 0.000000
Jan 2008
86
Arithmetic Coding/Decoding
b
a
b
b
a
a
a
Transmitter
Min
Max
00000000 11111111
10011001 11111111
10011001 11010111
10111110 11010111
11001101 11010111
11001101 11010011
11001101 11010000
11001101 11001111
b
11001110
11001111
…
…
…
Input
•
•
Send
1
10
011
1
Min
00000000
Receiver
Test
10011001
10011001
10011001
10011001
11010111
10011001
10111110
11001101
11001101
11001101
…
11010111
10111110
11001101
11010011
11010000
11001111
…
Max
Output
11111111
11111111
11010111
11010111
11010111
11010011
11010000
…
b
a
b
b
a
a
…
Min/Max give the limits of the input or output interval; identical in transmitter and receiver.
Blue denotes transmitted bits - they are compared with the corresponding bits of the receiver’s test
value and Red bit show the first difference. Gray identifies unchanged words.
Jan 2008
87
Arithmetic Coding Algorithm
Input Symbols: X = [a b], p = [p q]
[min , max] = Input Probability Range
Note: only keep untransmitted bits of min and max
Coding Algorithm:
Initialize [min , max] = [000…0 , 111…1]
For each input symbol, s
If s=a then max=min+p(max–min) else min=min+p(max–min)
while min and max have the same MSB
transmit MSB and set min=(min<<1) and max=(max<<1)+1
end while
end for
• Decoder is almost identical. Identical rounding errors ⇒ no symbol errors.
• Simple to modify algorithm for |X|>2 and/or D>2.
• Need to protect against range underflow when [x y] = [011111…, 100000…].
Jan 2008
88
Adaptive Probabilities
Number of guesses for next letter (a-z, space):
o r a n g e s a n d l e mo n s
17 7 8 4 1 1 2 1 1 5 1 1 1 1 1 1 1 1
We can change the input symbol probabilities
based on the context (= the past input sequence)
Example: Bernoulli source with unknown p. Adapt p based
on symbol frequencies so far:
X = [a b], pn = [1–pn pn], pn =
1 + count ( xi = b)
1≤ i < n
1+ n
Jan 2008
89
Adaptive Arithmetic Coding
1.0000 = 0.000000
pn =
1 + count ( xi = b)
1≤ i < n
1+ n
bbb
bbbb
bb
bbba
b
bba
p1 = 0.5
p2 = 1/3 or 2/3
p3 = ¼ or ½ or ¾
p4 = …
bab
ba
baa
abb
ab
aba
aab
a
bbab
bbaa
babb
baba
baab
baaa
abbb
abba
abab
abaa
aabb
aaba
aaab
0.8000 = 0.110011
0.7500 = 0.110000
0.7000 = 0.101100
0.6667 = 0.101010
0.6167 = 0.100111
0.5833 = 0.100101
0.5500 = 0.100011
0.5000 = 0.011111
0.4500 = 0.011100
0.4167 = 0.011010
0.3833 = 0.011000
0.3333 = 0.010101
0.3000 = 0.010011
0.2500 = 0.010000
0.2000 = 0.001100
aa
aaa
aaaa
0.0000 = 0.000000
Coder and decoder only need to calculate the probabilities along the
path that actually occurs
Jan 2008
90
Lempel-Ziv Coding
Memorize previously occurring substrings in the input data
– parse input into the shortest possible distinct ‘phrases’
– number the phrases starting from 1 (0 is the empty string)
1011010100010…
12_3_4__5_6_7
– each phrase consists of a previously occurring phrase
(head) followed by an additional 0 or 1 (tail)
– transmit code for head followed by the additional bit for tail
01001121402010…
– for head use enough bits for the max phrase number so far:
100011101100001000010…
– decoder constructs an identical dictionary
prefix codes are underlined
Jan 2008
91
Lempel-Ziv Example
Input = 1011010100010010001001010010
Dictionary
000
00
0
0000
001
01
1
0001
010
10
0010
011
11
0011
0100
100
0101
101
0110
110
111
0111
1000
1001
φ
1
0
11
01
010
00
10
0100
01001
Send
Decode
1
00
011
101
1000
0100
0010
1010
10001
10010
1
0
11
01
010
00
10
0100
01001
010010
Improvement
• Each head can only
be used twice so at
its second use we
can:
– Omit the tail bit
– Delete head from
the dictionary and
re-use dictionary
entry
Jan 2008
92
LempelZiv Comments
Dictionary D contains K entries D(0), …, D(K–1). We need to send M=ceil(log K) bits to
specify a dictionary entry. Initially K=1, D(0)= φ = null string and M=ceil(log K) = 0 bits.
Input
1
0
1
1
0
1
0
1
0
Action
“1” ∉D so send “1” and set D(1)=“1”. Now K=2 ⇒ M=1.
“0” ∉D so split it up as “φ”+”0” and send “0” (since D(0)= φ) followed by “0”.
Then set D(2)=“0” making K=3 ⇒ M=2.
“1” ∈ D so don’t send anything yet – just read the next input bit.
“11” ∉D so split it up as “1” + “1” and send “01” (since D(1)= “1” and M=2)
followed by “1”. Then set D(3)=“11” making K=4 ⇒ M=2.
“0” ∈ D so don’t send anything yet – just read the next input bit.
“01” ∉D so split it up as “0” + “1” and send “10” (since D(2)= “0” and M=2)
followed by “1”. Then set D(4)=“01” making K=5 ⇒ M=3.
“0” ∈ D so don’t send anything yet – just read the next input bit.
“01” ∈ D so don’t send anything yet – just read the next input bit.
“010” ∉D so split it up as “01” + “0” and send “100” (since D(4)= “01” and
M=3) followed by “0”. Then set D(5)=“010” making K=6 ⇒ M=3.
So far we have sent 1000111011000 where dictionary entry numbers are in red.
Jan 2008
93
Lempel-Ziv properties
• Widely used
– many versions: compress, gzip, TIFF, LZW, LZ77, …
– different dictionary handling, etc
• Excellent compression in practice
– many files contain repetitive sequences
– worse than arithmetic coding for text files
• Asymptotically optimum on stationary ergodic
source (i.e. achieves entropy rate)
– {Xi} stationary ergodic ⇒ lim sup n −1l ( X 1:n ) ≤ H ( X ) with prob 1
• Proof: C&T chapter 12.10
n →∞
– may only approach this for an enormous file
Jan 2008
94
Summary
• Stream Codes
– Encoder and decoder operate sequentially
• no blocking of input symbols required
– Not forced to send ≥1 bit per input symbol
• can achieve entropy rate even when H(X)<1
• Require a Perfect Channel
– A single transmission error causes multiple
wrong output symbols
– Use finite length blocks to limit the damage
Jan 2008
95
Lecture 7
• Markov Chains
• Data Processing Theorem
– you can’t create information from nothing
• Fano’s Inequality
– lower bound for error in estimating X from Y
Jan 2008
96
Markov Chains
If we have three random variables: x, y, z
p ( x, y , z ) = p ( z | x, y ) p ( y | x ) p ( x )
they form a Markov chain x→y→z if
p ( z | x, y ) = p ( z | y ) ⇔ p ( x, y , z ) = p ( z | y ) p ( y | x ) p ( x )
A Markov chain x→y→z means that
– the only way that x affects z is through the value of y
– if you already know y, then observing x gives you no additional
information about z, i.e. I ( x ; z | y ) = 0 ⇔ H (z | y ) = H (z | x , y )
– if you know y, then observing z gives you no additional
information about x.
A common special case of a Markov chain is when z = f(y)
Jan 2008
97
Markov Chain Symmetry
Iff x→y→z
p ( x, y , z ) ( a ) p ( x, y ) p ( z | y )
p ( x, z | y ) =
=
= p( x | y ) p( z | y )
p( y)
p( y)
(a)
p ( z | x, y ) = p ( z | y )
Hence x and z are conditionally independent given y
Also x→y→z iff z→y→x since
p ( z | y ) p( y ) (a) p ( x, z | y ) p ( y ) p ( x, y, z )
p( x | y) = p( x | y)
=
=
p( y, z )
p ( y, z )
p( y, z )
= p( x | y, z )
(a) p( x, z | y ) = p ( x | y ) p ( z | y )
Markov chain property is symmetrical
Jan 2008
98
Data Processing Theorem
If x→y→z then I(x ;y) ≥ I(x ; z)
– processing y cannot add new information about x
If x→y→z then I(x ;y) ≥ I(x ; y | z)
– Knowing z can only decrease the amount x tells you about y
Proof:
I (x ; y , z ) = I (x ; y ) + I (x ; z | y ) = I (x ; z ) + I (x ; y | z )
(a)
but I (x ; z | y ) = 0
hence I (x ; y ) = I (x ; z ) + I (x ; y | z )
so I ( x ; y ) ≥ I ( x ; z ) and I ( x ; y ) ≥ I ( x ; y | z )
(a) I(x ;z)=0 iff x and z are independent; Markov ⇒ p(x,z |y)=p(x |y)p(z |y)
Jan 2008
99
Non-Markov: Conditioning can increase I
Noisy Channel: z =x +y
– X=Y=[0,1]T pX=pY=[½,½]T
– I(x ;y)=0 since independent
– but I(x ;y | z)=½
H(x |z) = H(y |z) = H(x, y |z)
= 0×¼+1×½+0×¼ = ½
since in each case z≠1 ⇒ H()=0
I(x; y |z) = H(x |z)+H(y |z)–H(x, y |z)
= ½+½–½ = ½
x
z
+
y
XY 00
01
10
11
0
¼
Z
1
2
¼
¼
¼
If you know z, then x and y are no longer independent
Jan 2008
100
Long Markov Chains
If x1→ x2 → x3 → x4 → x5 → x6
then Mutual Information increases as you
get closer together:
– e.g. I(x3, x4) ≥ I(x2, x4) ≥ I(x1, x5) ≥ I(x1, x6)
Jan 2008
101
Sufficient Statistics
If pdf of x depends on a parameter θ and you extract a
statistic T(x) from your observation,
then θ → x → T (x ) ⇒ I (θ ; T (x )) ≤ I (θ ; x )
T(x) is sufficient for θ if the stronger condition:
θ → x → T (x ) → θ ⇔ I (θ ; T (x )) = I (θ ; x )
⇔ θ → T (x ) → x → θ
⇔ p (x | T (x ),θ ) = p (x | T (x ))
Example: xi ~ Bernoulli(θ ),
n
T (x 1:n ) = ∑ x i
i =1
p (X 1:n
⎧⎪ n Ck −1 if
= x1:n | θ , ∑ x i = k ) = ⎨
if
⎪⎩ 0
∑x
∑x
i
=k
i
≠k
independent of θ ⇒ sufficient
Jan 2008
102
Fano’s Inequality
If we estimate x from y, what is pe = p(xˆ ≠ x ) ?
H (x | y ) ≤ H ( pe ) + pe log(| X | −1)
⇒
(
H (x | y ) − H ( pe ) ) (a) (H (x | y ) − 1)
pe ≥
≥
log(| X | −1)
log(| X | −1)
x
y
x^
(a) the second form is
weaker but easier to use
Proof: Define a random variable e = (xˆ ≠ x ) ?1 : 0
H (e , x | y ) = H (x | y ) + H (e | x , y ) = H (e | y ) + H (x | e , y ) chain rule
H≥0; H(e |y)≤H(e)
⇒ H (x | y ) + 0 ≤ H (e ) + H (x | e , y )
= H (e ) + H (x | y , e = 0)(1 − pe ) + H (x | y , e = 1) pe
≤ H ( pe ) + 0 × (1 − pe ) + log(| X | −1) pe
H(e)=H(p )
e
Fano’s inequality is used whenever you need to show that errors are inevitable
Jan 2008
103
Fano Example
X = {1:5}, px = [0.35, 0.35, 0.1, 0.1, 0.1]T
Y = {1:2} if x ≤2 then y =x with probability 6/7
while if x >2 then y =1 or 2 with equal prob.
Our best strategy is to guess xˆ = y
– px |y=1 = [0.6, 0.1, 0.1, 0.1, 0.1]T
– actual error prob: pe = 0.4
Fano bound:
pe ≥
H (x | y ) − 1 1.771 − 1
=
= 0.3855
log(| X | −1)
log(4)
Main use: to show when error free transmission is impossible since pe > 0
Jan 2008
104
Summary
• Markov: x → y → z ⇔ p( z | x, y ) = p( z | y ) ⇔ I (x ; z | y ) = 0
• Data Processing Theorem: if x→y→z then
– I(x ; y) ≥ I(x ; z)
– I(x ; y) ≥ I(x ; y | z)
• Fano’s Inequality: if
pe ≥
can be false if not Markov
x → y → xˆ then
H ( x | y ) − H ( pe ) H ( x | y ) − 1 H ( x | y ) − 1
≥
≥
log(| X | −1)
log(| X | −1)
log | X |
weaker but easier to use since independent of pe
Jan 2008
105
Lecture 8
• Weak Law of Large Numbers
• The Typical Set
– Size and total probability
• Asymptotic Equipartition Principle
Jan 2008
106
Strong and Weak Typicality
X = {a, b, c, d}, p = [0.5 0.25 0.125 0.125]
–log p = [1 2 3 3] ⇒ H(p) = 1.75 bits
Sample eight i.i.d. values
• strongly typical ⇒ correct proportions
aaaabbcd –log p(x) = 14 = 8×1.75
• [weakly] typical ⇒ log p(x) = nH(x)
aabbbbbb –log p(x) = 14 = 8×1.75
• not typical at all ⇒ log p(x) ≠ nH(x)
dddddddd –log p(x) = 24
Strongly Typical ⇒ Typical
Jan 2008
107
Convergence of Random Numbers
• Convergence
x n → y ⇒ ∀ε > 0, ∃m such that ∀n > m, | x n − y |< ε
n →∞
Example:
x n = ±2 – n ,
p = [½; ½]
choose m = 1 − log ε
• Convergence in probability (weaker than convergence)
prob
xn → y
Example:
⇒ ∀ε > 0, P( | x n − y |> ε ) → 0
xn ∈ {0; 1},
p = [1 − n −1 ; n −1 ]
→∞
for any small ε , p (| xn |> ε ) = n −1 ⎯n⎯
⎯→ 0
Note: y can be a constant or another random variable
Jan 2008
108
Weak law of Large Numbers
Given i.i.d. {xi} ,Cesáro mean
– E sn = E x = μ
1 n
sn = ∑xi
n i =1
Var s n = n −1 Var x = n −1σ 2
As n increases, Var sn gets smaller and the values
become clustered around the mean
WLLN:
prob
sn → μ
⇔ ∀ε > 0, P(| s n − μ |> ε ) → 0
n →∞
The “strong law of large numbers” says that convergence
is actually almost sure provided that X has finite variance
Jan 2008
109
Proof of WLLN
• Chebyshev’s Inequality
2
2
Var y = E (y − μ ) = ∑ ( y − μ ) p( y )
y∈Y
≥
∑μ (εy − μ )
2
p( y ) ≥
y:| y − | >
• WLLN
∑μ εε
2
p( y) = ε 2 p( y − μ > ε )
y:| y − | >
For any choice of ε
1 n
s n = ∑ x i where E x i = μ and Var x i = σ 2
n i =1
2
ε 2 p ( s n − μ > ε ) ≤ Var s n =
σ
prob
Hence
sn → μ
→0
n
n →∞
Actually true even if σ = ∞
Jan 2008
110
Typical Set
xn is the i.i.d. sequence {xi} for 1 ≤ i ≤ n
n
– Prob of a particular sequence is p( x ) = ∏ p(x i )
i =1
– E − log p( x ) = n E − log p( xi ) = nH (x )
– Typical set: Tε( n ) = {x ∈ X n : − n −1 log p( x ) − H (x ) < ε }
Example:
N=4,
=41%
N=128,
p=0.2,e=0.1,
e=0.1,ppTpT=29%
=85%
N=8,
N=2,p=0.2,
N=64,
=0%
=72%
N=1,
N=32,
p=0.2,
e=0.1,
N=16,
=45%
T=62%
T
– xi Bernoulli with p(xi =1)=p
– e.g. p([0 1 1 0 0 0])=p2(1–p)4
– For p=0.2, H(X)=0.72 bits
– Red bar shows T0.1(n)
-2.5
-2
-1.5
-1
-1
N log p(x)
-0.5
0
Jan 2008
111
Typical Set Frames
N=1, p=0.2, e=0.1, pT=0%
-2.5
-2
-1.5
-1
N-1log p(x)
N=2, p=0.2, e=0.1, pT=0%
-0.5
0
-2.5
N=16, p=0.2, e=0.1, pT=45%
-2.5
-2
-1.5
-1
N-1log p(x)
-2
-1.5
-1
N-1log p(x)
N=4, p=0.2, e=0.1, pT=41%
-0.5
0
-2.5
0
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
-1.5
-1
N-1log p(x)
-0.5
0
-2.5
N=64, p=0.2, e=0.1, pT=72%
N=32, p=0.2, e=0.1, pT=62%
-0.5
-2
N=8, p=0.2, e=0.1, pT=29%
0
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
-2
-1.5
-1
N-1log p(x)
-0.5
0
N=128, p=0.2, e=0.1, pT=85%
0
-2.5
-2
-1.5
-1
N-1log p(x)
-0.5
0
1111,
00110011,
11,
0010001100010010,
1, 010,1101,
00 00110010,
1010, 0100,
0001001010010000,
00010001,
0000
00010000,
0001000100010000,
00000000
0000100001000000, 0000000010000000
Jan 2008
112
Typical Set: Properties
1. Individual prob:
x ∈ Tε( n ) ⇒ log p( x ) = − nH (x ) ± nε
2. Total prob:
3. Size:
p ( x ∈ Tε( n ) ) > 1 − ε for n > N ε
(1 − ε )2
n ( H ( x ) −ε )
n> Nε
<
Tε( n ) ≤ 2 n ( H ( x ) +ε )
Proof 2: − n −1 log p( x ) = n −1 n − log p( x ) prob
→ E − log p ( xi ) = H (x )
∑
i
i =1
Hence ∀ε > 0 ∃N ε s.t. ∀n > N ε
Proof 3a:
Proof 3b:
p ( − n −1 log p ( x ) − H (x ) > ε ) < ε
f.l.e. n, 1 − ε < p ( x ∈ Tε( n ) ) ≤
1 = ∑ p( x ) ≥
x
∑2
x∈Tε( n )
∑ p( x ) ≥ ∑ 2
x∈Tε( n )
− n ( H ( x ) −ε )
x∈Tε( n )
− n ( H ( x ) +ε )
= 2 − n ( H ( x ) −ε ) Tε( n )
= 2 − n ( H ( x ) +ε ) Tε( n )
Jan 2008
113
Asymptotic Equipartition Principle
• for any ε and for n > Nε
“Almost all events are almost equally surprising”
(n )
• p( x ∈ Tε ) > 1 − ε
and
log p ( x ) = − nH (x ) ± nε
Coding consequence
|X|n elements
– x ∈ Tε(n ) : ‘0’ + at most 1+n(H+ε) bits
(n )
– x ∉ Tε : ‘1’ + at most 1+nlog|X| bits
– L = Average code length
≤ 2 + n( H + ε ) + ε (n log X )
(
= n H + ε + ε log X + 2n −1
)
≤ 2 n ( H ( x ) +ε ) elements
Jan 2008
114
Source Coding & Data Compression
For any choice of δ > 0, we can, by choosing block
size, n, large enough, do either of the following:
• make a lossless code using only H(x)+δ bits per symbol
on average: L ≤ n H + ε + ε log X + 2n −1
(
)
• make a code with an error probability < ε using H(x)+ δ
bits for each symbol
– just code Tε using n(H+ε+n–1) bits and use a random wrong code
if x∉Tε
(
Jan 2008
)
115
What N ε ensures that p x ∈ Tε(n ) > 1 − ε ?
2
From WLLN, if Var(− log p ( xi ) ) = σ then for any n and
⎛1 n
⎞ σ2
⎜
ε p⎜ ∑ − log p( xi ) − H ( X ) > ε ⎟⎟ ≤
⎝ n i =1
⎠ n
Choose
N ε = σ 2ε −3 ⇒
(
⇒
2
p x ∉ Tε
(
(n)
σ2
) ≤ nε 2
ε
Chebyshev
)
p x ∈ Tε( n ) > 1 − ε
Nε increases radidly for small ε
For this choice of Nε , if x∈Tε(n)
log p( x ) = − nH (x ) ± nε = −nH (x ) ± σ 2ε −2
So within Tε(n), p(x) can vary by a factor of
2
2σ 2ε −2
Within the Typical Set, p(x) can actually vary a great deal when ε is small
Jan 2008
116
Smallest high-probability Set
Tε(n) is a small subset of Xn containing most of the
probability mass. Can you get even smaller ?
For any 0 < ε < 1, choose N0 = –ε–1log ε , then for any
n>max(N0,Nε) and any subset S(n) satisfying S ( n) < 2 n(H ( x ) − 2ε)
(
) (
)
(
p x ∈ S ( n ) = p x ∈ S ( n ) ∩ Tε( n ) + p x ∈ S ( n ) ∩ Tε( n )
(
(n)
< S ( n ) max
p
(
x
)
+
p
x
∈
T
ε
(n)
x∈Tε
)
< 2 n ( H − 2ε ) 2 − n ( H − ε ) + ε
= 2−nε + ε < 2ε
for n > N 0,
)
for n > N ε
2 − nε < 2 log ε = ε
Answer: No
Jan 2008
117
Summary
• Typical Set
– Individual Prob
– Total Prob
– Size
x ∈ Tε( n ) ⇒ log p ( x ) = − nH (x ) ± nε
p ( x ∈ Tε( n ) ) > 1 − ε for n > N ε
n > Nε
(1 − ε )2 n ( H ( x ) −ε ) <
Tε( n ) ≤ 2 n ( H ( x ) +ε )
• No other high probability set can be much
smaller than Tε(n )
• Asymptotic Equipartition Principle
– Almost all event sequences are equally surprising
Jan 2008
118
Lecture 9
• Source and Channel Coding
• Discrete Memoryless Channels
– Symmetric Channels
– Channel capacity
• Binary Symmetric Channel
• Binary Erasure Channel
• Asymmetric Channel
Jan 2008
119
Source and Channel Coding
Source Coding
Channel Coding
In
Compress
Encode
Noisy
Channel
Decode
Decompress
Out
• Source Coding
– Compresses the data to remove redundancy
• Channel Coding
– Adds redundancy to protect against channel
errors
Jan 2008
120
Discrete Memoryless Channel
• Input: x∈X, Output y∈Y
x
Noisy
Channel
y
• Time-Invariant Transition-Probability Matrix
(Q y x )
|
i, j
= p (y = y j | x = xi )
– Hence p y = Q y |x p x
– Q: each row sum = 1, average column sum = |X||Y|–1
T
• Memoryless: p(yn|x1:n, y1:n–1) = p(yn|xn)
• DMC = Discrete Memoryless Channel
Jan 2008
121
Binary Channels
⎛1 − f
⎜⎜
⎝ f
• Binary Symmetric Channel
– X = [0 1], Y = [0 1]
• Binary Erasure Channel
– X = [0 1], Y = [0 ? 1]
⎛1 − f
⎜⎜
⎝ 0
• Z Channel
– X = [0 1], Y = [0 1]
f ⎞
⎟
1 − f ⎟⎠
f
f
⎞
⎟⎟
⎠
0
1
1
0
0
x
0 ⎞
⎟
1 − f ⎟⎠
0
⎛1
⎜⎜
⎝ f 1− f
0
x
x
y
? y
1
1
0
0
1
1
y
Symmetric: rows are permutations of each other; columns are permutations of each other
Weakly Symmetric: rows are permutations of each other; columns have the same sum
Jan 2008
122
Weakly Symmetric Channels
Weakly Symmetric:
1.
All columns of Q have the same sum = |X||Y|–1
–
If x is uniform (i.e. p(x) = |X|–1) then y is uniform
p ( y ) = ∑ p ( y | x) p ( x) = X
x∈X
2.
−1
∑ p( y | x) = X
−1
×XY
−1
=Y
−1
x∈X
All rows are permutations of each other
–
Each row of Q has the same entropy so
H (y | x ) = ∑ p ( x) H (y | x = x) = H (Q1,: ) ∑ p ( x) = H (Q1,: )
x∈X
x∈X
where Q1,: is the entropy of the first (or any other) row of the Q matrix
Symmetric: 1. All rows are permutations of each other
2. All columns are permutations of each other
Symmetric ⇒ weakly symmetric
Jan 2008
123
Channel Capacity
• Capacity of a DMC channel:
C = max I (x ; y )
px
– Maximum is over all possible input distributions px
– ∃ only one maximum since I(x ;y) is concave in px for fixed py|x
– We want to find the px that maximizes I(x ;y)
– Limits on C:
H(x ,y)
0 ≤ C ≤ min(H (x ), H (y ) ) ≤ min (log X , log Y )
• Capacity for n uses of channel:
C (n) =
♦
H(x |y) I(x ;y) H(y |x)
H(y)
H(x)
1
max I (x 1:n ; y 1:n )
n px
1:n
♦ = proved in two pages time
Jan 2008
124
Mutual Information Plot
Binary Symmetric Channel
Bernoulli Input
0
1–p
I(X;Y)
1–f
f
f
x
p
1
0
1–f
y
1
1
0.5
0
1
-0.5
0
0.8
0.2
0.6
0.4
0.4
0.6
0.2
0.8
1
Input Bernoulli Prob (p)
0
I ( x ; y ) = H (y ) − H ( x | y )
= H ( f − 2 pf + p) − H ( f )
Channel Error Prob (f)
Jan 2008
125
Mutual Information Concave in pX
Mutual Information I(x;y) is concave in px for fixed py|x
Proof: Let u and v have prob mass vectors u and v
– Define z: bernoulli random variable with p(1) = λ
– Let x = u if z=1 and x=v if z=0 ⇒ px=λu+(1–λ)v
I ( x , z ; y ) = I ( x ; y ) + I (z ; y | x ) = I (z ; y ) + I ( x ; y | z )
U
but I (z ; y | x ) = H (y | x ) − H (y | x , z ) = 0 so
1
I (x ; y ) ≥ I (x ; y | z )
X
V
Z
0
p(Y|X)
Y
z = Deterministic
= λI (x ; y | z = 1) + (1 − λ ) I (x ; y | z = 0)
= λI (u ; y ) + (1 − λ ) I (v ; y )
Special Case: y=x ⇒ I(x; x)=H(x) is concave in px
Jan 2008
126
Mutual Information Convex in pY|X
Mutual Information I(x ;y) is convex in py|x for fixed px
Proof (b) define u, v, x etc:
– p y |x = λpu |x + (1 − λ )pv |x
I (x ; y , z ) = I (x ; y | z ) + I (x ; z )
= I (x ; y ) + I (x ; z | y )
but I (x ; z ) = 0 and I ( x ; z | y ) ≥ 0 so
z = Deterministic
I (x ; y ) ≤ I (x ; y | z )
= λI (x ; y | z = 1) + (1 − λ ) I (x ; y | z = 0)
= λI (x ;u ) + (1 − λ ) I (x ;v )
Jan 2008
127
n-use Channel Capacity
For Discrete Memoryless Channel:
I (x 1:n ; y 1:n ) = H (y 1:n ) − H (y 1:n | x 1:n )
n
n
= ∑ H (y i | y 1:i −1 ) − ∑ H (y i | x i )
i =1
n
n
n
i =1
i =1
i =1
Chain; Memoryless
i =1
≤ ∑ H (y i ) − ∑ H (y i | x i ) = ∑ I ( x i ; y i )
Conditioning
Reduces
Entropy
with equality if xi are independent ⇒ yi are independent
We can maximize I(x;y) by maximizing each I(xi;yi)
independently and taking xi to be i.i.d.
– We will concentrate on maximizing I(x; y) for a single channel use
Jan 2008
128
Capacity of Symmetric Channel
If channel is weakly symmetric:
I (x ; y ) = H (y ) − H (y | x ) = H (y ) − H (Q1,: ) ≤ log | Y | − H (Q1,: )
with equality iff input distribution is uniform
∴ Information Capacity of a WS channel is log|y|–H(Q1,:)
For a binary symmetric channel (BSC):
– |Y| = 2
– H(Q1,:) = H(f)
– I(x;y) ≤ 1 – H(f)
∴ Information Capacity of a BSC is 1–H(f)
0
1–f
f
f
x
1
0
1–f
y
1
Jan 2008
129
Binary Erasure Channel (BEC)
⎛1 − f
⎜⎜
⎝ 0
f
f
0
1− f
1–f
0
⎞
⎟⎟
⎠
0
f
x
? y
f
1
I (x ; y ) = H (x ) − H (x | y )
= H (x ) − p (y = 0) × 0 − p(y = ?) H (x ) − p (y = 1) × 0
1
1–f
H(x |y) = 0 when y=0 or y=1
= H (x ) − H (x ) f
= (1 − f ) H (x )
≤ 1− f
since max value of H(x) = 1
with equality when x is uniform
since a fraction f of the bits are lost, the capacity is only 1–f
and this is achieved when x is uniform
Jan 2008
130
Asymmetric Channel Capacity
Let px = [a a 1–2a]T ⇒ py=QTpx = px
H (y ) = −2a log a − (1 − 2a ) log(1 − 2a )
H (y | x ) = 2aH ( f ) + (1 − 2a ) H (1) = 2aH ( f )
0: a
x
To find C, maximize I(x ;y) = H(y) – H(y |x)
1–f
0
f
f
1: a
1–f
2: 1–2a
1
y
2
I = −2a log a − (1 − 2a ) log(1 − 2a ) − 2aH ( f )
f
0⎞
⎛1 − f
⎜
⎟
dI
Q
=
f
1
−
f
0
⎜
⎟
= −2 log e − 2 log a + 2 log e + 2 log(1 − 2a ) − 2 H ( f ) = 0
⎜ 0
da
0
1 ⎟⎠
⎝
1 − 2a
−1
log
= log a −1 − 2 = H ( f ) ⇒ a = (2 + 2 H ( f ) )
Note:
a
d(log x) = x–1 log e
⇒ C = −2a log a 2 H ( f ) − (1 − 2a ) log(1 − 2a ) = − log(1 − 2a )
(
)
(
Examples:
)
f = 0 ⇒ H(f) = 0 ⇒ a = 1/3 ⇒ C = log 3 = 1.585 bits/use
f = ½ ⇒ H(f) = 1 ⇒ a = ¼ ⇒ C = log 2 = 1 bits/use
Jan 2008
131
Lecture 10
• Jointly Typical Sets
• Joint AEP
• Channel Coding Theorem
– Random Coding
– Jointly typical decoding
Jan 2008
132
Significance of Mutual Information
• Consider blocks of n symbols:
x1:n
Noisy
Channel
2nH(x)
2nH(y|x)
2nH(y)
y1:n
– An average input sequence x1:n corresponds to about 2nH(y|x)
typical output sequences
– There are a total of 2nH(y) typical output sequences
– For nearly error free transmission, we select a number of input
sequences whose corresponding sets of output sequences hardly
overlap
– The maximum number of distinct sets of output sequences is
2n(H(y)–H(y|x)) = 2nI(y ;x)
Channel Coding Theorem: for large n can transmit at any rate < C with negligible errors
Jan 2008
133
Jointly Typical Set
xyn is the i.i.d. sequence {xi,yi} for 1 ≤ i ≤ n
N
– Prob of a particular sequence is p (x, y ) = ∏ p ( xi , yi )
i =1
– E − log p( x , y ) = n E − log p( xi , yi ) = nH (x , y )
– Jointly Typical set:
{
J ε( n ) = x, y ∈ XY n : − n −1 log p (x) − H (x ) < ε ,
− n −1 log p (y ) − H (y ) < ε ,
− n −1 log p (x, y ) − H (x , y ) < ε
}
Jan 2008
134
Jointly Typical Example
Binary Symmetric Channel
f = 0.2, p x = (0.75 0.25)
T
⎛ 0.6 0.15 ⎞
T
⎟⎟
p y = (0.65 0.35) , Pxy = ⎜⎜
0
.
05
0
.
2
⎝
⎠
0
1–f
f
f
x
1
0
1–f
y
1
Jointly Typical example (for any ε):
x=11111000000000000000
y=11110111000000000000
all combinations of x and y have exactly the right frequencies
Jan 2008
135
Jointly Typical Diagram
Each point defines both an x sequence and a y sequence
2nlog|Y|
2nH(Y)
Typical in both x, y: 2n(H(x)+H(y))
2nlog|X|
(x,
nH
2nH(x)
J
tly
oi n
ty
a
p ic
Inner rectangle
represents pairs
that are typical
in x or y but not
necessarily
jointly typical
y)
l: 2
2nH(x|y)
2nH(y|x)
All sequences: 2nlog(|X||Y|)
•
•
•
•
Dots represent
jointly typical
pairs (x,y)
There are about 2nH(x) typical x’s in all
Each typical y is jointly typical with about 2nH(x|y) of these typical x’s
The jointly typical pairs are a fraction 2–nI(x ;y) of the inner rectangle
Channel Code: choose x’s whose J.T. y’s don’t overlap; use J.T. for decoding
Jan 2008
136
Joint Typical Set Properties
(n)
1. Indiv Prob: x, y ∈ J ε ⇒ log p(x , y ) = −nH (x , y ) ± nε
2. Total Prob: p(x , y ∈ J ε( n ) ) > 1 − ε for n > N ε
n> N
n ( H ( x , y ) −ε )
3. Size:
(1 − ε )2
< J ε( n ) ≤ 2 n (H ( x ,y ) +ε )
ε
Proof 2: (use weak law of large numbers)
(
)
p − n −1 log p( x ) − H (x ) > ε <
Choose N1 such that ∀n > N1 ,
ε
3
Similarly choose N 2 , N 3 for other conditions and set N ε = max( N1 , N 2 , N 3 )
Proof 3: 1 − ε <
∑ p(x, y ) ≤ J ε
(n)
x , y∈J ε( n )
1≥
∑ p(x, y ) ≥ J ε
(n)
(n)
x , y∈J ε
max( n ) p ( x , y ) = J ε( n ) 2 − n ( H ( x ,y ) −ε ) n>Nε
x , y∈J ε
min p ( x , y ) = J ε( n ) 2 − n ( H ( x ,y ) +ε )
x , y∈J ε( n )
∀n
Jan 2008
137
Joint AEP
If px'=px and py' =py with x' and y' independent:
(
)
(1 − ε )2 − n ( I ( x ,y ) +3ε ) ≤ p x ' , y '∈ J ε( n ) ≤ 2 − n ( I ( x ,y ) −3ε ) for n > N ε
Proof: |J| × (Min Prob) ≤ Total Prob ≤ |J| × (Max Prob)
(
) ∑ p( x ' , y ' ) = ∑ p( x ' ) p( y ' )
) ≤ J max p(x ' ) p(y ' )
p x ' , y '∈ J ε( n ) =
(
p x ' , y '∈ J ε( n )
(
)
x ', y '∈J ε( n )
x ', y '∈J ε( n )
(n)
ε
x ', y '∈J ε( n )
≤ 2 n ( H ( x ,y ) +ε )2 − n ( H ( x ) −ε )2 − n ( H ( y ) −ε ) = 2 − n ( I ( x ;y ) −3ε )
p x ' , y '∈ J ε( n ) ≥ J ε( n ) min( n ) p ( x ' ) p( y ' )
x ', y '∈J ε
≥ (1 − ε )2 − n ( I ( x ;y ) +3ε ) for n > N ε
Jan 2008
138
Channel Codes
• Assume Discrete Memoryless Channel with known Qy|x
• An (M,n) code is
– A fixed set of M codewords x(w)∈Xn for w=1:M
– A deterministic decoder g(y)∈1:M
• Error probability
λw = p(g (y (w)) ≠ w) =
∑ p(y | x(w))δ
g (y)≠ w
y∈Y n
– Maximum Error Probability
λ( n ) = max λw
– Average Error probability
Pe( n ) =
1≤ w≤ M
1
M
M
∑λ
w
w =1
δC = 1 if C is true or 0 if it is false
Jan 2008
139
Achievable Code Rates
• The rate of an (M,n) code: R=(log M)/n bits/transmission
• A rate, R, is achievable if
– ∃ a sequence of ( ⎡2 nR ⎤ , n ) codes for n=1,2,…
– max prob of error λ(n)→0 as n→∞
– Note: we will normally write ( 2 nR , n ) to mean
( ⎡2 ⎤ , n)
nR
• The capacity of a DMC is the sup of all achievable rates
• Max error probability for a code is hard to determine
–
–
–
–
Shannon’s idea: consider a randomly chosen code
show the expected average error probability is small
Show this means ∃ at least one code with small max error prob
Sadly it doesn’t tell you how to find the code
Jan 2008
140
Channel Coding Theorem
• A rate R is achievable if R<C and not achievable if R>C
– If R<C, ∃ a sequence of (2nR,n) codes with max prob of error
λ(n)→0 as n→∞
– Any sequence of (2nR,n) codes with max prob of error λ(n)→0 as
n→∞ must have R ≤ C
A very counterintuitive result:
Despite channel errors you can
get arbitrarily low bit error
rates provided that R<C
Bit error probability
0.2
0.15
Achievable
0.1
Impossible
0.05
0
0
1
2
Rate R/C
3
Jan 2008
141
Lecture 11
• Channel Coding Theorem
Jan 2008
142
Channel Coding Principle
• Consider blocks of n symbols:
x1:n
Noisy
Channel
2nH(x)
2nH(y|x)
2nH(y)
y1:n
– An average input sequence x1:n corresponds to about
2nH(y|x) typical output sequences
– Random Codes: Choose 2nR random code vectors x(w)
• their typical output sequences are unlikely to overlap much.
– Joint Typical Decoding: A received vector y is very
likely to be in the typical output set of the transmitted
x(w) and no others. Decode as this w.
Channel Coding Theorem: for large n, can transmit at any rate R < C with negligible errors
Jan 2008
143
Random (2nR,n) Code
• Choose ε ≈ error prob, joint typicality ⇒ Nε , choose n>Nε
• Choose px so that I(x ;y)=C, the information capacity
• Use px to choose a code C with random x(w)∈Xn, w=1:2nR
– the receiver knows this code and also the transition matrix Q
• Assume (for now) the message W∈1:2nR is uniformly distributed
• If received value is y; decode the message by seeing how many
x(w)’s are jointly typical with y
– if x(k) is the only one then k is the decoded message
– if there are 0 or ≥2 possible k’s then 1 is the decoded message
– we calculate error probability averaged over all C and all W
p(E ) = ∑ p (C )2
− nR
C
2 nR
∑λ
w =1
w
(C ) = 2
− nR
( a)
2 nR
∑∑ p(C)λ
w =1 C
w
(C ) = ∑ p (C )λ1 (C ) = p(E | w = 1)
C
(a) since error averaged over all possible codes is independent of w
Jan 2008
144
Channel Coding Principle
• Assume we transmit x(1) and receive y
• Define the events e w = {x ( w), y ∈ J ε( n) } for w ∈ 1 : 2 nR
x1:n
Noisy
Channel
y1:n
• We have an error if either e1 false or ew true for w ≥ 2
• The x(w) for w ≠ 1 are independent of x(1) and hence also
− n ( I ( x ,y ) −3ε )
for any w ≠ 1
independent of y. So p (ew true) < 2
Joint AEP
Jan 2008
145
Error Probability for Random Code
• We transmit x(1), receive y and decode using joint typicality
• We have an error if either e1 false or ew true for w ≥ 2
(
) ( )
p(E | W = 1) = p e1 ∪ e 2 ∪ e 3 ∪
2 nR
∪ e 2nR ≤ p e1 + ∑ e w
p(A∪B)≤p(A)+p(B)
w= 2
2 nR
(1) Joint typicality
≤ ε + ∑ 2 − n ( I ( x ;y ) −3ε ) = ε + 2 nR 2 − n ( I ( x ;y ) −3ε )
i =2
≤ ε + 2 − n ( I ( x ;y ) − R −3ε ) ≤ 2ε for R < C − 3ε and n > −
log ε
C − R − 3ε
(2) Joint AEP
• Since average of Pe(n) over all codes is ≤ 2ε there must be at least
one code for which this is true: this code has 2 − nR λw ≤ 2ε
∑
♦
w
• Now throw away the worst half of the codewords; the remaining
ones must all have λw ≤ 4ε. The resultant code has rate R–n–1 ≅ R.
♦
♦ = proved on next page
Jan 2008
146
Code Selection & Expurgation
• Since average of Pe(n) over all codes is ≤ 2ε there must be at least
one code for which this is true.
Proof:
2ε ≥ K
−1
K
∑
Pe(,ni )
≥K
i =1
−1
∑ min(P ) = min(P )
K
i =1
(n)
e ,i
i
i
(n)
e ,i
K = num of codes
• Expurgation: Throw away the worst half of the codewords; the
remaining ones must all have λw≤ 4ε.
Proof: Assume λw are in descending order
2ε ≥ M
−1
M
∑λ
w =1
⇒ λ½ M ≤ 4ε
w
≥M
−1
½M
∑λ
w
w =1
⇒ λw ≤ 4ε
≥M
−1
½M
∑λ
½M
≥ ½ λ½ M
w =1
∀ w > ½M
M ′ = ½ × 2 nR messages in n channel uses ⇒ R ′ = n −1 log M ′ = R − n −1
Jan 2008
147
Summary of Procedure
• Given R´<C, choose ε<¼(C–R´) and set R=R´+ε ⇒ R<C – 3ε
{
−1
• Set n = max N ε , − (log ε ) /(C − R − 3ε ), ε
}
see (a),(b),(c) below
• Find the optimum pX so that I(x ; y) = C
• Choosing codewords randomly (using pX) and using joint typicality (a)
as the decoder, construct codes with 2nR codewords
• Since average of Pe(n) over all codes is ≤ 2ε there must be at least
one code for which this is true. Find it by exhaustive search.
• Throw away the worst half of the codewords. Now the worst
codeword has an error prob ≤ 4ε with rate R´ = R – n–1 > R – ε
(b)
(c)
• The resultant code transmits at a rate R´ with an error probability
that can be made as small as desired (but n unnecessarily large).
Note: ε determines both error probability and closeness to capacity
Jan 2008
148
Lecture 12
• Converse of Channel Coding Theorem
– Cannot achieve R>C
– Minimum bit-error rate
• Capacity with feedback
– no gain but simpler encode/decode
• Joint Source-Channel Coding
– No point for a DMC
Jan 2008
149
Converse of Coding Theorem
• Fano’s Inequality: if Pe(n) is error prob when estimating w from y,
H (w | y ) ≤ 1 + Pe( n ) log W = 1 + nRPe( n )
nR = H (w ) = H (w | y ) + I (w ; y )
• Hence
Definition of I
≤ H (w | y ) + I ( x (w ); y )
Markov : w → x → y → wˆ
≤ 1 + nRPe( n ) + I ( x; y )
Fano
≤ 1 + nRPe( n ) + nC
⇒ Pe( n ) ≥
•
R −C −n
R
−1
n-use DMC capacity
→
n →∞
R −C
R
Hence for large n, Pe(n) has a lower bound of (R–C)/R if w equiprobable
– If R>C was achievable for small n, it could be achieved also for large n by
concatenation. Hence it cannot be achieved for any n.
Jan 2008
150
Minimum Bit-error Rate
Suppose
– v1:nR is i.i.d. bits with H(vi)=1
Δ
ˆ
– The bit-error rate is Pb = E p (v i ≠ v i ) = E{p (e i )}
i
(a)
Then nC ≥ I (x 1:n ; y 1:n )
{
}
i
(b)
≥ I (v 1:nR ;vˆ1:nR ) = H (v 1:nR ) − H (v 1:nR |vˆ1:nR )
(
)
= nR − ∑ H (v i |vˆ1:nR ,v 1:i −1 ) ≥ nR − ∑ H (v i |vˆi ) = nR 1 − Ei {H (v i |vˆi )}
nR
(d)
(
nR
(c)
)
i =1
(
i =1
)
(e)
(
)
= nR 1 − E{H (e i |vˆi )} ≥ nR 1 − E{H (e i )} ≥ nR 1 − H ( E Pb ,i ) = nR(1 − H ( Pb ) )
i
(c)
i
(a) n-use capacity
(b) Data processing theorem
(c) Conditioning reduces entropy
(d) e i = v i ⊕vˆi
R ≤ C (1 − H ( Pb ) )
−1
Pb ≥ H −1 (1 − C / R )
Bit error probability
0.2
Hence
0.15
0.1
0.05
0
0
i
1
2
Rate R/C
3
(e) Jensen: E H(x) ≤ H(E x)
Jan 2008
151
Channel with Feedback
w∈1:2nR
y1:i–1
Encoder
xi
Noisy
Channel
^
w
yi
Decoder
• Assume error-free feedback: does it increase capacity ?
• A (2nR,n) feedback code is
– A sequence of mappings xi = xi(w,y1:i–1) for i=1:n
– A decoding function wˆ = g (y 1:n )
• A rate R is achievable if ∃ a sequence of (2nR,n) feedback
⎯→ 0
codes such that Pe( n ) = P(wˆ ≠ w ) ⎯n⎯
→∞
• Feedback capacity, CFB≥C, is the sup of achievable rates
Jan 2008
152
Capacity with Feedback
w∈1:2nR
I (w ; y ) = H ( y ) − H ( y | w )
n
= H ( y ) − ∑ H (y i | y 1:i −1 ,w )
y1:i–1
Encoder
xi
Noisy
Channel
yi
^
w
Decoder
i =1
n
= H ( y ) − ∑ H (y i | y 1:i −1 ,w , x i )
since xi=xi(w,y1:i–1)
i =1
n
= H ( y ) − ∑ H (y i | x i )
i =1
n
since yi only directly depends on xi
n
n
i =1
i =1
≤ ∑ H (y i ) − ∑ H (y i | x i ) = ∑ I (x i ; y i ) ≤ nC
i =1
cond reduces ent
Hence
nR = H (w ) = H (w | y ) + I (w ; y ) ≤ 1 + nRPe( n ) + nC
⇒P
(n)
e
R − C − n −1
≥
R
Fano
The DMC does not benefit from feedback: CFB = C
Jan 2008
153
Example: BEC with feedback
• Capacity is 1–f
• Encode algorithm
– If yi=?, retransmit bit i
=
0
f
x
– Average number of transmissions per bit:
1+ f + f 2 +
1–f
0
? y
f
1
1
1− f
1–f
1
• Average number of bits per transmission = 1–f
• Capacity unchanged but encode/decode algorithm much
simpler.
Jan 2008
154
Joint Source-Channel Coding
v1:n
Encoder
x1:n
Noisy
Channel
y1:n
^
Decoder
v1:n
• Assume vi satisfies AEP and |V|<∞
– Examples: i.i.d.; markov; stationary ergodic
• Capacity of DMC channel is C
– if time-varying: C = lim n −1 I ( x; y )
n →∞
• Joint Source Channel Coding Theorem:
⎯→ 0 iff H(V) < C
∃ codes with Pe( n ) = P(vˆ1:n ≠ v 1:n ) ⎯n⎯
→∞
♦
– errors arise from incorrect (i) encoding of V or (ii) decoding of Y
• Important result: source coding and channel coding
might as well be done separately since same capacity
♦ = proved on next page
Jan 2008
155
Source-Channel Proof (⇐)
• For n>Nε there are only 2n(H(V)+ε) v’s in the typical
set: encode using n(H(V)+ε) bits
– encoder error < ε
• Transmit with error prob less than ε so long as
H(V)+ ε < C
• Total error prob < 2ε
Jan 2008
156
Source-Channel Proof (⇒)
v1:n
Encoder
Fano’s Inequality:
x1:n
Noisy
Channel
y1:n
entropy rate of stationary process
= n −1 H (v 1:n |vˆ1:n ) + n −1 I (v 1:n ;vˆ1:n )
)
≤ n −1 1 + Pe( n ) n log V + n −1 I (x 1:n ; y 1:n )
≤ n −1 + Pe( n ) log V + C
Let
Decoder
H ( v | vˆ ) ≤ 1 + Pe( n ) n log V
H (V ) ≤ n −1 H (v 1:n )
(
^
v1:n
n → ∞ ⇒ Pe( n ) → 0 ⇒ H (V ) ≤ C
definition of I
Fano + Data Proc Inequ
Memoryless channel
Jan 2008
157
Separation Theorem
• For a (time-varying) DMC we can design the
source encoder and the channel coder
separately and still get optimum performance
• Not true for
– Correlated Channel and Source
– Multiple access with correlated sources
• Multiple sources transmitting to a single receiver
– Broadcasting channels
• one source transmitting possibly different information to
multiple receivers
Jan 2008
158
Lecture 13
• Continuous Random Variables
• Differential Entropy
– can be negative
– not a measure of the information in x
– coordinate-dependent
• Maximum entropy distributions
– Uniform over a finite range
– Gaussian if a constant variance
Jan 2008
159
Continuous Random Variables
Changing Variables
• pdf: f x (x)
CDF:
Fx ( x) = ∫
x
−∞
f x (t )dt
−1
• For g(x) monotonic: y = g (x ) ⇔ x = g (y )
(
Fy ( y ) = Fx g −1 ( y )
)
(
or 1 − Fx g −1 ( y )
)
according to slope of g(x)
dFY ( y )
dg −1 ( y )
dx
−1
fy ( y) =
= f x g ( y)
= f x (x )
dy
dy
dy
(
)
where
x = g −1 ( y )
• Examples:
Suppose
f x ( x) = 0.5 for
(a)
y = 4x
⇒ x = 0.25y
(b)
z =x 4 ⇒ x =z¼ ⇒
x ∈ (0,2) ⇒ Fx ( x) = 0.5 x
⇒
f y ( y ) = 0.5 × 0.25 = 0.125 for
f z ( z ) = 0.5 × ¼ z −¾ = 0.125 z −¾
y ∈ (0,8)
for
z ∈ (0,16)
Jan 2008
160
Joint Distributions Distributions
Joint pdf:
f x , y ( x, y )
∞
f x ( x) = ∫ f x ,y ( x, y )dy
Marginal pdf:
−∞
Independence: ⇔ f x ,y ( x, y ) = f x ( x) f y ( y )
f ( x, y )
Conditional pdf:
f x |y ( x) = x ,y
y
y
fy ( y)
f x ,y = 1 for y ∈ (0,1), x ∈ ( y, y + 1)
f x |y = 1 for x ∈ ( y, y + 1)
f y |x =
x
fX
1
for y ∈ (max(0, x − 1), min( x,1) )
min( x,1 − x)
x
fY
Example:
Jan 2008
161
Quantised Random Variables
• Given a continuous pdf f(x), we divide the range of x into
bins of width Δ
( i +1) Δ
– For each i, ∃xi with f ( xi )Δ =
∫
f ( x)dx
iΔ
mean value theorem
• Define a discrete random variable Y
– Y = {xi} and py = {f(xi)Δ}
– Scaled,quantised version of f(x) with slightly unevenly spaced xi
•
H (y ) = −∑ f ( xi )Δ log( f ( xi )Δ )
= − log Δ − ∑ f ( xi ) log( f ( xi ) )Δ
→
Δ →0
− log Δ − ∫
∞
−∞
f ( x) log f ( x)dx = − log Δ + h(x )
∞
• Differential entropy: h(x ) = − ∫ f x ( x) log f x ( x)dx
−∞
Jan 2008
162
Differential Entropy
Δ
∞
Differential Entropy: h(x ) =− ∫ f x ( x) log f x ( x)dx = E − log f x ( x)
−∞
Bad News:
– h(x) does not give the amount of information in x
– h(x) is not necessarily positive
– h(x) changes with a change of coordinate system
Good News:
– h1(x)–h2(x) does compare the uncertainty of two continuous
random variables provided they are quantised to the same
precision
– Relative Entropy and Mutual Information still work fine
– If the range of x is normalized to 1 and then x is quantised to n
bits, the entropy of the resultant discrete random variable is
approximately h(x)+n
Jan 2008
163
Differential Entropy Examples
• Uniform Distribution: x ~ U (a, b)
–
f ( x) = (b − a ) −1 for x ∈ (a, b) and f ( x) = 0 elsewhere
–
h(x ) = − ∫ (b − a ) −1 log(b − a ) −1 dx = log(b − a )
b
a
– Note that h(x) < 0 if (b–a) < 1
• Gaussian Distribution: x ~ N ( μ , σ 2 )
– f ( x) = (2πσ 2 ) exp(− ½( x − μ ) 2 σ − 2 )
−½
∞
– h(x ) = −(log e) ∫ f ( x) ln f ( x)dx
−∞
(
∞
= −(log e) ∫ f ( x) − ½ ln(2πσ 2 ) − ½( x − μ ) 2 σ − 2
−∞
(
= ½(log e)(ln(2πσ
(
= ½(log e) ln(2πσ 2 ) + σ −2 E ( x − μ ) 2
2
)
))
2
) + 1 = ½ log(2πeσ ) ≅ log(4.1σ ) bits
)
Jan 2008
164
Multivariate Gaussian
Given mean, m, and symmetric +ve definite covariance matrix K,
x 1:n ~ N(m, K ) ⇔
f (x) = 2πK
−½
(
exp − ½(x − m)T K −1 (x − m)
(
)
)
h( f ) = −(log e )∫ f (x) × − ½(x − m)T K −1 (x − m) − ½ ln 2πK dx
(
(
))
= ½ log(e )× (ln 2πK + E tr ((x − m)(x − m) K ))
= ½ log(e )× (ln 2πK + tr (E (x − m )(x − m ) K ))
= ½ log(e )× (ln 2πK + tr (KK )) = ½ log(e )× (ln 2πK + n )
= ½ log(e ) + ½ log( 2πK )
= ½ log(e )× ln 2πK + E (x − m)T K −1 (x − m)
−1
n
= ½ log( 2π eK ) bits
T
−1
T
−1
Jan 2008
165
Other Differential Quantities
Joint Differential Entropy
h(x , y ) = − ∫∫ f x ,y ( x, y ) log f x ,y ( x, y )dxdy = E − log f x ,y ( x, y )
x, y
Conditional Differential Entropy
h(x | y ) = − ∫∫ f x ,y ( x, y ) log f x ,y ( x | y )dxdy = h(x , y ) − h(y )
x, y
Mutual Information
I (x ; y ) = ∫∫ f x ,y ( x, y ) log
x, y
f x , y ( x, y )
f x ( x) f y ( y )
dxdy = h(x ) + h(y ) − h(x , y )
Relative Differential Entropy of two pdf’s:
f ( x)
dx
g ( x)
= −h f (x ) − E f log g (x )
D( f || g ) = ∫ f ( x) log
(a) must have f(x)=0 ⇒ g(x)=0
(b) continuity ⇒ 0 log(0/0) = 0
Jan 2008
166
Differential Entropy Properties
Chain Rules
h ( x , y ) = h ( x ) + h (y | x ) = h (y ) + h ( x | y )
I ( x , y ; z ) = I ( x ; z ) + I (y ; z | x )
Information Inequality: D( f || g ) ≥ 0
Proof: Define S = {x : f (x) > 0}
− D ( f || g ) =
∫
x∈S
f (x) log
⎛
g ( x)
g (x ) ⎞
⎟⎟
dx = E ⎜⎜ log
f ( x)
f
(
x
)
⎠
⎝
⎛
g ( x) ⎞
⎛ g (x ) ⎞
dx ⎟⎟
⎟⎟ = log⎜⎜ ∫ f (x)
≤ log⎜⎜ E
f
(
x
)
⎝ f (x) ⎠
⎝S
⎠
⎛
⎞
= log⎜⎜ ∫ g (x)dx ⎟⎟ ≤ log 1 = 0
⎝S
⎠
Jensen + log() is concave
all the same as for H()
Jan 2008
167
Information Inequality Corollaries
Mutual Information ≥ 0
I (x ; y ) = D( f x ,y || f x f y ) ≥ 0
Conditioning reduces Entropy
h( x ) − h( x | y ) = I ( x ; y ) ≥ 0
Independence Bound
n
n
i =1
i =1
h(x 1:n ) = ∑ h(x i | x 1:i −1 ) ≤ ∑ h(x i )
all the same as for H()
Jan 2008
168
Change of Variable
Change Variable: y = g (x )
from earlier
(
f y ( y ) = f x g −1 ( y )
−1
) dg dy( y)
h(y ) = − E log( f y (y )) = − E log( f x ( g −1 (y )) ) − E log
= − E log( f x (x ) ) − E log
dx
dy
dx
dx
= h(x ) − E log
dy
dy
Examples:
– Translation:
– Scaling:
y = x + a ⇒ dy / dx = 1 ⇒ h(y ) = h(x )
y = cx
⇒ dy / dx = c ⇒ h(y ) = h(x ) − log c −1
– Vector version: y 1:n = Ax 1:n
⇒ h( y ) = h( x ) + log det( A)
not the same as for H()
Jan 2008
169
Concavity & Convexity
• Differential Entropy:
– h(x) is a concave function of fx(x) ⇒ ∃ a maximum
• Mutual Information:
– I(x ; y) is a concave function of fx (x) for fixed fy|x (y)
– I(x ; y) is a convex function of fy|x (y) for fixed fx (x)
Proofs:
Exactly the same as for the discrete case: pz = [1–λ, λ]T
U
1
X
V
U
Z
0
Z
H (x ) ≥ H (x | z )
U
1
X
X
V
0
f(u|x)
1
f(y|x)
Y
I (x ;y ) ≥ I (x ;y |z )
Y
f(v|x)
V
0
Z
I (x ;y ) ≤ I (x ;y |z )
Jan 2008
170
Uniform Distribution Entropy
What distribution over the finite range (a,b) maximizes the
entropy ?
Answer: A uniform distribution u(x)=(b–a)–1
Proof:
Suppose f(x) is a distribution for x∈(a,b)
0 ≤ D( f || u ) = −h f (x ) − E f log u (x )
= −h f (x ) + log(b − a )
⇒ h f (x ) ≤ log(b − a)
Jan 2008
171
Maximum Entropy Distribution
What zero-mean distribution maximizes the entropy on
(–∞, ∞)n for a given covariance matrix K ?
−½
Answer: A multivariate Gaussian φ (x) = 2πK exp(− ½ xT K −1x )
Proof:
0 ≤ D ( f || φ ) = −h f ( x ) − E f log φ ( x )
(
⇒ h f ( x ) ≤ −(log e )E f − ½ ln ( 2πK ) − ½ x T K −1x
(
(
= ½(log e ) ln ( 2πK ) + tr E f xx T K −1
))
)
= ½(log e )(ln ( 2πK ) + tr (I ))
= ½ log( 2πeK ) = hφ (x )
EfxxT = K
tr(I)=n=ln(en)
Since translation doesn’t affect h(X),
we can assume zero-mean w.l.o.g.
Jan 2008
172
Lecture 14
• Discrete-time Gaussian Channel Capacity
– Sphere packing
• Continuous Typical Set and AEP
• Gaussian Channel Coding Theorem
• Bandlimited Gaussian Channel
– Shannon Capacity
– Channel Codes
Jan 2008
173
Capacity of Gaussian Channel
Discrete-time channel: yi = xi + zi
Zi
– Zero-mean Gaussian i.i.d. zi ~ N(0,N)
Xi
n
Yi
+
−1
2
– Average power constraint n ∑ xi ≤ P
i =1
Ey = E (x + z ) = Ex + 2 E (x ) E (z ) + Ez 2 ≤ P + N
2
2
2
X,Z indep and EZ=0
Information Capacity
I (x ; y )
– Define information capacity: C = max
2
Ex ≤ P
I ( x ; y ) = h (y ) − h ( y | x ) = h (y ) − h ( x + z | x )
(a)
= h ( y ) − h ( z | x ) = h (y ) − h (z )
(a) Translation independence
≤ ½ log 2πe(P + N ) − ½ log 2πeN
= ½ log(1 + PN −1 ) =
1
6
Gaussian Limit with
equality when x~N(0,P)
[ P+NN ]dB
The optimal input is Gaussian & the worst noise is Gaussian
Jan 2008
174
Gaussian Channel Code Rate
z
w∈1:2nR
Encoder
x
y
+
^
Decoder
w∈0:2nR
• An (M,n) code for a Gaussian Channel with power
constraint is
– A set of M codewords x(w)∈Xn for w=1:M with x(w)Tx(w) ≤ nP ∀w
– A deterministic decoder g(y)∈0:M where 0 denotes failure
max : λ( n )
average : Pe( n )
– Errors: codeword : λi
i
i
• Rate R is achievable if ∃ seq of (2nR,n) codes with λ( n ) → 0
n →∞
• Theorem: R achievable iff
R < C = ½ log(1 + PN −1 )
♦
♦ = proved on next pages
Jan 2008
175
Sphere Packing
• Each transmitted xi is received as a
probabilistic cloud yi
– cloud ‘radius’ =
Var (y | x ) = nN
• Energy of yi constrained to n(P+N) so
clouds must fit into a hypersphere of
radius n(P + N )
• Volume of hypersphere ∝ r n
• Max number of non-overlapping clouds:
(nP + nN )½ n
½ n log (1+ PN −1 )
=
2
(nN )½ n
• Max rate is ½log(1+PN–1)
Jan 2008
176
Continuous AEP
Typical Set: Continuous distribution, discrete time i.i.d.
For any ε>0 and any n, the typical set with respect to f(x) is
{
Tε( n ) = x ∈ S n : − n −1 log f (x) − h(x ) ≤ ε
where
}
S is the support of f ⇔ {x : f(x)>0}
n
f (x) = ∏ f ( xi ) since xi are independent
i =1
h(x ) = E − log f (x ) = −n −1 E log f ( x )
Typical Set Properties
1.
p ( x ∈ Tε( n ) ) > 1 − ε for n > N ε
2.
(1 − ε )2 n ( h ( x ) −ε ) ≤ Vol(Tε( n ) ) ≤ 2 n ( h ( x ) +ε )
n> Nε
where Vol( A) =
∫ dx
x∈A
Proof: WLLN
Proof: Integrate
max/min prob
Jan 2008
177
Continuous AEP Proof
Proof 1: By weak law of large numbers
n
prob
− n −1 log f (x 1:n ) = −n −1 ∑ log f (x i ) → E − log f (x ) = h(x )
i =1
prob
Reminder: x n → y ⇒ ∀ε > 0, ∃N ε such that ∀n > N ε , P(| x n − y |> ε ) < ε
Proof 2a:
1− ε ≤
∫ f (x)dx
for n > N ε
Tε( n )
Property 1
( )
max f(x) within T
( )
min f(x) within T
≤ 2 − n (h ( X ) −ε ) ∫ dx = 2 − n (h ( X ) −ε ) Vol Tε( n )
(n)
Tε
Proof 2b:
1=
∫ f (x)dx ≥ ∫ f (x)dx
Sn
Tε( n )
≥ 2 − n (h ( X ) +ε ) ∫ dx = 2 − n (h ( X ) +ε ) Vol Tε( n )
(n)
Tε
Jan 2008
178
Jointly Typical Set
Jointly Typical: xi,yi i.i.d from ℜ2 with fx,y (xi,yi)
{
J ε( n ) = x, y ∈ ℜ 2 n : − n −1 log f X (x) − h(x ) < ε ,
− n −1 log fY (y ) − h(y ) < ε ,
− n −1 log f X ,Y (x, y ) − h(x , y ) < ε
Properties:
}
1. Indiv p.d.:
x, y ∈ J ε( n ) ⇒ log f x ,y (x, y ) = − nh(x , y ) ± nε
2. Total Prob:
p x , y ∈ J ε( n ) > 1 − ε
3. Size:
4. Indep x′,y′:
(
(1 − ε )2
)
n ( h ( x , y ) −ε )
(1 − ε )2
n> Nε
for n > N ε
( )
≤ Vol J ε( n ) ≤ 2 n (h ( x ,y ) +ε )
− n ( I ( x ;y ) + 3ε )
n> Nε
(
)
≤ p x ' , y '∈ J ε( n ) ≤ 2 − n ( I ( x ;y ) −3ε )
Proof of 4.: Integrate max/min f(x´, y´)=f(x´)f(y´), then use known bounds on Vol(J)
Jan 2008
179
Gaussian Channel Coding Theorem
(
R is achievable iff R < C = ½ log 1 + PN −1
Proof (⇐):
)
Choose ε > 0
Random codebook: x w ∈ ℜn for w = 1 : 2nR where xw are i.i.d. ~ N (0, P − ε )
Use Joint typicality decoding
Errors: 1. Power too big p (x T x > nP ) → 0 ⇒ ≤ ε for n > M ε
p ( x , y ∉ J ε( n ) ) < ε for n > N ε
2. y not J.T. with x
3. another x J.T. with y
2 nR
∑ p( x
j =2
− n ( I ( X ;Y ) − R −3ε )
j
, y i ∈ J ε( n ) ) ≤ (2 nR − 1) × 2 − n ( I ( x ;y ) −3ε )
≤ 3ε for large n if R < I ( X ; Y ) − 3ε
Total Err Pε ≤ ε + ε + 2
Expurgation: Remove half of codebook*: λ( n ) < 6ε
now max error
We have constructed a code achieving rate R–n–1
(n)
*:Worst codebook half includes xi: xiTxi >nP ⇒ λi=1
Jan 2008
180
Gaussian Channel Coding Theorem
Proof (⇒): Assume Pε( n ) → 0 and n -1xT x < P for each x( w)
nR = H (w ) = I (w ; y 1:n ) + H (w | y 1:n )
≤ I (x 1:n ; y 1:n ) + H (w | y 1:n )
Data Proc Inequal
= h(y 1:n ) − h(y 1:n | x 1:n ) + H (w | y 1:n )
n
≤ ∑ h(y i ) − h(z 1:n ) + H (w | y 1:n )
Indep Bound + Translation
i =1
n
≤ ∑ I (x i ; y i ) + 1 + nRPε( n )
Z i.i.d + Fano, |W|=2nR
i =1
(
n
)
≤ ∑ ½ log 1 + PN −1 + 1 + nRPε( n )
i =1
(
)
max Information Capacity
(
−1
R ≤ ½ log 1 + PN −1 + n −1 + RPε( n ) → ½ log 1 + PN
)
Jan 2008
181
Bandlimited Channel
• Channel bandlimited to f∈(–W,W) and signal duration T
• Nyquist: Signal is completely defined by 2WT samples
• Can represent as a n=2WT-dimensional vector space with
prolate spheroidal functions as an orthonormal basis
– white noise with double-sided p.s.d. ½N0 becomes
i.i.d gaussian N(0,½N0) added to each coefficient
– Signal power constraint = P ⇒ Signal energy ≤ PT
• Energy constraint per coefficient: n–1xTx<PT/2WT=½W–1P
• Capacity:
(
)
C = ½ log 1 + ½W −1 P(½ N 0 ) −1 × 2W
(
)
= W log 1 + N 0−1W −1 P bits/second
Compare discrete time version: ½log(1+PN–1) bits per channel use
Jan 2008
182
Shannon Capacity
Bandwidth = W Hz, Signal variance = ½W −1 P, Noise variance = ½ N 0
Signal Power = P, Noise power = N 0W , Min bit energy = Eb = PC −1
Capacity = C = W log(1 + PN 0−1W −1 ) bits per second
Define:
(
W0 = PN 0−1 ⇒ C / W0 = (W / W0 ) log 1 + (W / W0 )
−1
⇒ C W0 = Eb N
−1
0
−1
) → log e
W →∞
→ ln 2 = −1.6 dB
W →∞
• For fixed power, high bandwidth is better – Ultra wideband
30
Eb/N0 (dB)
Capacity / W0
1.5
1
0.5
0
0
P constant
2
4
Bandwidth / W 0
6
20
C constant
10
0
-10
0
0.5
1
1.5
Bandwidth/Capacity (Hz/bps)
2
Jan 2008
183
Practical Channel Codes
Code Classification:
– Very good: arbitrarily small error up to the capacity
– Good: arbitrarily small error up to less than capacity
– Bad: arbitrarily small error only at zero rate (or never)
Coding Theorem: Nearly all codes are very good
– but nearly all codes need encode/decode computation ∝ 2n
Practical Good Codes:
– Practical: Computation & memory ∝ nk for some k
– Convolution Codes: convolve bit stream with a filter
• Concatenation, Interleaving, turbo codes (1993)
– Block codes: encode a block at a time
• Hamming, BCH, Reed-Solomon, LD parity check (1995)
Jan 2008
184
Channel Code Performance
• Power Limited
–
–
–
–
High bandwidth
Spacecraft, Pagers
Use QPSK/4-QAM
Block/Convolution Codes
• Bandwidth Limited
– Modems, DVB, Mobile
phones
– 16-QAM to 256-QAM
– Convolution Codes
• Value of 1 dB for space
– Better range, lifetime,
weight, bit rate
– $80 M (1999)
Diagram from “An overview of Communications” by John R Barry, TTC = Japanese Technology Telecommunications Council
Jan 2008
185
Lecture 15
• Parallel Gaussian Channels
– Waterfilling
• Gaussian Channel with Feedback
Jan 2008
186
Parallel Gaussian Channels
• n gaussian channels (or one channel n times)
– e.g. digital audio, digital TV, Broadband ADSL
• Noise is independent zi ~ N(0,Ni)
• Average Power constraint ExTx ≤ P
• Information Capacity: C =
max
f ( x ):E f x T x ≤ P
z1
I ( x; y )
x1
• R<C ⇔ R achievable
– proof as before
• What is the optimal f(x) ?
+
y1
zn
xn
+
yn
Jan 2008
187
Parallel Gaussian: Max Capacity
Need to find f(x): C =
max
f ( x ):E f x T x ≤ P
I ( x; y )
I ( x ; y ) = h ( y ) − h ( y | x ) = h ( y ) − h (z | x )
Translation invariance
n
= h ( y ) − h(z ) = h ( y ) − ∑ h (z i )
x,z indep; Zi indep
i =1
(a) n
(
(b) n
≤ ∑ (h(y i ) − h(z i ) ) ≤ ∑ ½ log 1 + Pi N i−1
i =1
)
(a) indep bound;
(b) capacity limit
i =1
Equality when: (a) yi indep ⇒ xi indep; (b) xi ~ N(0, Pi)
∑ ½ log(1 + P N )
n
We need to find the Pi that maximise
i
i =1
−1
i
Jan 2008
188
Parallel Gaussian: Optimal Powers
We need to find the Pi that maximise
log(e)∑ ½ ln (1 + Pi N i−1 )
n
n
– subject to power constraint
– use Lagrange multiplier
n
(
)
∑P = P
i =1
n
½λ–1
J = ∑ ½ ln 1 + Pi N i−1 − λ ∑ Pi
i =1
i =1
∂J
−1
−1
= ½(Pi + N i ) − λ = 0 ⇒ Pi + N i = ½λ
∂Pi
⎛
⎞
Also ∑ Pi = P ⇒ λ = ½n⎜ P + ∑ N i ⎟
i =1
i =1
⎝
⎠
n
i =1
i
n
P1
−1
Water Filling: put most power into
least noisy channels to make equal
power + noise in each channel
N1
P2
N2
P3
N3
Jan 2008
189
Very Noisy Channels
• Must have Pj ≥ 0 ∀i
• If ½λ–1 < Nj then set Pj=0 and
recalculate λ
½λ–1
Kuhn Tucker Conditions:
(not examinable)
– Max f(x) subject to Ax+b=0 and
P1
P2
N1
N2
N3
g i (x) ≥ 0 for i ∈1 : M with f , g i concave
M
T
– set J (x) = f (x) − ∑ μi g i (x) − λ Ax
i =1
– Solution x0, λ, μi iff
∇J (x 0 ) = 0, Ax + b = 0, g i (x 0 ) ≥ 0, μi ≥ 0, μi g i (x 0 ) = 0
Jan 2008
190
Correlated Noise
• Suppose y = x + z where E zzT = KZ and E xxT = KX
• We want to find KX to maximize capacity subject to
n
power constraint: E ∑ x i2 ≤ nP ⇔ tr (K X ) ≤ nP
i =1
– Find noise eigenvectors:
K Z = QDQT with QQT = I
QTy = QTx + QTz = QTx + w
where E wwT = E QTzzTQ = E QTKZQ = D is diagonal
– Now
• ⇒ Wi are now independent
(
) (
)
T
T
– Power constraint is unchanged tr Q K X Q = tr K X QQ = tr (K X )
– Choose
QT K X Q = LI − D where L = P + n −1 tr (D)
⇒ K X = Q(LI − D)QT
Jan 2008
191
Power Spectrum Water Filling
Noise Spectrum
• If z is from a stationary process
power spectrum
then diag(D) n→
→∞
5
4.5
4
3.5
3
– To achieve capacity use waterfilling on
noise power spectrum
L
2.5
2
1.5
1
P = ∫ max(L − N ( f ),0 ) df
W
0.5
0
-0.5
−W
W
⎛ max(L − N ( f ),0 ) ⎞
⎟⎟ df
C = ∫ ½ log⎜⎜1 +
−W
N
f
(
)
⎝
⎠
0
Frequency
0.5
T r a n s m it S p e c t r u m
2
1 .5
1
0 .5
0
-0 .5
0
F re qu e n c y
0 .5
Jan 2008
192
Gaussian Channel + Feedback
zi
Does Feedback add capacity ?
w
Coder
– White noise – No
– Coloured noise – Not much
xi
n
I (w ; y ) = h( y ) − h( y | w ) = h( y ) − ∑ h(y i | w , y 1:i −1 )
Chain rule
+
yi
x i = xi (w , y 1:i −1 )
i =1
n
= h( y ) − ∑ h(y i | w , y 1:i −1 , x 1:i , z 1:i −1 )
i =1
n
xi=xi(w,y1:i–1), z = y – x
= h( y ) − ∑ h(z i | w , y 1:i −1 , x 1:i , z 1:i −1 )
z = y – x and translation invariance
= h( y ) − ∑ h(z i | z 1:i −1 )
zi depends only on z1:i–1
= h ( y ) − h (z )
Chain rule, h(z) = ½ log( 2π eK Z ) bits
i =1
n
i =1
= ½ log
Ky
Kz
⇒ maximize I(w ; y) by maximizing h(y) ⇒ y gaussian
⇒ we can take z and x = y – z jointly gaussian
Jan 2008
193
Gaussian Feedback Coder
x and z jointly gaussian ⇒
w
x = Bz + v (w )
zi
x i = xi (w , y 1:i −1 )
Coder
xi
yi
+
where v is indep of z and
B is strictly lower triangular since xi indep of zj for j>i.
y = x + z = (B + I )z + v
(
= E (Bzz B
)
K y = Eyy T = E (B + I )zzT (B + I )T + vv T = (B + I )K z (B + I )T + K v
T
K x = Exx T
T
½n
Capacity: Cn, FB = max
K ,B
+ vv T
−1
v
subject to
(
| Ky |
| Kz |
)
= BK z BT + K v
−1
= max ½ n log
(B + I )K z (B + I )T + K v
| Kz |
K v ,B
)
K x = tr BK z BT + K v ≤ nP
hard to solve /
Jan 2008
194
Gaussian Feedback: Toy Example
⎛2 1⎞
⎛ 0 0⎞
⎟⎟, B = ⎜⎜
⎟⎟
n = 2, P = 2, K Z = ⎜⎜
b
1
2
0
⎝
⎠
⎝
⎠
25
(
det(K Y ) = det (B + I )K z (B + I ) + K v
T
det(KY)
x = Bz + v
Goal: Maximize (w.r.t. Kv and b)
15
10
)
5
Subject to:
0
-1.5
(
)
Power constraint : tr BK z BT + K v ≤ 4
Solution (via numerically search):
b=0:
det(Ky)=16
b=0.69: det(Ky)=20.7
C=0.604 bits
C=0.697 bits
-0.5
0
b
0.5
4
PV2
3
1
1.5
PbZ1
2
PV1
1
0
-1.5
Feedback increases C by 16%
-1
5
Power: V1 V2 bZ1
K v must be positive definite
det(Ky)
20
-1
-0.5
Px 1 = Pv 1
0
b
0.5
1
Px 2 = Pv 2 + Pbz 1
1.5
Jan 2008
195
Max Benefit of Feedback: Lemmas
Lemma 1: Kx+z+Kx–z = 2(Kx+Kz)
K x + z + K x −z = E (x + z )(x + z ) + E (x − z )(x − z )
T
(
= E (2 xx
T
= E xx T + xzT + zx T + zzT
T
)
)
+ xx T − xzT − zx T + zzT
+ 2zzT = 2(K x + K z )
Lemma 2: If F,G are +ve definite then det(F+G)≥det(F)
Consider two indep random vectors f~N(0,F), g~N(0,G)
(
)
½ log (2πe ) det( F + G ) = h(f + g)
n
≥ h(f + g | g) = h(f | g)
(
Conditioning reduces h()
Translation invariance
= h(f ) = ½ log (2πe ) det(F)
n
)
f,g independent
Hence: det(2(Kx+Kz)) = det(Kx+z+Kx–z) ≥ det(Kx+z) = det(Ky)
Jan 2008
196
Maximum Benefit of Feedback
zi
w
Coder
Cn , FB ≤ max ½ n −1 log
+
yi
det(K y )
det(K z )
det (2(K x + K z ))
≤ max ½ n −1 log
tr( K x ) ≤ nP
det(K z )
tr( K x ) ≤ nP
no constraint on B ⇒ ≤
Lemmas 1 & 2:
det(2(Kx+Kz)) ≥ det(Ky)
2 n det (K x + K z )
= max ½ n log
tr( K x ) ≤ nP
det(K z )
−1
= ½ + max ½ n −1 log
tr( K x ) ≤ nP
xi
det(kA)=kndet(A)
det (K x + K z )
= ½ + Cn bits/transmission
det(K z )
Ky = Kx+Kz if no feedback
Having feedback adds at most ½ bit per transmission
Jan 2008
197
Lecture 16
• Lossy Coding
• Rate Distortion
– Bernoulli Scurce
– Gaussian Source
• Channel/Source Coding Duality
Jan 2008
198
Lossy Coding
x1:n
f(x1:n)∈1:2nR
Encoder
f( )
Decoder
g( )
Distortion function: d ( x, xˆ ) ≥ 0
– sequences:
d (x, xˆ ) = n
−1
⎧0 x = xˆ
⎩1 x ≠ xˆ
(ii) d H ( x, xˆ ) = ⎨
– examples: (i) d S ( x, xˆ ) = ( x − xˆ ) 2
Δ
n
∑ d ( x , xˆ )
i =1
x^1:n
i
i
Distortion of Code fn(),gn(): D = Ex∈X d (x , xˆ ) = E d (x , g ( f ( x )) )
n
Rate distortion pair (R,D) is achievable for source X if
∃ a sequence fn() and gn() such that lim Ex∈X d ( x , g n ( f n ( x ))) ≤ D
n →∞
n
Jan 2008
199
Rate Distortion Function
Rate Distortion function for {xi} where p(x) is known is
R( D) = inf R such that (R,D) is achievable
Theorem: R ( D ) = min I (x ; xˆ ) over all p (x , xˆ ) such that :
p (x ) is correct
(b) Ex ,xˆ d (x , xˆ ) ≤ D
(a)
– this expression is the Information Rate Distortion function for X
We will prove this next lecture
If D=0 then we have R(D)=I(x ; x)=H(x)
Jan 2008
200
R(D) bound for Bernoulli Source
Bernoulli: X = [0,1], pX = [1–p , p] assume p ≤ ½
d ( x, xˆ ) = x ⊕ xˆ
1
0.8
– If D≥p, R(D)=0 since we can set g( )≡0
– For D < p ≤ ½, if
E d ( x, xˆ ) ≤ D then
p=0.5
0.4
0
0
= H ( p ) − H (x ⊕ xˆ | xˆ )
≥ H ( p ) − H (x ⊕ xˆ )
Hence R(D) ≥ H(p) – H(D)
0.6
0.2
I (x ; xˆ ) = H (x ) − H (x | xˆ )
≥ H ( p) − H ( D)
R(D), p=0.2,0.5
– Hamming Distance:
p=0.2
0.1
0.2
0.3
0.4
0.5
D
⊕ is one-to-one
Conditioning reduces entropy
p ( x ⊕ xˆ ) ≤ D and so for D ≤ ½
H ( x ⊕ xˆ ) ≤ H ( D ) as H ( p ) monotonic
Jan 2008
201
R(D) for Bernoulli source
We know optimum satisfies R(D) ≥ H(p) – H(D)
– We show we can find a p(xˆ , x ) that attains this.
– Peculiarly, we consider a channel with xˆ as the
input and error probability D
Now choose r to give x the correct
1–r 0 1–D
0
D
probabilities:
^
x
r (1 − D ) + (1 − r ) D = p
r 1
−1
⇒ r = ( p − D)(1 − 2 D )
Now I (x ; xˆ ) = H (x ) − H (x | xˆ ) = H ( p ) − H ( D)
and p (x ≠ xˆ ) = D
D
1–D
1
1–p
x
p
Hence R(D) = H(p) – H(D)
Jan 2008
202
R(D) bound for Gaussian Source
• Assume X ~ N (0, σ 2 ) and d ( x, xˆ ) = ( x − xˆ ) 2
• Want to minimize I (x ; xˆ ) subject to E (x − xˆ ) 2 ≤ D
I (x ; xˆ ) = h(x ) − h(x | xˆ )
= ½ log 2πeσ 2 − h(x − xˆ | xˆ )
≥ ½ log 2πeσ 2 − h(x − xˆ )
Translation Invariance
Conditioning reduces entropy
≥ ½ log 2πeσ 2 − ½ log(2πe Var (x − xˆ ))
≥ ½ log 2πeσ 2 − ½ log 2πeD
⎛
σ2 ⎞
ˆ
I (x ; x ) ≥ max⎜⎜ ½ log
,0 ⎟⎟
D
⎝
⎠
Gauss maximizes entropy
for given covariance
require Var (x − xˆ ) ≤ E (x − xˆ ) 2 ≤ D
I(x ;y) always positive
Jan 2008
203
R(D) for Gaussian Source
To show that we can find a p ( xˆ , x) that achieves the bound, we
construct a test channel that introduces distortion D<σ2
z ~ N(0,D)
x^ ~ N(0,σ2–D)
x ~ N(0,σ2)
+
I (x ; xˆ ) = h(x ) − h(x | xˆ )
3.5
= ½ log 2πeσ 2 − h(x − xˆ | xˆ )
= ½ log 2πeσ − h(z | xˆ )
3
2.5
= ½ log
⎛σ ⎞
⇒ D( R) = ⎜ R ⎟
⎝2 ⎠
σ2
R(D)
2
D
achievable
2
1.5
1
2
0.5
impossible
0
0
0.5
1
D/σ
1.5
2
Jan 2008
204
Lloyd Algorithm
Problem: Find optimum quantization levels for Gaussian pdf
a.
b.
Bin boundaries are midway between quantization levels
Each quantization level equals the mean value of its own bin
Lloyd algorithm: Pick random quantization levels then apply
conditions (a) and (b) in turn until convergence.
0.2
2
0.15
1
MSE
Quantization levels
3
0
0.1
-1
0.05
-2
-3 0
10
1
10
Iteration
10
2
Solid lines are bin boundaries. Initial levels uniform in [–1,+1] .
0 0
10
1
10
Iteration
Best mean sq error for 8 levels = 0.0345σ2. Predicted D(R) = (σ/8)2 = 0.0156σ2
10
2
Jan 2008
205
Vector Quantization
To get D(R), you have to quantize many values together
– True even if the values are independent
D = 0.035
D = 0.028
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0.5
1
1.5
2
0
0
2.5
0.5
1
1.5
2
2.5
Two gaussian variables: one quadrant only shown
– Independent quantization puts dense levels in low prob areas
– Vector quantization is better (even more so if correlated)
Jan 2008
206
Multiple Gaussian Variables
• Assume x1:n are independent gaussian sources with
different variances. How should we apportion the
available total distortion between the sources?
• Assume x i ~ N (0,σ i2 ) and d (x, xˆ ) = n −1 (x − xˆ ) (x − xˆ ) ≤ D
T
n
I (x 1:n ; xˆ1:n ) ≥ ∑ I (x i ; xˆ i )
Mut Info Independence Bound
for independent xi
i =1
⎛
σ i2 ⎞
≥ ∑ R ( Di ) = ∑ max⎜⎜ ½ log
,0 ⎟⎟
D
i =1
i =1
i
⎝
⎠
n
n
⎛
σ i2 ⎞
We must find the Di that minimize ∑ max⎜⎜ ½ log
,0 ⎟
Di ⎟⎠
i =1
⎝
n
R(D) for individual
Gaussian
Jan 2008
207
Reverse Waterfilling
n
⎛
σ i2 ⎞
Minimize ∑ max⎜⎜ ½ log
,0 ⎟⎟ subject to ∑ Di ≤ nD
D
i =1
i =1
i
⎝
⎠
n
Use a lagrange multiplier:
n
n
σ2
J = ∑ ½ log i + λ ∑ Di
i =1
Di
i =1
∂J
−1
= −½ Di−1 + λ = 0 ⇒ Di = ½λ = D0
∂Di
n
∑D
i =1
i
= nD0 = nD
⇒ D0 = D
Choose Ri for equal distortion
• If σi2 <D then set Ri=0 and increase D0 to
maintain the average distortion equal to D
• If xi are correlated then reverse waterfill
the eigenvectors of the correlation matrix
Ri = ½ log
σ i2
D
½logσi2
½logD
R1
X1
R2
R3
X2
X3
½logσi2
½logD0
R1 R =0
2
R3
X1
X3
½logD
X2
Jan 2008
208
Channel/Source Coding Duality
• Channel Coding
– Find codes separated enough to
give non-overlapping output
images.
– Image size = channel noise
– The maximum number (highest
rate) is when the images just fill
the sphere.
Noisy channel
Sphere Packing
Lossy encode
• Source Coding
– Find regions that cover the sphere
– Region size = allowed distortion
– The minimum number (lowest rate)
is when they just don’t overlap
Sphere Covering
Jan 2008
209
Channel Decoder as Source Coder
z1:n ~ N(0,D)
w∈2nR
Encoder
y1:n ~ N(0,σ2–D)
( (
^ ∈2nR
w
x1:n ~ N(0,σ2)
+
Decoder
) )
• For R ≅ C = ½ log 1 + σ 2 − D D −1 , we can find a channel
encoder/decoder so that p(wˆ ≠ w ) < ε and E (x i − y i ) 2 = D
• Reverse the roles of encoder and decoder. Since
p(w ≠ wˆ ) < ε , also p (xˆ ≠ y ) < ε and E (x i − xˆ i ) 2 ≅ E (x i − y i ) 2 = D
Z1:n ~ N(0,D)
W
Encoder
Y1:n
X1:n ~ N(0,σ2)
+
Decoder
^ ∈2nR
W
Encoder
^
X
1:n
We have encoded x at rate R=½log(σ2D–1) with distortion D
Jan 2008
210
High Dimensional Space
In n dimensions
– “Vol” of unit hypercube: 1
– “Vol” of unit-diameter hypersphere:
⎧π ½ n −½ (½ n − ½ )! / n! n odd
Vn = ⎨
½ n −n
n even
⎩ π 2 /(½ n)!
– “Area” of unit-diameter hypersphere:
d
n
An = (2r ) V n
= 2nVn
dr
r =½
n
Vn
An
1
1
2
2
0.79
3.14
3
0.52
3.14
4
0.31
2.47
10
2.5e–3
5e–2
100
1.9e–70
3.7e–68
– >63% of Vn is in shell (1 − n −1 ) R ≤ r ≤ R
(
Proof : 1 - n-1
)
n
<
→ e −1 = 0.3679
n →∞
Most of n-dimensional
space is in the corners
Jan 2008
211
Lecture 17
• Rate Distortion Theorem
Jan 2008
212
Review
x1:n
Encoder
fn( )
f(x1:n)∈1:2nR
Decoder
gn( )
^
x1:n
Rate Distortion function for x whose px(x) is known is
R ( D ) = inf R such that ∃f n , g n with lim Ex∈X d ( x , xˆ ) ≤ D
Rate Distortion Theorem:
n
n →∞
R ( D) = min I (x ; xˆ ) over all p ( xˆ | x) such that Ex ,xˆ d (x , xˆ ) ≤ D
3.5
We will prove this theorem for discrete X
and bounded d(x,y)≤dmax
3
R(D)
2.5
2
achievable
1.5
1
R(D) curve depends on your choice of d(,)
0.5
0
0
impossible
0.5
1
D/σ 2
1.5
Jan 2008
213
Rate Distortion Bound
Suppose we have found an encoder and decoder at rate R0
with expected distortion D for independent xi (worst case)
We want to prove that R0 ≥ R( D) = R(E d ( x; xˆ ) )
– We show first that
R0 ≥ n −1 ∑ I (x i ; xˆ i )
i
– We know that
I (x i ; xˆ i ) ≥ R (E d (x i ; xˆ i ) )
Defn of R(D)
– and use convexity to show
n −1 ∑ R(E d (x i ; xˆ i ) ) ≥ R(E d ( x; xˆ ) ) = R( D)
i
We prove convexity first and then the rest
Jan 2008
214
Convexity of R(D)
If p1 ( xˆ | x ) and p2 ( xˆ | x ) are
3.5
3
associated with ( D1 , R1 ) and ( D2 , R2 )
Then
E pλ d ( x, xˆ ) = λD1 + (1 − λ ) D2 = Dλ
R(D)
on the R ( D ) curve we define
pλ ( xˆ | x ) = λp1 ( xˆ | x ) + (1 − λ ) p2 ( xˆ | x )
2.5
Rλ
2
(D1,R1)
1.5
1
0.5
(D2,R2)
0
0
R ( Dλ ) ≤ I pλ (x ; xˆ )
≤ λI p1 (x ; xˆ ) + (1 − λ ) I p2 (x ; xˆ )
= λR( D1 ) + (1 − λ ) R( D2 )
Dλ 0.5
1
1.5
D/σ2
R( D) = min I (x ; xˆ )
p ( xˆ |x )
I (x ; xˆ ) convex w.r.t. p( xˆ | x)
p1 and p2 lie on the R(D) curve
Jan 2008
215
Proof that R ≥ R(D)
nR0 ≥ H (xˆ1:n ) = H (xˆ1:n ) − H (xˆ1:n | x 1:n )
Uniform bound; H (xˆ | x ) = 0
= I (xˆ1:n ; x 1:n )
Definition of I(;)
n
≥ ∑ I (x i ; xˆ i )
xi indep: Mut Inf
Independence Bound
i =1
n
n
≥ ∑ R(E d (x i ; xˆ i ) ) = n∑ n −1 R(E d (x i ; xˆ i ) )
i =1
definition of R
i =1
⎛ −1 n
⎞
≥ nR⎜ n ∑ E d (x i ; xˆ i ) ⎟ = nR(E d (x 1:n ; xˆ1:n ) )
⎝ i =1
⎠
≥ nR(D)
convexity
defn of vector d( )
original assumption that E(d) ≤ D
and R(D) monotonically decreasing
Jan 2008
216
Rate Distortion Achievability
x1:n
Encoder
fn( )
f(x1:n)∈1:2nR
Decoder
gn( )
^
x1:n
We want to show that for any D, we can find an encoder
and decoder that compresses x1:n to nR(D) bits.
• pX is given
• Assume we know the p( xˆ | x ) that gives I (x ; xˆ ) = R( D)
• Random Decoder: Choose 2nR random xˆ i ~ p xˆ
– There must be at least one code that is as good as the average
• Encoder: Use joint typicality to design
– We show that there is almost always a suitable codeword
First define the typical set we will use, then prove two preliminary results.
Jan 2008
217
Distortion Typical Set
Distortion Typical: (x i , xˆ i )∈ X × Xˆ drawn i.i.d. ~p(x,xˆ)
{
J d( n,ε) = x, xˆ ∈ X n × Xˆ n : − n −1 log p (x) − H (x ) < ε ,
− n −1 log p (xˆ ) − H (xˆ ) < ε ,
− n −1 log p(x, xˆ ) − H (x , xˆ ) < ε
d (x, xˆ ) − E d (x , xˆ ) < ε
}
new condition
Properties:
1. Indiv p.d.:
x, xˆ ∈ J d( n,ε) ⇒ log p (x, xˆ ) = −nH (x , xˆ ) ± nε
2. Total Prob:
p x, xˆ ∈ J d( n,ε) > 1 − ε
(
)
for n > N ε
weak law of large numbers; d (x i , xˆ i ) are i.i.d.
Jan 2008
218
Conditional Probability Bound
Lemma:
Proof:
x, xˆ ∈ J d( n,ε) ⇒ p(xˆ ) ≥ p (xˆ | x )2 − n (I ( x ;x ) +3ε )
ˆ
p (xˆ , x )
p(x )
p(xˆ , x )
= p(xˆ )
p (xˆ ) p(x )
p (xˆ | x ) =
take max of top and min of bottom
2 − n ( H ( x , x ) −ε )
≤ p (xˆ ) − n ( H ( x ) +ε ) − n (H ( xˆ ) +ε )
2
2
ˆ
= p(xˆ )2 n (I ( x ;x ) +3ε )
ˆ
bounds from defn of J
defn of I
Jan 2008
219
Curious but necessary Inequality
Lemma: u, v ∈ [0,1], m > 0 ⇒ (1 − uv) m ≤ 1 − u + e − vm
Proof: u=0: e − vm ≥ 0 ⇒ (1 − 0) m ≤ 1 − 0 + e − vm
u=1: Define f (v) = e − v − 1 + v ⇒ f ′(v) = 1 − e − v
f (0) = 0 and f ′(v) > 0 for v > 0 ⇒
Hence for v ∈ [0,1], 0 ≤ 1 − v ≤ e − v
0<u<1:
f (v) ≥ 0 for v ∈ [0,1]
⇒ (1 − v ) ≤ e − vm
m
Define g v (u ) = (1 − uv) m
⇒ g v′′( x) = m(m − 1)v 2 (1 − uv) n − 2 ≥ 0 ⇒ g v (u ) convex
(1 − uv) m = g v (u ) ≤ (1 − u ) g v (0) + ug v (1)
for u,v ∈ [0,1]
convexity for u,v ∈ [0,1]
= (1 − u )1 + u (1 − v) m ≤ 1 − u + ue − vm ≤ 1 − u + e − vm
Jan 2008
220
Achievability of R(D): preliminaries
x1:n
Encoder
fn( )
f(x1:n)∈1:2nR
Decoder
gn( )
^
x1:n
• Choose D and find a p( xˆ | x) such that I (x ; xˆ ) = R( D); E d ( x, xˆ ) ≤ D
Choose δ > 0 and define p xˆ = {
p ( xˆ ) = ∑ p ( x) p( xˆ | x) }
x
• Decoder: For each w ∈1 : 2 choose g n ( w) = xˆ w drawn i.i.d. ~ p nxˆ
nR
• Encoder: f n (x) = min w such that (x,xˆ w ) ∈ J d( n,ε) else 1 if no such w
• Expected Distortion: D = Ex , g d (x, xˆ )
– over all input vectors x and all random decode functions, g
– for large n we show D = D + δ
code
so there must be one good
Jan 2008
221
Expected Distortion
We can divide the input vectors x into two categories:
a) if ∃w such that (x,xˆ w ) ∈ J d( n,ε) then d (x, xˆ w ) < D + ε
since E d (x, xˆ ) ≤ D
b)
if no such w exists we must have d ( x, xˆ w ) < d max
since we are assuming that d(,) is bounded. Supose
the probability of this situation is Pe.
Hence D = Ex , g d (x, xˆ )
≤ (1 − Pe )( D + ε ) + Pe d max
≤ D + ε + Pe d max
We need to show that the expected value of Pe is small
Jan 2008
222
Error Probability
Define the set of valid inputs for code g
V ( g ) = { x : ∃w with ( x, g ( w)) ∈ J d( n,ε) }
We have
Pe = ∑ p( g )
g
∑ p(x ) = ∑ p(x ) ∑ p( g )
x∉V ( g )
x
g:x∉V ( g )
Define K (x, xˆ ) = 1 if (x, xˆ ) ∈ J d( n,ε) else 0
Prob that a random xˆ does not match x is 1 − ∑ p (xˆ ) K (x, xˆ )
xˆ
⎛
⎞
Prob that an entire code does not match is ⎜1 − ∑ p (xˆ ) K (x, xˆ ) ⎟
xˆ
⎝
⎠
⎛
⎞
Hence Pe = ∑ p (x)⎜1 − ∑ p (xˆ ) K (x, xˆ ) ⎟
x
xˆ
⎝
⎠
2 nR
2 nR
Jan 2008
223
Achievability for average code
(n)
− n ( I ( x ; x ) + 3ε )
Since x, xˆ ∈ J d ,ε ⇒ p(xˆ ) ≥ p (xˆ | x )2
ˆ
2 nR
⎛
⎞
Pe = ∑ p(x)⎜1 − ∑ p(xˆ ) K (x, xˆ ) ⎟
x
xˆ
⎝
⎠
2 nR
⎛
⎞
ˆ
≤ ∑ p(x)⎜1 − ∑ p (xˆ | x) K (x, xˆ ) 2 − n (I ( x ;x ) +3ε ) ⎟
x
xˆ
⎝
⎠
Using (1 − uv ) m ≤ 1 − u + e − vm
with u =
∑ p ( xˆ | x ) K ( x , xˆ );
v = 2 − nI ( x ; x ) − 3 n ε ;
ˆ
m = 2 nR
xˆ
(
)
⎛
⎞
ˆ
≤ ∑ p (x)⎜1 − ∑ p (xˆ | x) K (x, xˆ ) + exp − 2 − n (I ( x ;x ) +3ε )2 nR ⎟
x
xˆ
⎝
⎠
Note : 0 ≤ u , v ≤ 1 as required
Jan 2008
224
Achievability for average code
(
)
⎛
⎞
ˆ
Pe ≤ ∑ p (x)⎜1 − ∑ p (xˆ | x) K (x, xˆ ) + exp − 2 − n (I ( X ; X ) +3ε )2 nR ⎟
x
xˆ
⎝
⎠
(
(
take out terms not involving x
))
= 1 + exp − 2 − n (I ( X ; X ) +3ε )2 nR − ∑ p (x) p (xˆ | x) K (x, xˆ )
ˆ
(
x , xˆ
ˆ
= 1 − ∑ p (x, xˆ ) K (x, xˆ ) + exp − 2 n (R − I ( X ; X ) −3ε )
{
x , xˆ
}
(
= P (x, xˆ ) ∉ J d( n,ε) + exp − 2 n (R − I ( X ; X ) −3ε )
ˆ
)
)
both terms → 0 as n → ∞ provided nR > I ( x , xˆ ) + 3ε
⎯n⎯
⎯→ 0
→∞
Hence ∀δ > 0, D = Ex , g d (x, xˆ ) can be made ≤ D + δ
Jan 2008
225
Achievability
Since ∀δ > 0, D = Ex , g d (x, xˆ ) can be made ≤ D + δ
there must be at least one g with Ex d (x, xˆ ) ≤ D + δ
Hence (R,D) is achievable for any R>R(D)
x1:n
Encoder
fn( )
f(x1:n)∈1:2nR
Decoder
gn ( )
^
x1:n
that is lim E X 1:n ( x, xˆ ) ≤ D
n →∞
In fact a stronger result is true:
∀δ > 0, D and R > R ( D), ∃f n , g n with p (d (x, xˆ ) ≤ D + δ ) → 1
n →∞
Jan 2008
226
Lecture 18
• Revision Lecture
Jan 2008
227
Summary (1)
• Entropy:
H (x ) = ∑ p ( x) × − log 2 p ( x) = E − log 2 ( p X ( x))
x∈X
0 ≤ H (x ) ≤ log X
– Bounds:
– Conditioning reduces entropy: H (y | x ) ≤ H (y )
n
n
H (x 1:n ) = ∑ H (x i | x 1:i −1 ) ≤ ∑ H (x i )
– Chain Rule:
i =1
i =1
n
• Relative Entropy:
H (x 1:n | y 1:n ) ≤ ∑ H (x i | y i )
i =1
D(p || q ) = Ep log( p (x ) / q (x ) ) ≥ 0
Jan 2008
228
Summary (2)
H(x ,y)
• Mutual Information:
H(x |y) I(x ;y) H(y |x)
H(x)
I (y ; x ) = H (y ) − H (y | x )
= H (x ) + H (y ) − H (x , y ) = D (p x ,y || p x p y )
H(y)
– Positive and Symmetrical: I (x ; y ) = I (y ; x ) ≥ 0
– x, y indep ⇔ H (x , y ) = H (y ) + H (x ) ⇔ I (x ; y ) = 0
n
– Chain Rule: I (x 1:n ; y ) = ∑ I (x i ; y | x 1:i −1 )
i =1
n
x i independent ⇒ I (x 1:n ; y 1:n ) ≥ ∑ I (x i ; y i )
i =1
n
p (y i | x 1:n ; y 1:i −1 ) = p (y i | x i ) ⇒ I (x 1:n ; y 1:n ) ≤ ∑ I (x i ; y i )
i =1
Jan 2008
229
Summary (3)
• Convexity: f’’ (x) ≥ 0 ⇒ f(x) convex ⇒ Ef(x) ≥ f(Ex)
– H(p) concave in p
– I(x ; y) concave in px for fixed py|x
– I(x ; y) convex in py|x for fixed px
• Markov:
x → y → z ⇔ p ( z | x, y ) = p ( z | y ) ⇔ I ( x ; z | y ) = 0
⇒ I(x ; y) ≥ I(x ; z) and I(x ; y) ≥ I(x; y | z)
• Fano:
H (x | y ) − 1
log(| X | −1)
−1
H ( X ) = lim n H (x 1:n )
x → y → xˆ ⇒ p(xˆ ≠ x ) ≥
• Entropy Rate:
n →∞
– Stationary process
– Markov Process:
– Hidden Markov:
H ( X ) ≤ H (x n | x 1:n −1 )
H ( X ) = lim H (x n | x n −1 )
= as n → ∞
= if stationary
n →∞
H (y n | y 1:n −1 , x 1 ) ≤ H (Y) ≤ H (y n | y 1:n −1 )
= as n → ∞
Jan 2008
230
Summary (4)
• Kraft: Uniquely Decodable ⇒
| X|
∑D
− li
≤ 1 ⇒ ∃ prefix code
i =1
• Average Length: Uniquely Decodable ⇒ LC = E l(x) ≥ HD(x)
• Shannon-Fano: Top-down 50% splits. LSF ≤ HD(x)+1
LS ≤ H D (x ) + 1
• Shannon: l x = ⎡− log D p( x)⎤
• Huffman: Bottom-up design. Optimal. LH ≤ H D (x ) + 1
– Designing with wrong probabilities, q ⇒ penalty of D(p||q)
– Long blocks disperse the 1-bit overhead
• Arithmetic Coding:
C(x N ) =
∑ p( x
N
i
)
– Long blocks reduce 2-bit overhead
– Efficient algorithm without calculating all possible probabilities
– Can have adaptive probabilities
xiN < x N
Jan 2008
231
Summary (5)
• Typical Set
–
–
–
–
Individual Prob x ∈ Tε ⇒ log p(x) = − nH (x ) ± nε
p ( x ∈ Tε( n ) ) > 1 − ε for n > N ε
Total Prob
n> N
n ( H ( x ) −ε )
(
1
−
ε
)
2
< Tε( n ) ≤ 2 n ( H ( x ) +ε )
Size
No other high probability set can be much smaller
(n)
ε
• Asymptotic Equipartition Principle
– Almost all event sequences are equally surprising
Jan 2008
232
Summary (6)
• DMC Channel Capacity:
• Coding Theorem
C = max I (x ; y )
px
– Can achieve capacity: random codewords, joint typical decoding
– Cannot beat capacity: Fano
• Feedback doesn’t increase capacity but simplifies coder
• Joint Source-Channel Coding doesn’t increase capacity
Jan 2008
233
Summary (7)
• Differential Entropy:
h(x ) = E − log f x (x )
– Not necessarily positive
– h(x+a) = h(x),
h(ax) = h(x) + log|a|,
h(x|y) ≤ h(x)
– I(x; y) = h(x) + h(y) – h(x, y) ≥ 0, D(f||g)=E log(f/g) ≥ 0
• Bounds:
– Finite range: Uniform distribution has max: h(x) = log(b–a)
– Fixed Covariance: Gaussian has max: h(x) = ½log((2πe)n|K|)
• Gaussian Channel
– Discrete Time: C=½log(1+PN–1)
– Bandlimited: C=W log(1+PN0–1W–1)
ln 2 = −1.6 dB
• For constant C: Eb N 0−1 = PC −1 N 0−1 = (W / C ) 2(W / C ) − 1 W→
→∞
– Feedback: Adds at most ½ bit for coloured noise
(
−1
)
Jan 2008
234
Summary (8)
• Parallel Gaussian Channels: Total power constraint
– White noise: Waterfilling: Pi = max (P0 − N i ,0)
– Correlated noise: Waterfill on noise eigenvectors
• Rate Distortion:
R( D) =
min
p xˆ |x s .t . Ed ( x , xˆ ) ≤ D
∑P = P
i
I (x ; xˆ )
– Bernoulli Source with Hamming d: R(D) = max(H(px)–H(D),0)
– Gaussian Source with mean square d: R(D) = max(½log(σ2 D–1),0)
– Can encode at rate R: random decoder, joint typical encoder
– Can’t encode below rate R: independence bound
• Lloyd Algorithm: iterative optimal vector quantization
Download