lecture4

advertisement
ENTROPY
Entropy measures the uncertainty in a random experiment.
Let X be a discrete random variable with range SX = { 1,2,3,. . . k}
and pmf pk = PX (X = k)
Let A  { X  k}
Uncertainty of A  I ( X  k )
1
 ln
pk
Thus pk 1

pk  0 
Uncertainty = 0
Uncertainty → ∞
Entropy of X ≡ expected uncertainty of outcomes
H X  E I X  
K
   p ln p
k
k 1 k
• If log2 is used units are bits , with ln, units are nats.
• By convention, if P(X= x) = 0 i.e., when P(X = x) = 1
-0 log (0) = 0
For a binary random variable X  { 0,1} ,
Let p ≡ P(X=1)
H X   ( 1  p ) log( 1  p )  p log p
HX is maximum when p=0.5 ↔
↔
p = 1, p= 0  no uncertainty
0,1 are equally probable
max uncertainty
→
HX = 0
Image: http://en.wikipedia.org/wiki/GNU_Free_Documentation_License
Let S X  { 00,01,11,10} equally probable
HX 
3
3
1
1
  pk log 2 pk    log 2
4
k 0
k 0 4
 2 bits
If given that the first bit is 1,
H X |1st bit 1
1
1
1
1
  log 2 
log 2  1 bit
2
2
2
2
for 01
for 11
In general, HX of n equally probable outcomes= n bits
e.g., n-bit equiprobable numbers
 n bits
As each bit is specified, HX decreases by 1 bit.
When all n-bits are specified, HX = 0
Relative Entropy:
If p = ( p1 , p2 , . . . , pK )
q = ( q1 , q2 , . . . , qK )
X~p
Y~q
K outcomes for both X and Y
are two pmf’s .
H (p;q) ≡ relative entropy of q with respect to p


K
p
k 1


k
ln qk  H X ~ p
k
ln qk 
K
p
k 1
H  p; q  
K
 pk ln
k 1
p
k
ln pk
pk
qk
H  p; q   0
H  p ; q   0  pk  qk  k  1,.. . , K
H (p;q) is often used as a metric for probability distributions, is called the
Kullback – Leibler Distance.
log x
To prove the assertions, use
 x 
ln 1  
x
ln 1
H  p; q 
1 x
1
1 x
K
p
  pk ln k
qk
k 1
 x 1



qk 


p
1


k 
pk 
k 1

1
K
p
k
k

q
k
0
k
 H  p; q   0
to get H  p ; q   0,
qk
1
pk
 k i.e. q  p
x
1
 k 1, ..., K
K
H  p; q   ln K  H X ~ p
If
qk 
K
pk
  pk ln
1
k 1
K
0
 H X  ln K
H X  ln K iff
pK 
1
k
K
This is called maximum entropy (ME) or the minimum relative
entropy (MRE) situation.
Thus:
0  H X  ln K
Only one possible outcome
K equally probable outcomes
Differential Entropy:
For a continuous random variable all entropies are maximally uncertain.

entropy cannot be defined as for discrete random variables.
Instead, differential entropy is used

HX  
 f x  ln
X
f X x  dx

 E  ln f X x 
In fact, the integral extends only over the region where fX (x) > 0
f X x ln f X x  0
if f X x  0
e.g . If f X  x 
  x  
1
e
2 
2

E  ln f X  x   
ln

ln


ln

ln

ln

H X ~ Gaussian 
2 

  x  
 E 

2

22

E x   
2 2


2
2 2
2 


2 

1
2

ln e
2 


2 


ln 
22
2 e 
2






Relative entropies for continuous random varaibles X and Y
 H  f X ; fY 




f X  x  ln
f X x 
dx
fY x 
Information Theory
Let X be a random variable with SX = { x1 , …., xk }
Information about outcomes of X is to be sent over a channel.
Channel
X
Source
Receiver
Destination
How can outcomes { x1 , …., xk } be coded so that all information is
carried with maximal efficiency?
Best code → minimum expected codeword length.
Code must be instantaneously decodable,
i.e. no codeword is a prefix for any other.
→ construct a code tree
e.g. S = {x1 , x2 , x3 , x4 ,x5 }
0
0
x1
1
1 0
x2 x3
1
0
x4
x1 = 00
1
x5
x2 = 01 x3 = 10 x4 = 110 x5 = 111
If lk = length of code for xk
E (codeword length)
 E lk 

K
 px  l
k 1
k
k
For instantaneous binary codes
K
l k
2
 1
k 1
For D-ary code
K
l k
D
 1
k 1
Kraft Inequality
Consider E lk   H X

K
p
k 1

p
lk 
 p log p
k 1
K
k
k
 pk log
pk
2 lk
k 1

k
K
K
k 1
k
log pk
 log 2 lk

0
 Relative Entropy of pk and qk  2 lk which is  0

E lk   H X
E lk   H X
iff
pk  2
lk
 1 
i .e. lk  log 2    k
 pk 
Shannon' s source coding theorem
i.e.
1.
minimum average codeword length = entropy of X
2.
most efficient code is obtained when length(xk) = - log pk
i.e. word lengths are inversely proportional to their probabilities.
1 
bits of information in X = entropy of X
2 
a maximally efficient code can always be found when all pk
are powers of 2 otherwise
H X  E lk best  H X  1
One such optimal code is the Huffman code constructed by a Huffman tree.
e.g. Let SX = { A , B , C , D , E } with pmf = { 0.1, 0.3, 0.25, 0.2, 0.15}
At every step, combine nodes with minimal sum:
1)
0.25
A
3)
0.3
0.25
0.2
B
C
D
E
0.3
2)
A
0.55
C
E
4)
0.45
B
C
D
1
0.55
0
A
E
0.25
0
A
D
1
0
0.25
0.45
B
0.25
1
E
0.45
1
0
B
C
1
D
A = 000 B = 01
C = 10 D = 11 E = 001
To prove
-lk
2
 1
(for any binary tree code)
k
for any binary tree with each leaf a codeword
Let lmax = longest codeword
 leaf at level lmax
(root = level 0)
l
If all leaves are at level lmax # leaves = 2 max
If a leaf is at lk < lmax
it eliminates 2lmax lk leaves from the full tree.

2
k
lmax lk
 2lmax
( Remember, each leaf is eliminated by exactly one codeword)
2lmax
 lk
lmax
2

2

k
l k
2
 1
k
0
1
A
B
2
3
C
A eliminates
D
23 - 1  4 leaves
B eliminates 23 - 2  2 leaves
C, D do not eliminate any leaves
2
k
lk
 2 1  2  2  2 3  2 3  1
In General if the tree is complete,
 lk
2
 1
k
If not , it is < 1
e.g.
A
B
2
 lk
C
 2 1  2 3  2 3
k
3

4
Maximum Entropy Method
Given random variable X , SX = { x1 , …., xk }
unknown pmf pk = p(xk)
constraint E(g(x)) = r
1
Estimate pk
Hypothesis : p k  c e
  g ( xk )
is the maximum entropy pmf.
Proof: Suppose pmf q ≠ p satisfies 1

0  H q ; p 

q
k
k

ln
qk
pk
 q ln q
k
k
 ln c  g  xk  
k

 ln c    qk g  xk   H X q 
k

 ln c  r  H X q 
 H X  p   H X q 
 H X  p   H X q 
In general, given n constraints,
E(g1(x)) = r1
.
.
.
E(gn(x)) = rn
_____
1-1
_____
1-n
the ME pmf has the form
pk  c e  1 g1 ( xk )  ...   n g n ( xk )
Where c and  i are chosen to satisfy 1  1 ... 1  n
and
p
k
1
k
If X is continuous , the ME pdf is of the form
f X  x   c e 1 g1 ( x )  ...   n g n ( x )
Note that g i x  may be moments
 ME method allows pmf/pdf estimates if some moments are known
Download