Information Theory

advertisement
符号理論...coding theory
The official language of this course is English
slides and talks are (basically) in English
I will accept questions and comments in Japanese also.
omnibus-style lecture ... collection of several subjects
“take-home” test ... questions are given, solve in your home
1
The Name of the Class
Coding Theory?
a branch of Information Theory
properties/constructions of “codes”
source codes (for data compression)
channel codes (for error-correction)
and various codes for various purposes
this class =
middle-class of Information Theory,
with some emphasis on the techniques of coding
2
relation to Information Theory
measuring of
information
entropy
mutual information
source coding
Kraft’s inequality
Huffman code
universal code
source coding theorem
linear code
channel coding
and more...
Hamming code
analysis of codes
convolutional code
channel coding theorem
Turbo & LDPC codes
codes for data recording
network coding
3
class plan
seven classes, one test
Oct. 8 review
brief review of information theory
Oct. 15 compress arithmetic code, universal codes
Oct. 22 analyze
analysis of codes, weight distribution
Oct. 29 struggle
cyclic code, convolutional code
Nov. 5 Shannon channel coding theorem
Nov. 12 frontier
Turbo code, LDPC code
Nov. 19 unique
coding for various purposes
take-home test
Nov. 26 no class
slides ... http://isw3.naist.jp/~kaji/lecture/
4
Information Theory
Information Theory (情報理論)
is founded by C. E. Shannon in 1948
focuses on mathematical theory of communication
gave essential impacts on today’s digital technology
wired/wireless communication/broadcasting
CD/DVD/HDD
Claude E. Shannon
1916-2001
data compression
cryptography, linguistics, bioinformatics, games, ...
5
the model of communication
A communication system can be modeled as;
C.E. Shannon, A Mathematical Theory of Communication,
The Bell System Technical Journal, 27, pp. 379–423, 623–656, 1948.
engineering artifacts
Other components are “given” and not controllable.
6
the first step
precise measurement is essential in engineering
vs.
information cannot be measured
To handle information by engineering means,
we need to develop a quantitative measure of information.
Entropy makes it!
7
the model of information source
information source = a machinery that produces symbols.
The symbol produced is determined probabilistically.
Use a random variable 𝑋 to represent the produced symbol.
𝑋 takes either one value in 𝐷 𝑋 = {𝑥1 , … 𝑥𝑀 }.
𝑃𝑋 𝑥𝑖 denotes the probability that 𝑋 = 𝑥𝑖 .
𝑋
𝐷 𝑋 = {1,2,3,4,5,6}
𝑃𝑋 1 = 𝑃𝑋 2 = ⋯ = 𝑃𝑋 (6) = 1/6
(We mainly focus on memoryless & stationary sources.)
8
entropy
the entropy of 𝑋:
𝐻(𝑋) =
−𝑃𝑋 (𝑥) log 2 𝑃𝑋 (𝑥) (bit)
𝑥∈𝐷(𝑋)
the expected value of − log 2 𝑃𝑋 (𝑥) over all 𝑥 ∈ 𝐷(𝑋)
− log 2 𝑃𝑋 (𝑥) is sometimes called as a self information of 𝑥.
𝐻(𝑋) is sometimes called as an expected information of 𝑋.
𝑃𝑋 1 = 𝑃𝑋 2 = ⋯ = 𝑃𝑋 (6) = 1/6
𝑋 𝐻 𝑋 = − 1 log 2 1 − ⋯ − 1 log 2 1 = 2.585 bit
6
6
6
6
9
entropy and uncertainty (不確実さ)
cheat dice...easier to guess
𝑋
𝐻 𝑋 = 2.585 bit
𝑃𝑋 1 = 0.9
𝑃𝑋 2 = ⋯ = 𝑃𝑋 (6) = 0.02
𝐻 𝑋 = −0.9 log 2 0.9 − 0.02 log 2 0.02 − … − 0.02 log 2 0.02
= 0.701 bit
More difficulty to guess the value of 𝑋 correctly,
more entropy 𝐻(𝑋) is.
entropy 𝑯(𝑿) = the size of uncertainty
10
basic properties of entropy
𝐻(𝑋) =
−𝑃𝑋 (𝑥) log 2 𝑃𝑋 (𝑥)
𝑥∈𝐷(𝑋)
𝐻 𝑋 ≥ 0...【nonnegative】
min 𝐻(𝑋) = 0... 【smallest value】
when 𝑃𝑋 𝑥 = 1 for one particular value in 𝐷(𝑋)
max 𝐻(𝑋) = log 2 |𝐷 𝑋 |... 【largest value】
when 𝑃𝑋 𝑥 = 1/|𝐷 𝑋 | for all 𝑥 ∈ 𝐷(𝑋)
11
some more entropies
joint entropy
𝐻 𝑋, 𝑌 =
−𝑃𝑋,𝑌 𝑥, 𝑦 log 2 𝑃𝑋,𝑌 𝑥, 𝑦 .
𝑥∈𝐷(𝑋) 𝑦∈𝐷(𝑌)
conditional entropy
𝐻 𝑋𝑌 =
𝑃𝑌 𝑦
𝑦∈𝐷(𝑌)
−𝑃𝑋|𝑌 𝑥 𝑦 log 2 𝑃𝑋|𝑌 𝑥 𝑦
𝑥∈𝐷(𝑋)
max 𝐻 𝑋 , 𝐻 𝑌 ≤ 𝐻 𝑋, 𝑌 ≤ 𝐻(𝑋) + 𝐻(𝑌)
𝐻 𝑋 𝑌 ≤ 𝐻(𝑋)
if 𝑋 and 𝑌 are independent, then 𝐻 𝑋, 𝑌 = 𝐻(𝑋) + 𝐻(𝑌)
𝐻 𝑋 𝑌 = 𝐻(𝑋)
12
mutual information
mutual information between 𝑋 and 𝑌
𝐼(𝑋; 𝑌) = 𝐻 𝑋 − 𝐻 𝑋 𝑌
=𝐻 𝑌 −𝐻 𝑌 𝑋
= 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌)
𝐻(𝑋, 𝑌)
𝐻(𝑋)
𝐻(𝑋|𝑌)
𝐻(𝑌)
𝐻(𝑌|𝑋)
𝐼(𝑋; 𝑌)
if 𝑋 & 𝑌 are independent:
𝐼 𝑋; 𝑌 = 𝐼(𝑌; 𝑋)
𝐻(𝑋, 𝑌)
𝐼 𝑋; 𝑋 = 𝐻(𝑋)
𝐻(𝑋)
𝐼 𝑋; 𝑌 = 0 if 𝑋 and 𝑌 are independent
𝐻(𝑌)
𝐻(𝑋|𝑌)
𝐻(𝑌|𝑋)
𝐼(𝑋; 𝑌)
13
example
binary symmetric channel (BSC)
𝑋 ∈ {0,1} is transmitted
𝑋
𝑌 ∈ 0,1 is received
𝑃𝑌|𝑋 0 0 = 𝑃𝑌|𝑋 1 1 = 1 − 𝑝
𝑃𝑌|𝑋 1 0 = 𝑃𝑌|𝑋 0 1 = 𝑝
1−𝑝
0
𝑝
0
𝑝
1
1−𝑝
𝑌
1
compute 𝐼(𝑋; 𝑌), assuming 𝑃𝑋 0 = 𝑞, 𝑃𝑋 1 = 1 − 𝑞
for simplicity, define a binary entropy function
ℋ 𝑥 = −𝑥 log 2 𝑥 − (1 − 𝑥) log 2 (1 − 𝑥)
14
example solved
compute 𝐼(𝑋; 𝑌), assuming 𝑃𝑋 0 = 𝑞, 𝑃𝑋 1 = 1 − 𝑞
𝑌
𝑋
0
1
0
1−𝑝 𝑞
𝑝𝑞
𝑞
1
𝑝(1 − 𝑞)
(1 − 𝑝)(1 − 𝑞) 1 − 𝑞 𝑃𝑋 (𝑥)
𝑝 + 𝑞 − 2𝑝𝑞 1 − 𝑝 − 𝑞 + 2𝑝𝑞 𝑃𝑌 (𝑦)
𝑃𝑋,𝑌 (𝑥, 𝑦)
𝐻 𝑋 =ℋ 𝑞
𝐻 𝑌 = ℋ(𝑝 + 𝑞 − 2𝑝𝑞)
𝐻 𝑋, 𝑌 = − 1 − 𝑝 𝑞log 2 1 − 𝑝 𝑞 − 𝑝𝑞 log 2 𝑝𝑞
−𝑝 1 − 𝑞 log 2 𝑝 1 − 𝑞 − (1 − 𝑝)(1 − 𝑞) log 2 (1 − 𝑝)(1 − 𝑞)
=ℋ 𝑝 +ℋ 𝑞
𝐼 𝑋; 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻 𝑋, 𝑌 = ℋ 𝑝 + 𝑞 − 2𝑝𝑞 − ℋ 𝑝
15
good input and bad input
1−𝑝
0
𝑝
0
𝐼 𝑋; 𝑌 = ℋ 𝑝 + 𝑞 − 2𝑝𝑞 − ℋ 𝑝
𝑋
𝑝 is a channel-specific constant
𝑝
𝑞 is a controllable parameter
1
1
1−𝑝
min 𝐼 𝑋; 𝑌 = 0
... the channel works poorly for input with 𝑞 = 0 or 1
max 𝐼 𝑋; 𝑌 = 1 − ℋ 𝑝
... the channel works finely for input with 𝑞 = 0.5
𝑌
1 − ℋ(𝑝)
𝐼(𝑋; 𝑌)
𝑞
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0
16
channel capacity
channel capacity = maximum of 𝐼 𝑋; 𝑌 with
𝑋 = input to the channel
𝑌 = output from the channel
the channel capacity of BSC =1 − ℋ(𝑝)
the channel capacity of a binary erasure channel = 1 − 𝑝,
where 𝑝 is the probability of erasure
Channel capacities of some practical channels are also studied.
17
source coding
encoder
121464253 …
00110101 …
codewords
A source coding is to give a representation of information.
The representation must be as small (short) as possible.
18
problem formulation
source symbol 𝐷 𝑋 = {𝑥1 , … , 𝑥𝑀 }
construct a code 𝐶 = 𝑐1 , … , 𝑐𝑀 ,
where 𝑐𝑖 is a sequence (over {0,1})
that is called a codeword for 𝑥𝑖
our goal is to construct C so that
𝐶 is immediately decodable, and
the average codeword length of 𝐶,
𝑥1
𝑥2
⋮
𝑥𝑀
source
symbols
𝑐1
𝑐2
⋮
𝑐𝑀
code 𝐶
𝑀
𝐿=
𝑃𝑋 (𝑥𝑖 )|𝑐𝑖 |
𝑖=1
is as small as possible.
19
Huffman code
Code construction by iterative tree operations
1.
prepare isolated 𝑀 nodes, each attached with
a probability of a symbol (node = size-one tree)
David Huffman
1925-1999
2.
repeat the following operation until all trees are joined to one
a. select two trees 𝑇1 and 𝑇2 having the smallest probabilities
b. join 𝑇1 and 𝑇2 by introducing a new parent node
c. the sum of probabilities of 𝑇1 and 𝑇2 is given to the new tree
20
construction example
A
B
C
D
E
prob. codewords
0.2
0.1
0.3
0.3
0.1
21
source coding theorem
Shannon’s source coding theorem:
There is no immediately decodable code with 𝐿 < 𝐻(𝑋).
proof by Kraft’s inequality and Shannon’s lemma
We can construct an immediately decodable code with 𝐻(𝑋) ≤
𝐿 < 𝐻 𝑋 + 𝜖 for any small 𝜖.
construction of a block Huffman code
... two faces of source coding
22
“block” coding
A
B
C
D
E
prob.
0.2
0.1
0.3
0.3
0.1
AA
AB
AC
AD
AE
BA
BB
BC
BD
BE
CA
:
prob. codewords
0.04
0.02
0.06
0.06
0.02
0.04
0.01
0.03
0.03
0.01
0.06
:
23
problems of block Huffman code
The optimum code is obtained by
grouping several symbols into one, and
applying Huffman code construction
practical problems arise:
we need much storage
we need to know the probability distribution in advance
...solutions to these problems are discussed in this class.
24
channel coding
Errors are unavoidable in communication.
ABCADC
ABCABC
Some errors are correctable by adding some redundancy.
ABC
Alpha, Bravo, Charlie
Alpha, Bravo, Charlie
ABC
Channel coding gives a clever way to introduce the redundancy.
25
linear code
linear code: practical class of channel codes
the encoding is made by using a generator matrix 𝐺
𝑔1,1 ⋯ 𝑔1,𝑛
⋮
⋱
⋮
codeword 𝑐1 , … , 𝑐𝑛 = 𝑥1 , … , 𝑥𝑘
𝑔𝑘,1 ⋯ 𝑔𝑘,𝑛
the decoding is made by using a parity check matrix 𝐻
𝑠1
ℎ1,1
⋮
syndrome ⋮ =
𝑠𝑚
ℎ𝑚,1
⋯ ℎ1,𝑛
⋱
⋮
⋯ ℎ𝑚,𝑛
𝑟1
⋮
𝑟𝑛
The syndrome indicates the position of errors.
26
Hamming code
To construct a one-bit error correcting code,
let column vectors of parity check matrix 𝐻 all different.
Richard Hamming
1915-1998
Hamming code
determine a parameter 𝑚
enumerate all nonzero vectors with length 𝑚
use the vectors as columns of 𝐻
1110100
𝐻 = 1101010
1011001
𝐺=
transpose
1000111
0100110
0010101
0001011
27
parameters of Hamming code
Hamming code
determine 𝑚
design 𝐻 to have 2𝑚 – 1 different column vectors
𝐻 has 𝑚 rows and 2𝑚 – 1 columns
code length
𝑛 = 2𝑚 – 1
# of information symbols 𝑘 = 2𝑚 – 1 – 𝑚
𝑚
# of parity symbols
code rate = 𝑘/𝑛
𝑚
2
3
4
5
6
7
𝑛
3
7
15
31
63
127
𝑘
1
4
11
26
57
120
28
code rate and performance
If code rate = 𝑘/𝑛 is large...
more information in one codeword
less number of symbols for error correction
The error-correcting capability is weak in general.
error
capability
strong
weak
To have good error-correcting capability,
we need to sacrifice the code rate...
small
code rate
large
29
channel coding theorem
Shannon’s channel coding theorem:
Let 𝐶 be the capacity of the communication channel.
Among channel codes with rate ≤ 𝐶,
there exists a code which can correct almost all errors.
There is no such codes in the class of codes with rate > 𝐶.
... two faces of channel coding
30
two coding theorems
source coding theorem:
constructive solution given by Huffman code
almost finished work
channel coding theorem
no constructive solution
a number of studies have been made
still under investigation
remarkable classes of channel codes
proof of the theorem
31
summary
today’s talk ... not self-contained summary of Information Theory
measuring of information
source coding
channel coding
Students are encouraged to review basics of Information Theory.
32
Download