符号理論...coding theory The official language of this course is English slides and talks are (basically) in English I will accept questions and comments in Japanese also. omnibus-style lecture ... collection of several subjects “take-home” test ... questions are given, solve in your home 1 The Name of the Class Coding Theory? a branch of Information Theory properties/constructions of “codes” source codes (for data compression) channel codes (for error-correction) and various codes for various purposes this class = middle-class of Information Theory, with some emphasis on the techniques of coding 2 relation to Information Theory measuring of information entropy mutual information source coding Kraft’s inequality Huffman code universal code source coding theorem linear code channel coding and more... Hamming code analysis of codes convolutional code channel coding theorem Turbo & LDPC codes codes for data recording network coding 3 class plan seven classes, one test Oct. 8 review brief review of information theory Oct. 15 compress arithmetic code, universal codes Oct. 22 analyze analysis of codes, weight distribution Oct. 29 struggle cyclic code, convolutional code Nov. 5 Shannon channel coding theorem Nov. 12 frontier Turbo code, LDPC code Nov. 19 unique coding for various purposes take-home test Nov. 26 no class slides ... http://isw3.naist.jp/~kaji/lecture/ 4 Information Theory Information Theory (情報理論) is founded by C. E. Shannon in 1948 focuses on mathematical theory of communication gave essential impacts on today’s digital technology wired/wireless communication/broadcasting CD/DVD/HDD Claude E. Shannon 1916-2001 data compression cryptography, linguistics, bioinformatics, games, ... 5 the model of communication A communication system can be modeled as; C.E. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, 27, pp. 379–423, 623–656, 1948. engineering artifacts Other components are “given” and not controllable. 6 the first step precise measurement is essential in engineering vs. information cannot be measured To handle information by engineering means, we need to develop a quantitative measure of information. Entropy makes it! 7 the model of information source information source = a machinery that produces symbols. The symbol produced is determined probabilistically. Use a random variable 𝑋 to represent the produced symbol. 𝑋 takes either one value in 𝐷 𝑋 = {𝑥1 , … 𝑥𝑀 }. 𝑃𝑋 𝑥𝑖 denotes the probability that 𝑋 = 𝑥𝑖 . 𝑋 𝐷 𝑋 = {1,2,3,4,5,6} 𝑃𝑋 1 = 𝑃𝑋 2 = ⋯ = 𝑃𝑋 (6) = 1/6 (We mainly focus on memoryless & stationary sources.) 8 entropy the entropy of 𝑋: 𝐻(𝑋) = −𝑃𝑋 (𝑥) log 2 𝑃𝑋 (𝑥) (bit) 𝑥∈𝐷(𝑋) the expected value of − log 2 𝑃𝑋 (𝑥) over all 𝑥 ∈ 𝐷(𝑋) − log 2 𝑃𝑋 (𝑥) is sometimes called as a self information of 𝑥. 𝐻(𝑋) is sometimes called as an expected information of 𝑋. 𝑃𝑋 1 = 𝑃𝑋 2 = ⋯ = 𝑃𝑋 (6) = 1/6 𝑋 𝐻 𝑋 = − 1 log 2 1 − ⋯ − 1 log 2 1 = 2.585 bit 6 6 6 6 9 entropy and uncertainty (不確実さ) cheat dice...easier to guess 𝑋 𝐻 𝑋 = 2.585 bit 𝑃𝑋 1 = 0.9 𝑃𝑋 2 = ⋯ = 𝑃𝑋 (6) = 0.02 𝐻 𝑋 = −0.9 log 2 0.9 − 0.02 log 2 0.02 − … − 0.02 log 2 0.02 = 0.701 bit More difficulty to guess the value of 𝑋 correctly, more entropy 𝐻(𝑋) is. entropy 𝑯(𝑿) = the size of uncertainty 10 basic properties of entropy 𝐻(𝑋) = −𝑃𝑋 (𝑥) log 2 𝑃𝑋 (𝑥) 𝑥∈𝐷(𝑋) 𝐻 𝑋 ≥ 0...【nonnegative】 min 𝐻(𝑋) = 0... 【smallest value】 when 𝑃𝑋 𝑥 = 1 for one particular value in 𝐷(𝑋) max 𝐻(𝑋) = log 2 |𝐷 𝑋 |... 【largest value】 when 𝑃𝑋 𝑥 = 1/|𝐷 𝑋 | for all 𝑥 ∈ 𝐷(𝑋) 11 some more entropies joint entropy 𝐻 𝑋, 𝑌 = −𝑃𝑋,𝑌 𝑥, 𝑦 log 2 𝑃𝑋,𝑌 𝑥, 𝑦 . 𝑥∈𝐷(𝑋) 𝑦∈𝐷(𝑌) conditional entropy 𝐻 𝑋𝑌 = 𝑃𝑌 𝑦 𝑦∈𝐷(𝑌) −𝑃𝑋|𝑌 𝑥 𝑦 log 2 𝑃𝑋|𝑌 𝑥 𝑦 𝑥∈𝐷(𝑋) max 𝐻 𝑋 , 𝐻 𝑌 ≤ 𝐻 𝑋, 𝑌 ≤ 𝐻(𝑋) + 𝐻(𝑌) 𝐻 𝑋 𝑌 ≤ 𝐻(𝑋) if 𝑋 and 𝑌 are independent, then 𝐻 𝑋, 𝑌 = 𝐻(𝑋) + 𝐻(𝑌) 𝐻 𝑋 𝑌 = 𝐻(𝑋) 12 mutual information mutual information between 𝑋 and 𝑌 𝐼(𝑋; 𝑌) = 𝐻 𝑋 − 𝐻 𝑋 𝑌 =𝐻 𝑌 −𝐻 𝑌 𝑋 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌) 𝐻(𝑋, 𝑌) 𝐻(𝑋) 𝐻(𝑋|𝑌) 𝐻(𝑌) 𝐻(𝑌|𝑋) 𝐼(𝑋; 𝑌) if 𝑋 & 𝑌 are independent: 𝐼 𝑋; 𝑌 = 𝐼(𝑌; 𝑋) 𝐻(𝑋, 𝑌) 𝐼 𝑋; 𝑋 = 𝐻(𝑋) 𝐻(𝑋) 𝐼 𝑋; 𝑌 = 0 if 𝑋 and 𝑌 are independent 𝐻(𝑌) 𝐻(𝑋|𝑌) 𝐻(𝑌|𝑋) 𝐼(𝑋; 𝑌) 13 example binary symmetric channel (BSC) 𝑋 ∈ {0,1} is transmitted 𝑋 𝑌 ∈ 0,1 is received 𝑃𝑌|𝑋 0 0 = 𝑃𝑌|𝑋 1 1 = 1 − 𝑝 𝑃𝑌|𝑋 1 0 = 𝑃𝑌|𝑋 0 1 = 𝑝 1−𝑝 0 𝑝 0 𝑝 1 1−𝑝 𝑌 1 compute 𝐼(𝑋; 𝑌), assuming 𝑃𝑋 0 = 𝑞, 𝑃𝑋 1 = 1 − 𝑞 for simplicity, define a binary entropy function ℋ 𝑥 = −𝑥 log 2 𝑥 − (1 − 𝑥) log 2 (1 − 𝑥) 14 example solved compute 𝐼(𝑋; 𝑌), assuming 𝑃𝑋 0 = 𝑞, 𝑃𝑋 1 = 1 − 𝑞 𝑌 𝑋 0 1 0 1−𝑝 𝑞 𝑝𝑞 𝑞 1 𝑝(1 − 𝑞) (1 − 𝑝)(1 − 𝑞) 1 − 𝑞 𝑃𝑋 (𝑥) 𝑝 + 𝑞 − 2𝑝𝑞 1 − 𝑝 − 𝑞 + 2𝑝𝑞 𝑃𝑌 (𝑦) 𝑃𝑋,𝑌 (𝑥, 𝑦) 𝐻 𝑋 =ℋ 𝑞 𝐻 𝑌 = ℋ(𝑝 + 𝑞 − 2𝑝𝑞) 𝐻 𝑋, 𝑌 = − 1 − 𝑝 𝑞log 2 1 − 𝑝 𝑞 − 𝑝𝑞 log 2 𝑝𝑞 −𝑝 1 − 𝑞 log 2 𝑝 1 − 𝑞 − (1 − 𝑝)(1 − 𝑞) log 2 (1 − 𝑝)(1 − 𝑞) =ℋ 𝑝 +ℋ 𝑞 𝐼 𝑋; 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻 𝑋, 𝑌 = ℋ 𝑝 + 𝑞 − 2𝑝𝑞 − ℋ 𝑝 15 good input and bad input 1−𝑝 0 𝑝 0 𝐼 𝑋; 𝑌 = ℋ 𝑝 + 𝑞 − 2𝑝𝑞 − ℋ 𝑝 𝑋 𝑝 is a channel-specific constant 𝑝 𝑞 is a controllable parameter 1 1 1−𝑝 min 𝐼 𝑋; 𝑌 = 0 ... the channel works poorly for input with 𝑞 = 0 or 1 max 𝐼 𝑋; 𝑌 = 1 − ℋ 𝑝 ... the channel works finely for input with 𝑞 = 0.5 𝑌 1 − ℋ(𝑝) 𝐼(𝑋; 𝑌) 𝑞 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 16 channel capacity channel capacity = maximum of 𝐼 𝑋; 𝑌 with 𝑋 = input to the channel 𝑌 = output from the channel the channel capacity of BSC =1 − ℋ(𝑝) the channel capacity of a binary erasure channel = 1 − 𝑝, where 𝑝 is the probability of erasure Channel capacities of some practical channels are also studied. 17 source coding encoder 121464253 … 00110101 … codewords A source coding is to give a representation of information. The representation must be as small (short) as possible. 18 problem formulation source symbol 𝐷 𝑋 = {𝑥1 , … , 𝑥𝑀 } construct a code 𝐶 = 𝑐1 , … , 𝑐𝑀 , where 𝑐𝑖 is a sequence (over {0,1}) that is called a codeword for 𝑥𝑖 our goal is to construct C so that 𝐶 is immediately decodable, and the average codeword length of 𝐶, 𝑥1 𝑥2 ⋮ 𝑥𝑀 source symbols 𝑐1 𝑐2 ⋮ 𝑐𝑀 code 𝐶 𝑀 𝐿= 𝑃𝑋 (𝑥𝑖 )|𝑐𝑖 | 𝑖=1 is as small as possible. 19 Huffman code Code construction by iterative tree operations 1. prepare isolated 𝑀 nodes, each attached with a probability of a symbol (node = size-one tree) David Huffman 1925-1999 2. repeat the following operation until all trees are joined to one a. select two trees 𝑇1 and 𝑇2 having the smallest probabilities b. join 𝑇1 and 𝑇2 by introducing a new parent node c. the sum of probabilities of 𝑇1 and 𝑇2 is given to the new tree 20 construction example A B C D E prob. codewords 0.2 0.1 0.3 0.3 0.1 21 source coding theorem Shannon’s source coding theorem: There is no immediately decodable code with 𝐿 < 𝐻(𝑋). proof by Kraft’s inequality and Shannon’s lemma We can construct an immediately decodable code with 𝐻(𝑋) ≤ 𝐿 < 𝐻 𝑋 + 𝜖 for any small 𝜖. construction of a block Huffman code ... two faces of source coding 22 “block” coding A B C D E prob. 0.2 0.1 0.3 0.3 0.1 AA AB AC AD AE BA BB BC BD BE CA : prob. codewords 0.04 0.02 0.06 0.06 0.02 0.04 0.01 0.03 0.03 0.01 0.06 : 23 problems of block Huffman code The optimum code is obtained by grouping several symbols into one, and applying Huffman code construction practical problems arise: we need much storage we need to know the probability distribution in advance ...solutions to these problems are discussed in this class. 24 channel coding Errors are unavoidable in communication. ABCADC ABCABC Some errors are correctable by adding some redundancy. ABC Alpha, Bravo, Charlie Alpha, Bravo, Charlie ABC Channel coding gives a clever way to introduce the redundancy. 25 linear code linear code: practical class of channel codes the encoding is made by using a generator matrix 𝐺 𝑔1,1 ⋯ 𝑔1,𝑛 ⋮ ⋱ ⋮ codeword 𝑐1 , … , 𝑐𝑛 = 𝑥1 , … , 𝑥𝑘 𝑔𝑘,1 ⋯ 𝑔𝑘,𝑛 the decoding is made by using a parity check matrix 𝐻 𝑠1 ℎ1,1 ⋮ syndrome ⋮ = 𝑠𝑚 ℎ𝑚,1 ⋯ ℎ1,𝑛 ⋱ ⋮ ⋯ ℎ𝑚,𝑛 𝑟1 ⋮ 𝑟𝑛 The syndrome indicates the position of errors. 26 Hamming code To construct a one-bit error correcting code, let column vectors of parity check matrix 𝐻 all different. Richard Hamming 1915-1998 Hamming code determine a parameter 𝑚 enumerate all nonzero vectors with length 𝑚 use the vectors as columns of 𝐻 1110100 𝐻 = 1101010 1011001 𝐺= transpose 1000111 0100110 0010101 0001011 27 parameters of Hamming code Hamming code determine 𝑚 design 𝐻 to have 2𝑚 – 1 different column vectors 𝐻 has 𝑚 rows and 2𝑚 – 1 columns code length 𝑛 = 2𝑚 – 1 # of information symbols 𝑘 = 2𝑚 – 1 – 𝑚 𝑚 # of parity symbols code rate = 𝑘/𝑛 𝑚 2 3 4 5 6 7 𝑛 3 7 15 31 63 127 𝑘 1 4 11 26 57 120 28 code rate and performance If code rate = 𝑘/𝑛 is large... more information in one codeword less number of symbols for error correction The error-correcting capability is weak in general. error capability strong weak To have good error-correcting capability, we need to sacrifice the code rate... small code rate large 29 channel coding theorem Shannon’s channel coding theorem: Let 𝐶 be the capacity of the communication channel. Among channel codes with rate ≤ 𝐶, there exists a code which can correct almost all errors. There is no such codes in the class of codes with rate > 𝐶. ... two faces of channel coding 30 two coding theorems source coding theorem: constructive solution given by Huffman code almost finished work channel coding theorem no constructive solution a number of studies have been made still under investigation remarkable classes of channel codes proof of the theorem 31 summary today’s talk ... not self-contained summary of Information Theory measuring of information source coding channel coding Students are encouraged to review basics of Information Theory. 32