UNIQUE DECIPHERABILITY AND SYNCHRONIZABILITY OF CODES SOME DEFINITIONS SOURCE ALPHABET: A finite set of symbols. The elements are called source symbols. Ex: S = {A, B, C, D}. CODE ALPHABET: A finite set of symbols The elements are called code symbols. Ex: L = {0, 1, 2}. CODE WORD: Concatenation of a finite number of code symbols. Ex: 0110, 102, 3. CODE: Finite number of code words, each representing a source symbol. Examples: Code A . S 00 C 01 R 10 F 11 Code B S C R F . 0 110 10 1110 Code C A B C D . 0 11 00 01 Code A is a block code. Code B and code C are variable-length codes. Code A: 2 binary digits per source symbol. If the probabilities of the occurrences of source symbols are equal, then this block code is OK. But, suppose we have the probabilities p(S) = 1/2, p(C) = p(F) = 1/8, and p(R) = 1/4. In this case code B is better than code A, since the average length per source symbol for code B is: iL(si) p(si) = 1 p(S) + 3 p(C) + 2 p(R) + 4 p(F) = 1/2 + 3/8 + 2/4 + 4/8 = 15/8. Compare this with 2 (code A). CODED MESSAGE: String obtained by concatenating code words, without spacing or any other punctuation. Examples: a) 1101011011001110 is a coded message in code B, corresponding to the source string CRCCSF. b) 0000 is a coded message in code C, corresponding to AAAA, AAC, ACA, CAA, or CC. c) 0000 is a coded message in code B, corresponding to SSSS. d) 0000 is a coded message in code A, corresponding to SS. NONSINGULAR CODE: If all code words are distinct the code is called nonsingular. [Some authors use the term “code” only for nonsingular codes]. If there are at least one duplication of code words, the code is called singular. THE nth EXTENSION OF A CODE: If the source alphabet of a code has q symbols, S = {s1, s2, ..., sq}, the n th extension of the code has qn source symbols, 1, 2, ..., qn. Each of the i corresponds to some sequence of n of the sj. Example. Second extension of code C: AA: 00, AB: 011, AC: 000, AD: 001, BA: 110, BB: 1111, BC: 1100, BD: 1101, CA: 000, CB: 0011, CD: 0001, DA: 010, DB: 0111, DC: 0100, DD: 0101. UNIQUE DECODABILITY (UNIQUE DECIPHERABILITY) (UD): A code is said to be uniquely decodable (UD) if and only if the nth extension of the code is nonsingular for every finite n. Examples: a) Code C is clearly not UD, since its second extension is singular (AC: 000 and CA: 000). b) 2nd and 3rd extensions of code B are nonsingular, but this is not sufficient for code B to be UD. This shows that we need a method for testing a code for unique decodability. The above definition of UD assures that any two sequences of source symbols lead to distinct sequences of code symbols. For this we have the following theorem [notation: the sequence of source symbols Si correspond to the sequence of code symbols Xi; L denotes the length; “Si Xi” is used for “Si corresponds to Xi”]: THEOREM: If the code is UD, then any two distinct sequences of source symbols lead to (i.e. correspond to) distinct sequences of code words. Proof a) L(S1) = L(S2) = k. Consider the kth extension, which should be nonsingular by hypothesis. Then clearly [S1 S2] [X1 X2]. b) L(S1) L(S2). Assume the contrary, i.e. assume that S1 X and also S2 X. Then S1S2 XX, and S2S1 XX. Hence, if L(S1S2) = L(S2S1) = m, [S1S2 XX] AND [S2S1 XX] AND [L(S1S2) = L(S2S1) = m] mth extension should be nonsingular. Contradiction. 1 Example. Consider code L = [A: 0, B: 01, C: 1010]. We note that both AC and BBA correspond to 01010, hence the code is clearly non-UD. Second, third, fourth extensions are nonsingular. For example the 2nd extension is: [AA: 00, AB: 001, AC: 01010, BA: 010, BB: 0101, BC: 011010, CA: 10100, CB: 101001, CC: 10101010]. But, L(AC) + L(BBA) = 5 and the 5th extension is bound to be singular: ACBBA 0101001010 and BBAAC 0101001010. Hence AC 01010 and BBA 01010. The code is not UD. With the above theorem, we see that our definition of UD (“nth extension is nonsingular for any finite n”) and the definition in Kohavi (“every coded message can be decomposed into a sequence of code words in only one way”) are essentially same. So, if a code is UD, then any (whole) coded message can be deciphered in a unique way. UNIQUE DECODABILITY WITH FINITE DELAY (UDF-): A code is said to be uniquely decodable with delay , iff is the least integer, so that the knowledge of the first symbols of the coded message suffices to determine its first code word. If a code is UDF, then we can start the deciphering without the need for waiting the end of the message. Hence we can decode “as we receive the message”, of course with some delay, being a measure for decoding delay. Example. Code K = [A: 1, B: 10, C: 001]. This code is known to be UDF-4. Hence there exists at least one message of length 3 which does not allow us to uniquely identify the first code word. Consider the coded message 100: B_ C-__ 1 0 0_ First code word: 10 (B) or 1 (A)? A CBut four symbols suffices: If the fourth symbol is 0 then first code word is 10 (B), and if the fourth symbol is 1 then first code word is 1 (A): B 1 0 C-__ 0 0 1 A 0 0 C 1 UDF-4 For all possible coded messages of length four, the first code word can be determined uniquely. Testing a code for UD and for UDF is analogous to IL and ILF tests. EVEN’S TEST FOR UD AND UDF (Explained on code L = [A: 0, B: 01, C:1010] and K = [A: 1, B: 10, C: 001]) 1. Given code , insert a separation symbol S at the beginning and at the end of each code word in . In addition, in every code word representing the source symbol N, insert the symbol Ni between its ith symbol and its (i+1)st symbol. 2. Each code symbol xk is now situated between two separation symbols. We say that the separation symbol to the right of the code symbol xk is the xk-successor, denoted Ri, of the left separation symbol. A: B: C: S S S 0 0 1 S B1 C1 1 0 S C2 1 C3 0 S For example, S is 0 successor of S, S is zero successor of C3, S is 1 successor of B1, C3 is 1-successor of C2. 3. Two successors Ri and Rj are compatible if a) S xk Ri and S xk Rj occur, or b) Rp xk Ri and Rq xk Rj occur, and Rp and Rq are compatible. In such a case, (Ri Rj) is said to be a compatible pair implied by (Rp Rq) under xk [compatible pairs are unordered pairs]. 4. Testing table (for UD) a) Column headings are the symbols of the code alphabet. b) The firs row heading is S. Other row headings are compatible pairs. c) The entries in row (Rp Rq), column xk, are the compatibles implied by (Rp Rq) under xk. Testing table for code L: 0 1 ------------------------------S (S B1) S B1 (C1 S) C1 S (C2 S)(C2 B1) C2 S (C3 C1) C2 B1 (C3 S) C3 C1 S C2 C3 S (S B1)(S S) ------------------------------- 2 5. A necessary and sufficient condition for a code to be UD is that the pair (S S) is not generated in the testing table. Since (S S) is generated in the testing table for code L, it is not UD. We don’t proceed further. 6. If the code is not UD, trace back from (S S) to S in order to find an ambiguous message: S (S B1) (C1 S) (C2 B1) (C3 S) (S S) B 0 A B 1 0 A 0 1 01010 BBA or AC C 7. If no (S S) appears in the testing table, the code is UD. Then, proceed to construct the testing graph. In the graph there is one vertex for S, and for each compatible pair. If the graph is loop-free, then the code is UDF-, and = L + 1, where L is the length of the longest path. Example. Code K = [A: 1, B: 10, C: 001] A B C 1 10 001 S S S 1 1 0 S B1 C1 0 0 S C2 1 S (S B1) (C1 S) (C2 C1) S 0 (C1 S) (C2 C1) - 1_____ (S B1) -_____ Testing graph for code K: S SB1) (C1 S) (C2 C1) No loops. UDF. L = 3, =L+1=4 DECIPHERING A CODED MESSAGE Example. Code Q = [A: 11, B: 011, C: 001, D: 01, E: 00]. 0 (B1 C1)(B1 D1)(B1 E1) (C1 D1)(C1 E1)(D1 E1) B1 C1 B1 D1 B1 E1 C1 D1 C1 E1 (S C2) D1 E1 S B2 S C2 S A1 - 1 . S (S B2) (S A1) (S A1) (S A1) . No (S S) in the table: code Q is UD. Construct the graph. The graph contains a loop: (S A1) 1 (S A1) Hence Code Q is not UDF. In order to find an arbitrarily long coded message which does not allow to identify the first code word, consider the paths starting from S, and reaching to this loop. There are two such paths: S 0(C1 E1) 0(S C2) 1(S A1) and S 1(B1 D1) 1(S B2) 1(S A1). Hence, the two cases for which arbitrarily long coded messages does not allow to identify the first code word are: A- _ A1 0 0 1 1 1 1 1 . . . S E1 S A1 S A1 S _ E A A A S C C1 S _ A A1 B2 S A A1 S No matter how long you keep receiving 1111 ..., the ambiguity about the first code word being C or E will never be resolved. A- _ A1 0 1 1 1 1 1 1 . . . S D1 S A1 S A1 S _ D A A AS C B1 _ C2 S But, since the code is UD, we must be able to decipher uniquely any message, if the end of the message is known. For example, if we know that the whole message is 0011111, it corresponds to CAA (i.e. the first code word is 001=C). On the other hand if the whole message is 001111, this corresponds to EAA (i.e. the first code word is 00=E). 3 In general, given any coded (whole) message in this code, we can decipher it working in forward and in backward directions, analogous to finding the input sequence of an IL machine. This is shown in the following example. Suppose we want to decode the message 0011101100011010011 which is encoded using code Q. We scan the message in forward direction (left to right) and insert a lower comma whenever a sequence which corresponds to a legitimate code word is detected. For example the first lower comma from the left follows the initial 00, since 00 is a word in the code. Next a comma follows the 1 since the sequence 001 is also a word in the code, and so on. Although the tenth and eleventh symbols are 0’s, no lower comma is inserted between the eleventh and the twelfth symbols, because there is no comma between the ninth and the tenth symbol, and a new word cannot start unless a comma indicates the end of the preceding word. Next, we scan the coded message in backward direction (right to left) and insert an upper comma whenever a sequence which corresponds to the inverse of a legitimate code word is scanned. The inverses of the code words in our example are [11, 110, 100, 10, 00]. If the code is UD, the message can be decoded by retaining only those commas that occur in the upper and the lower spaces simultaneously (shown . , , , , , , , , , , , , 0 0 , 1 , 1 , 1 , 0 1 , 1 , 0 0 , 0 1 , 1 , 0 1 , 0 0 , 1 , 1 C A B E B D E A__ Although the above procedure will in general require keeping track of a number of sequences and the locations of the various commas, it is in principle a simple procedure that can be carried out by a finite-state machine (simple algorithm). SYNCHRONIZABILITY OF CODES SYNCHRONIZABILITY: A code is said to be synchronizable of order if is the least integer, so that the knowledge of any consecutive code symbols is sufficient to determine a separation of code words within these symbols. We shall restrict our attention to synchronizable codes which are UDF. TEST FOR SYNCHRONIZABILITY The problem of testing a code for synchronizability is analogous to the problem of testing a machine for finite output memory (FOM). In fact, since in both cases the objective is to specify the sequence at some point, we can use the same testing proceedure. Testing table for synchronizability: Upper table: The row headings consist of all the separation symbols. The entries in row Ri, column xk are the xk-successors of Ri. Lower table: The row headings are all pairs of separation symbols. The entries in row (Ri Rj), column xk, are the pairs implied by (Ri Rj) under xk. Testing graph for synchronizability: There is a vertex for each row in the lower table. A directed arc labeled xk leads from vertex (Ri Rj) to vertex (Rp Rq), where p q, if and only if (Rp Rq) is the xk-successor of (Ri Rj). THEOREM. A code is synchronizable of order if and only if it is UD and its testing graph for synchronizability is loop-free, the order being = L+1. Example. Code K = [A: 1, B: 10, C: 001] This code is UD and UDF-4 as shown above. A: 1 B: 10 C: 001 Graph: (B1 C2) S S S 1 1 0 S B1 C1 0 0 S C2 1 S 0 1 0 0 (B1 C1)(S C2)(S B1)(S C1)C1 C2) The graph is loop-free. L = 4, hence = L + 1 = 5. S B1 C1 C2 S B1 S C1 S C2 B1 C1 B1 C2 C1 C2 0 C1 S C2 (S C1) (C1 C2) (S C2) - 1_____ (S B1) S_____ (S B1) -_____ Four symbols is not enough (for at least one case): If the fifth symbol is 0, these are the separation points. B C-_ _ B1 S B1 S C1 0 1 0 0 C1 C2 S C1 C2__ - C C If the fifth symbol is 1, this is the separation point. -B 4 Example. Code P = [A: 0, B: 11, C: 101]. This code is UDF-2. Testing table for UD A B C S 0 S S 1 B1 1 S S 1 C1 0 C2 1 S Testing table for synchronizability S B1 C1 C2 SB1 SC1 SC2 B1C1 B1C2 C1C2 S B1C1 0 - Testing graph for UDF 1_____ (B1C1) -_____ 1 S L=1 =2 UDF-2 Testing graph for synchronizability 1 0 1 1 S C2 (SC2) - B1C1 S S _ (SB1)(SC1) (SB1)(SC1) SS -___________ (SB1) (SC1) 110 SC2 There are three loops. Code P is not synchronizable. Some of the arbitrarily long messages which does not allow to identify a separation point are given below. -B 1 B ... Corresponding to the loop SB1SB1: B 1 1 B C S Loop SB1–SC1–SC2–SB1: B1C1 ... C1 1 B1 -B C2 0 S 1 S A S C1 1 B1 B S B___ 1 1 B- ... C_____ C2 S 0 1 ... S B1 A B- ------------------------------------------------------------------------------ 5