CODES WITH CONSTRAINTS Example.Code T (no constraints): A=0 S0S “a” (a-ta) B = 10 S 1 B1 0 S “ba” (ba-ba) C = 01 S 0 C1 1 S “at” (at-kı) D = 101 S 1 D1 0 D2 1 S “kat” (kat-kı) S SC1 B1D1 SB1 SD1 SD2 C1D2 0 (SC1) (SD2) (SC1)(SS) (C1D1)(SD2) - 1 (B1D1) (SB1)(SD1) (SB1)(SD1) (SS) Some ambiguous messages in code T: I) S— 0 SC1— 1 SB1— 0 SS: 010 = AB II) 0101 = AD III) 1010 = BB IV) 10101 = BD (a-ta) (a-teş) (ka-ra) (ya-tak) or or or or CA (at-a) CC (at-eş) DA (kar-a) DC (yat-ak) correct wrong according to the “rules” of Turkish 1 Let code U = code T with these constraints: A never follows C = CA forbidden C never follows C = CC forbidden A never follows D = DA forbidden C never follows D = DC forbidden (QC PA, QC PC, QD PA, QD PC) “A code word starting with a 0 never follows a code word ending with a 1” = “A syllable starting with a vowel never follows a syllable ending with a consonant” A=0 PA 0 QA B = 10 PB 1 B1 0 QB C = 01 PC 0 C1 1 QC D = 101 PD 1 D1 0 D2 1 QD P QAC1 B1D1 QCB1 QCD1 QBD2 QDB1 QDD1 0 (QAC1) (QBD2) - 1 (B1D1) (QCB1)(QCD1) (QDB1)(QDD1) 2 No (QxQy) pair in the table: Code U is UD. Construct the testing graph. No loops in the graph: Code U is UDF. L=3, hence the code is UDF-4, in other words the knowledge of the first 4 code symbols suffices to determine the first code word, but 3 symbols may not be sufficient. Worst case example: Consider a longest path in G, say 101. When we receive 101, we can not decide whether this is word D, or word B (=10) followed by a word D, just starting. Corresponding hyphenation problem: Receiving “bat”, we can not decide whether the first syllable is “bat” (as, say, in “bat-mak”) or “ba” (as in “ba-tak”). But knowing the fourth letter (underlined) we can uniquely decide which. 3 Decoding automaton example (for code U) The input alphabet is {0, 1, #}, where 0 is used for V, 1 for C, and # means “end of word reached”. The output alphabet is {A, B, C, D, *}, where letters stand for syllable types, and * is an error message. The automaton does not give an output for some transitions (you may visualize this as M giving an output “wait”). All # transitions go to state . A string as the label of a state correspond to the yet unsyllabified portion (starting from the previous syllable breakpoint) of the word. 4 Here are some eamples of M’s hypenation: word b a t m a k input 1 0 1 1 0 1 # state 1 10 101 1 10 101 output - - D - D ---------------------------------------------------------------------- word b a t a k input 1 0 1 0 1 # state 1 10 101 10 101 output - - B D ---------------------------------------------------------------------- word k o c a e l i input 1 0 1 0 0 1 0 # state 1 10 101 10 0 01 10 output - - B B - A B ---------------------------------------------------------------------- Code U, having four code words, corresponds to a subset of the syllable structure of Turkish. The “proper” Turkish, (i.e. excluding foreign borrowings) has six syllable types: 5 structure 0 10 01 101 011 1011 V CV VC CVC VCC CVCC example a-dam ba-ba ek-mek al-tın erk kork This code, under the same constraint set (i.e. “A syllable starting with a vowel never follows a syllable ending with a consonant”) can be shown UD and UDF-5. A worst case example (4 is not enough): “kork”: Is the first syllable “kor” or “kork”? If the fifth letter is V, “kor”: kor-ku, If the fifth letter is C. “kork”: kork-mak. With the huge influx of borrowings (mainly from French and English) the syllable structure of Turkish has been completely changed. As far as I can detect, there are 8 to10 new syllable types, in addition to the 6 types of Turkish proper. These are 6 structure CCV VCCC CCVC CCCV CVCCC CCVCC CCCVC CCVCCC CCCVCC CCCVCCC example tra-fik [antr-lok] (potential) kon-trol stra-teji kontr-bas trans-fer strik-nin sfenks sprink-ler [strontr-yum](potential) A search through nearly all the loan words reveals that if the code alphabet is extended from binary {V, C} to ternary {V,R (=the letter “r”),T(=consonant other than “r”)} then the code becomes UDF-7, for all the proper Turkish words plus most of the loan words. A decoding automaton designed for this purpose has been presented in Güney Gönenç, “A finite state automaton for syllabification of Turkish words”, Proc. 6th International Symposium on Computer and Information Sciences (ISCIS), vol. 2, pp 10391046, Elsevier, Amsterdam, 1991. 7