CODES WITH CONSTRAINTS

advertisement
CODES WITH CONSTRAINTS
Example.Code T (no constraints):
A=0
S0S
“a” (a-ta)
B = 10
S 1 B1 0 S
“ba” (ba-ba)
C = 01
S 0 C1 1 S
“at” (at-kı)
D = 101 S 1 D1 0 D2 1 S “kat” (kat-kı)
S
SC1
B1D1
SB1
SD1
SD2
C1D2
0
(SC1)
(SD2)
(SC1)(SS)
(C1D1)(SD2)
-
1
(B1D1)
(SB1)(SD1)
(SB1)(SD1)
(SS)
Some ambiguous messages in code T:
I) S— 0 SC1— 1 SB1— 0 SS:
010 = AB
II) 0101 = AD
III) 1010 = BB
IV) 10101 = BD
(a-ta)
(a-teş)
(ka-ra)
(ya-tak)

or
or
or
or
CA (at-a)
CC (at-eş)
DA (kar-a)
DC (yat-ak)




correct
wrong
according to the “rules” of Turkish
1
Let code U = code T with these constraints:
A never follows C = CA forbidden
C never follows C = CC forbidden
A never follows D = DA forbidden
C never follows D = DC forbidden
(QC  PA, QC  PC, QD  PA, QD  PC)
“A code word starting with a 0 never follows a
code word ending with a 1” =
“A syllable starting with a vowel never follows a
syllable ending with a consonant”
A=0
PA 0 QA
B = 10
PB 1 B1 0 QB
C = 01
PC 0 C1 1 QC
D = 101
PD 1 D1 0 D2 1 QD
P
QAC1
B1D1
QCB1
QCD1
QBD2
QDB1
QDD1
0
(QAC1)
(QBD2)
-
1
(B1D1)
(QCB1)(QCD1)
(QDB1)(QDD1)
2
No (QxQy) pair in the table: Code U is UD.
Construct the testing graph.
No loops in the graph: Code U is UDF.
L=3, hence the code is UDF-4, in other
words the knowledge of the first 4 code
symbols suffices to determine the first code
word, but 3 symbols may not be sufficient.
Worst case example: Consider a longest
path in G, say 101. When we receive 101,
we can not decide whether this is word D,
or word B (=10) followed by a word D, just
starting. Corresponding hyphenation
problem: Receiving “bat”, we can not
decide whether the first syllable is “bat” (as,
say, in “bat-mak”) or “ba” (as in “ba-tak”).
But knowing the fourth letter (underlined)
we can uniquely decide which.
3
Decoding automaton example (for code U)
The input alphabet is {0, 1, #}, where 0 is
used for V, 1 for C, and # means “end of
word reached”. The output alphabet is {A,
B, C, D, *}, where letters stand for syllable
types, and * is an error message. The
automaton does not give an output for some
transitions (you may visualize this as M
giving an output “wait”). All # transitions
go to state . A string as the label of a state
correspond to the yet unsyllabified portion
(starting from the previous syllable
breakpoint) of the word.
4
Here are some eamples of M’s hypenation:
word
b a t
m a k
input 1 0 1
1 0 1
#
state  1 10 101 1 10 101 
output - - D - D
----------------------------------------------------------------------
word
b a t
a k
input 1 0 1
0 1
#
state  1 10 101 10 101 
output - - B D
----------------------------------------------------------------------
word
k o c
a e l i
input 1 0 1
0 0 1 0 #
state  1 10 101 10 0 01 10 
output - - B B - A B
----------------------------------------------------------------------
Code U, having four code words,
corresponds to a subset of the syllable
structure of Turkish. The “proper” Turkish,
(i.e. excluding foreign borrowings) has six
syllable types:
5
structure
0
10
01
101
011
1011
V
CV
VC
CVC
VCC
CVCC
example
a-dam
ba-ba
ek-mek
al-tın
erk
kork
This code, under the same constraint set
(i.e. “A syllable starting with a vowel never follows a
syllable ending with a consonant”) can be shown
UD and UDF-5.
A worst case example (4 is not enough):
“kork”: Is the first syllable “kor” or “kork”?
If the fifth letter is V, “kor”: kor-ku,
If the fifth letter is C. “kork”: kork-mak.
With the huge influx of borrowings (mainly
from French and English) the syllable
structure of Turkish has been completely
changed. As far as I can detect, there are 8
to10 new syllable types, in addition to the 6
types of Turkish proper. These are
6
structure
CCV
VCCC
CCVC
CCCV
CVCCC
CCVCC
CCCVC
CCVCCC
CCCVCC
CCCVCCC
example
tra-fik
[antr-lok] (potential)
kon-trol
stra-teji
kontr-bas
trans-fer
strik-nin
sfenks
sprink-ler
[strontr-yum](potential)
A search through nearly all the loan words
reveals that if the code alphabet is extended
from binary {V, C} to ternary {V,R (=the
letter “r”),T(=consonant other than “r”)}
then the code becomes UDF-7, for all the
proper Turkish words plus most of the loan
words. A decoding automaton designed for
this purpose has been presented in
Güney Gönenç, “A finite state automaton for
syllabification of Turkish words”, Proc. 6th
International Symposium on Computer and
Information Sciences (ISCIS), vol. 2, pp 10391046, Elsevier, Amsterdam, 1991.
7
Download