Hebrew vowel restoration - Computer Science

advertisement
Hebrew vowel restoration using a bigram HMM model
Kobi Gal
Department of Engineering and Applied Sciences
Harvard University
<gal@eecs.harvard.edu>
CS 287r – Natural Language Processing
May 16th, 2001
Abstract
We present the problem of vowel restoration in Hebrew and attempt to solve it by using a
unigram base-line model and a bigram HMM model while using Katz's method for
integrating smoothing and discounting. We achieve 68% word accuracy with our baseline and 80% word accuracy with the HMM model. Most of the errors are due to unseen
bigrams and unigrams and the high number of hidden states consisting of low counts. We
conclude that some degree of morphological analysis would greatly assist us in
improving word accuracy.
1. Introduction In modern Hebrew, written texts contain only consonants. Hebrew has a rich
morphology, and due to missing vowels considerable ambiguity exists already at the
word level. This is because many words with different vowel structures may contain the
same consonants in a vowel-less setting. For example, the non-voweled word ‫ ספר‬,
written in Latin transliteration as SPR1 may represent the noun “book” (pronounced
/sepher/2), the verb “to count” /saphar/ in 3rd person singular form, and at least 4 other
interpretations3. This is further complicated by the fact that verb conjugations,
connectives and short particles in Hebrew are all attached to the word as prefixes or
suffixes, resulting in further ambiguity. For example, short particles, such as the definite
article ha- (=the), represented by the letter ‫ה‬, the connective w (=and), represented by the
1
For a full table of Latin transliteration of Hebrew please see appendix.
Italic letters in between slashes are the approximate phonetic pronunciations of the word.
3
The complete list of possible interpretation for this word is given in the appendix.
2
letter ‫ו‬, and the subordinator $e (=of, which), represented by the letter ‫ש‬, and many short
prepositions, are all written with no space before the following word. The result is that
strings of letters may represent one word or a combination of a particle and a word or
even several particles or a word (Ornan et al., 95). For example, the word ‫ ברכות‬,written
in Latin transliteration as BRXWT, may represent the Hebrew noun “congratulations”
/brahot/, or the Hebrew noun plus proposition “with softness” /be-rakut/. The word ‫היתר‬
can represent the Hebrew noun “permit” /hey-ter/ or the Hebrew noun plus definitive “the
rest” /ha-yeter. When faced with modern Hebrew script, a reader is faced with a double
problem. He must not only recognize each word in order to be able to vocalize it, but
when the consonant strings are ambiguous, he must choose the right one in context. This
is hardly a simple task, for the number of ambiguous representations of words may be
fairly large. Amazingly, in most cases a native speaker of the language can accurately
vocalize each word. They do it on the basis of their control of the grammar, on their
acquaintance with the lexicon, on their knowledge of semantic connections, and by
understanding the real world. In this paper, we propose to devise a model that would
employ contextual information and attempt to restore the proper vowel structure of the
word. Note that this is a harder problem than restoring the vocal form of the word, as
many different vowel forms sound the same4. For applications such as automatic speech
synthesizers, merely restoring the vocal pattern of the sentence would be sufficient.
However, we wish to fully annotate the vowel pattern of the word, so that non-voweled
text may be restored to its original form, and benefit foreign speakers, sufferers of
dyslexia, children learning the language.
2. Previous and related work
I have located a commercial product named "Nakdan Text" (Choueka, 89). Given a
sentence in modern Hebrew, this program will restore its vowels by first activating a
function that finds all possible morphological analyses and attached vocalizations of
every word in the sentence. Then, "Nakdan Text" chooses for every such word the correct
context-dependent vowel pattern using short-context syntactical rules as well as some
probabilistic and statistical modules. The authors claim to achieve 95% “correctness” in
4
For a full table of the Hebrew vowels see appendix.
restoring vowel patterns. It is not clear if this refers to word accuracy or letter accuracy5.
This program was demonstrated at BISFAI-95, the fifth Bar Ilan international symposium
on Artificial Intelligence, but no summary or article was included in its proceedings, and
to the best of my knowledge no article has ever been published describing this work.
Also, my several attempts at contacting the authors have been unsuccessful.
Yarowsky (Yarowsky, 92) has implemented several techniques for the restoration of
vowel patterns in French and in Spanish. The techniques in question were Bayesian
classification and decision lists, the latter method producing the best results. Decision
lists are a hybrid approach combining the strengths of both HMM's and Bayesian
Classifiers, taking into account both local and long term dependencies between words as
well as integrating different types of rules. We believe decision lists to be applicable to
the Hebrew vowel restoration problem and in future work would like to compare the
performance of these with our bigram HMM model. 6 Levinger and Ornan (Levinger and
Ornan, 91) have conducted research in the related field of morphological analyzers of
Hebrew. They attempt to decompose a word to its distinct features, such as part of
speech, lexical entry, tense (verbs only), gender, number, and person of the word as well
as gender, number, and person of any pronoun suffixes. For example, the morphological
analyses of the Hebrew string ‫ וכשבאתי‬, transliterated in Latin as WK$B^TY, and
pronounced /ve-kshe-bati/ is as follows –

lexical entry : the verb ‫ בא‬, translated in latin as B^, (=to come)

category – verb

tense – past

feminine/masculine, first person singular

object pronoun – masculine, singular, 3-rd person
Clearly, full morphological analyses of Hebrew would also solve the vowel restoration
problem because it would disambiguate any word and we could simply look up the
appropriate meaning in the dictionary.
5
Word accuracy and letter accuracy measure the number of correct vowel-annotated words or letters
respectively.
6
Due to time limitations and the limited scope of this paper decision lists have not been implemented in
this work.
Segel (Segel, 97) devised a statistical Hebrew lexical analyzer that inputs non-voweled
Hebrew texts and achieves 95% word accuracy on test data extracted from the Israeli
newspaper Ha-aretz. However, this method requires fully analyzed Hebrew text to train
on. The author used a hand-analyzed training set consisting of only 500 sentences.
Because there is no tree bank in Hebrew of analyzed text this method would not carry
well to other domains, such as novels or medical texts. To this date, no statistical
language model of Hebrew has used an N-gram model7.
3. Our Approach
We have stated above that contextual information is important in deciphering lexical
ambiguities in Hebrew and is commonly used by native speakers. We believe that an
HMM bigram would sufficiently capture the contextual dependencies between words.
We initially considered using a trigram approach but decided that the added cost in
development time brought about by the increase in the number of hidden states and the
complex smoothing techniques exceeds the limit for this project.
We will evaluate performance by checking the percentage of words in the training set
containing the correct vowel pattern. We refer to this performance measure as word
accuracy percentage. Because our primary purpose of is to assist in the reading of
modern Hebrew we will not check the percentage of correctly annotated syllables/letters8.
4. Corpus – The Westminster Hebrew Morphological Database
There is an unfortunate lack of data for vowel-annotated text in modern Hebrew. An
obvious substitute is the Hebrew bible. Ancient Hebrew bears enough syntactical and
semantic resemblance to modern Hebrew to warrant usage of a bible corpus. The
Westminster Hebrew Morphological Database is a corpus carrying a complete
transcription of the graphical form of the Massoretic text of the Hebrew bible and
contains 301,2670 words. The text is coded in Latin letters. Apart from the vowels, the
7
Levinger and Ornan state that n-grams are inadequate for Hebrew NLP because of the "free word form"
nature of Hebrew. I believe the Hebrew word form to be no freer than English.
8
Note that it is also possible to divide the 16 Hebrew vowels into 5 separate phonological groups. It makes
sense to check how successful the algorithm is in assigning the correct vowel group to a syllable/letter as a
correct group assignment of a vowel would adequately assist the reading of a text.
corpus contains the following: open and closed paragraph marks, the accentuation signs,
and morphological divisions for certain morphemes. The code for these different details
is interspersed with one another. Ninety percent of the corpus was used as training data,
and 10 percent of it as test data. Due to the relatively small size of the training data,
multifold testing and training of different segments of the data might improve robustness
of results, but due to the timing constraints they were not used.
5. Base line – A unigram model
A frequency table was established including a separate entry for each vowel-annotated
word in the training data. We counted the number of times each vowel-annotated word
appeared in the set. In the testing phase, we search through all of the words that have the
same consonant structure in the table and pick the vowel-annotated word that has the
highest count. The following chart describes ambiguity distribution among the training
set.
percentage of training set
chart 1 - ambiguity distribution in training set
35
33
30
24
25
18.1
20
15
9
10
5.5
5.3
2.2
5
0.8
0.7
0.3
0.6 0.04 0.3
0.2
8
9
10
11
14
0
0
2
3
4
5
6
7
12
13
number of ambiguities
Note that only 33% of the training set was unambiguous. We achieved an overall success
rate of 68% using the base line technique (see results section).
6. Bigram model
We constructed a bigram Hidden Markov Model where hidden states were vowelannotated words, and visible states were vowel-less words. One example of a path in the
HMM is given in chart 2. In our model, there was only one emission from each hidden
state to an observation. Therefore the probability of sentence W1,n is the sum over all
hidden states of finding the hidden states the HMM traversed while generating the
sentence w1,n starting at a particular hidden state.
pr (w1, n) 
  i pr (Ti | Ti  1)
T 1, n 1
These probabilities are approximated by bigram counts (see chart 2). Note that two
special symbols, shown as # #, were prepended to words in the test set. They serve to
“anchor” the initial state of the HMM and facilitate computation.
Chart 2 – HMM model for the non-vowel-annotated phrase “in the beginning god
created…” ‫בראשית ברא אלוהים‬, transliterated in Latin as BR^$YT -BR^ - ^LWHYM nad
pronounced as /be-reshit bara elohim/
BR^$YT
be-re-shit|bara
#|#
bara|elo-him
# | be-re-shit
BR^
missing | missing
bara | missing
^LWHYM
Therefore, the hidden states actually consist of vowel-annotated bigrams.
The probability of one possible transition of generating this phrase can be computed as
follows : pr ( w1, n)   pr ( wi | wi 1 ) which decomposes into the following probability
i
estimations.
pˆ r (# |# )  1,
pˆ r (# | be  re  shit )  c(be  re  shit ),
Unigram count
c(be  re  shit , bara )
c(bara )
c(bara | elo  him )
pˆ r (bara | elo  him ) 
c(elo  him )
pˆ r (be  re  shit | ba  ra ) 
Bigram counts
Note that in chart 2, there exist transitions to hidden states that include missing bigrams.
In the model, each state has a possible transition to a missing bigram state. We elaborate
more on this issue in section 7. We implemented a Viterbi algorithm to find the most
likely path transitions throughout the hidden states. More formally, we define the
problem as Hebrew vowel restoration using a bigram HMM model as finding
arg max p (t1, n | w1, n)  arg max
t1, n
t1, n
p( w1,n , t1,n )
p( w1,n )
 arg max p ( w1,n , t1,n )
t1, n
We achieve this by keeping track of the most likely path through the hidden states for
each possible hidden state with a non zero probability estimation and each observation.
7. Sparse data
Because our bigram model is trained from a finite corpus, many bigrams are likely to be
missing from it. In the unigram model, we found that as many as 16 percent of words in
the test set were not to be found in the count look up table. The amount of unseen
bigrams was even higher, as much as 20 percent. This is not surprising, as we expect
some unseen bigrams to consist of words that were both seen before individually. We did
not specifically deal with sparse data in the unigram base line model. As many of the
unseen unigrams were non ambiguous, we would have liked to look up the sparse words
in the Hebrew vowel-annotated dictionary and copy the vowel pattern found in the
dictionary. However, as noted in section 2, Hebrew words are attached with prefixes and
suffixes that represent conjugation, propositions, pronouns, and connectives. The
dictionary contains only the stem form of verbs and nouns, and without a morphological
analyzer we cannot decipher the stem. Therefore we proceed as follows: We employ
Katz’s technique (Katz, 99) to combine a discounting method along with a backoff
method to try and obtain a good estimate of unseen bigrams. We use the Good and
Turing discounting method [Gale & Sampson 91] to tell us how much total probability
mass to set aside for all the events we haven’t seen, and the backoff algorithm to tell us
how to distribute this probability. Formally, we define

pr ( w2 | w1)  Pd ( w2 | w1)
If c(w2,w1)>0

pr ( w2 | w1)   ( w1) p ( w2 | w1)
If c(w2,w1)=0
where Pd is the discounted estimate using the Good and Turing method, p is a
probability estimated by counts and  (w1) is a normalizing factor that divides the
unknown probability mass of unseen bigrams beginning with w1.
1
 ( w1) 
 Pd (w2 | w1)
w 2:c ( w1, w 2 ) 0
1
 p(w2 | w1)
w 2|c ( w1, w 2 ) 0
In order to compute Pd we create a separate discounting model for each context word w1.
The reason for this is simple: If we use only one model over all of the bigram counts, we
would be really approximating Pd(w2,w1). Because we wish to estimate Pd(w2|w1), we
define the discounted frequency counts as follows –
c * ( w1, w2)  c( w1, w2) 
nc ( w1, w 2 ) 1
nc ( w1, w 2 )
where nc is the number of different bigrams in the corpus that have frequency c.
Following Katz, we estimate the probability of unseen bigrams to be
p(w2)
if c(w2) > 0
p(w2|w1) 
p(unseen|w1)
if c(w2) = 0
Note that in some cases, w2 itself is unseen and c(w2) cannot be computed. To get the
estimate probability for unseen unigrams p(unseen|w1) we allocate some probability
mass to unseen w2 words by keeping a special count for bigrams (see chart 2) that were
seen less then k times.9
8. Results and discussion
As stated previously, the base line unigram model achieved a word accuracy of 68%.
Using the HMM bigram model, we managed to achieve word accuracy of 80%. This is an
improvement over our base line, but these results are unsatisfying, as they mean that on
average, our model misclassifies 2 words in every sentence. Most of the errors (16%) are
attributed to the misclassification of unseen words. As stated in section 2, some degree of
morphological analysis will help us to classify many of these words more accurately,
Another problem of the model is the large number of hidden states that correspond to
vowel-annotated words seen in the training data. This phenomenon is attributed to the
following aspects of the morphological richness of Hebrew:

Several Hebrew letters are ambiguous. For example, the letter ‫ ה‬may represent
either a consonant /h/ or a vowel. The same applies to the letters ‫ י‬and ‫ו‬.
Furthermore, all of them, (i.e. ‫ ה י‬and ‫ )ו‬may indicate several different vowels. For
instance,‫ ו‬may indicate the vowel /o/ or /u/. It is difficult to tell off hand whether
an ambiguous letter defines a vowel or a consonant. However, when these letters
represent vowels, they are usually left in the text and are not removed. Therefore,
these letters could help us restore the vowel patterns of the preceding letter,
provided we know that they represent vowels and not consonants. Currently my
model treats all appearances of these words as consonants, as we cannot
disambiguate without the use of an analyzer.

Verb conjugation is also represented through the use of suffixes and prefixes to
the verb stem. For example, the word ‫ ראיתי‬/ra-ity/ (= I saw) conjugates the verb
‫ ראה‬/ra-a/. We expect that all verb conjugations of the same verb have equal
frequency throughout the test set, and thus we would like to store counts only for
the stem verb rather than each conjugation. Currently my model keeps a separate
count for each conjugation, leading to an increase in the number of hidden states.
9
K was arbitrarily set to 3 in our experiment. Alternatively, we could get a more exact estimation of the
missing probability mass by discounting the unigram probabilities of w2, but we did not have time to
implement this.
Both of these problems can again be addressed through he use of morphological analysis.
We predict that even a limited and unsophisticated analyzer would help to increase word
accuracy. Also, We did not integrate sentence boundaries into the model. In future work,
we will make each word beginning a sentence independent on the last word of the
previous sentence.
9. Conclusion
We have shown that for the task of restoring Hebrew vowels, using an HMM bigram
model improves word accuracy performance over a unigram model. However, the results
are still unsatisfying. We predict that integrating a limited morphological analyzer would
vastly improve performance.
Acknowledgements
Thanks to Prof. Daniel Moore of the for providing me with the full Latin transcripted
code of the Hebrew bible. Special thanks is due to the CS 287r staff at Harvard
University: Ken Shan, Wheeler Ruml and Prof. Stuart Shieber for engaging discussions
and helpful comments.
Apendix A – The Hebrew Vowels
patach
like "a" car
hireq defective Like "i" bit
qametz
"a" (father)
hireq plene
qametz
metheg
seghol
qibbutz
shureq
tsere defective
tsere plene
holem
like "a" car
defective
"e" (bet)
holem plene
like "u" in
shewa
flute
like "u" in
patach shewa
flute
like "ey" in
like "i"
machine
Like "o" boat
Like "o" boat
like "a" in
"above"
Like "a" car
qametz shewa Like "a" car
they
like "ey" in
seghol shewa
they
Like "e" bet
Apendix B
Given below is the Hebrew transliteration used throughout the paper. This is the accepted
transcription proposed by the Hebrew academy (Academy of the Hebrew Language,
1957).
Latin
Hebrew
Latin
Hebrew
Latin
Hebrew
P
‫פ‬
@
‫ט‬
^
‫א‬
C
‫צ‬
Y
‫י‬
B
‫ב‬
Q
‫ק‬
K
‫כ‬
G
‫ג‬
R
‫ר‬
L
‫ל‬
D
‫ד‬
$
‫ש‬
M
‫מ‬
H
‫ה‬
T
‫ת‬
N
‫נ‬
W
‫ו‬
S
‫ס‬
Z
‫ז‬
&
‫ע‬
X
‫ח‬
Appendix C
An example of word ambiguity in Hebrew, the word ‫ספר‬
Vowel-annotated word
Meaning
(Technically, I cannot
include the Hebrew vowels
in this word processor)
‫ספר‬
A book
‫ספר‬
A barber
‫ספר‬
The verb count (3rd person
singular),
‫ספר‬
The verb cut (3rd person
singular),
‫ספר‬
Was told (passive)
‫ספר‬
Was cut (passive)
References
Academy of the Hebrew Language , 1957. The rules for Hebrew-Latin transcription. In
Memiors of the Academy of the Hebrew Language, pages 5-8 (in Hebrew)
Choueka, Y. and Neeman, Y. Nakdan-Text, (an In-Context Text-Vocalizer for Modern
Hebrew). BISFAI-95, The fifth Bar Ilan Symposium for Artificial Intelligence 1995
Ido Dagan, Fernando Pereira, and Lillian Lee. Similarity-based estimation of word
cooccurrence probabilities. In Proceedings of the 32nd Annual Meeting of the
Association for Computational Linguistics, 1994.
Gale, W. A. and Sampson, G. (1995). Good-Turing Frequency Estimation Without
Tears. Journal of Quantitative Linguistics 2, 217-237.
Segel, A. A probabilistic Morphological Analyzer for Hebrew undotted text, MSc thesis
1997, Israeli Institute of Technology. (in Hebrew)
D. Yarowsky. A comparison of corpus-based techniques for restoring accents in
Spanish and French text. In Proceedings, 2nd Annual Workshop on Very Large
Corpora, Kyoto, pages 19--32, 1994.
Download