Hebrew vowel restoration using a bigram HMM model Kobi Gal Department of Engineering and Applied Sciences Harvard University <gal@eecs.harvard.edu> CS 287r – Natural Language Processing May 16th, 2001 Abstract We present the problem of vowel restoration in Hebrew and attempt to solve it by using a unigram base-line model and a bigram HMM model while using Katz's method for integrating smoothing and discounting. We achieve 68% word accuracy with our baseline and 80% word accuracy with the HMM model. Most of the errors are due to unseen bigrams and unigrams and the high number of hidden states consisting of low counts. We conclude that some degree of morphological analysis would greatly assist us in improving word accuracy. 1. Introduction In modern Hebrew, written texts contain only consonants. Hebrew has a rich morphology, and due to missing vowels considerable ambiguity exists already at the word level. This is because many words with different vowel structures may contain the same consonants in a vowel-less setting. For example, the non-voweled word ספר, written in Latin transliteration as SPR1 may represent the noun “book” (pronounced /sepher/2), the verb “to count” /saphar/ in 3rd person singular form, and at least 4 other interpretations3. This is further complicated by the fact that verb conjugations, connectives and short particles in Hebrew are all attached to the word as prefixes or suffixes, resulting in further ambiguity. For example, short particles, such as the definite article ha- (=the), represented by the letter ה, the connective w (=and), represented by the 1 For a full table of Latin transliteration of Hebrew please see appendix. Italic letters in between slashes are the approximate phonetic pronunciations of the word. 3 The complete list of possible interpretation for this word is given in the appendix. 2 letter ו, and the subordinator $e (=of, which), represented by the letter ש, and many short prepositions, are all written with no space before the following word. The result is that strings of letters may represent one word or a combination of a particle and a word or even several particles or a word (Ornan et al., 95). For example, the word ברכות,written in Latin transliteration as BRXWT, may represent the Hebrew noun “congratulations” /brahot/, or the Hebrew noun plus proposition “with softness” /be-rakut/. The word היתר can represent the Hebrew noun “permit” /hey-ter/ or the Hebrew noun plus definitive “the rest” /ha-yeter. When faced with modern Hebrew script, a reader is faced with a double problem. He must not only recognize each word in order to be able to vocalize it, but when the consonant strings are ambiguous, he must choose the right one in context. This is hardly a simple task, for the number of ambiguous representations of words may be fairly large. Amazingly, in most cases a native speaker of the language can accurately vocalize each word. They do it on the basis of their control of the grammar, on their acquaintance with the lexicon, on their knowledge of semantic connections, and by understanding the real world. In this paper, we propose to devise a model that would employ contextual information and attempt to restore the proper vowel structure of the word. Note that this is a harder problem than restoring the vocal form of the word, as many different vowel forms sound the same4. For applications such as automatic speech synthesizers, merely restoring the vocal pattern of the sentence would be sufficient. However, we wish to fully annotate the vowel pattern of the word, so that non-voweled text may be restored to its original form, and benefit foreign speakers, sufferers of dyslexia, children learning the language. 2. Previous and related work I have located a commercial product named "Nakdan Text" (Choueka, 89). Given a sentence in modern Hebrew, this program will restore its vowels by first activating a function that finds all possible morphological analyses and attached vocalizations of every word in the sentence. Then, "Nakdan Text" chooses for every such word the correct context-dependent vowel pattern using short-context syntactical rules as well as some probabilistic and statistical modules. The authors claim to achieve 95% “correctness” in 4 For a full table of the Hebrew vowels see appendix. restoring vowel patterns. It is not clear if this refers to word accuracy or letter accuracy5. This program was demonstrated at BISFAI-95, the fifth Bar Ilan international symposium on Artificial Intelligence, but no summary or article was included in its proceedings, and to the best of my knowledge no article has ever been published describing this work. Also, my several attempts at contacting the authors have been unsuccessful. Yarowsky (Yarowsky, 92) has implemented several techniques for the restoration of vowel patterns in French and in Spanish. The techniques in question were Bayesian classification and decision lists, the latter method producing the best results. Decision lists are a hybrid approach combining the strengths of both HMM's and Bayesian Classifiers, taking into account both local and long term dependencies between words as well as integrating different types of rules. We believe decision lists to be applicable to the Hebrew vowel restoration problem and in future work would like to compare the performance of these with our bigram HMM model. 6 Levinger and Ornan (Levinger and Ornan, 91) have conducted research in the related field of morphological analyzers of Hebrew. They attempt to decompose a word to its distinct features, such as part of speech, lexical entry, tense (verbs only), gender, number, and person of the word as well as gender, number, and person of any pronoun suffixes. For example, the morphological analyses of the Hebrew string וכשבאתי, transliterated in Latin as WK$B^TY, and pronounced /ve-kshe-bati/ is as follows – lexical entry : the verb בא, translated in latin as B^, (=to come) category – verb tense – past feminine/masculine, first person singular object pronoun – masculine, singular, 3-rd person Clearly, full morphological analyses of Hebrew would also solve the vowel restoration problem because it would disambiguate any word and we could simply look up the appropriate meaning in the dictionary. 5 Word accuracy and letter accuracy measure the number of correct vowel-annotated words or letters respectively. 6 Due to time limitations and the limited scope of this paper decision lists have not been implemented in this work. Segel (Segel, 97) devised a statistical Hebrew lexical analyzer that inputs non-voweled Hebrew texts and achieves 95% word accuracy on test data extracted from the Israeli newspaper Ha-aretz. However, this method requires fully analyzed Hebrew text to train on. The author used a hand-analyzed training set consisting of only 500 sentences. Because there is no tree bank in Hebrew of analyzed text this method would not carry well to other domains, such as novels or medical texts. To this date, no statistical language model of Hebrew has used an N-gram model7. 3. Our Approach We have stated above that contextual information is important in deciphering lexical ambiguities in Hebrew and is commonly used by native speakers. We believe that an HMM bigram would sufficiently capture the contextual dependencies between words. We initially considered using a trigram approach but decided that the added cost in development time brought about by the increase in the number of hidden states and the complex smoothing techniques exceeds the limit for this project. We will evaluate performance by checking the percentage of words in the training set containing the correct vowel pattern. We refer to this performance measure as word accuracy percentage. Because our primary purpose of is to assist in the reading of modern Hebrew we will not check the percentage of correctly annotated syllables/letters8. 4. Corpus – The Westminster Hebrew Morphological Database There is an unfortunate lack of data for vowel-annotated text in modern Hebrew. An obvious substitute is the Hebrew bible. Ancient Hebrew bears enough syntactical and semantic resemblance to modern Hebrew to warrant usage of a bible corpus. The Westminster Hebrew Morphological Database is a corpus carrying a complete transcription of the graphical form of the Massoretic text of the Hebrew bible and contains 301,2670 words. The text is coded in Latin letters. Apart from the vowels, the 7 Levinger and Ornan state that n-grams are inadequate for Hebrew NLP because of the "free word form" nature of Hebrew. I believe the Hebrew word form to be no freer than English. 8 Note that it is also possible to divide the 16 Hebrew vowels into 5 separate phonological groups. It makes sense to check how successful the algorithm is in assigning the correct vowel group to a syllable/letter as a correct group assignment of a vowel would adequately assist the reading of a text. corpus contains the following: open and closed paragraph marks, the accentuation signs, and morphological divisions for certain morphemes. The code for these different details is interspersed with one another. Ninety percent of the corpus was used as training data, and 10 percent of it as test data. Due to the relatively small size of the training data, multifold testing and training of different segments of the data might improve robustness of results, but due to the timing constraints they were not used. 5. Base line – A unigram model A frequency table was established including a separate entry for each vowel-annotated word in the training data. We counted the number of times each vowel-annotated word appeared in the set. In the testing phase, we search through all of the words that have the same consonant structure in the table and pick the vowel-annotated word that has the highest count. The following chart describes ambiguity distribution among the training set. percentage of training set chart 1 - ambiguity distribution in training set 35 33 30 24 25 18.1 20 15 9 10 5.5 5.3 2.2 5 0.8 0.7 0.3 0.6 0.04 0.3 0.2 8 9 10 11 14 0 0 2 3 4 5 6 7 12 13 number of ambiguities Note that only 33% of the training set was unambiguous. We achieved an overall success rate of 68% using the base line technique (see results section). 6. Bigram model We constructed a bigram Hidden Markov Model where hidden states were vowelannotated words, and visible states were vowel-less words. One example of a path in the HMM is given in chart 2. In our model, there was only one emission from each hidden state to an observation. Therefore the probability of sentence W1,n is the sum over all hidden states of finding the hidden states the HMM traversed while generating the sentence w1,n starting at a particular hidden state. pr (w1, n) i pr (Ti | Ti 1) T 1, n 1 These probabilities are approximated by bigram counts (see chart 2). Note that two special symbols, shown as # #, were prepended to words in the test set. They serve to “anchor” the initial state of the HMM and facilitate computation. Chart 2 – HMM model for the non-vowel-annotated phrase “in the beginning god created…” בראשית ברא אלוהים, transliterated in Latin as BR^$YT -BR^ - ^LWHYM nad pronounced as /be-reshit bara elohim/ BR^$YT be-re-shit|bara #|# bara|elo-him # | be-re-shit BR^ missing | missing bara | missing ^LWHYM Therefore, the hidden states actually consist of vowel-annotated bigrams. The probability of one possible transition of generating this phrase can be computed as follows : pr ( w1, n) pr ( wi | wi 1 ) which decomposes into the following probability i estimations. pˆ r (# |# ) 1, pˆ r (# | be re shit ) c(be re shit ), Unigram count c(be re shit , bara ) c(bara ) c(bara | elo him ) pˆ r (bara | elo him ) c(elo him ) pˆ r (be re shit | ba ra ) Bigram counts Note that in chart 2, there exist transitions to hidden states that include missing bigrams. In the model, each state has a possible transition to a missing bigram state. We elaborate more on this issue in section 7. We implemented a Viterbi algorithm to find the most likely path transitions throughout the hidden states. More formally, we define the problem as Hebrew vowel restoration using a bigram HMM model as finding arg max p (t1, n | w1, n) arg max t1, n t1, n p( w1,n , t1,n ) p( w1,n ) arg max p ( w1,n , t1,n ) t1, n We achieve this by keeping track of the most likely path through the hidden states for each possible hidden state with a non zero probability estimation and each observation. 7. Sparse data Because our bigram model is trained from a finite corpus, many bigrams are likely to be missing from it. In the unigram model, we found that as many as 16 percent of words in the test set were not to be found in the count look up table. The amount of unseen bigrams was even higher, as much as 20 percent. This is not surprising, as we expect some unseen bigrams to consist of words that were both seen before individually. We did not specifically deal with sparse data in the unigram base line model. As many of the unseen unigrams were non ambiguous, we would have liked to look up the sparse words in the Hebrew vowel-annotated dictionary and copy the vowel pattern found in the dictionary. However, as noted in section 2, Hebrew words are attached with prefixes and suffixes that represent conjugation, propositions, pronouns, and connectives. The dictionary contains only the stem form of verbs and nouns, and without a morphological analyzer we cannot decipher the stem. Therefore we proceed as follows: We employ Katz’s technique (Katz, 99) to combine a discounting method along with a backoff method to try and obtain a good estimate of unseen bigrams. We use the Good and Turing discounting method [Gale & Sampson 91] to tell us how much total probability mass to set aside for all the events we haven’t seen, and the backoff algorithm to tell us how to distribute this probability. Formally, we define pr ( w2 | w1) Pd ( w2 | w1) If c(w2,w1)>0 pr ( w2 | w1) ( w1) p ( w2 | w1) If c(w2,w1)=0 where Pd is the discounted estimate using the Good and Turing method, p is a probability estimated by counts and (w1) is a normalizing factor that divides the unknown probability mass of unseen bigrams beginning with w1. 1 ( w1) Pd (w2 | w1) w 2:c ( w1, w 2 ) 0 1 p(w2 | w1) w 2|c ( w1, w 2 ) 0 In order to compute Pd we create a separate discounting model for each context word w1. The reason for this is simple: If we use only one model over all of the bigram counts, we would be really approximating Pd(w2,w1). Because we wish to estimate Pd(w2|w1), we define the discounted frequency counts as follows – c * ( w1, w2) c( w1, w2) nc ( w1, w 2 ) 1 nc ( w1, w 2 ) where nc is the number of different bigrams in the corpus that have frequency c. Following Katz, we estimate the probability of unseen bigrams to be p(w2) if c(w2) > 0 p(w2|w1) p(unseen|w1) if c(w2) = 0 Note that in some cases, w2 itself is unseen and c(w2) cannot be computed. To get the estimate probability for unseen unigrams p(unseen|w1) we allocate some probability mass to unseen w2 words by keeping a special count for bigrams (see chart 2) that were seen less then k times.9 8. Results and discussion As stated previously, the base line unigram model achieved a word accuracy of 68%. Using the HMM bigram model, we managed to achieve word accuracy of 80%. This is an improvement over our base line, but these results are unsatisfying, as they mean that on average, our model misclassifies 2 words in every sentence. Most of the errors (16%) are attributed to the misclassification of unseen words. As stated in section 2, some degree of morphological analysis will help us to classify many of these words more accurately, Another problem of the model is the large number of hidden states that correspond to vowel-annotated words seen in the training data. This phenomenon is attributed to the following aspects of the morphological richness of Hebrew: Several Hebrew letters are ambiguous. For example, the letter הmay represent either a consonant /h/ or a vowel. The same applies to the letters יand ו. Furthermore, all of them, (i.e. ה יand )וmay indicate several different vowels. For instance, וmay indicate the vowel /o/ or /u/. It is difficult to tell off hand whether an ambiguous letter defines a vowel or a consonant. However, when these letters represent vowels, they are usually left in the text and are not removed. Therefore, these letters could help us restore the vowel patterns of the preceding letter, provided we know that they represent vowels and not consonants. Currently my model treats all appearances of these words as consonants, as we cannot disambiguate without the use of an analyzer. Verb conjugation is also represented through the use of suffixes and prefixes to the verb stem. For example, the word ראיתי/ra-ity/ (= I saw) conjugates the verb ראה/ra-a/. We expect that all verb conjugations of the same verb have equal frequency throughout the test set, and thus we would like to store counts only for the stem verb rather than each conjugation. Currently my model keeps a separate count for each conjugation, leading to an increase in the number of hidden states. 9 K was arbitrarily set to 3 in our experiment. Alternatively, we could get a more exact estimation of the missing probability mass by discounting the unigram probabilities of w2, but we did not have time to implement this. Both of these problems can again be addressed through he use of morphological analysis. We predict that even a limited and unsophisticated analyzer would help to increase word accuracy. Also, We did not integrate sentence boundaries into the model. In future work, we will make each word beginning a sentence independent on the last word of the previous sentence. 9. Conclusion We have shown that for the task of restoring Hebrew vowels, using an HMM bigram model improves word accuracy performance over a unigram model. However, the results are still unsatisfying. We predict that integrating a limited morphological analyzer would vastly improve performance. Acknowledgements Thanks to Prof. Daniel Moore of the for providing me with the full Latin transcripted code of the Hebrew bible. Special thanks is due to the CS 287r staff at Harvard University: Ken Shan, Wheeler Ruml and Prof. Stuart Shieber for engaging discussions and helpful comments. Apendix A – The Hebrew Vowels patach like "a" car hireq defective Like "i" bit qametz "a" (father) hireq plene qametz metheg seghol qibbutz shureq tsere defective tsere plene holem like "a" car defective "e" (bet) holem plene like "u" in shewa flute like "u" in patach shewa flute like "ey" in like "i" machine Like "o" boat Like "o" boat like "a" in "above" Like "a" car qametz shewa Like "a" car they like "ey" in seghol shewa they Like "e" bet Apendix B Given below is the Hebrew transliteration used throughout the paper. This is the accepted transcription proposed by the Hebrew academy (Academy of the Hebrew Language, 1957). Latin Hebrew Latin Hebrew Latin Hebrew P פ @ ט ^ א C צ Y י B ב Q ק K כ G ג R ר L ל D ד $ ש M מ H ה T ת N נ W ו S ס Z ז & ע X ח Appendix C An example of word ambiguity in Hebrew, the word ספר Vowel-annotated word Meaning (Technically, I cannot include the Hebrew vowels in this word processor) ספר A book ספר A barber ספר The verb count (3rd person singular), ספר The verb cut (3rd person singular), ספר Was told (passive) ספר Was cut (passive) References Academy of the Hebrew Language , 1957. The rules for Hebrew-Latin transcription. In Memiors of the Academy of the Hebrew Language, pages 5-8 (in Hebrew) Choueka, Y. and Neeman, Y. Nakdan-Text, (an In-Context Text-Vocalizer for Modern Hebrew). BISFAI-95, The fifth Bar Ilan Symposium for Artificial Intelligence 1995 Ido Dagan, Fernando Pereira, and Lillian Lee. Similarity-based estimation of word cooccurrence probabilities. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994. Gale, W. A. and Sampson, G. (1995). Good-Turing Frequency Estimation Without Tears. Journal of Quantitative Linguistics 2, 217-237. Segel, A. A probabilistic Morphological Analyzer for Hebrew undotted text, MSc thesis 1997, Israeli Institute of Technology. (in Hebrew) D. Yarowsky. A comparison of corpus-based techniques for restoring accents in Spanish and French text. In Proceedings, 2nd Annual Workshop on Very Large Corpora, Kyoto, pages 19--32, 1994.