Parsing by Semantic Features MT2000 -International Conference at the University of Exeter 20-22 Nov.2000 Organised by the British Computer Society Uzzi Ornan Israel Gutter Computer Science, Technion I.T.T. Abstract This article presents a method for analyzing and understanding a sentence not according to syntax but according to the meaning of the words that constitute it, or more properly by means of identifying and attaching its various components as semantic complements to the sentence main verb. The innovation in this method is the determination that there is no need to identify the sentence components according to their syntactic function but according to their semantic function. This way, the correctness/incorrectness of the sentence can be determined relatively easily, quite unconnected with the question of they order of its parts. This applies even when the morphology presents a text with multiple meanings - an outcome far more difficult to obtain by other usual methods such as ordinary grammar rules or the RT or AT networks . By this method morphological ambiguity is fully dissipated, in that as the sentence is being analyzed all the inappropriate morphological parsings of the sentence words (on whatever grounds - syntactic or semantic) are eliminated. Accordingly, this method is proposed for the purpose of complete morphological disambiguation. This completeness, which takes a non-fixed order of the sentence parts into account also, is not achieved by most of the morphological disambiguation methods common nowadays, be they syntactic or statistical. Morphological disambiguation is one of the hardest problems in languages that use a script in which not all the elements of the word are evident, such as Hebrew. The article sets out the principles of the method and illustrates its application. In the illustrations the method is operated on Hebrew, in which the order of the sentence parts is fairly free and which has a rich morphology with multiplicity of meaning. 1 .Introduction The usual parsing methods tend to rely on a fixed order of the sentence components. For example, a sentence usually is perceived as a nominal phrase, followed by a verbal phrase, as expressed in the grammatical formula S -> NP VP. Usually, grammar assumes that the nominal phrase serving as the sentence subject appears before the verb phrase, which serves as the sentence predicate. This is so in English, the authors of whose grammars for the most part composed their methods principally for operation on that language . But languages exist in which the order of the sentence parts is quite free. In Hebrew, for example, all the following six sentences, literally translated, are possible: A horse ate grass A horse grass ate Ate a horse grass Ate grass a horse Grass a horse ate Grass ate a horse All these sentences are more particularly acceptable and right-sounding if 'horse' (the performer of the act) has the definite article. Analysis that uses grammatical rewriting rules in keeping with the foregoing will be hard pressed to manage a large part of these sentences, let alone a longer and more complex sentence. Of 1 course, the grammatical rules can be duplicated for any reasonable order of sentence components, but this is a clumsy and inefficient solution, and as we shall see, our method makes it superfluous. Another problem here is the morphological ambiguity of words. This is solvable by any method through examination of the connection of the word to its close or distant surroundings. But if the order of the sentence parts is not rigid, the relevant surroundings that define the context of the word are far harder to determine. Many methods of morphological disambiguation make do with the short context of the word, and turn to probability or other means. These methods will be less efficient when the word order in the sentence is not rigid. In any case, such methods do not guarantee full morphological disambiguation . The problem of morphological ambiguity is especially acute in Hebrew, as it is in Arabic and other languages. This is not just because of the addition of inflexional affixes to the word stem but also because of the defective writing system. One. Short particles are written attached to the following word, and with it they create a single letter chain. This way of writing enlarges the number of possible morphological analyses of the letter chain. Two. Double letters are signified with one letter only alone, not with two. Three. Some letters stand for a consonant and also a vowel. We propose a parsing method that examines the sentence parts according to their semantic function in expression, not according to their syntactic function. Our method is particularly useful for analyzing a sentence in a language in which the order of its parts is not fixed, but it can work equally successfully, and certainly more easily, when the order is rigid or somewhat rigid. The search for ways to dispel morphological ambiguity is also the outcome of the special problems we had to solve to analyze sentences in Hebrew successfully. .2 Thematic complements (roles demanded by the verb) As stated more than once in many linguistics essays, the verb is the center of information expressed in the sentence. In Hebrew, and in Arabic and other languages in which verb inflexion is likely to include the personal pronoun, a proper sentence may contain just one word, which is a verb, for example, halakti, 'went-I'. In this case, concealed in the verb is a thematic complement, for halakti is composed of the basis halak 'went' and the suffix ti 'I', namely with the meaning 'I went'. This complement, hidden within the verb, shows that the sentence is proper despite being a single word . The roles required by the verb are defined according to thematic function that each of them fulfils. This question is discussed by Fillmore (1968) and Chomsky (1965). To meet the demanded function, every such required complement has to maintain certain semantic features (known as Selectional Restrictions). Some complements must also maintain syntactic agreement, such as agreement with person, gender, and number, or agreement with preposition . We compiled a semantic lexicon for verbs, in which for every verb particular thematic complements are defined, and the semantic features demanded for each complement. We placed in the lexicon complements that must appear for a sentence to be formed, as well as optional complements . 3 . Examples To clarify matters in practice, we offer here four examples of various verb entries 1.1 Verb entry akl (eat) Here, four thematic complements are identifiable: 2 A complement that defines an agent, the initiator of an act (which we abbreviate as 'agent'). This complement must maintain the semantic feature of 'living' (e.g., the boy ate. The boy fulfils the thematic function of the agent, and this function must possess the semantic feature 'living.)' A complement that describes what is eaten (this thematic function will be called 'theme' role). It must maintain the semantic feature of 'being eatable' (e.g., he ate ice cream.) A complement that describes the means whereby one eats ('instrument'). This complement must maintain the semantic feature of 'eating utensil' (e.g., he ate with a spoon .) A complement that describes with whom one eats ('co-agent.)' With this verb, only the 'agent' function is obligatory; all the others are optional, and do not have to appear in the expression. As stated, in Hebrew, because the verb inflexion contains the pronoun, the 'agent' role does not have to be stated expressly, for example, the verb akalnu we ate. In this case the 'agent' role is hidden within the verb. Accordingly, the following (translated) sentences will be found proper : The boy ate. The boy ate ice cream. The boy ate ice cream with a spoon. The boy ate ice cream with a spoon with his friend. Also , Ate the boy ice cream with a spoon with his friend. With his friend with a spoon ate the boy ice cream Complements exist that must also observe certain syntactic restrictions. Examples are agreement with person, gender, and number of the verb, or opening with a certain preposition. Because parsing by our method does not involve syntactic functions at all, the need for rigid order of the sentence parts is eliminated anyway. 1.2 Intransitive verb and transitive verb The system of thematic complements distinguishes the various uses of the verb far better, and there is no need for the defective syntactic distinction between transitive and intransitive, for example, as follows: The ice thawed. The sun thawed the snow. Evidently, the same verb appears on both sentences, and in both the two words ice, sun fulfil the same syntactic function, namely the subject. Still, it would be wrong to say The ice thawed the snow. The reason for this wrongness is that there is a difference in thematic function of each word in the two cases. In the first sentence the ice plays fulfils the theme role 'influenced' by the action of thawing, so it must maintain the semantic feature of 'frozen liquid', while in the second sentence the sun performs the thematic function of 'cause of action', so it must maintain the semantic feature of 'a heat-creating body'. Therefore, in the second sentence sun cannot be exchanged with ice, as ice is not a heat-creating body. The thematic roles are not connected with the syntactic function of the sentence parts, but they are a direct projection of the full lexical value of the verb. Syntactic parsing in this example would not explain why one cannot say The ice thawed the snow. 1.1 A different expression for equivalent content Let us observe the following two sentences: 3 The cook blanched the meat with steam. The meat was blanched with steam by the cook . The second sentence is 'like' the first in the sense that the same information is given in both. From the moment we know of the existence of the form 'was blanched' it is clear to us that the complement functioning as the subject in the active sentence functions in the same thematic role as the complement in the passive sentence, which is syntactically characterized by the proposition 'by'; while the complement in the active sentence that performs the syntactic role of direct object has a thematic role akin to the complement that has the syntactic function of subject in the passive sentence. The case concept was explained by Fillmore (1968). Indeed this is the expression of the thematic status of the noun in respect of the action expressed in the sentence. Fillmore's theory illuminated the matter for us, even if the list of thematic roles has grown over the years, and even if the semantic features of the thematic function performers has developed nowadays far beyond what he and his followers set forth. 1.3 Syntactic and semantic role In the following two sentences, apple has the same syntactic role, namely direct object. (1 I took the apple. )2 I saw the apple. But in terms of the semantic status there is a difference between the two cases of apple in respect of the action. In sentence (1) the word apple is what receives the action (Theme), and in sentence (2) it is the Aimed-at of the action. Here the difference between syntax and semantics is made very plain. .3 Annexed roles Apart from the roles demanded by the verb, which are listed as an internal part in the lexical entry of the verb, to every sentence may be added annexed or external roles too. An annexed role is external, not located in the lexical entry. In principal it can be attached to any verb, for example, yesterday, tomorrow, at the market, at home, etc. For example, The boy ate ice cream yesterday at the market. 5 .The parsing process 5.1 A 'sentence' is: the verb and its complements As stated above, in the sentence parsing process we treat the sentence as a verbal phrase and lexical components serving as thematic complements to the verb - each verb according to its lexical entry, or annexed, usually of time or place. The various complements can be arranged in any order among themselves or with the verb. A thematic complement to the verb can be a nominal phrase or a prepositional phrase, and even another complete sentence (an internal sentence: in this case the whole sentence will be compound). Every thematic complement must be a regular component according to the complement demands of the verb, and sometimes it must also meet syntactic conditions such as agreement with person, gender, and number . An annexed complement must be a regular component with meaning of time or of place, and it must not contradict the morphological or semantic elements of the verb. Usually an annexed complement of place will be a prepositional phrase, such as at the market, at home, etc., but also here, there, and the like. 4 5.2 Stages of sentence parsing In principle, sentence parsing has the following stages. One. Identification of every sentence part, with details of all its grammatical, syntactic, and semantic features. By this means the verb, and all the parts that are likely to serve as its thematic complements or annexed complements, are revealed. Here we have to use a.1 first, a morpho-parser that discloses all the various possibilities of reading the letter chain; a.2 then the semantic lexicon for verbs in order to discover the required thematic functions and the semantic features demanded from these thematic functions; a.3 and finally a nominal lexicon, in which the semantic features of nouns are detailed, in order to discover which of them contains the required features as demanded by the intended thematic function. b. Now it is necessary to check if all the words find their place in the sentence structure, and to detect annexed complements, adjectives attached to the nominal phrases, and additional words that may be found to have a function 'on the side'. All the sentence components must satisfy the syntactic conditions and the relevant semantic conditions. c. The sentence will be found proper if at stage (b) at least one possibility is found wherein, as stated, all the words find their place in the sentence structure, and in this possibility all the obligatory thematic complements required by the verb are provided. The realization of this process is beset by several difficulties: At stage (a) semantic parsing is sometimes necessary in order to find semantic features of a component constructed of a number of words. For example, 'the boy' differs semantically form 'the boy and the horse'. In regular speech that is not a metaphor you can say 'the boy thought' but you cannot say 'the boy and the horse thought'. This because the semantic lexicon mentioned in connection with stage (a) contains semantic information about isolated words only (or about expressions: see below). At stage (b) in addition to the semantic conditions, a check has to be made of syntactic conditions also, for example, agreement with prepositions or with person, gender, and number. All existing possibilities have to be tested too. For example, if a certain component can serve as either of two thematic complements to the verb, each has to be tested separately. Finally, more than one possibility may be found at stage (c), in which case the sentence has multiple meanings. 5.3 Word order in the sentence The following sentence shows that testing all existing possibilities as done in stage (b) allows a non-rigid order of the sentence components. Let us look once more at the sentence grass ate a horse. At stage (a) three components will be identified, and for simplicity's sake we shall assume that every component is morphologically unambiguous . Component 1: grass (nominal phrase, possessing relevant semantic features.) Component 2: ate (verb, for which four thematic complements are defined, as detailed in example 3.1.) Component 3: horse (nominal phrase, possessing relevant semantic features . Now all the possibilities of defining the nominal phrases (1) and (3) as regular complements of the verb must be tested. 5 The possibility of annexed complements is disqualified for these two components as they are not prepositional phrases (nor do they end in the 'directive ah'), so they cannot be complements of place; and as they have no meaning of time, they cannot be complements of time . Since for the verb 'eat' four requirements of semantic complement demands are defined, all the following possibilities must be tested, as set out in the table . TABLE HEADINGS Possibility no. Thematic role of component 1 Thematic role of component 3 For each of the above possibilities, syntactic and semantic agreement has to be tested for each component as defined in the demands for complement of the verb . If not one of the above possibilities is not obtained, the entire sentence is improper. But even if there is a possibility that is obtained, the whole sentence may be improper because there is no component that serves in the agent role . Stage (c). All the requirements for complement of the verb eat are optional, apart from the agent role, which is obligatory. Therefore, any possibility that does not furnish an agent is excluded. For all the rest, it has to be tested if the component supposed to fill the function of agent is syntactically in agreement (according to morphological features) with the verb, and of course if its semantic features match the features required according to the semantic lexicon. The entire sentence will be found proper only if at least one possibility is found legitimate at this stage . On the grounds of the test described, we can see that only possibility 4 comes under consideration as an acceptable possibility. We reject in advance the possibility that the two components will serve as the same semantic complement. In such a case we expect them to appear side by side as one expanded syntactic component, not apart . Now we must check that in possibility 4 all the obligatory complements of the verb eat have been covered. The entire sentence is therefore found to be proper. We note that during the sentence parsing we understood its meaning also, namely we understood who ate whom: a horse age grass and not grass ate a horse. 5.3 Disambiguation As stated earlier, all the morphological, syntactic, and semantic possibilities for understanding the sentence components are tested, and those that do not match are disqualified. Through this process, therefore, we acquire complete disambiguation, particularly of morphological multiplicity of meaning. Obviously, this complete disambiguation reaches the point of ability to understand all the sentence unequivocally. If all the sentence (every parsed section) is subject to understanding by several possibilities, it is possible that its isolated components will have multiple meanings also . 5.5 Sentence with coordinated phrase Here is an example of a sentence with a coordinated phrase: The boy ate an apple and a banana. The two words apple and banana join together as one component in the sentence. The main problem that such a component presents in a sentence with coordinate component is that of the agreement sometimes required between this component and others in the sentence . Syntactic agreement of the coordinated nominal phrase in Hebrew is determined as follows: Number: plural Gender: if all are feminine, then feminine; if not, masculine 6 Person: according to the lowest among them . The semantic features are determined according to the cross-sectional group of the groups of semantic features of each of the components of the coordinated component nominal phrase. 5.5 Compound sentence As an example, let us look here at how a relative clause is to be supported. The nominal phrase has to serve as a legitimate complement in the relative clause, and after it has been identified as such a complement, an obligatory complement that has not yet been accomplished should not be left in the sentence. An example is: The boy who came to the yard ate an ice cream that the man bought at the market . The boy who came to the yard: the word boy fulfils a legitimate agent role in the relative clause, and after the word is added to the relative clause the sentence The man bought ice cream at the market is obtained. The original sentence is therefore a legitimate sentence. But the sentence The boy who came to the yard bought a book that the man ate, according to the parsing method we showed in the previous example is improper. .5 The semantic lexicon The semantic lexicon file contains semantic definitions (and sometimes minimal semantic information) about the lexical entries whose lexical category is one word: a noun, a pronoun, an adjective, an adverb, a preposition, or a verb. Apart from verbs, each one of all the others is defined singly, in a separate line, in which information is given about the features of prepositions and/or semantic features that exist in the word, or that should exist in the word that should be attached to it. A verb is defined in a line, after which all the required semantic complements are defined. Each complement has an additional separate line. For every complement the following details are defined: .1Thematic function: if it is an obligatory complement; if it has to agree with the verb; if in conditions of imperative or first person past it may be dispensed with, even if it is defined as an obligatory complement. .2Information about prepositions that should appear. A number of prepositions may be defined. In such a case, if one of the prepositions appears, it will be deemed correct. .1Information about the semantic features that the complement has to present. The entries appear in the lexicon in their basic form only, in phonetic script (ISO 259-3), without inflexion. Nouns are in the singular. Verbs are in third person singular past. Examples of entries from the semantic lexicon are given in Appendix A . 7 .Summary The parsing method by semantic features as we have defined it in this paper proves to be efficient and effective, and it has great importance for languages in which word order is fairly free, and also for environments of multiple meaning (e.g., Hebrew). It can also serve the purpose of full morphological disambiguation . This method does not oblige us to apply grammatical formalisms and their like (such as RTN and ATN), and it guides us to parsing and understanding of the sentence directly through the semantics of its components, without recourse to syntactic parsing. It is thus relatively easy to ascertain the like and the unlike in similar forms of different sentences, for example, active-passive, the same sentence with a change of order of components, various kinds of compound sentences, and so on. A practical use may also be found for it in such applications as translation, search, etc . This method is already in use in a product of a search engine of words according to their meaning. A formal search alone of words written in Hebrew script is bound to bring forth many 7 'finds' that are not directed at the sought word. The search engine according to meaning has created a precision instrument, which as far as we known has nothing like it, certainly not regarding Hebrew. The engine was developed by the Multitext Inc.The study was commissioned by the Ministry of Science in the government of Israel, with its encouragement for use in schools in Israel . 8 .References Allen, James (1995),. Natural Language Understanding, 2nd edition. The Benjamin/Cummings Publishing Company, Inc. (especially pp. 244-250 on the subject of Thematic Roles.) Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. Chomsky, Noam (1981) Lectures on Government and Binding, Foris Pub. Earley, J. 1970. "An efficient context-free parsing algorithm", Commun. Of the ACM 13, 2 : 94-102 . Even Shoshan, Abraham, The New Dictionary, 1987 Fillmore, C. J. (1968). "The case for case". In E. Bach and R.Harms (eds,). Universals in Linguistic Theory. New York: Holt, Rinehart, and Winston, 1-90. Fillmore, C. J. (1977). "The case for case reopened". In P. Cole and J. Sadock (eds,). Syntax and Semantics. Vol. 8: Grammatical Relations. New York: Academic Press pp. 59-81 Ide, Nancy and Veronis, Jean (1998). "Introduction to the Special Issue on Word Sense Disambiguation : The State of the Art ," Computational Linguistics 24, 1 : 1-40. ISO 259-3 (1999). Conversion of Hebrew Characters into Latin Characters. Part 3 : Phonemic Conversion, February 1999. Levinger, Moshe (1992). Morphological Disambiguation in Hebrew , Research Thesis for the degree of Master of Science in Computer Science, Technion . Levinger, Moshe, Uzzi Ornan, and Itai Alon (1995). "Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew," Computational Linguistics 21, 3 : 383-404. Nirenburg, Sergei (1987). "Machine Translation", Theoretical and Methodological Issues, Cambridge University Press. Nirenburg, Sergei (1992). "Machine Translation", A Knowledge-Based Approach. San Mateo, CA : Morgan Kaufmann. Nirenburg, Sergei (1993). "Progress in Machine Translation". Amsterdam, Netherlands : IOS Press. Ornan, Uzzi, (1987) , "Hebrew Text Processing Based on Unambiguous Script ," Mishpatim 17 (1). The Hebrew University of Jerusalem . Ornan, Uzzi and Michael Katz, (1994), "A New Program for Hebrew Index Based on Phonemic Script", TR #LCL 94-7 (revised), Technion, I.I.T. Ornan, Uzzi (1990). "Machinery for Hebrew Word Formation". In Martin Golumbic ( ed.), Advances in Artificial Intelligence. Natural Language and Knowledge-Based Systems, Springer-Verlag, 75-93 . Ornan, Uzzi (1991). "Theoretical gemination in Israeli Hebrew". In Semitic Studies in Honor of Wolf Leslau, edited by Alan S. Kaye, Otto Harrassowitz. .1158-1158 8 the Resnik, P. (1993). "Semantic classes and syntactic ambiguity", Proc. ARPA Human Language Technology Workshop, San Mateo, CA: Morgan Kaufmann. Stern, Naftali, (1994), The Verb Dictionary, Bar-Ilan University Wintner, Shuly and Uzzi Ornan (1995), "Syntactic Analysis of Hebrew Sentences ," Natural Language Engineering, I (3 .) 9