Parsing by Semantic Features

advertisement
Parsing by Semantic Features
MT2000 -International Conference at the University of Exeter 20-22 Nov.2000
Organised by the British Computer Society
Uzzi Ornan
Israel Gutter
Computer Science, Technion I.T.T.
Abstract
This article presents a method for analyzing and understanding a sentence not according to syntax
but according to the meaning of the words that constitute it, or more properly by means of
identifying and attaching its various components as semantic complements to the sentence main
verb. The innovation in this method is the determination that there is no need to identify the
sentence components according to their syntactic function but according to their semantic function.
This way, the correctness/incorrectness of the sentence can be determined relatively easily, quite
unconnected with the question of they order of its parts. This applies even when the morphology
presents a text with multiple meanings - an outcome far more difficult to obtain by other usual
methods such as ordinary grammar rules or the RT or AT networks .
By this method morphological ambiguity is fully dissipated, in that as the sentence is being
analyzed all the inappropriate morphological parsings of the sentence words (on whatever grounds
- syntactic or semantic) are eliminated. Accordingly, this method is proposed for the purpose of
complete morphological disambiguation. This completeness, which takes a non-fixed order of the
sentence parts into account also, is not achieved by most of the morphological disambiguation
methods common nowadays, be they syntactic or statistical. Morphological disambiguation is one
of the hardest problems in languages that use a script in which not all the elements of the word are
evident, such as Hebrew.
The article sets out the principles of the method and illustrates its application. In the
illustrations the method is operated on Hebrew, in which the order of the sentence parts is fairly
free and which has a rich morphology with multiplicity of meaning.
1 .Introduction
The usual parsing methods tend to rely on a fixed order of the sentence components. For example, a
sentence usually is perceived as a nominal phrase, followed by a verbal phrase, as expressed in the
grammatical formula S -> NP VP. Usually, grammar assumes that the nominal phrase serving as
the sentence subject appears before the verb phrase, which serves as the sentence predicate. This is
so in English, the authors of whose grammars for the most part composed their methods principally
for operation on that language .
But languages exist in which the order of the sentence parts is quite free. In Hebrew, for example,
all the following six sentences, literally translated, are possible:
A horse ate grass A horse grass ate Ate a horse grass
Ate grass a horse Grass a horse ate Grass ate a horse
All these sentences are more particularly acceptable and right-sounding if 'horse' (the performer of
the act) has the definite article.
Analysis that uses grammatical rewriting rules in keeping with the foregoing will be hard
pressed to manage a large part of these sentences, let alone a longer and more complex sentence. Of
1
course, the grammatical rules can be duplicated for any reasonable order of sentence components,
but this is a clumsy and inefficient solution, and as we shall see, our method makes it superfluous.
Another problem here is the morphological ambiguity of words. This is solvable by any
method through examination of the connection of the word to its close or distant surroundings. But
if the order of the sentence parts is not rigid, the relevant surroundings that define the context of the
word are far harder to determine. Many methods of morphological disambiguation make do with
the short context of the word, and turn to probability or other means. These methods will be less
efficient when the word order in the sentence is not rigid. In any case, such methods do not
guarantee full morphological disambiguation .
The problem of morphological ambiguity is especially acute in Hebrew, as it is in Arabic
and other languages. This is not just because of the addition of inflexional affixes to the word stem
but also because of the defective writing system.
One. Short particles are written attached to the following word, and with it they create a single
letter chain. This way of writing enlarges the number of possible morphological analyses of the
letter chain.
Two. Double letters are signified with one letter only alone, not with two.
Three. Some letters stand for a consonant and also a vowel.
We propose a parsing method that examines the sentence parts according to their semantic function
in expression, not according to their syntactic function.
Our method is particularly useful for analyzing a sentence in a language in which the order
of its parts is not fixed, but it can work equally successfully, and certainly more easily, when the
order is rigid or somewhat rigid. The search for ways to dispel morphological ambiguity is also the
outcome of the special problems we had to solve to analyze sentences in Hebrew successfully.
.2 Thematic complements (roles demanded by the verb)
As stated more than once in many linguistics essays, the verb is the center of information expressed
in the sentence. In Hebrew, and in Arabic and other languages in which verb inflexion is likely to
include the personal pronoun, a proper sentence may contain just one word, which is a verb, for
example, halakti, 'went-I'. In this case, concealed in the verb is a thematic complement, for halakti
is composed of the basis halak 'went' and the suffix ti 'I', namely with the meaning 'I went'. This
complement, hidden within the verb, shows that the sentence is proper despite being a single word .
The roles required by the verb are defined according to thematic function that each of them
fulfils. This question is discussed by Fillmore (1968) and Chomsky (1965). To meet the demanded
function, every such required complement has to maintain certain semantic features (known as
Selectional Restrictions). Some complements must also maintain syntactic agreement, such as
agreement with person, gender, and number, or agreement with preposition .
We compiled a semantic lexicon for verbs, in which for every verb particular thematic
complements are defined, and the semantic features demanded for each complement. We placed in
the lexicon complements that must appear for a sentence to be formed, as well as optional
complements .
3 . Examples
To clarify matters in practice, we offer here four examples of various verb entries
1.1 Verb entry akl (eat)
Here, four thematic complements are identifiable:
2
A complement that defines an agent, the initiator of an act (which we abbreviate as 'agent').
This complement must maintain the semantic feature of 'living' (e.g., the boy ate. The boy fulfils
the thematic function of the agent, and this function must possess the semantic feature 'living.)'
A complement that describes what is eaten (this thematic function will be called 'theme'
role). It must maintain the semantic feature of 'being eatable' (e.g., he ate ice cream.)
A complement that describes the means whereby one eats ('instrument'). This complement
must maintain the semantic feature of 'eating utensil' (e.g., he ate with a spoon .)
A complement that describes with whom one eats ('co-agent.)'
With this verb, only the 'agent' function is obligatory; all the others are optional, and do not
have to appear in the expression. As stated, in Hebrew, because the verb inflexion contains the
pronoun, the 'agent' role does not have to be stated expressly, for example, the verb akalnu we ate.
In this case the 'agent' role is hidden within the verb.
Accordingly, the following (translated) sentences will be found proper :
The boy ate.
The boy ate ice cream.
The boy ate ice cream with a spoon.
The boy ate ice cream with a spoon with his friend.
Also ,
Ate the boy ice cream with a spoon with his friend.
With his friend with a spoon ate the boy ice cream
Complements exist that must also observe certain syntactic restrictions. Examples are
agreement with person, gender, and number of the verb, or opening with a certain preposition.
Because parsing by our method does not involve syntactic functions at all, the need for rigid order
of the sentence parts is eliminated anyway.
1.2 Intransitive verb and transitive verb
The system of thematic complements distinguishes the various uses of the verb far better, and there
is no need for the defective syntactic distinction between transitive and intransitive, for example, as
follows:
The ice thawed.
The sun thawed the snow.
Evidently, the same verb appears on both sentences, and in both the two words ice, sun fulfil the
same syntactic function, namely the subject. Still, it would be wrong to say The ice thawed the
snow. The reason for this wrongness is that there is a difference in thematic function of each word
in the two cases. In the first sentence the ice plays fulfils the theme role 'influenced' by the action of
thawing, so it must maintain the semantic feature of 'frozen liquid', while in the second sentence the
sun performs the thematic function of 'cause of action', so it must maintain the semantic feature of
'a heat-creating body'. Therefore, in the second sentence sun cannot be exchanged with ice, as ice is
not a heat-creating body.
The thematic roles are not connected with the syntactic function of the sentence parts, but
they are a direct projection of the full lexical value of the verb. Syntactic parsing in this example
would not explain why one cannot say The ice thawed the snow.
1.1 A different expression for equivalent content
Let us observe the following two sentences:
3
The cook blanched the meat with steam.
The meat was blanched with steam by the cook .
The second sentence is 'like' the first in the sense that the same information is given in both. From
the moment we know of the existence of the form 'was blanched' it is clear to us that the
complement functioning as the subject in the active sentence functions in the same thematic role as
the complement in the passive sentence, which is syntactically characterized by the proposition 'by';
while the complement in the active sentence that performs the syntactic role of direct object has a
thematic role akin to the complement that has the syntactic function of subject in the passive
sentence. The case concept was explained by Fillmore (1968). Indeed this is the expression of the
thematic status of the noun in respect of the action expressed in the sentence. Fillmore's theory
illuminated the matter for us, even if the list of thematic roles has grown over the years, and even if
the semantic features of the thematic function performers has developed nowadays far beyond what
he and his followers set forth.
1.3 Syntactic and semantic role
In the following two sentences, apple has the same syntactic role, namely direct object.
(1 I took the apple.
)2 I saw the apple.
But in terms of the semantic status there is a difference between the two cases of apple in respect of
the action. In sentence (1) the word apple is what receives the action (Theme), and in sentence (2) it
is the Aimed-at of the action. Here the difference between syntax and semantics is made very plain.
.3 Annexed roles
Apart from the roles demanded by the verb, which are listed as an internal part in the lexical entry
of the verb, to every sentence may be added annexed or external roles too. An annexed role is
external, not located in the lexical entry. In principal it can be attached to any verb, for example,
yesterday, tomorrow, at the market, at home, etc. For example, The boy ate ice cream yesterday at
the market.
5 .The parsing process
5.1 A 'sentence' is: the verb and its complements
As stated above, in the sentence parsing process we treat the sentence as a verbal phrase and lexical
components serving as thematic complements to the verb - each verb according to its lexical entry,
or annexed, usually of time or place. The various complements can be arranged in any order among
themselves or with the verb. A thematic complement to the verb can be a nominal phrase or a
prepositional phrase, and even another complete sentence (an internal sentence: in this case the
whole sentence will be compound). Every thematic complement must be a regular component
according to the complement demands of the verb, and sometimes it must also meet syntactic
conditions such as agreement with person, gender, and number .
An annexed complement must be a regular component with meaning of time or of place,
and it must not contradict the morphological or semantic elements of the verb. Usually an annexed
complement of place will be a prepositional phrase, such as at the market, at home, etc., but also
here, there, and the like.
4
5.2 Stages of sentence parsing
In principle, sentence parsing has the following stages.
One. Identification of every sentence part, with details of all its grammatical, syntactic, and
semantic features. By this means the verb, and all the parts that are likely to serve as its thematic
complements or annexed complements, are revealed.
Here we have to use
a.1 first, a morpho-parser that discloses all the various possibilities of reading the letter chain;
a.2 then the semantic lexicon for verbs in order to discover the required thematic functions and the
semantic features demanded from these thematic functions;
a.3 and finally a nominal lexicon, in which the semantic features of nouns are detailed, in order to
discover which of them contains the required features as demanded by the intended thematic
function.
b. Now it is necessary to check if all the words find their place in the sentence structure, and to
detect annexed complements, adjectives attached to the nominal phrases, and additional words that
may be found to have a function 'on the side'. All the sentence components must satisfy the
syntactic conditions and the relevant semantic conditions.
c. The sentence will be found proper if at stage (b) at least one possibility is found wherein, as
stated, all the words find their place in the sentence structure, and in this possibility all the
obligatory thematic complements required by the verb are provided.
The realization of this process is beset by several difficulties:
At stage (a) semantic parsing is sometimes necessary in order to find semantic features of a
component constructed of a number of words. For example, 'the boy' differs semantically form 'the
boy and the horse'. In regular speech that is not a metaphor you can say 'the boy thought' but you
cannot say 'the boy and the horse thought'. This because the semantic lexicon mentioned in
connection with stage (a) contains semantic information about isolated words only (or about
expressions: see below).
At stage (b) in addition to the semantic conditions, a check has to be made of syntactic
conditions also, for example, agreement with prepositions or with person, gender, and number. All
existing possibilities have to be tested too. For example, if a certain component can serve as either
of two thematic complements to the verb, each has to be tested separately.
Finally, more than one possibility may be found at stage (c), in which case the sentence has
multiple meanings.
5.3 Word order in the sentence
The following sentence shows that testing all existing possibilities as done in stage (b) allows a
non-rigid order of the sentence components. Let us look once more at the sentence grass ate a
horse.
At stage (a) three components will be identified, and for simplicity's sake we shall assume
that every component is morphologically unambiguous .
Component 1: grass (nominal phrase, possessing relevant semantic features.)
Component 2: ate (verb, for which four thematic complements are defined, as detailed in
example 3.1.)
Component 3: horse (nominal phrase, possessing relevant semantic features .
Now all the possibilities of defining the nominal phrases (1) and (3) as regular complements
of the verb must be tested.
5
The possibility of annexed complements is disqualified for these two components as they
are not prepositional phrases (nor do they end in the 'directive ah'), so they cannot be complements
of place; and as they have no meaning of time, they cannot be complements of time .
Since for the verb 'eat' four requirements of semantic complement demands are defined, all
the following possibilities must be tested, as set out in the table .
TABLE HEADINGS
Possibility no.
Thematic role of component 1
Thematic role of component 3
For each of the above possibilities, syntactic and semantic agreement has to be tested for
each component as defined in the demands for complement of the verb .
If not one of the above possibilities is not obtained, the entire sentence is improper. But
even if there is a possibility that is obtained, the whole sentence may be improper because there is
no component that serves in the agent role .
Stage (c). All the requirements for complement of the verb eat are optional, apart from the
agent role, which is obligatory. Therefore, any possibility that does not furnish an agent is
excluded. For all the rest, it has to be tested if the component supposed to fill the function of agent
is syntactically in agreement (according to morphological features) with the verb, and of course if
its semantic features match the features required according to the semantic lexicon. The entire
sentence will be found proper only if at least one possibility is found legitimate at this stage .
On the grounds of the test described, we can see that only possibility 4 comes under
consideration as an acceptable possibility. We reject in advance the possibility that the two
components will serve as the same semantic complement. In such a case we expect them to appear
side by side as one expanded syntactic component, not apart .
Now we must check that in possibility 4 all the obligatory complements of the verb eat have
been covered. The entire sentence is therefore found to be proper. We note that during the sentence
parsing we understood its meaning also, namely we understood who ate whom: a horse age grass
and not grass ate a horse.
5.3 Disambiguation
As stated earlier, all the morphological, syntactic, and semantic possibilities for understanding the
sentence components are tested, and those that do not match are disqualified. Through this process,
therefore, we acquire complete disambiguation, particularly of morphological multiplicity of
meaning. Obviously, this complete disambiguation reaches the point of ability to understand all the
sentence unequivocally. If all the sentence (every parsed section) is subject to understanding by
several possibilities, it is possible that its isolated components will have multiple meanings also .
5.5 Sentence with coordinated phrase
Here is an example of a sentence with a coordinated phrase: The boy ate an apple and a banana.
The two words apple and banana join together as one component in the sentence. The main
problem that such a component presents in a sentence with coordinate component is that of the
agreement sometimes required between this component and others in the sentence .
Syntactic agreement of the coordinated nominal phrase in Hebrew is determined as follows:
Number: plural
Gender: if all are feminine, then feminine; if not, masculine
6
Person: according to the lowest among them .
The semantic features are determined according to the cross-sectional group of the groups
of semantic features of each of the components of the coordinated component nominal phrase.
5.5 Compound sentence
As an example, let us look here at how a relative clause is to be supported. The nominal phrase has
to serve as a legitimate complement in the relative clause, and after it has been identified as such a
complement, an obligatory complement that has not yet been accomplished should not be left in the
sentence.
An example is: The boy who came to the yard ate an ice cream that the man bought at the market .
The boy who came to the yard: the word boy fulfils a legitimate agent role in the relative
clause, and after the word is added to the relative clause the sentence The man bought ice cream at
the market is obtained. The original sentence is therefore a legitimate sentence.
But the sentence The boy who came to the yard bought a book that the man ate, according
to the parsing method we showed in the previous example is improper.
.5 The semantic lexicon
The semantic lexicon file contains semantic definitions (and sometimes minimal semantic
information) about the lexical entries whose lexical category is one word: a noun, a pronoun, an
adjective, an adverb, a preposition, or a verb.
Apart from verbs, each one of all the others is defined singly, in a separate line, in which
information is given about the features of prepositions and/or semantic features that exist in the
word, or that should exist in the word that should be attached to it.
A verb is defined in a line, after which all the required semantic complements are defined.
Each complement has an additional separate line. For every complement the following details are
defined:
.1Thematic function: if it is an obligatory complement; if it has to agree with the verb; if in
conditions of imperative or first person past it may be dispensed with, even if it is defined as an
obligatory complement.
.2Information about prepositions that should appear. A number of prepositions may be defined. In
such a case, if one of the prepositions appears, it will be deemed correct.
.1Information about the semantic features that the complement has to present.
The entries appear in the lexicon in their basic form only, in phonetic script (ISO 259-3), without
inflexion. Nouns are in the singular. Verbs are in third person singular past. Examples of entries
from the semantic lexicon are given in Appendix A .
7 .Summary
The parsing method by semantic features as we have defined it in this paper proves to be efficient
and effective, and it has great importance for languages in which word order is fairly free, and also
for environments of multiple meaning (e.g., Hebrew). It can also serve the purpose of full
morphological disambiguation .
This method does not oblige us to apply grammatical formalisms and their like (such as
RTN and ATN), and it guides us to parsing and understanding of the sentence directly through the
semantics of its components, without recourse to syntactic parsing. It is thus relatively easy to
ascertain the like and the unlike in similar forms of different sentences, for example, active-passive,
the same sentence with a change of order of components, various kinds of compound sentences,
and so on. A practical use may also be found for it in such applications as translation, search, etc .
This method is already in use in a product of a search engine of words according to their
meaning. A formal search alone of words written in Hebrew script is bound to bring forth many
7
'finds' that are not directed at the sought word. The search engine according to meaning has created
a precision instrument, which as far as we known has nothing like it, certainly not regarding
Hebrew. The engine was developed by the Multitext Inc.The study was commissioned by the
Ministry of Science in the government of Israel, with its encouragement for use in schools in Israel .
8 .References
Allen, James (1995),. Natural Language Understanding, 2nd edition.
The Benjamin/Cummings Publishing Company, Inc. (especially
pp. 244-250 on the subject of Thematic Roles.)
Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
Chomsky, Noam (1981) Lectures on Government and Binding, Foris Pub.
Earley, J. 1970. "An efficient context-free parsing algorithm", Commun. Of the
ACM 13, 2 : 94-102 .
Even Shoshan, Abraham, The New Dictionary, 1987
Fillmore, C. J. (1968). "The case for case". In E. Bach and R.Harms (eds,).
Universals in Linguistic Theory. New York: Holt, Rinehart, and Winston, 1-90.
Fillmore, C. J. (1977). "The case for case reopened". In P. Cole and J. Sadock (eds,).
Syntax and Semantics. Vol. 8: Grammatical Relations. New York: Academic Press
pp. 59-81
Ide, Nancy and Veronis, Jean (1998). "Introduction to the Special Issue on
Word Sense Disambiguation : The State of the Art ,"
Computational Linguistics 24, 1 : 1-40.
ISO 259-3 (1999). Conversion of Hebrew Characters into Latin Characters.
Part 3 : Phonemic Conversion, February 1999.
Levinger, Moshe (1992). Morphological Disambiguation in Hebrew ,
Research Thesis for the degree of Master of Science in Computer Science,
Technion .
Levinger, Moshe, Uzzi Ornan, and Itai Alon (1995). "Learning Morpho-Lexical
Probabilities from an Untagged Corpus with an Application to Hebrew,"
Computational Linguistics 21, 3 : 383-404.
Nirenburg, Sergei (1987). "Machine Translation", Theoretical and Methodological
Issues, Cambridge University Press.
Nirenburg, Sergei (1992). "Machine Translation", A Knowledge-Based Approach.
San Mateo, CA : Morgan Kaufmann.
Nirenburg, Sergei (1993). "Progress in Machine Translation". Amsterdam,
Netherlands : IOS Press.
Ornan, Uzzi, (1987) , "Hebrew Text Processing Based on Unambiguous Script ,"
Mishpatim 17 (1). The Hebrew University of Jerusalem .
Ornan, Uzzi and Michael Katz, (1994), "A New Program for Hebrew Index Based on
Phonemic Script", TR #LCL 94-7 (revised), Technion, I.I.T.
Ornan, Uzzi (1990). "Machinery for Hebrew Word Formation". In Martin Golumbic
( ed.), Advances in Artificial Intelligence. Natural Language and Knowledge-Based
Systems, Springer-Verlag, 75-93 .
Ornan, Uzzi (1991). "Theoretical gemination in Israeli Hebrew". In Semitic
Studies in Honor of Wolf Leslau, edited by Alan S. Kaye, Otto Harrassowitz.
.1158-1158
8
the
Resnik, P. (1993). "Semantic classes and syntactic ambiguity", Proc. ARPA
Human Language Technology Workshop, San Mateo, CA: Morgan Kaufmann.
Stern, Naftali, (1994), The Verb Dictionary, Bar-Ilan University
Wintner, Shuly and Uzzi Ornan (1995), "Syntactic Analysis of Hebrew Sentences ,"
Natural Language Engineering, I (3 .)
9
Download