Kadri Vider, Heili Orav Estonian wordnet and Lexicography 1. Introduction There has been the need for computer thesauri in Estonian lexicography for a long time. For 1997 it was clear, that besides morphological and syntactical analysis a lexical database based on word semantics was needed. There was the WordNet (WN) created by G.A. Miller and others at Princeton University in the 1980-s, and we followed suit. Compilation of the Estonian Wordnet started in 1997 and the work is still in progress. The work is funded partly by Estonian Science Foundation and partly in the frames of Estonian National Programme of Language Technology. The existing Estonian wordnet contains nouns, verbs and some adjectives and proper nouns. 2. WordNet The design of a semantic database is faced with two fundamental questions: (a) what kind of semantic information is stored: linguistic knowledge or world-knowledge (b) how is this information stored: as a language internal system or using some meta-language Many semantic databases are not explicit with respect to these distinctions and it is often very difficult to keep them apart. In the case of traditional dictionaries you might expect to find ‘linguistic knowledge’ telling you what semantic properties determine the exact usage of a word, but dictionary definition very often also contain fragments of world-knowledge. In the case of conceptual networks that try to capture our common-sense knowledge and reasoning you might expect to only find ‘world knowledge’ but what is defined are often the same words (Bloksma et al. 1996). By focusing on historical evidence, the standard dictionaries neglected questions concerning the synchronic organization of lexical knowledge. Beginning with word association studies at the turn of the 20th century and continuing down to the sophisticated experimental tasks of the past twenty years, psycholinguists have discovered many synchronic properties of the mental lexicon that can be exploited in lexicography (Miller et al, 1990). In 1985 a group of psychologists and linguists at Princeton University undertook to develop a lexical database along lines suggested by these investigations. G.A. Miller and others create WordNet - a semantic database of English. Inasmuch as it instants hypotheses based on results of psycholinguistic research, WordNet can be said to be a dictionary based on psycholinguistic principles (Fellbaum 1998). 126 Kadri Vider, Heili Orav In WordNet fundamental differences in the semantic organization of syntactic categories can be clearly seen and systematically exploited. Nouns are organized in lexical memory as topical hierarchies, adjectives are organized as N-dimensional hyperspaces, and verbs are organized by a variety of entailment relations. The most ambitious feature of WordNet is its attempt to organize lexical information in terms of word meanings, rather than word forms. (Miller et al. 1993) 3. Terms A word can have several meanings and mostly it has. As WordNet is based on word meaning, all of the words that can express a given sense are grouped together in a SYNONYM SET or shortly called SYNSET. For example if you are searching Estonian word ‘viis’ you can find at least four different meanings, i.e. four synsets: Viis_1, 5_1: number 5 (meaning is number five) Viis_2, väga_hea_1: kõrgeim hinne eesti koolisüsteemis (meaning is five – the highest grade in Estonian school system) Viis_3, meloodia_1: helide järgnevus muusikas (meaning is melody – consequence of sounds in music) Viis_4, mood_1, stiil_1, laad_1: väljakujunenud, iseloomulik või äratuntav tegutsemislaad või olemisviis (meaning is manner, style, mode – specific, known manner of performance) WordNet is organised by semantic relations. Since a semantic relation is a relation between meanings, and since meanings can be represented by synsets, it is natural to think of semantic relations as pointers between synsets. The database is restricted to the relations suggested by the psycholinguistic data: synonyms, antonyms, hyponyms, hyperonyms, meronyms, holonyms, causation etc. These and other similar relations serve to organise the mental lexicon. Some examples: Hierarchical: koer (dog) HAS_HYPERONYM loom (animal) Antonymy: halb (bad) ANTONYM hea (good) Part/hole: koer (dog) HAS_MERONYM saba (tail) Role/involved: õpetaja (teacher) ROLE õpetama (to teach) Causation: tapma (to kill) CAUSE surema (to die) As in EuroWordNet project is in Estonian Wordnet also more than 50 different semantic relations, most of them are reciprocated: If A HAS_HYPERONYM B, then B HAS_HYPONYM A. 4. EuroWordNet – a multilingual semantic database Estonian Wordnet and Lexicography 127 In 1996 the European Union supported project EuroWordNet (EWN) started, which aimed to develop a multilingual database with basic semantic relations between words for several European languages (Dutch, Spanish, Italian and English; from 1998 also Czech, French, German and Estonian). The wordnets are stored in a central lexical database system and the word meanings are linked as in the Princeton WordNet. Each wordnet represents a unique language-internal system of lexicalisations (Vossen 1998a). The design of the EuroWordNet database is first of all based on the structure of the Princeton WordNet and specifically version WordNet1.5. The notion of a synset and the main semantic relations has been taken over in EuroWordNet. However, some specific changes have been made to the design of the database, which are mainly motivated by the following objectives: 1) to create a multilingual database; 2) to maintain language-specific relations in the wordnets; 3) to achieve maximal compatibility across the different resources; 4) to build the wordnets relatively independently (re)-using existing resources. To be able to maintain the language-specific structures and to allow for the separate development of independent resources, they make a distinction between the language-specific modules and a separate language-independent module. Each language-specific wordnet is structured along the same lines as WordNet synonyms are grouped in synsets, which in their turn are related by means of basic semantic relations such as hyponymy (between specific and more general concepts), meronymy relations (between parts and wholes). By means of these relations all meanings can be interconnected, constituting a huge network or wordnet. Each language module represents an autonomous and unique language-specific system of language-internal relations between synsets. Nevertheless, some specific measures have been taken to enlarge the compatibility of the different language-specific wordnets, as the definition of a common set of so-called Base Concepts (BC). Base Concepts are meanings (i.e. synsets) that play a major role in the wordnets (Vossen et al. 1997). Equivalence relations between the synsets in different languages and WordNet1.5 are made explicit in the so-called Inter-Lingual-Index (ILI). Each synset in the monolingual wordnets has at least one equivalence relation with a record in this ILI, either directly or indirectly via other related synsets. Language-specific synsets linked to the same ILI-record should thus be equivalent across the languages. Synsets linked to the same WordNet1.5 synset are supposed to be equivalent or close in meaning and can then be compared. 5. Estonian Wordnet Compilation of Estonian wordnet (EstWN) started from 1997 and the work will continue. Its relevant to mention that other partners in EWN project have created their wordnets automatically, but we have preferred manual work. One reason for this is absence of suitable electronical material, and on the other hand, it gives more exact results. 128 Kadri Vider, Heili Orav In addition to the general vocabulary also legal vocabulary as to facilitate the precise translation of legal texts is made. Table 1 gives an overview of characteristic numbers Estonian wordnet is reached to. Table 1: Number of synsets per different part-of-speeches in EstWN Proper Nouns Nouns Verbs Adjectives Total SYNSETS 446 6734 2757 307 10244 Meanings (variants) 471 11230 5746 518 17965 Meanings per synset 1,06 1,67 2,08 1,75 Different words and compounds 470 9518 3797 419 14204 Meanings per word 1 1,18 1,51 1,26 Semantic relations 474 15024 5596 538 21632 Semantic relations per synset 1,06 2,23 2,11 2,03 1,69 1,24 1,75 The major lexical information, what is so necessary for compiling thesaurus, comes from monolingual explanatory and/or sense distinguishing dictionaries or synonyms dictionaries. In Estonian there are not so many of them, more over we don’t have enough machinereadable dictionaries. The data for the compilation of the Estonian WordNet are got from the following sources: (a) to get the word meanings, explanations and examples the Estonian Explanatory Dictionary, which is unfortunately not a completely machine readable dictionary, is used (b) in case of synonymy and antonym relations the dictionaries of synonyms and antonyms are used (c) to link concepts with Inter Lingual Index bilingual dictionaries are used (d) other tools of Estonian language technology are used. For example word frequency records are compiled on the basis of the Corpus of Estonian Literary Language (CELL, 1 mln. words) and the materials of the Corpus are also used to define the different meanings of a word and quotations from the Corpus are used as examples. 6. Weaving Estonian wordnet In general, the wordnet was built in two major cycles as indicated in Figure 1. Each cycle consisted of a building phase and a comparison phase. For creation of database itself we use the EWN Database Editor Polaris (Louw 1998). 129 Estonian Wordnet and Lexicography Add new synsets How existing synsets cover language usage Figure 1. Building cycles of EstWN. Word-frequency lists Find all meanings of a word Monolingual resources Join synonymous word-meanings into synset Link synsets via different semantic relations Multilingual resources Link synsets via ILI relations 6.1. Words from texts EstWN is supposed to cover the Estonian base vocabulary in its initial version. Besides Base Concepts translation from Princeton Wordnet was avoided, main emphasis was given on frequency lists of word forms. The base vocabulary was determined by statistical analysis of Corpus of Estonian Literary Language. Lexical entries in EstWN are presented nominal singular form for nouns and supine form for verbs. In real texts the words are mostly in their full richness of forms. For example, there are fourteen cases in Estonian, it makes 28 forms in every noun paradigm. On average 45% of the word forms are morphologically ambiguous in Estonian texts and this fact causes irrelevant data in lexicon word-list. The number of compounds in Estonian is indefinite. It is quite easy for a user to make up compounds that are not found in any dictionary, but are still understandable by other users. It is not reasonable to enter all compounds that one can imagine into wordnet. The difference can be made by analysing the components of a compound: if the components are of low frequency as independent words in corpora, then the compounds they are building up should be added into lexicon. Estonian is a flective language with a free word order and that makes complicated to figure out all phrases from corpus texts. The elements of phrase can be scattered around sentence in an unpredictable order. Multi-word expressions are included into EstWN if they build up a conceptual unit and are commonly used as lexical units. 130 Kadri Vider, Heili Orav 6.2. A set of synonymous word meanings One of the problems concerned dividing meanings of a word into different synsets in EstWN. Due to lexicographical sources what we use, over-differentiation (word meaning marked as too specific) and over-generalisation (word meaning marked as too general) may occur. Traditional dictionary entries are centred to word and all syntactically or pragmatically specific meanings are represented there, but semantically they are not always significant. Which new words or meanings should be concentrated on to upgrade the EstWN? It is essential that words actually used in text will be added. Results of word sense disambiguation (WSD) of corpus texts turned to be a good way to add missing synsets and senses into our wordnet. There were significant inconsistencies in opinions of these people, who disambiguated the texts. This shows us the most problematic entries in EstWN, the need to reconsider the borders of meaning some concepts. 6.3. Language-dependent lexicalisation vs Inter-Lingual Index As said, wordnets of different languages are linked via Inter Lingual Index in EWN. But lexicalisation features or borders of one concrete word meaning are different in language by language in many cases and finding exact equal relation (EQ_SYNONYM) is often impossible. Fortunately, it is possible to use other eq-relations for describing more exactly what is difference between source language and ILI synsets. The Estonian language as Finno-Ugric language is different from most Indo-European languages as in EuroWordNet. Some specific lexicalisation patterns appear, when we tried to link Estonian synsets to ILI. In the case of verbs there is a quite regular discrepancy between English and Estonian in the expression of causativity. In Estonian causative counterparts of noncausative verbs are regularly lexicalised, also different lexical entries, and the opposition is productive. In addition, there exist cases where in English there are verbs which have only causative or only noncausative meaning, but in Estonian both meanings are lexicalised and used productively. For instance, the English verb ‘to spend (money)’ is translated as ‘kulutama’, but there is also noncausative ‘kuluma’ (also about money) for which no exact English lexicalised counterpart exists (the dictionary translation is ‘be spent’ but the Estonian ‘kuluma’ has no passive meaning). The same helds e.g. about ‘levitama’ (to disseminate, distribute) and noncausative ‘levima’. Problem rises from ILI synset, there is no causativity/noncausativity marked, and so it is impossible to use nor eq_causes non eq_is_caused_by relation for linking source language and ILI synset. For example ‘to move’ equivalents in Estonian are ‘liikuma’ and ‘liigutama’. The last is causative. In Estonian are these differences systematically lexicalized variously from other languages (Vider 2002). Also onomatopoetic-descriptive words are more detailed in Estonian. One example: Estonian bear ‘mõmiseb’ and lion ‘möirgab’. In English they both ‘roar’. Estonian Wordnet and Lexicography 131 7. Conclusion The Estonian wordnet can be used, among others, for monolingual and cross-lingual information retrieval, which is enable to search by meanings and concepts. Also wordnet-type thesaurus is helpful for translation’s tools or for compiling other, different kind of lexicons. For Estonian language technology it is very important to have such lexical-semantic resource. The present paper pointed out some difficulties, which appeared during practical compiling of Estonian wordnet-type thesaurus. References Bloksma, Laura, P. Díez-Orzas, Piek Vossen 1996. User requirements and functional specification of the EuroWordNet project, EuroWordNet (LE-4003) deliverable D001, University of Amsterdam. Available at http://www.illc.uva.nl/EuroWordNet/docs.html Fellbaum, Christiane 1998. Introduction. In: Fellbaum, Christiane (ed.) WordNet: An Electronic Lexical Database, MIT Press, pp. 1-19. Louw, M.1998. Polaris User's Guide.The EuroWordNet Database Editor. EuroWordNet (LE-4003), Deliverable D023D024, Lernout & Hauspie - Antwerp, Belgium. Available at http://www.illc.uva.nl/EuroWordNet/docs.html Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross and Katherine J. Miller 1990. „Introduction to WordNet: an on-line lexical database.” In: International Journal of Lexicography 3 (4), 1990, pp. 235 - 244. Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross and Katherine J. Miller 1993. “Introduction to WordNet: an on-line lexical database.” Available at . ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.ps Vider, Kadri 2002. Notes about labelling semantic relations in Estonian WordNet In: D. N. Christodoulakis, C. Kunze, L. Lemnitzer (eds) Proceedings of Workshop on Wordnet Structures and Standardisation, and how these Affect Wordnet Applications and Evaluation; Third International Conference on Language Resources and Evaluation (LREC 2002). ELRA, Las Palmas de Gran Canaria, pp. 56-59 Vossen, Piek, Laura Bloksma, Horacio Rodriguez, S. Climent, Nicoletta Calzolari, Adriana Roventini, Francesca Bertagna, Antonia Alonge, Wim Peters. 1997. The EuroWordNet Base Concepts and Top Ontology. EuroWordNet (LE 4003) Deliverable D017, D034, D036. University of Amsterdam. Available at http://www.illc.uva.nl/EuroWordNet/docs.html Vossen, Piek 1998. EuroWordNet: Building a Multilingual Database with Wordnets for European Languages. In: K. Choukri, D. Fry, M. Nilsson (eds), The ELRA Newsletter, Vol3, n1, 1998. Vossen, Piek 1998a. Introduction to EuroWordNet. Computers and the Humanities, 32 (2-3), pp. 7389