Vojko Gorjanc University in Ljubljana, Faculty of Arts, Aškerčeva 2, 1000 Ljubljana, Slovenia e-mail: vojko.gorjanc@guest.arnes.si Simon Krek DZS Publishing House, Mestni trg 26, 1000 Ljubljana, Slovenia e-mail: simon.krek@dzs.si A corpus-based dictionary database as the source for compiling Slovene-X dictionaries Abstract The paper describes the compiling of a corpus-based database which will be used for the production of pocket-size bilingual dictionaries with Slovene as L1. The 100-million reference corpus of Slovene language called "FIDA" (http://www.fida.net) is used by the compilers as the source of information for the dictionary database. From the corpus, a list of 20.000 most frequent words was taken to form the initial list of possible entries. The idea is to analyse the way words are used, to record as much lexicographical useful information as possible and to put it in a form which is the most convenient for the subsequent production of dictionaries with different L2. The database is in SGML format and for the purposes of this database, a special DTD was created. Keywords: corpus-based lexicography, bilingual dictionaries, Slovene language 1. Introduction The article presents a corpus-based dictionary database prepared for subsequent compilation of pocket-size bilingual Slovene-L2 dictionaries. The project is a logical consequence of the fact that in the last years of the nineties, the "FIDA" corpus was created, the first reference corpus of the Slovene language which was the result of a joint venture project of two academic and two industrial partners: the Faculty of Arts at the University of Ljubljana, the Jožef Stefan Institute, the DZS Publishing House and the Amebis software company. The first part of the article describes the corpus and some possible problems in compiling the dictionary database which originate in the "FIDA" corpus design and characteristics. The next part shows how linguistic data found in the corpus are presented in the dictionary database. Three types of information have to be extracted from the corpus. First, frequency list helps to determine which lemmas are most possible candidates for entry words. Secondly, corpus concordances help to establish what is/are the core sense(s) of the entry words and the microstructure is formed according to the findings. And lastly, statistical methods are used to define and analyse lexical units and their patterns in the corpus. In the end, the text format and the structure of the dictionary database is briefly presented. 2. Linguistic data: the FIDA corpus The "FIDA" corpus contains just over 100 million words of contemporary Slovene texts from the second half of 20th century, mainly from the nineties. On the whole, it is a collection of written texts or texts written to be spoken; the transcriptions of parliamentary debates are its only spoken component (http://www.fida.net). As familiarity with the corpus content is important for the interpretation of the data, some of the basic taxonomic parameters are presented here (in %): Medium Text type Linguistic proofreading spoken 1,97 literary electronic 0,03 technical 18,46 no other 75,60 unknown written 98,00 5,94 yes 63,92 3,13 32,95 The corpus is lemmatised and morpho-syntactically tagged, but all the tagging was done automatically without the possibility of disambiguation in cases where two, three or more lemmas were possible. Since Slovene is a morphologically complex language, double or triple lemmas are frequent which makes statistical data from the corpus to some extent unreliable. In the last years, some significant steps were taken to solve the problem, both by testing the existing language-independent tools and by developing new tools (Džeroski and Erjavec 1998; Zupan 1999) but the situation is still far from ideal. Although less acute, lemmatisation of non-lemmatised words is another problem which waits for improvement of text processing tools for Slovene. Lemmatisation was based on the lexicon developed by the software company involved in the project for the purposes of their spelling-checker software which is fairly extensive but did not cover all the words in the corpus. Experience shows that in certain cases non-lemmatised words skew the results of statistical analysis. These characteristics of the "FIDA" corpus have to be taken into account when interpreting the data from the corpus. 3. Extracting the language data from the FIDA corpus 3.1 Word list A word list of 20.000 most frequent lemmas appearing in the corpus was taken as the initial list of possible candidates for entry words. The final number depends on the findings in the corpus but the aim is to reduce the number to appr. 15.000 entries. The shortcomings of text processing applied are also relevant with this feature and in the process of making a dictionary database entry, the lexicographers have to confirm that the lemma was not overrated. 3.2 Corpus analysis 3.2.1 The senses of entry words are determined by examining the list of concordances. Since the list is usually too extensive to be examined in the whole, a random filter is used to reduce the number of concordances to a manageable size. The assumption is that core senses of the word which are pertinent to a small-sized dictionary remain detectable after filtering. According to the findings, microstructure of the entry is formed with a sense indicator (a synonym or a short description) at the beginning of each sense. 3.2.2 To determine which multi-word lexical units should be documented in the dictionary database, MI3 as it was introduced in McEnry, Langé, Oakes and Véronis (1997: 229) score is used in statistical analysis. The software used for analysing the corpus enables two statistical values to be implemented: MI and MI3. The comparison of results showed that MI is less effective since frequency of corpus elements is underestimated and single co-occurrence of two elements in the corpus gives high scores which can diminish the importance of more frequent lexical units (Manning and Schütze 1999). This consideration is even more relevant in our case since specific forms of non-lemmatised words are attributed high MI scores. To some extent, MI3 neutralises the effects of low frequency of an element in the corpus which is why it was chosen as the preferred statistical values. Noun čaj 'tea' and MI/MI3 values (frame 5) MI 1 =nefermentiran 2 =superčaj 3 =bančo 4 =neslajenega 5 =oolong 6 =koprivnega 7 =čiren 8 0902 9 =koprivin 10 =luštrekov 11 =belarminu 12 =nelita 13 =58zeliščni 14 =kamenjaških 15 =yena 16 =počakou 17 =probavale 18 =nerefmentiran 19 =broodje 20 =virštajnskimi = - nonlemmatized MI3 1 1 1 1 2 5 4 4 3 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 5 4 4 3 3 2 2 2 2 2 2 2 2 2 2 skodelica kava čaj piti zeliščen in kamiličen pitje metin biti leden =izpuš ledeneti jasminov on mat precediti saden popiti skuhati 271 336 200 184 74 1169 31 90 29 1473 97 14 49 18 882 107 36 63 63 59 1483 4902 3082 5617 480 2729097 51 1254 56 7749214 3692 15 683 36 4267671 8501 333 1801 1829 1613 Based on the information from the corpus, multi-word lexical units are classified into three categories in the dictionary database according to two criteria - frequency and compositionality: (a) compounds with high frequency in the corpus; (b) compounds with low frequency; (c) combinations which are not compositional but frequent enough to be recognised as a pattern. Semantic indicators are required with all multi-word units where one of the components is used figuratively. Culture-bound lexemes need to be described in an editorial note, for example bela garda - Slovene collaborationist organization during WW2. Collocators are listed in square brackets with a minimum of two; they are ordered according to frequency and separated with a comma. A semi-colon separates collocators used figuratively from the ones used in their literal sense. The order of the two strings again depends on their frequency in the corpus. Depending on specific word class, the following system of collocator listing is used: If the entry word is a noun: adjectives noun complements verbs [mlad, pozoren, nepoučen] bralec bralka [revije, časopisa], boj z/s [konkurenco, tekmeci, mlini na veter, rakom] [kotirati, trgovati] na borzi If the entry word is an adjective: adverbs nouns bolan [neozdravljivo, duševno, smrtno, kronično] bolan [otrok, mati, tkivo, pacient] If the entry word is an adverb: verbs boleče [občutiti, odjekniti, zarezati] adjectives bistveno [drugačen, zmanjšan] adverbs bistveno [manj, bolj] If the entry word is a verb: subject of the verb [veter, burja] brije direct or/and indirect object beliti [stanovanje, hišo] postmodifier in prepositional phrase bežati pred [vojno, nacizmom, Turki; resničnostjo] In database entries: Noun: brég sam. 1 (pas zemlje ob vodi) biti na dveh/nasprotnih bregovih biti na istem bregu (stati) na nasprotnih/različnih bregovih ostati vsak na svojem bregu prestopiti bregove rečni breg 2 (strmina) strm breg ID: imeti (nekaj) za bregom Verb: bežáti gl. (umikati se) bežati pred [vojno,nacizmom, Turki; resničnostjo] bežati iz [domovine, vojašnice, mesta, ječe] bežati od [doma; odgovornosti, laži, resničnosti] bežati v [puščavo, gozd, inozemstvo/tujino, zaklonišče] bežati proti [jugu, zahodu] [panično/brezglavo, množično] bežati ID: čas beži (fig.) 4. Standard vs. compound entries One of the standard thorny lexicographical problems is the question which multi-word lexical units are to be given the entry headword status. The existing Slovene lexicography is quite traditional with a preference of listing multi-word units under single-word entry headwords. The new dictionary database is trying to break with the tradition by introducing a special kind of entry, the so-called "compound entry". The main criteria for a multi-word lexical unit to be given the status of a compound headword are the following: it has to be a nominal phrase; the sense of one of its components has to be non-transparent; its frequency in the corpus has to be relatively high. A special non-standard structure of the entry is specified for this category with an obligatory sense indicator or gloss. 5. The format and structure Dictionary database is in SGML/XML format with a special DTD (Document Type Definition) written for the purpose. Using one of the standard DTD's such as DocBook or TEI (Text Encoding Initiative) was considered in the beginning but it was found that its complexity would unnecessarily confuse the compilers. In spite of possible difficulties in acquainting with a new text format, its advantages make the initial effort worthwhile even though an ideal affordable SGML-friendly dictionary-making software has yet to be found. 6. Conclusions The project of compiling a corpus-based dictionary database for the subsequent production of pocket-size bilingual dictionaries with Slovene as L1 was briefly described. The project introduces a radically new approach in lexicographic work for the Slovene language. Corpusbased lexicography in Slovenia is at the very beginning and there are still many topics much to be discussed. We believe that this could be a solid starting point.