Vojko Gorjanc University in Ljubljana, Faculty of Arts, Aškerčeva 2

advertisement
Vojko Gorjanc
University in Ljubljana, Faculty of Arts, Aškerčeva 2, 1000 Ljubljana, Slovenia
e-mail: vojko.gorjanc@guest.arnes.si
Simon Krek
DZS Publishing House, Mestni trg 26, 1000 Ljubljana, Slovenia
e-mail: simon.krek@dzs.si
A corpus-based dictionary database as the source for compiling Slovene-X dictionaries
Abstract
The paper describes the compiling of a corpus-based database which will be used for the
production of pocket-size bilingual dictionaries with Slovene as L1. The 100-million
reference corpus of Slovene language called "FIDA" (http://www.fida.net) is used by the
compilers as the source of information for the dictionary database. From the corpus, a list of
20.000 most frequent words was taken to form the initial list of possible entries. The idea is
to analyse the way words are used, to record as much lexicographical useful information as
possible and to put it in a form which is the most convenient for the subsequent production of
dictionaries with different L2. The database is in SGML format and for the purposes of this
database, a special DTD was created.
Keywords: corpus-based lexicography, bilingual dictionaries, Slovene language
1. Introduction
The article presents a corpus-based dictionary database prepared for subsequent compilation
of pocket-size bilingual Slovene-L2 dictionaries. The project is a logical consequence of the
fact that in the last years of the nineties, the "FIDA" corpus was created, the first reference
corpus of the Slovene language which was the result of a joint venture project of two
academic and two industrial partners: the Faculty of Arts at the University of Ljubljana, the
Jožef Stefan Institute, the DZS Publishing House and the Amebis software company.
The first part of the article describes the corpus and some possible problems in compiling the
dictionary database which originate in the "FIDA" corpus design and characteristics. The next
part shows how linguistic data found in the corpus are presented in the dictionary database.
Three types of information have to be extracted from the corpus. First, frequency list helps to
determine which lemmas are most possible candidates for entry words. Secondly, corpus
concordances help to establish what is/are the core sense(s) of the entry words and the
microstructure is formed according to the findings. And lastly, statistical methods are used to
define and analyse lexical units and their patterns in the corpus. In the end, the text format
and the structure of the dictionary database is briefly presented.
2. Linguistic data: the FIDA corpus
The "FIDA" corpus contains just over 100 million words of contemporary Slovene texts from
the second half of 20th century, mainly from the nineties. On the whole, it is a collection of
written texts or texts written to be spoken; the transcriptions of parliamentary debates are its
only spoken component (http://www.fida.net). As familiarity with the corpus content is
important for the interpretation of the data, some of the basic taxonomic parameters are
presented here (in %):
Medium
Text type
Linguistic proofreading
spoken
1,97
literary
electronic
0,03
technical
18,46
no
other
75,60
unknown
written
98,00
5,94
yes
63,92
3,13
32,95
The corpus is lemmatised and morpho-syntactically tagged, but all the tagging was done
automatically without the possibility of disambiguation in cases where two, three or more
lemmas were possible. Since Slovene is a morphologically complex language, double or
triple lemmas are frequent which makes statistical data from the corpus to some extent
unreliable. In the last years, some significant steps were taken to solve the problem, both by
testing the existing language-independent tools and by developing new tools (Džeroski and
Erjavec 1998; Zupan 1999) but the situation is still far from ideal. Although less acute,
lemmatisation of non-lemmatised words is another problem which waits for improvement of
text processing tools for Slovene. Lemmatisation was based on the lexicon developed by the
software company involved in the project for the purposes of their spelling-checker software
which is fairly extensive but did not cover all the words in the corpus. Experience shows that
in certain cases non-lemmatised words skew the results of statistical analysis. These
characteristics of the "FIDA" corpus have to be taken into account when interpreting the data
from the corpus.
3. Extracting the language data from the FIDA corpus
3.1 Word list
A word list of 20.000 most frequent lemmas appearing in the corpus was taken as the initial
list of possible candidates for entry words. The final number depends on the findings in the
corpus but the aim is to reduce the number to appr. 15.000 entries. The shortcomings of text
processing applied are also relevant with this feature and in the process of making a
dictionary database entry, the lexicographers have to confirm that the lemma was not
overrated.
3.2 Corpus analysis
3.2.1 The senses of entry words are determined by examining the list of concordances. Since
the list is usually too extensive to be examined in the whole, a random filter is used to reduce
the number of concordances to a manageable size. The assumption is that core senses of the
word which are pertinent to a small-sized dictionary remain detectable after filtering.
According to the findings, microstructure of the entry is formed with a sense indicator (a
synonym or a short description) at the beginning of each sense.
3.2.2 To determine which multi-word lexical units should be documented in the dictionary
database, MI3 as it was introduced in McEnry, Langé, Oakes and Véronis (1997: 229) score is
used in statistical analysis. The software used for analysing the corpus enables two statistical
values to be implemented: MI and MI3. The comparison of results showed that MI is less
effective since frequency of corpus elements is underestimated and single co-occurrence of
two elements in the corpus gives high scores which can diminish the importance of more
frequent lexical units (Manning and Schütze 1999). This consideration is even more relevant
in our case since specific forms of non-lemmatised words are attributed high MI scores. To
some extent, MI3 neutralises the effects of low frequency of an element in the corpus which is
why it was chosen as the preferred statistical values.
Noun čaj 'tea' and MI/MI3 values (frame 5)
MI
1
=nefermentiran
2
=superčaj
3
=bančo
4
=neslajenega
5
=oolong
6
=koprivnega
7
=čiren
8
0902
9
=koprivin
10
=luštrekov
11
=belarminu
12
=nelita
13
=58zeliščni
14
=kamenjaških
15
=yena
16
=počakou
17
=probavale
18
=nerefmentiran
19
=broodje
20
=virštajnskimi
= - nonlemmatized
MI3
1
1
1
1
2
5
4
4
3
3
2
2
2
2
2
2
2
2
2
2
3
2
2
2
3
5
4
4
3
3
2
2
2
2
2
2
2
2
2
2
skodelica
kava
čaj
piti
zeliščen
in
kamiličen
pitje
metin
biti
leden
=izpuš
ledeneti
jasminov
on
mat
precediti
saden
popiti
skuhati
271
336
200
184
74
1169
31
90
29
1473
97
14
49
18
882
107
36
63
63
59
1483
4902
3082
5617
480
2729097
51
1254
56
7749214
3692
15
683
36
4267671
8501
333
1801
1829
1613
Based on the information from the corpus, multi-word lexical units are classified into three
categories in the dictionary database according to two criteria - frequency and
compositionality: (a) compounds with high frequency in the corpus; (b) compounds with low
frequency; (c) combinations which are not compositional but frequent enough to be
recognised as a pattern. Semantic indicators are required with all multi-word units where one
of the components is used figuratively. Culture-bound lexemes need to be described in an
editorial note, for example bela garda - Slovene collaborationist organization during WW2.
Collocators are listed in square brackets with a minimum of two; they are ordered according
to frequency and separated with a comma. A semi-colon separates collocators used
figuratively from the ones used in their literal sense. The order of the two strings again
depends on their frequency in the corpus.
Depending on specific word class, the following system of collocator listing is used:
If the entry word is a noun:
adjectives
noun complements
verbs
[mlad, pozoren, nepoučen] bralec
bralka [revije, časopisa], boj z/s [konkurenco,
tekmeci, mlini na veter, rakom]
[kotirati, trgovati] na borzi
If the entry word is an adjective:
adverbs
nouns bolan
[neozdravljivo, duševno, smrtno, kronično] bolan
[otrok, mati, tkivo, pacient]
If the entry word is an adverb:
verbs boleče
[občutiti, odjekniti, zarezati]
adjectives bistveno [drugačen, zmanjšan]
adverbs bistveno [manj, bolj]
If the entry word is a verb:
subject of the verb
[veter, burja] brije
direct or/and
indirect object
beliti [stanovanje, hišo]
postmodifier in
prepositional phrase bežati pred [vojno, nacizmom,
Turki; resničnostjo]
In database entries:
Noun:
brég sam.
1 (pas zemlje ob vodi)
biti na dveh/nasprotnih bregovih
biti na istem bregu
(stati) na nasprotnih/različnih bregovih
ostati vsak na svojem bregu
prestopiti bregove
rečni breg
2 (strmina)
strm breg
ID: imeti (nekaj) za bregom
Verb:
bežáti gl.
(umikati se)
bežati pred [vojno,nacizmom, Turki; resničnostjo]
bežati iz [domovine, vojašnice, mesta, ječe]
bežati od [doma; odgovornosti, laži, resničnosti]
bežati v [puščavo, gozd, inozemstvo/tujino, zaklonišče]
bežati proti [jugu, zahodu]
[panično/brezglavo, množično] bežati
ID: čas beži (fig.)
4. Standard vs. compound entries
One of the standard thorny lexicographical problems is the question which multi-word lexical
units are to be given the entry headword status. The existing Slovene lexicography is quite
traditional with a preference of listing multi-word units under single-word entry headwords.
The new dictionary database is trying to break with the tradition by introducing a special kind
of entry, the so-called "compound entry". The main criteria for a multi-word lexical unit to be
given the status of a compound headword are the following: it has to be a nominal phrase; the
sense of one of its components has to be non-transparent; its frequency in the corpus has to be
relatively high. A special non-standard structure of the entry is specified for this category
with an obligatory sense indicator or gloss.
5. The format and structure
Dictionary database is in SGML/XML format with a special DTD (Document Type
Definition) written for the purpose. Using one of the standard DTD's such as DocBook or TEI
(Text Encoding Initiative) was considered in the beginning but it was found that its
complexity would unnecessarily confuse the compilers. In spite of possible difficulties in
acquainting with a new text format, its advantages make the initial effort worthwhile even
though an ideal affordable SGML-friendly dictionary-making software has yet to be found.
6. Conclusions
The project of compiling a corpus-based dictionary database for the subsequent production of
pocket-size bilingual dictionaries with Slovene as L1 was briefly described. The project
introduces a radically new approach in lexicographic work for the Slovene language. Corpusbased lexicography in Slovenia is at the very beginning and there are still many topics much
to be discussed. We believe that this could be a solid starting point.
Download