On the Corpus Size Needed for Efficient and Reliable Lexical

advertisement
On the Corpus Size Needed for
Compiling a Comprehensive
Computational Lexicon by Automatic
Lexical Acquisition
Dan-Hee Yang
Ik-Hwan Lee
Pascual Cantos
Dept. of Computer Science
Dept. of English
Dept. of English
Pyogtack Univ., Korea
Yonsei Univ., Korea
Murcia Univ., Spain
Abstract
Comprehensive computational lexicons are essential to practical natural language processing
(NLP). To compile such computational lexicons by automatically acquiring lexical information,
however, we previously require sufficiently large corpora. This study aims at predicting the
ideal size of such automatic-lexical-acquisition oriented corpora, focusing on six specific
factors: (1) specific versus general purpose prediction, (2) variation among corpora, (3) base
forms versus inflected forms, (4) open class items, (5) homographs, and (6) unknown words.
Another important and related issue with regard to predictability has something to do
with data sparseness. Research using the TOTAL Corpus reveals serious data sparseness in this
corpus. This, again, points towards the importance and necessity of reducing data sparseness to
a satisfactory level for the automatic lexical acquisition and reliable corpus predictions. The
functions of predicting the number of tokens and lemmas in a corpus are based on the piecewise
curve-fitting algorithm. Unfortunately, the predicted size of a corpus for automatic lexical
acquisition is too astronomical to compile it by using presently existing compiling strategies.
Therefore, we suggest a practical and efficient alternative method. We are confident that this
study will shed new light on issues such as corpus predictability, compiling strategies and
linguistic comprehensiveness.
1
1. Introduction
In principle, practical semantic analyzers should be able to process virtually any phrase or
sentence containing even words not included in a concise medium-size dictionary. This would
only be possible if we have at our disposal a complete computational lexicon in the amount and
depth of information encoded for an entry as well as its size. This is probably one of the main
reasons for the urgent necessity of a complete computational lexicon for practical natural
language processing (NLP). However, the problem is compiling such a complete computational
lexicon for an automatic semantic analyzer. Currently, we find a great deal of interesting
researches on the automatic acquisition of linguistic information from corpora by means of
statistical or machine learning mechanisms (Church 1994, Resnik 1993, Weischedel 1994, DanHee Yang 1999b). Nevertheless, all these works suffer from data sparseness: a phenomenon that
reveals that a given corpus fails to provide sufficient information or data on the relevant wordto-word relationships. This is an intrinsic problem and a serious drawback for most corpusbased NLP works.
In various studies using a small scale corpus, low frequency words detracted from the
reliability of the probabilities obtained from the corpus. In other words, even if a certain word is
never to occur in a corpus, we cannot be confident that its probability is zero. It is simply
because the experimental data (corpus) is not large enough to warrant confidence in the
statistics obtained from it. This situation suggests that we should compile larger size corpora for
more reliable statistical NLP.
Regarding corpora, data sparseness relates to both quantity (corpus size) and also quality
(composition of a corpus). However, in this study, we focus mainly on quantity or corpus size .
That is, how large does a corpus need to be in order to build a comprehensive computational
lexicon by means of automatic lexical acquisition procedures/techniques (Zernik 1991, Yang
1999b)?” Probably, we do not need to predict the required size of a corpus that accurately, but
we need a certain accuracy that would help us to estimate the total corpus compiling cost: time,
2
money, human effort, etc. The cost estimation leads us to lay down more practical compiling
strategies than ever.
Katz (1996) emphasized that corpora of different domains will, simply, have different
words in them, and assuming there are probabilities for domain-specific words to occur in a
language in general is a bad idea. Assume that there are domains D1, D2, ..., Dn and their
corresponding corpora C1, C2, ..., Cn in a language. Then, the probability for a domain-specific
word wt in Ci becomes Prob(wt|Ci). Here, we can define an omni-domain Ds including all Dj
and its corresponding omni-corpus Cs that consists of all Cj. Then, the probability for wt in Cs
becomes Prob(wt|Cs). As a matter of fact, Prob(wt|Ci) will, in general, not coincide with
Prob(wt|Cs) (Hays 1994). However, the aim of this study is on estimating the size of Cs that can
justify the probability Prob(wt|Cs) “in general” rather than that of a universal set1  (Stewart
1977), though we guess that Prob(wt|Ci) will gradually converge to Prob(wt|Cs) as the size of Ci
grows.
Most previous investigations on this matter, however, have not considered issues such as:
(1) the implausibility of general purpose prediction, (2) the similarities and differences among
corpora, (3) the importance of base forms, (4) the need for discriminating open class items, (5)
the limitations of morphological analyzers and taggers, particularly when dealing with
homographs and unknown words. We believe that the non-inclusion/non-consideration of these
factors by researchers in their works might have resulted in not very reliable investigations,
particularly if we consider their potential contribution to corpus size prediction.
In this study, we shall, first of all, deal with and discuss the above issues and then
experiment on the TOTAL Corpus to understand its actual state regarding the data sparseness.
Finally, we shall try to predict the corpus size needed for compiling a comprehensive
computational lexicon by the piecewise curve-fitting algorithm (for more details of this
algorithm, see Dan-Hee Yang 2000).
1
an ideal set which includes absolutely everything, here, every words in a language
3
2. Corpus Size Prediction: Preliminary Considerations
2.1. Specific versus General Purpose Prediction
Lauer (1995a, 1995b) outlined the basic requirements regarding the size of linguistic corpora for
general statistical NLP. He suggested a virtual statistical language learning system by
establishing an upper boundary on the expected error rate of a class of statistical language
learners as a function of the size of training data. However, his approach partially lacks validity
because the size of training data that one can extract from the same corpus is significantly
different depending on the type of linguistic knowledge to be learned and on the procedure and
technique used to extract data from the corpus.
Similarly, De Haan (1992) reported that the suitability of data seems to depend on the
specific study undertaken and added that there is no such thing as the best or optimal size. Note
that Weischedel et al. (1994) demonstrated that a tri-tag model using 47 possible parts-ofspeech (POS) would need a little more than one million words of training. However, most
researchers will agree that this size is neither suitable nor sufficient for semantic acquisition,
among others. This implies that the corpus size is heavily dependent on the linguistic research
we want to carry out. Consequently, there is no use trying to predict a corpus size that can
satisfy all linguistic requirements and expectations.
2.2. Variation among Corpora
Empirical corpus-based data depends heavily on the corpora from which this information has
been extracted. This means that different linguistic domains (economy, medicine, science,
computing, etc.), different authors, social strata, degrees of formality, and media (radio, TV,
newspaper, etc.) result in different token-lemma2 relationships (see Sánchez and Cantos 1998).
For an illustration of this issue, consider variation among various corpora (Table 1 and Figure 1).
In order to get the most universally applicable data, we would need a very balanced corpus,
2
Lemmas are also referred in the literature as lexemes, base forms or dictionary entries.
4
which should entail all major linguistic domains, media, styles, etc., emulating real linguistic
performance; in other words, a corpus that reliably models the language.
Table 1. Corpora used for the experiment
YSC I
YSC II
Short Name
STANDARD
DEWEY
Number of Tokens
2,022,291
1,046,833
Sampling Criteria
YSC V
YSC VI
1980s
1970s
1960s
5,177,744 7,033,594 6,882,884
Reading pattern Dewey decimal
YSC VII
YSC III
YSC VIII
80s Texts
YSC IX
70s Texts
NEWS
60s Texts
TDMS
Short Name
1990s
TEXT
CHILD
NEWS
TDMS
Number of Tokens
7,171,653
674,521
1,153,783
10,556,514
9,060,973
Sampling Criteria
90s Texts
Textbooks Children’s books Newspaper
Sampling
45000
40000
35000
Lemmas
30000
25000
20000
TEXT
DEWEY
15000
10000
CHILD
STANDARD
5000
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Corpus size (unit: 100,000 tokens)
60000
55000
50000
45000
Lemmas
40000
35000
1960s
30000
1970s
25000
1980s
20000
1990s
15000
TDMS
10000
NEWS
5000
0
1
6
11
16
21
26
31
36
41
46
51
56
61
66
Corpus size (unit: 100,000 tokens )
Figure 1. Number of lemma-growth in various corpora
5
71
76
81
86
91
96
101
106
The Yonsei Corpus (YSC) consists of eight subcorpora (the YSC I ~ YSC IX) compiled
in accordance with different sampling criteria. The Center for Language and Information
Development (CLID) paid special attention to the sampling criteria in order to get a most
balanced corpus (see Chan-Sup Jeong et al. 1990). In Figure 1, we have the corpus size in words
(tokens) on the x-axes and the number of different lemmas on the y-axes. The TEXT Corpus
consists of Korean textbooks (ranging from elementary to high school levels) written by native
Korean-speaking authors. The CHILD Corpus was compiled by means of samples taken from
children’s books (fairy tales, children’s fiction, etc.).
The upper graph in Figure 1 shows that the number of different lemmas in the TEXT and
CHILD Corpora increases at a slower rate than the Dewey and STANDARD Corpora ones. This
seems obvious as both the Dewey and the STANDARD Corpus contain different linguistic
domains and, additionally, the texts refer to adult language, which is, in principle, more varied
and lexically and semantically more complex.
Note that the 1960s ~ 1990s Corpora were compiled with chronological criteria in mind,
the Dewey Corpus by the Dewey decimal classification criteria and the STANDARD Corpus by
means of the reading pattern criteria. The TDMS (Text and Dictionary Management System)
Corpus was compiled by the Korean Advanced Institute of Science and Technology (KAIST), in
a similar manner as the STANDARD Corpus. The NEWS Corpus is a CD-ROM title, consisting
of all the Chosun-ilbo newspapers issued between 1993 and 1994.
The lower graph in Figure 1 shows that the NEWS Corpus has a lower lemma-growth
compared to its counterparts on the same graph. Interesting is the way the 1960s Corpus
behaves compared to the 1970s, 1980s and 1990s Corpora. Notice that the difference between
the 1960s and the 1990s Corpus is more marked than the one between the NEWS and the 1990s
Corpus, though the 1960s and the 1990s Corpus --not the NEWS Corpus-- were compiled
following similar sampling criteria.
6
From these observations it follows that sampling strategies affect the lemma-token slope
to some degree. However, more important is the fact that all corpora have one common inherent
characteristic: monotone, convex up, and increasing curve.
Furthermore, in order to overcome all these differences regarding the sampling and
lemma-token relationship, we decided to merge all of the 10 corpora above into a single one, on
the assumption that the corpora are reliable and balanced models. It seems plausible to think
that merging various reliable and balanced models can only result in an even better languagelike model. The 10 corpora gave rise to the TOTAL Corpus (totaling 50,780,790 tokens), a
model on which we shall base our research.
2.3. Base Form versus Inflected Form
Base forms are most valuable for constructing lexicons for NLP and indexes for information
retrieval systems. It seems wise, therefore, to consider these forms (lemmas) rather than
inflected ones (types), though Heaps (1978) and Young-Mi Jeong (1995) did not consider
lemmas for information retrieval systems. To get a taste of the sentence structure and
grammatical relationship in Korean, consider the following sentence:
철수가 그 논문을 썼다. Chelswu-ka ku nonmwun-ul ssessta. ‘Chelswu wrote the thesis.’
Chelswu-SM the thesis-OM wrote
-가 -ka (subject marker) and -을 -ul (object marker) are case particles, occasionally having
similar functions to prepositions in other languages. As case particles are bound to other items
and cannot appear in isolation or on their own (a main feature of agglutinative languages), we
shall not consider them for measuring/predicting the size of a corpus. Consequently, the number
of different tokens in the above sentence will be considered as the same in Korean and English,
whereas the number of different lemmas is different, 6 (including case particles) for Korean and
4 for English. However, notice that this has little if any effect on the total number of different
7
lemmas in a large corpus as the total number of case particles in Korean and the one of
prepositions in English are below 100, respectively.
Sánchez and Cantos (1997) also pointed out the need for such discrimination (tokenlemma). However, they focused more on the distinction and relationship between token and
type (word form). Their approach to the token-lemma relationship was based solely on a handlemmatized sample of just 250,000 tokens, which we consider too small for this study and
insufficient to draw any conclusions on the corpus size. To elucidate the number of different
lemmas in a corpus, we lemmatized and tagged the corpus by means of the NM-KTS (New
Morphological analyzer and the KAIST Tagging System), which reaches an accuracy rate of
96% and 75% in guessing unregistered words in its internal dictionary (Dan-Hee Yang 1998).
2.4. Open Class Items
Nouns, verbs, adjectives and adverbs are considered open class items. These parts of speech
(POS) are open in the sense that new items are constantly added to the language. Clearly, nouns
increase much more than any other POS particularly because of jargons, new compounds,
proper nouns and borrowings from other languages. The other open class items (verbs,
adjectives and adverbs) are clearly less productive, which leads us to consider different degrees
of productivity among open classes. In addition, open class items are also considered to be
lexical items due to their contribution to the meaning of propositions in a language.
Table 2 shows the distribution of the various open class items found in the 우리말큰사전
Ulimal Keun Dictionary ‘Korean Grand Dictionary’ (the Society of Hangul 1992), which
consists of 399,217 lemmas (or entries) and is presently considered to be the most
comprehensive Korean dictionary. HD (see Table 2) gives each number of lemmas relative to
the four open classes found in this dictionary. The four open classes form up to 98.27% of all
the lemmas in this dictionary. This clearly shows that lexical items are the main targets in the
automatic lexical acquisition. In addition, the very high proportion of open class items (more
8
than 98%) is persuasive enough not to attempt compiling a comprehensive computational
lexicon manually.
Table 2. Amount of entries distributed according to open classes in Korean
Noun
Verb
Adjective
Adverb
Total
HD
305,030
55,677
15,934
15,679
392,320
Ratio
76.47%
13.96%
3.99%
3.93%
98.27%
HI
249,748
45,241
14,825
14,602
324,416
HD / HI
1.22
1.23
1.07
1.07
1.20
Regarding the distribution of open class items, there might be cases where finding
predicting functions by POS could be necessary. For instance, if we try to acquire a certain
number of items (for example, 20,000 items), regardless of POS, most of these items might
belong to a specific POS (probably mostly nouns). However, we sometimes need to acquire
items with certain POS constraints, for example, 20,000 nouns, 3,000 verbs, 1,000 adjectives
and 500 adverbs.
2.5. Homographs
Homographs, that is, different linguistic items with the same spelling, are problematic,
particularly when they additionally share the same POS. These items cannot be distinguished, at
least, at a morphological level. This has led us to disregard them, considering that if N is the
actual number of different lemmas that we find in any given corpus, then the maximum number
of different lemmas M we can get will be:
M = N  (HD / HI)
where HD stands for the number of entries in a dictionary and HI for the total number of
“homograph-independent” entries in the same dictionary, that is, considering all homographs
token/type identical. Thus the average rate of error AE is
AE = (HD / HI – 1) / 2  100.
For instance, the AE for nouns is (305,030 / 249,748 – 1) / 2  100 ≒ 11% (see Table 2).
9
2.6. Unknown Words
There are cases where the resulting analyses of the morphological analyzer, for example, the
NM-KTS, are not stated in the reference dictionary (e.g., Ulimal Keun Dictionary). This might
be due to two reasons: (1) errors found in word spacing, typing, spelling, etc.; or (2) incapacity
of the system to analyze unknown words such as proper nouns, compounds or derivative items
(shifted forms, i.e. noun becoming a verb, etc.), as they are not in the lexical database of the
600000
60000
500000
50000
400000
40000
Lemmas
Lemmas
NM-KTS (Dan-Hee Yang 1999a).
300000
30000
200000
20000
100000
10000
0
0
0
90
180
270
360
Corpus size (unit: 100,000 tokens)
0
450
90
180
270
360
450
Corpus size (unit: 10,000 tokens)
(1) Noun
(2) Verb
Figure 2. Number of items (nouns and verbs) not found in the dictionary
Figure 2 illustrates the number of unknown words (according to the NM-KTS) found in
the TOTAL Corpus. The x-axis represents the corpus size in total items or tokens and the y-axis
the number of different items absent in the dictionary. The graph clearly shows that the increase
of unknown nouns is steady, almost linear. Note that we found about 600,000 unknown nouns,
which is over twice the number of nouns present (see HI of Table 2) in the Ulimal Keun
Dictionary (our reference dictionary for Korean). Regarding verbs (see Figure 2 (2)), the
number of unknown verbs is roughly 60,000, which results in more verbs than those appearing
in the Ulimal Keun Dictionary.
This is an important issue that needs a careful examination; otherwise misconceptions or
misleading conclusions might easily lead us to believe that the 50 million word TOTAL Corpus
could be sufficiently large to find almost all the lemmas present in the reference dictionary. A
10
second look at the statistics on the lemma slopes, not on mere frequencies, is worth considering:
we insist here that if a morphological analyzer/tagger analyzes a word and this resultant word is
not in the reference dictionary, then it should be excluded in the statistical processing of corpus
size prediction. This fact has not been considered or realized by other related studies such as
Heaps (1978), Young-Mi Jeong (1995) and Sánchez and Cantos (1997).
Consequently, the number of actual different lemmas in a corpus should always be
underestimated, as all unregistered items, though correct (such as proper nouns, compounds,
borrowings, etc.), will be excluded statistically in this study. Since the goal of our study is to
reliably predict the size of a corpus in order to compile a computational lexicon that contains
almost all the lemmas present in a current contemporary reference dictionary, the strategy of
underestimating the total number of different lemmas in corpus size prediction seems justifiable,
efficient and valid. Obviously, this also implies that all lemmas present in a current
contemporary reference dictionary need to be considered as part of the contemporary
vocabulary of a language, in our case, Korean. This leads us to set up the following natural
inclusion hypothesis:
While collecting/compiling those lemmas present in a contemporary
comprehensive dictionary from a corpus, one can, as a side effect, also
collect/compile other contemporary vocabulary items.
It follows from this hypothesis that any corpus whose size has been predicted to contain almost
all the lemmas present in a current contemporary reference dictionary, might also contain
additional contemporary vocabulary items. Clearly, the most optimistic consequence of the
above stated hypothesis is that we might not just get some additional features relative to the
contemporary vocabulary of the language under investigation, but get almost all the
contemporary vocabulary of that language.
11
3. Analyzing the Total Corpus
3.1. Data Sparseness Curves
The idea of relative frequency is connected with probability according to “Bernoulli’s theorem”.
This theorem states that if the probability of occurrence of event X is p(X) and if N trials are
made, independently and under exactly the same conditions, the probability that the relative
frequency of occurrence of X differs from p(X) by any amount, however small, approaches zero
as the number of trials grows indefinitely large. Focusing on the term “indefinitely large”, if we
consider each occurrence of a word (token) in a corpus as a trial, then the total number of trials
Lemmas
amounts to a sample space or the corpus itself.
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
0
15 30 45 60 75 90 105120 135150 165180 195210 225240 255270 285300 315 330345 360375 390 405 420435 450465 480495
Corpus size (unit:100,000 tokens)
Lemmas
(1) frequency = 1
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
0 15 30 45 60 75 90 105120135150165180195210225240255270285300315330345360375390405420435450465480495
Corpus size (unit:100,000 tokens)
Lemmas
(2) frequency  3
22000
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
0
15 30 45 60 75 90 105 120 135 150165 180 195 210 225 240 255 270 285 300 315 330345 360 375 390 405 420 435 450 465 480 495
Corpus size (unit:100,000 tokens)
(3) frequency  5
Figure 3. Data sparseness in nouns
12
The question now is: what is a reasonable size for a sample space or corpus? To answer
this question, let us go one step further and paraphrase the above question: if we hope to
observe the occurrence of each lemma with a frequency higher than a desirable one, how large
does the corpus then need to be? This seems to be an important issue, particularly if we are
interested in the variety of regarding linguistic knowledge that can be automatically extracted
and acquired by means of linguistic corpora. Therefore, it is of prime importance that each
linguistic event occurs more frequently than the desirable frequency in the corpus.
Observing Figure 3, the data sparseness slopes are almost horizontally linear or parallel to
the x-axis from xi=22.5 million tokens onwards. Around 9,000 different nouns occur only once
(hapax legomena) in the TOTAL Corpus, around 14,000 different nouns occur three times or
less, and around 18,000 different nouns appear with the frequency of five or less. It is important
to notice that the frequency statistics do not refer to the same set of different nouns, but to new
sets of different nouns for each xi.
It is a hasty generalization that the horizontal linearity of the slopes in Figure 3 keeps
going without any change, as all slopes in Figure 5 still keep going up as new lemmas keep
occurring. For that reason, watching the predicting curve for nouns in Figure 6 (1), it is
mathematically certain that we shall find all nouns shown in Table 2, invariably of how huge the
corpus is.
3.2. Shared and Unique Vocabulary
Table 3 shows the number of different lemmas found in the TOTAL Corpus. Considering the
number of dictionary entries in the reference dictionary (see Table 2), adjectives in the TOTAL
Corpus with a frequency of one or more just happen 4,720 / 14,825 × 100 = 31.8%, and the
other POS are not higher than 25%, which seems to be a quite low percentage. This evidence is
also ratified by other studies. For example, Hwi-Young Lee (1974) reported that people use on
average about 24,000 lemmas in actual conversation (nouns: 12,000 verbs: 4,500 adjectives:
5,500 adverbs: 1,000).
13
Table 3. The number of different lemmas in the TOTAL Corpus,
arranged by POS
Frequency
Noun
Verb
Adjective
Adverb
3
16,474
3,631
1,398
990
5
20,457
4,413
1,760
1,180
1
62,497
12,188
4,720
3,138
 10
37,015
6,950
2,689
1,876
 30
26,640
5,340
2,113
1,426
 50
21,869
4,604
1,860
1,225
One might think that the TOTAL Corpus is likely to supply us with sufficient information
regarding nouns, verbs and adverbs (except for adjectives) since the number of different lemmas
within the frequency range of  50 for each POS is greater than the corresponding figures that
Hwi-Young Lee (1974) obtained. However, notice that each individual has unique linguistic
competence, unique living experiences and, consequently, a unique vocabulary, etc. All these
features are different among individuals’ speech to varying degrees. So, how many vocabulary
items do speakers of the same language share? In other words, what can we consider to be the
common core vocabulary of a language in contrast to more idiolectal vocabulary items? This
problem is inevitably related to the corpus representativeness.
In this sense, Hyeok-Cheol Kwen (1997) compared high frequency words (tokens)
occurring in the YSC Corpus (totaling 40 million tokens) and the KAIST Corpus (totaling 40
million tokens, compiled by Korea Advanced Institute of Science and Technology). He reported
that both corpora share between 78% and 85% of their words (among the 200,000 most frequent
items, the intersection is 85%, among the 300,000 most common items, the intersection reaches
83%, 78% among the 500,000 most common tokens and 84% at 40 million tokens). From this
observation, we conclude that individuals of the same language are likely to share roughly 80%
of their vocabulary, though the sharing rate should be either tightened up or defended at greater
length.
This shared vocabulary (80% in this case) is very easy to find because it is mostly used
among the same language-speaking community. By contrast, the remaining 20% amounts to the
14
less frequent vocabulary or unique vocabulary of certain linguistic domains or individuals,
hence needs to be gathered from elsewhere. This means that we still need to devise a tactic to
complement the remaining 20%. Intuitively, we can conclude that the TOTAL Corpus cannot
supply us with sufficient information, at least, for our research purposes.
3.3. Reusability and Data Sparseness
Figure 4 shows reusability and data sparseness by POS in the TOTAL Corpus. Reusability is
calculated as Count (freq  10, 30, or 50, respectively) / Count (freq  1). Similarly, data
sparseness as Count (freq  3, or 5, respectively) / Count (freq  1). In reusability, about 58%
for freq  10, 44%, for freq  30 and 38% for freq  50 are reused irrespective of their POS. In
data sparseness, about 29% for freq  3 and 36% for freq  5 are sparse irrespective of their
POS, even though nouns are a little sparser than other POS.
70.0%
Percents
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
N o un
Freq. <= 3
Freq. >= 30
Verb
Adjective
Freq. <= 5
Freq. >= 50
Adverb
Freq. >= 10
Figure 4. Reusability and data sparseness
These observations reveal that there is little difference among POS in reusability and data
sparseness, at least as far as the TOTAL Corpus (probably the KAIST Corpus too) is concerned.
However, further considerations are needed for nouns as unknown words have been excluded in
this experiment.
Assume that we can acquire the meaning of a lemma automatically when the lemma
occurs in freq  10. Then, we can define only about 58% of the total lemmas. If we consider the
lemmas of freq  5 sparse, we should apply techniques for treating data sparseness (e.g.,
15
smoothing methods, proximity methods and clustering) to about 36% of the total lemmas.
However, there is no reason why we should not try to compile a very large corpus in order to
Lemmas
drop the percentage 36% significantly. This is precisely what has motivated our study.
40000
35000
30000
25000
20000
15000
10000
5000
0
Actual
Estimated
0 15 30 45 60 75 90 105120135150165180195210225240255270285 300315330345360375390405420435450465480 495
Corpus size (unit:100,000 tokens)
Lemmas
(1) frequency  10
28000
24000
20000
16000
12000
8000
4000
0
Actual
Estimated
0
15 30 45 60 75 90 105 120 135150 165 180 195 210 225240 255 270 285 300 315330 345 360 375 390 405420 435 450 465 480495
Corpus size (unit:100,000 tokens)
Lemmas
(2) frequency  30
24000
21000
18000
15000
12000
9000
6000
3000
0
Actual
Estimated
0
1 5 3 0 4 5 6 0 7 5 9 0 1 0 5 1 2 01 3 5 1 5 0 1 6 51 8 0 1 9 52 1 0 2 2 52 4 0 2 5 5 2 7 02 8 5 3 0 0 3 1 5 3 3 03 4 5 3 6 03 7 5 3 9 04 0 5 4 2 0 4 3 54 5 0 4 6 54 8 0 4 9 5
Corpus size (unit:100,000 tokens)
(3) frequency  50
Figure 5. The number of different nouns having more than the specified frequencies
4. Results of the Experiment
4.1. Prediction of Corpus Sizes
The desired frequency might vary depending on the aim of the investigation and/or the task one
wants to carry out. Figure 5 shows the number of newly occurring lemmas that appear more
than or equal to 10, 30, and 50 times in the TOTAL Corpus. The number of different lemmas
within the frequency span of  10 is about 37,000 in the corpus; the one within the frequency
16
span of  30 is about 26,500, and the one within the frequency span of  50 is about 22,000.
Unfortunately, there is no research on desirable frequencies for the automatic lexical acquisition
that could help us in this sense. In this study, merely to minimize the error due to our relatively
small-scale experimental data, we stick to a minimal frequency of 10 for predicting the corpus
size.
Lemmas
240,000
200,000
160,000
120,000
y=100.22257×x0.333427
80,000
40,000
0
0.0E+00
1.0E+09
2.0E+09
3.0E+09
4.0E+09
5.0E+09
6.0E+09
Tokens
7.0E+09
8.0E+09
9.0E+09
1.0E+10
1.1E+10
1.0E+10
1.1E+10
1.0E+10
1.1E+10
1.0E+10
1.1E+10
(1) Nouns
21,000
17,500
Lemmas
14,000
10,500
7,000
3,500
0
0.0E+00
y=305.96220×x0.176116
1.0E+09
2.0E+09
3.0E+09
4.0E+09
5.0E+09
6.0E+09
Tokens
7.0E+09
8.0E+09
9.0E+09
(2) Verbs
10,800
Lemmas
9,000
7,200
y=55.06041×x0.221324
5,400
3,600
1,800
0
0.0E+00
1.0E+09
2.0E+09
3.0E+09
4.0E+09
5.0E+09
6.0E+09
Tokens
7.0E+09
8.0E+09
9.0E+09
(3) Adjectives
4,200
Lemmas
3,500
2,800
y=153.32264×x0.138917
2,100
1,400
700
0
0.0E+00
1.0E+09
2.0E+09
3.0E+09
4.0E+09
5.0E+09
6.0E+09
Tokens
7.0E+09
8.0E+09
9.0E+09
(4) Adverbs
Figure 6. Predicting curves by POS (frequency  10)
Using the piecewise curve-fitting algorithm suggested by Dan-Hee Yang (2000) and
considering a word frequency of  10, we can get the predicting functions of
17
Table 4 by POS and hence the predicting curves of Figure 6. Figure 6 clearly shows that
as the corpus size grows, the increasing rate in the number of different lemmas gets significantly
smaller. However, we cannot ignore the possibility that the increasing rate of newly occurring
lemmas might fall abruptly somewhere in an interval outside the graphs of Figure 5. It is the
intrinsic limitation of inductive reasoning.
Table 4. Predicting Functions and Transformed Functions by POS (y is the required
number of items, x is the predicted corpus size, frequency  10))
Predicting Functions
Transformed Functions
Nouns
y=100.22257×x
0.333427
x = 10(log y
- log 100.22257) / 0.333427
Verbs
y=305.96220×x0.176116
x = 10(log y
- log 305.96220) / 0.176116
Adjectives
y=55.06041×x0.221324
x = 10(log y
- log 55.06041) / 0.221324
Adverbs
y=153.32264×x0.138917
x = 10(log y
- log 153.32264) / 0.138917
In Table 4, to get each transformed function, taking logarithms of both sides of the
predicting function, for example, for nouns yields
y=100.22257 × x0.333427  log y = log 100.22257 + 0.333427 log x
where the predicted corpus size x results from
0.333427 log x = (log y - log 100.22257)  log x = (log y - log 100.22257) / 0.333427
giving the following transformed function
x = 10(log y
- log 100.22257) / 0.333427
Other transformed functions of
Table 4 are induced in the same way. We can use these functions to get the corresponding
corpus size x regarding a required number of nouns y.
To obtain 100,000 different nouns, its transformed function x = 10(log y
- log 100.22257) / 0.333427
reveals that a corpus of 987 million tokens would be required (by solving x = 10(log 100,000 - log
100.22257) / 0.333427
), and a 7.9 billion ones to get 200,000 different nouns. To get 20,000 different
18
verbs, x = 10(log y - log 305.96220) / 0.176116 reveals that roughly 20.3 billion tokens would be needed,
and about 1.04 trillion tokens for 40,000 different verbs. The predicted corpus sizes by POS in
Table 5 were calculated by taking the predicting functions of Table 4 and by considering the
number of each dictionary entry by POS in Table 2. All in all, most of the predicted results
clearly exceed the size of most presently available corpora and current corpus compilation
expectations.
Table 5. The Predicted corpus sizes by POS according to the transformed functions in Table 4
(frequency  10)
Nouns
Verbs
Adjectives
Adverbs
Corpus size Different
Different
Different
Different
Increase
Increase
Increase
Increase
words
words
words
words
10 million
21,625
-
5,230
-
1,950
-
1,439
-
20 million
30 million
27,247
31,192
5,623
3,944
5,909
6,346
679
437
2,274
2,487
323
213
1,584
1,676
145
92
100 million
200 million
46,600
58,716
15,408
12,116
7,845
8,864
1,499
1,019
3,247
3,785
759
538
1,981
2,182
305
200
300 million 67,215
1 billion 100,417
8,500
33,202
9,520
11,768
656
2,249
4,140
5,404
355
1,264
2,308
2,728
126
420
2 billion 126,526
3 billion 144,842
26,109
18,316
13,296
14,280
1,528
984
6,301
6,892
896
592
3,004
3,178
276
174
10 billion 216,389
20 billion 272,651
71,547
56,262
17,653
19,945
3,373
2,292
8,997
10,488
2,104
1,492
3,756
4,136
579
380
30 billion
100 billion
21,422
26,482
1,476
5,060
11,473
14,976
985
3,503
4,376
5,172
240
797
200 billion
300 billion
29,920
32,135
3,438
2,215
5,695
6,025
523
330
1 trillion
2 trillion
39,725
44,882
7,590
5,158
7,122
7,842
1,097
720
3 trillion
10 trillion
48,205
3,322
8,296
9,807
454
1,510
20 trillion
30 trillion
10,798
11,424
991
626
100 trillion
200 trillion
13,503
14,868
2,080
1,365
4.2. Explanation of the Extreme Size
The astronomical sizes in Table 5 need some brief considerations: (1) it is likely that many
dictionary entries are not used in our everyday language, and (2) there are significant
19
inconsistencies and discrepancies in terms of the criteria for selecting dictionary entries and the
rules on which morphological analysis is carried out.
Most comprehensive paper/compact disk (CD) dictionaries normally include entries
without considering frequency, except for some recently corpus-based dictionaries (see CollinsCOBUILD, CUMBRE Gran Diccionario de la Lengua Española). For example, regarding
nouns, many reference dictionaries include a great amount of proper nouns, compound nouns,
derivative nouns, technical terms, and so on. However, many of these terms are little used in
real life. Similarly, many adverbs and adjectives are most likely to be used only in literary
works. It follows intuitively that we cannot find such terms occurring more often than with the
given frequency unless we have at our disposal a large corpus which is big enough to include all
virtually possible text types.
Regarding verbs, passivization and causativization can be realized morphologically in
Korean. Korean dictionaries normally include these variants or derivations as entries whenever
they are frequently used in everyday conversation. However, most morphological analyzers,
including the NM-KTS, return the base form and suffix. For example, the NM-KTS analyzes
the term 공부하다 kongpu-hata ‘make a study’ as 공부 kongpu ‘study’ (noun), and -하다 hata ‘make’ (verbal suffix), whereas our reference dictionary includes the term as a full verb
(lemma). Accordingly, the analyzer underestimates the number of verbs and overestimates the
number of nouns. We seriously doubt whether Korean has actually around 55,677 verbs.
As previously mentioned in Section 2.4, the NM-KTS analyzer is highly reliable
regarding adverbs. Nevertheless, since many adverbs are not likely to be used in everyday
language, we cannot take adverb predictions totally for granted (see Table 5).
4.3. Increasing Effects
Figure 7 displays the effects of increasing corpus size in relation to Table 5. The first increasing
step to 10 million tokens produces the maximal effect. Obviously, the next increasing step of 10
million tokens is less effective: 26.0% (5,623 / 21,625) for nouns, 13.0% (679 / 5,230) for verbs,
20
16.6% (323 / 1,950) for adjectives and 10.1% (145 / 1,439) for adverbs. Successive 10 million
token increases are even less effective. These graphs can give us some clues about the
economical efficiency of trying to build an enormous corpus (e.g., 3 trillion tokens); for instance,
a 2 trillion token corpus contains 7,842 adverbs occurring more than or equal to 10 times,
24,000
22,000
20,000
18,000
16,000
14,000
12,000
10,000
8,000
6,000
4,000
2,000
0
Lemmas
Lemmas
adding a further trillion tokens to the corpus, just gives 454 new adverbs, according to Table 5.
5,500
5,000
4,500
4,000
3,500
3,000
2,500
2,000
1,500
1,000
500
0
0.0E+ 3.0E+ 6.0E+ 9.0E+ 1.2E+ 1.5E+ 1.8E+ 2.1E+ 2.4E+ 2.7E+ 3.0E+ 3.3E+
00
07
07
07
08
(1) Nouns
08
08
08
08
08
0.0E+ 3.0E+ 6.0E+ 9.0E+ 1.2E+ 1.5E+ 1.8E+ 2.1E+ 2.4E+ 2.7E+ 3.0E+ 3.3E+
08
00
2,200
2,000
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
0.0E+ 3.0E+ 6.0E+ 9.0E+ 1.2E+ 1.5E+ 1.8E+ 2.1E+ 2.4E+ 2.7E+ 3.0E+ 3.3E+
00
07
07
(3) Adjectives
07
08
08
08
07
07
07
08
08
08
08
08
08
08
08
08
08
08
08
Tokens
(2) Verbs
Lemmas
Lemmas
08
Tokens
1650
1500
1350
1200
1050
900
750
600
450
300
150
0
0.0E+ 3.0E+ 6.0E+ 9.0E+ 1.2E+ 1.5E+ 1.8E+ 2.1E+ 2.4E+ 2.7E+ 3.0E+ 3.3E+
08
00
07
Tokens
(4) Adverbs
07
07
08
08
08
08
08
08
08
08
Tokens
Figure 7. Effect of increasing the corpus size (frequency  10)
5. Conclusion
In this study, an attempt has been made to predict the corpus sizes needed for compiling a
comprehensive computational lexicon by the automatic lexical acquisition. In addition, any
prediction might also help to estimate elaboration and compilation costs. The primacy of
quantity need not necessarily be maintained at all costs, regardless of issues and problems such
as: (1) the relevance of the new data offered, (2) economy, (3) complexity and (4) the financial
support required. Predictive tools like the ones offered here might be very useful in this
21
direction as they help us to estimate the total compiling cost and to determine the most efficient
compiling strategy in accordance with the economy principle.
To achieve this twofold goal, our research has tried to overcome several problems that
previous studies have failed to notice. First, we have insisted on finding a specific predicting
function rather than a universal one. Second, there is no special variation among various corpora
for prediction in the sense that they are all monotone, increasing, and convex up. Third, base
forms of words, rather than just tokens (word forms), should be considered in order to compile
dictionaries for NLP. Fourth, it is worthwhile distinguishing between open class items. Finally,
it makes sense to disregard unknown words in the results of morphological analysis.
Traditionally, most corpora have been compiled for lexicography and/or corpus
linguistics, where quality (representativeness or balancing), rather than quantity, has been
regarded as being very worth pursuing. In a sense, Bernoulli’s theorem has been overwhelmed
by the sampling theory in statistics. This obviously resulted in underestimating very important
issues such as the data sparseness. In order to overcome the data sparseness in NLP-oriented
corpora for developing practical NLP software, however, we should prefer quantity to quality so
that probabilities obtained from a corpus can be validated according to Bernoulli’s theorem.
An apparent serious drawback of our study is the huge predicted sizes, which seem ‘out
of this world’, unlikely to be compiled, at least by means of the presently existing compiling
strategies (oral transcriptions, manual typing, etc.). Instead, we prefer to determine accurately
the experimental reliable corpus size needed for each research. That is, the number of lemmas
that we need will determine the required size of the corpus. In addition, we also need to consider
the total cost of compiling such a corpus. Therefore, what constitutes a reasonable corpus size is
also closely related to whether we can afford to invest. Often, appending new huge data just
results in few new different lemmas.
In the worst case, if we inevitably need a 200 trillion token corpus, we might consider
two possible compiling methods or strategies; (1) collect randomly as many different text types
as possible, of course, electronic texts from Internet, publishing companies, electronic
22
newspapers, electronic libraries, BBS, and so on. (2) consider Internet itself as a virtual corpus.
Consequently, it is necessary to collect on-line texts without any other condition, whenever they
are correct in word spacing and spelling. We strongly believe that as long as we do not
deliberately try to compile an unbalanced corpus, the larger the corpus is, the more balanced and
representative it gets. However, there is always the possibility that some linguistic phenomena,
structure, or words are not found even in a very large corpus.
In the future, we shall try to resolve some of the limitations in this study: (1) the small
size of the experimental corpus (the TOTAL Corpus). Of course, the larger the corpus is, the
more accuracy we get; (2) some inconsistencies found between the morphological analyzer
(NM-KTS) and the reference dictionary (Ulimal Keun Dictionary); and (3) the inaccuracy of the
morphological analyzer. Regrettably, this creates a deadlock in the sense that an enormous
corpus is required to develop a more accurate morphological analyzer in statistical NLP.
All in all, we are confident that this study will shed new light on issues such as corpus
predictability, corpus-compiling policy, and reliability of corpus based NLP.
Acknowledgments
This research was funded by the Ministry of Information and Communication of Korea under
contract 98-86. Also, this is being partly supported by grant No. 2001-2-52200-001-2 from the
Basic Research Program of the Korea Science & Engineering Foundation. Thanks go to Dr.
David Walton at the University of Murcia for reading over the final draft of this paper and for
offering some useful suggestions.
References
Church, Kenneth W. and Robert L. Mercer (1994). Introduction to the Special Issue on
Computational Linguistics Using Large Corpora. Using Large Corpora, 1-24, edited by
Susan Armstrong. The MIT Press.
De Haan, Pieter (1992). The Optimum Corpus Sample Size? In Leitner, Gerhard (eds.): New
Directions in English Language Corpora, Methodology Results, Software Development, 319. Mouton de Gruyte, New York.
23
Hays, William (1994). Statistics, 42-47, 94-111. Harcourt Brace College Publishers, Florida.
Heaps, H. S. (1978). Information Retrieval: Computational and Theoretical Aspects, 206-208.
Academic Press, New York.
Jeong, Chan-Sup, Sang-Sup Lee, Ki-Sim Nam, et al. (1990). Selection Criteria of Sampling for
Frequency Survey in Korean Words. Lexicographic Study, Vol. 3, 7-69. Tap Press, Seoul.
Jeong, Young-Mi (1995). Statistical Characteristics of Korean Vocabulary and Its Application.
Lexicographic Study, Vol. 5, 6, 134-163. Tap Press, Seoul.
Kwen, Hyeok-Cheol (1997). Performance Improvement of Korean Information Processing
System by Using a Corpus. Corpus and the Korean Language Information. The 9th Annual
Meeting of Korean Lexicographic Center. Korean Lexicographic Center at Yonsei
University. Seoul.
Lauer, Mark (1995a). Conserving Fuel in Statistical Language Learning: Predicting Data
Requirements. In the 8th Australian Joint Conference on Artificial Intelligence. Canberra.
Lauer, Mark (1995b). How much is enough?: Data requirements for statistical NLP. In 2th
Conference of the Pacific Association for Computational Linguistics. Brisbane, Australia.
Lee, Hwi-Young (1974). An Introduction to French Linguistics, 47. Jeong-Eum-Sa, Seoul.
Resnik, Philip (1993). Selection and Information: A Class-Based Approach to Lexical
Relationships, 6-33. Ph.D. Dissertation of Department of Computer and Information
Science. Pennsylvania University.
Sánchez, Aquilino and Pascual Cantos (1997). Predictability of Word Forms (Types) and
Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the CUMBRE
Corpus: An 8-Million-Word Corpus of Contemporary Spanish. In International Journal of
Corpus Linguistics 2(2), 259-280.
Sánchez, Aquilino and Pascual Cantos (1998). El ritmo incremental de palabras nuevas en los
repertorios de textos. Estudio experimental y comparativo basado en dos corpus
lingüísticos equivalentes de cuatro millones de palabras, de las lenguas inglesa y española y
en cinco autores de ambas lenguas, Atlantis (Revista de la Asociación Española de Estudios
Anglo-Norteamericanos), Vol. XIX (1), 205-223, Spain.
Katz, Slava (1996). Distribution of Content Words and Phrases in Text and Language
Modelling. In Journal of Natural Language Engineering 2(1), 15-59.
Stewart, Ian and David Tall (1977). The Foundations of Mathematics, 41-61. Oxford University
Press.
The Society of Hangul (1992). 우리말 큰사전 ‘Ulimal Keun Dictionary. Eomun-gak Press,
Seoul.
Weischedel, Ralph et al. (1994). Coping with Ambiguity and Unknown Words through
24
Probabilistic Models. Using Large Corpora, edited by Susan Armstrong, 323-326. The
MIT Press.
Yang, Dan-Hee, Mansuk Song (1998). How Much Training Data Is Required to Remove Data
Sparseness in Statistical language Learning? In Proceedings of the First Workshop on Text,
Speech, Dialogue (TSD’98), 141-146, Bruno.
Yang, Dan-Hee Yang, Su-Jong Lim, Mansuk Song (1999a). The Estimate of the Corpus Size
for Solving Data Sparseness. In Journal of KISS, Vol. 26, No. 4, 568-583.
Yang, Dan-Hee, Mansuk Song (1999b). Representation and Acquisition of the Word Meaning
for Picking out Thematic Roles, In International Journal of Computer Processing of
Oriental Languages (CPOL), Vol. 12, No. 2, 161-177, the Oriental Languages Computer
Society.
Yang, Dan-Hee, Pascual Cantos Gómez, Mansuk Song (2000). "An Algorithm for Predicting
the Relationship between Lemmas and Corpus Size", In ETRI Journal, Vol. 22, No. 2,
Electronics and Telecommunications Research Institute.
Zernik , Uri (1991). Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1-26.
Lawrence Erlbaum Associates.
25
Download