On the Corpus Size Needed for Compiling a Comprehensive Computational Lexicon by Automatic Lexical Acquisition Dan-Hee Yang Ik-Hwan Lee Pascual Cantos Dept. of Computer Science Dept. of English Dept. of English Pyogtack Univ., Korea Yonsei Univ., Korea Murcia Univ., Spain Abstract Comprehensive computational lexicons are essential to practical natural language processing (NLP). To compile such computational lexicons by automatically acquiring lexical information, however, we previously require sufficiently large corpora. This study aims at predicting the ideal size of such automatic-lexical-acquisition oriented corpora, focusing on six specific factors: (1) specific versus general purpose prediction, (2) variation among corpora, (3) base forms versus inflected forms, (4) open class items, (5) homographs, and (6) unknown words. Another important and related issue with regard to predictability has something to do with data sparseness. Research using the TOTAL Corpus reveals serious data sparseness in this corpus. This, again, points towards the importance and necessity of reducing data sparseness to a satisfactory level for the automatic lexical acquisition and reliable corpus predictions. The functions of predicting the number of tokens and lemmas in a corpus are based on the piecewise curve-fitting algorithm. Unfortunately, the predicted size of a corpus for automatic lexical acquisition is too astronomical to compile it by using presently existing compiling strategies. Therefore, we suggest a practical and efficient alternative method. We are confident that this study will shed new light on issues such as corpus predictability, compiling strategies and linguistic comprehensiveness. 1 1. Introduction In principle, practical semantic analyzers should be able to process virtually any phrase or sentence containing even words not included in a concise medium-size dictionary. This would only be possible if we have at our disposal a complete computational lexicon in the amount and depth of information encoded for an entry as well as its size. This is probably one of the main reasons for the urgent necessity of a complete computational lexicon for practical natural language processing (NLP). However, the problem is compiling such a complete computational lexicon for an automatic semantic analyzer. Currently, we find a great deal of interesting researches on the automatic acquisition of linguistic information from corpora by means of statistical or machine learning mechanisms (Church 1994, Resnik 1993, Weischedel 1994, DanHee Yang 1999b). Nevertheless, all these works suffer from data sparseness: a phenomenon that reveals that a given corpus fails to provide sufficient information or data on the relevant wordto-word relationships. This is an intrinsic problem and a serious drawback for most corpusbased NLP works. In various studies using a small scale corpus, low frequency words detracted from the reliability of the probabilities obtained from the corpus. In other words, even if a certain word is never to occur in a corpus, we cannot be confident that its probability is zero. It is simply because the experimental data (corpus) is not large enough to warrant confidence in the statistics obtained from it. This situation suggests that we should compile larger size corpora for more reliable statistical NLP. Regarding corpora, data sparseness relates to both quantity (corpus size) and also quality (composition of a corpus). However, in this study, we focus mainly on quantity or corpus size . That is, how large does a corpus need to be in order to build a comprehensive computational lexicon by means of automatic lexical acquisition procedures/techniques (Zernik 1991, Yang 1999b)?” Probably, we do not need to predict the required size of a corpus that accurately, but we need a certain accuracy that would help us to estimate the total corpus compiling cost: time, 2 money, human effort, etc. The cost estimation leads us to lay down more practical compiling strategies than ever. Katz (1996) emphasized that corpora of different domains will, simply, have different words in them, and assuming there are probabilities for domain-specific words to occur in a language in general is a bad idea. Assume that there are domains D1, D2, ..., Dn and their corresponding corpora C1, C2, ..., Cn in a language. Then, the probability for a domain-specific word wt in Ci becomes Prob(wt|Ci). Here, we can define an omni-domain Ds including all Dj and its corresponding omni-corpus Cs that consists of all Cj. Then, the probability for wt in Cs becomes Prob(wt|Cs). As a matter of fact, Prob(wt|Ci) will, in general, not coincide with Prob(wt|Cs) (Hays 1994). However, the aim of this study is on estimating the size of Cs that can justify the probability Prob(wt|Cs) “in general” rather than that of a universal set1 (Stewart 1977), though we guess that Prob(wt|Ci) will gradually converge to Prob(wt|Cs) as the size of Ci grows. Most previous investigations on this matter, however, have not considered issues such as: (1) the implausibility of general purpose prediction, (2) the similarities and differences among corpora, (3) the importance of base forms, (4) the need for discriminating open class items, (5) the limitations of morphological analyzers and taggers, particularly when dealing with homographs and unknown words. We believe that the non-inclusion/non-consideration of these factors by researchers in their works might have resulted in not very reliable investigations, particularly if we consider their potential contribution to corpus size prediction. In this study, we shall, first of all, deal with and discuss the above issues and then experiment on the TOTAL Corpus to understand its actual state regarding the data sparseness. Finally, we shall try to predict the corpus size needed for compiling a comprehensive computational lexicon by the piecewise curve-fitting algorithm (for more details of this algorithm, see Dan-Hee Yang 2000). 1 an ideal set which includes absolutely everything, here, every words in a language 3 2. Corpus Size Prediction: Preliminary Considerations 2.1. Specific versus General Purpose Prediction Lauer (1995a, 1995b) outlined the basic requirements regarding the size of linguistic corpora for general statistical NLP. He suggested a virtual statistical language learning system by establishing an upper boundary on the expected error rate of a class of statistical language learners as a function of the size of training data. However, his approach partially lacks validity because the size of training data that one can extract from the same corpus is significantly different depending on the type of linguistic knowledge to be learned and on the procedure and technique used to extract data from the corpus. Similarly, De Haan (1992) reported that the suitability of data seems to depend on the specific study undertaken and added that there is no such thing as the best or optimal size. Note that Weischedel et al. (1994) demonstrated that a tri-tag model using 47 possible parts-ofspeech (POS) would need a little more than one million words of training. However, most researchers will agree that this size is neither suitable nor sufficient for semantic acquisition, among others. This implies that the corpus size is heavily dependent on the linguistic research we want to carry out. Consequently, there is no use trying to predict a corpus size that can satisfy all linguistic requirements and expectations. 2.2. Variation among Corpora Empirical corpus-based data depends heavily on the corpora from which this information has been extracted. This means that different linguistic domains (economy, medicine, science, computing, etc.), different authors, social strata, degrees of formality, and media (radio, TV, newspaper, etc.) result in different token-lemma2 relationships (see Sánchez and Cantos 1998). For an illustration of this issue, consider variation among various corpora (Table 1 and Figure 1). In order to get the most universally applicable data, we would need a very balanced corpus, 2 Lemmas are also referred in the literature as lexemes, base forms or dictionary entries. 4 which should entail all major linguistic domains, media, styles, etc., emulating real linguistic performance; in other words, a corpus that reliably models the language. Table 1. Corpora used for the experiment YSC I YSC II Short Name STANDARD DEWEY Number of Tokens 2,022,291 1,046,833 Sampling Criteria YSC V YSC VI 1980s 1970s 1960s 5,177,744 7,033,594 6,882,884 Reading pattern Dewey decimal YSC VII YSC III YSC VIII 80s Texts YSC IX 70s Texts NEWS 60s Texts TDMS Short Name 1990s TEXT CHILD NEWS TDMS Number of Tokens 7,171,653 674,521 1,153,783 10,556,514 9,060,973 Sampling Criteria 90s Texts Textbooks Children’s books Newspaper Sampling 45000 40000 35000 Lemmas 30000 25000 20000 TEXT DEWEY 15000 10000 CHILD STANDARD 5000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Corpus size (unit: 100,000 tokens) 60000 55000 50000 45000 Lemmas 40000 35000 1960s 30000 1970s 25000 1980s 20000 1990s 15000 TDMS 10000 NEWS 5000 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 Corpus size (unit: 100,000 tokens ) Figure 1. Number of lemma-growth in various corpora 5 71 76 81 86 91 96 101 106 The Yonsei Corpus (YSC) consists of eight subcorpora (the YSC I ~ YSC IX) compiled in accordance with different sampling criteria. The Center for Language and Information Development (CLID) paid special attention to the sampling criteria in order to get a most balanced corpus (see Chan-Sup Jeong et al. 1990). In Figure 1, we have the corpus size in words (tokens) on the x-axes and the number of different lemmas on the y-axes. The TEXT Corpus consists of Korean textbooks (ranging from elementary to high school levels) written by native Korean-speaking authors. The CHILD Corpus was compiled by means of samples taken from children’s books (fairy tales, children’s fiction, etc.). The upper graph in Figure 1 shows that the number of different lemmas in the TEXT and CHILD Corpora increases at a slower rate than the Dewey and STANDARD Corpora ones. This seems obvious as both the Dewey and the STANDARD Corpus contain different linguistic domains and, additionally, the texts refer to adult language, which is, in principle, more varied and lexically and semantically more complex. Note that the 1960s ~ 1990s Corpora were compiled with chronological criteria in mind, the Dewey Corpus by the Dewey decimal classification criteria and the STANDARD Corpus by means of the reading pattern criteria. The TDMS (Text and Dictionary Management System) Corpus was compiled by the Korean Advanced Institute of Science and Technology (KAIST), in a similar manner as the STANDARD Corpus. The NEWS Corpus is a CD-ROM title, consisting of all the Chosun-ilbo newspapers issued between 1993 and 1994. The lower graph in Figure 1 shows that the NEWS Corpus has a lower lemma-growth compared to its counterparts on the same graph. Interesting is the way the 1960s Corpus behaves compared to the 1970s, 1980s and 1990s Corpora. Notice that the difference between the 1960s and the 1990s Corpus is more marked than the one between the NEWS and the 1990s Corpus, though the 1960s and the 1990s Corpus --not the NEWS Corpus-- were compiled following similar sampling criteria. 6 From these observations it follows that sampling strategies affect the lemma-token slope to some degree. However, more important is the fact that all corpora have one common inherent characteristic: monotone, convex up, and increasing curve. Furthermore, in order to overcome all these differences regarding the sampling and lemma-token relationship, we decided to merge all of the 10 corpora above into a single one, on the assumption that the corpora are reliable and balanced models. It seems plausible to think that merging various reliable and balanced models can only result in an even better languagelike model. The 10 corpora gave rise to the TOTAL Corpus (totaling 50,780,790 tokens), a model on which we shall base our research. 2.3. Base Form versus Inflected Form Base forms are most valuable for constructing lexicons for NLP and indexes for information retrieval systems. It seems wise, therefore, to consider these forms (lemmas) rather than inflected ones (types), though Heaps (1978) and Young-Mi Jeong (1995) did not consider lemmas for information retrieval systems. To get a taste of the sentence structure and grammatical relationship in Korean, consider the following sentence: 철수가 그 논문을 썼다. Chelswu-ka ku nonmwun-ul ssessta. ‘Chelswu wrote the thesis.’ Chelswu-SM the thesis-OM wrote -가 -ka (subject marker) and -을 -ul (object marker) are case particles, occasionally having similar functions to prepositions in other languages. As case particles are bound to other items and cannot appear in isolation or on their own (a main feature of agglutinative languages), we shall not consider them for measuring/predicting the size of a corpus. Consequently, the number of different tokens in the above sentence will be considered as the same in Korean and English, whereas the number of different lemmas is different, 6 (including case particles) for Korean and 4 for English. However, notice that this has little if any effect on the total number of different 7 lemmas in a large corpus as the total number of case particles in Korean and the one of prepositions in English are below 100, respectively. Sánchez and Cantos (1997) also pointed out the need for such discrimination (tokenlemma). However, they focused more on the distinction and relationship between token and type (word form). Their approach to the token-lemma relationship was based solely on a handlemmatized sample of just 250,000 tokens, which we consider too small for this study and insufficient to draw any conclusions on the corpus size. To elucidate the number of different lemmas in a corpus, we lemmatized and tagged the corpus by means of the NM-KTS (New Morphological analyzer and the KAIST Tagging System), which reaches an accuracy rate of 96% and 75% in guessing unregistered words in its internal dictionary (Dan-Hee Yang 1998). 2.4. Open Class Items Nouns, verbs, adjectives and adverbs are considered open class items. These parts of speech (POS) are open in the sense that new items are constantly added to the language. Clearly, nouns increase much more than any other POS particularly because of jargons, new compounds, proper nouns and borrowings from other languages. The other open class items (verbs, adjectives and adverbs) are clearly less productive, which leads us to consider different degrees of productivity among open classes. In addition, open class items are also considered to be lexical items due to their contribution to the meaning of propositions in a language. Table 2 shows the distribution of the various open class items found in the 우리말큰사전 Ulimal Keun Dictionary ‘Korean Grand Dictionary’ (the Society of Hangul 1992), which consists of 399,217 lemmas (or entries) and is presently considered to be the most comprehensive Korean dictionary. HD (see Table 2) gives each number of lemmas relative to the four open classes found in this dictionary. The four open classes form up to 98.27% of all the lemmas in this dictionary. This clearly shows that lexical items are the main targets in the automatic lexical acquisition. In addition, the very high proportion of open class items (more 8 than 98%) is persuasive enough not to attempt compiling a comprehensive computational lexicon manually. Table 2. Amount of entries distributed according to open classes in Korean Noun Verb Adjective Adverb Total HD 305,030 55,677 15,934 15,679 392,320 Ratio 76.47% 13.96% 3.99% 3.93% 98.27% HI 249,748 45,241 14,825 14,602 324,416 HD / HI 1.22 1.23 1.07 1.07 1.20 Regarding the distribution of open class items, there might be cases where finding predicting functions by POS could be necessary. For instance, if we try to acquire a certain number of items (for example, 20,000 items), regardless of POS, most of these items might belong to a specific POS (probably mostly nouns). However, we sometimes need to acquire items with certain POS constraints, for example, 20,000 nouns, 3,000 verbs, 1,000 adjectives and 500 adverbs. 2.5. Homographs Homographs, that is, different linguistic items with the same spelling, are problematic, particularly when they additionally share the same POS. These items cannot be distinguished, at least, at a morphological level. This has led us to disregard them, considering that if N is the actual number of different lemmas that we find in any given corpus, then the maximum number of different lemmas M we can get will be: M = N (HD / HI) where HD stands for the number of entries in a dictionary and HI for the total number of “homograph-independent” entries in the same dictionary, that is, considering all homographs token/type identical. Thus the average rate of error AE is AE = (HD / HI – 1) / 2 100. For instance, the AE for nouns is (305,030 / 249,748 – 1) / 2 100 ≒ 11% (see Table 2). 9 2.6. Unknown Words There are cases where the resulting analyses of the morphological analyzer, for example, the NM-KTS, are not stated in the reference dictionary (e.g., Ulimal Keun Dictionary). This might be due to two reasons: (1) errors found in word spacing, typing, spelling, etc.; or (2) incapacity of the system to analyze unknown words such as proper nouns, compounds or derivative items (shifted forms, i.e. noun becoming a verb, etc.), as they are not in the lexical database of the 600000 60000 500000 50000 400000 40000 Lemmas Lemmas NM-KTS (Dan-Hee Yang 1999a). 300000 30000 200000 20000 100000 10000 0 0 0 90 180 270 360 Corpus size (unit: 100,000 tokens) 0 450 90 180 270 360 450 Corpus size (unit: 10,000 tokens) (1) Noun (2) Verb Figure 2. Number of items (nouns and verbs) not found in the dictionary Figure 2 illustrates the number of unknown words (according to the NM-KTS) found in the TOTAL Corpus. The x-axis represents the corpus size in total items or tokens and the y-axis the number of different items absent in the dictionary. The graph clearly shows that the increase of unknown nouns is steady, almost linear. Note that we found about 600,000 unknown nouns, which is over twice the number of nouns present (see HI of Table 2) in the Ulimal Keun Dictionary (our reference dictionary for Korean). Regarding verbs (see Figure 2 (2)), the number of unknown verbs is roughly 60,000, which results in more verbs than those appearing in the Ulimal Keun Dictionary. This is an important issue that needs a careful examination; otherwise misconceptions or misleading conclusions might easily lead us to believe that the 50 million word TOTAL Corpus could be sufficiently large to find almost all the lemmas present in the reference dictionary. A 10 second look at the statistics on the lemma slopes, not on mere frequencies, is worth considering: we insist here that if a morphological analyzer/tagger analyzes a word and this resultant word is not in the reference dictionary, then it should be excluded in the statistical processing of corpus size prediction. This fact has not been considered or realized by other related studies such as Heaps (1978), Young-Mi Jeong (1995) and Sánchez and Cantos (1997). Consequently, the number of actual different lemmas in a corpus should always be underestimated, as all unregistered items, though correct (such as proper nouns, compounds, borrowings, etc.), will be excluded statistically in this study. Since the goal of our study is to reliably predict the size of a corpus in order to compile a computational lexicon that contains almost all the lemmas present in a current contemporary reference dictionary, the strategy of underestimating the total number of different lemmas in corpus size prediction seems justifiable, efficient and valid. Obviously, this also implies that all lemmas present in a current contemporary reference dictionary need to be considered as part of the contemporary vocabulary of a language, in our case, Korean. This leads us to set up the following natural inclusion hypothesis: While collecting/compiling those lemmas present in a contemporary comprehensive dictionary from a corpus, one can, as a side effect, also collect/compile other contemporary vocabulary items. It follows from this hypothesis that any corpus whose size has been predicted to contain almost all the lemmas present in a current contemporary reference dictionary, might also contain additional contemporary vocabulary items. Clearly, the most optimistic consequence of the above stated hypothesis is that we might not just get some additional features relative to the contemporary vocabulary of the language under investigation, but get almost all the contemporary vocabulary of that language. 11 3. Analyzing the Total Corpus 3.1. Data Sparseness Curves The idea of relative frequency is connected with probability according to “Bernoulli’s theorem”. This theorem states that if the probability of occurrence of event X is p(X) and if N trials are made, independently and under exactly the same conditions, the probability that the relative frequency of occurrence of X differs from p(X) by any amount, however small, approaches zero as the number of trials grows indefinitely large. Focusing on the term “indefinitely large”, if we consider each occurrence of a word (token) in a corpus as a trial, then the total number of trials Lemmas amounts to a sample space or the corpus itself. 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0 15 30 45 60 75 90 105120 135150 165180 195210 225240 255270 285300 315 330345 360375 390 405 420435 450465 480495 Corpus size (unit:100,000 tokens) Lemmas (1) frequency = 1 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 0 15 30 45 60 75 90 105120135150165180195210225240255270285300315330345360375390405420435450465480495 Corpus size (unit:100,000 tokens) Lemmas (2) frequency 3 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 0 15 30 45 60 75 90 105 120 135 150165 180 195 210 225 240 255 270 285 300 315 330345 360 375 390 405 420 435 450 465 480 495 Corpus size (unit:100,000 tokens) (3) frequency 5 Figure 3. Data sparseness in nouns 12 The question now is: what is a reasonable size for a sample space or corpus? To answer this question, let us go one step further and paraphrase the above question: if we hope to observe the occurrence of each lemma with a frequency higher than a desirable one, how large does the corpus then need to be? This seems to be an important issue, particularly if we are interested in the variety of regarding linguistic knowledge that can be automatically extracted and acquired by means of linguistic corpora. Therefore, it is of prime importance that each linguistic event occurs more frequently than the desirable frequency in the corpus. Observing Figure 3, the data sparseness slopes are almost horizontally linear or parallel to the x-axis from xi=22.5 million tokens onwards. Around 9,000 different nouns occur only once (hapax legomena) in the TOTAL Corpus, around 14,000 different nouns occur three times or less, and around 18,000 different nouns appear with the frequency of five or less. It is important to notice that the frequency statistics do not refer to the same set of different nouns, but to new sets of different nouns for each xi. It is a hasty generalization that the horizontal linearity of the slopes in Figure 3 keeps going without any change, as all slopes in Figure 5 still keep going up as new lemmas keep occurring. For that reason, watching the predicting curve for nouns in Figure 6 (1), it is mathematically certain that we shall find all nouns shown in Table 2, invariably of how huge the corpus is. 3.2. Shared and Unique Vocabulary Table 3 shows the number of different lemmas found in the TOTAL Corpus. Considering the number of dictionary entries in the reference dictionary (see Table 2), adjectives in the TOTAL Corpus with a frequency of one or more just happen 4,720 / 14,825 × 100 = 31.8%, and the other POS are not higher than 25%, which seems to be a quite low percentage. This evidence is also ratified by other studies. For example, Hwi-Young Lee (1974) reported that people use on average about 24,000 lemmas in actual conversation (nouns: 12,000 verbs: 4,500 adjectives: 5,500 adverbs: 1,000). 13 Table 3. The number of different lemmas in the TOTAL Corpus, arranged by POS Frequency Noun Verb Adjective Adverb 3 16,474 3,631 1,398 990 5 20,457 4,413 1,760 1,180 1 62,497 12,188 4,720 3,138 10 37,015 6,950 2,689 1,876 30 26,640 5,340 2,113 1,426 50 21,869 4,604 1,860 1,225 One might think that the TOTAL Corpus is likely to supply us with sufficient information regarding nouns, verbs and adverbs (except for adjectives) since the number of different lemmas within the frequency range of 50 for each POS is greater than the corresponding figures that Hwi-Young Lee (1974) obtained. However, notice that each individual has unique linguistic competence, unique living experiences and, consequently, a unique vocabulary, etc. All these features are different among individuals’ speech to varying degrees. So, how many vocabulary items do speakers of the same language share? In other words, what can we consider to be the common core vocabulary of a language in contrast to more idiolectal vocabulary items? This problem is inevitably related to the corpus representativeness. In this sense, Hyeok-Cheol Kwen (1997) compared high frequency words (tokens) occurring in the YSC Corpus (totaling 40 million tokens) and the KAIST Corpus (totaling 40 million tokens, compiled by Korea Advanced Institute of Science and Technology). He reported that both corpora share between 78% and 85% of their words (among the 200,000 most frequent items, the intersection is 85%, among the 300,000 most common items, the intersection reaches 83%, 78% among the 500,000 most common tokens and 84% at 40 million tokens). From this observation, we conclude that individuals of the same language are likely to share roughly 80% of their vocabulary, though the sharing rate should be either tightened up or defended at greater length. This shared vocabulary (80% in this case) is very easy to find because it is mostly used among the same language-speaking community. By contrast, the remaining 20% amounts to the 14 less frequent vocabulary or unique vocabulary of certain linguistic domains or individuals, hence needs to be gathered from elsewhere. This means that we still need to devise a tactic to complement the remaining 20%. Intuitively, we can conclude that the TOTAL Corpus cannot supply us with sufficient information, at least, for our research purposes. 3.3. Reusability and Data Sparseness Figure 4 shows reusability and data sparseness by POS in the TOTAL Corpus. Reusability is calculated as Count (freq 10, 30, or 50, respectively) / Count (freq 1). Similarly, data sparseness as Count (freq 3, or 5, respectively) / Count (freq 1). In reusability, about 58% for freq 10, 44%, for freq 30 and 38% for freq 50 are reused irrespective of their POS. In data sparseness, about 29% for freq 3 and 36% for freq 5 are sparse irrespective of their POS, even though nouns are a little sparser than other POS. 70.0% Percents 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% N o un Freq. <= 3 Freq. >= 30 Verb Adjective Freq. <= 5 Freq. >= 50 Adverb Freq. >= 10 Figure 4. Reusability and data sparseness These observations reveal that there is little difference among POS in reusability and data sparseness, at least as far as the TOTAL Corpus (probably the KAIST Corpus too) is concerned. However, further considerations are needed for nouns as unknown words have been excluded in this experiment. Assume that we can acquire the meaning of a lemma automatically when the lemma occurs in freq 10. Then, we can define only about 58% of the total lemmas. If we consider the lemmas of freq 5 sparse, we should apply techniques for treating data sparseness (e.g., 15 smoothing methods, proximity methods and clustering) to about 36% of the total lemmas. However, there is no reason why we should not try to compile a very large corpus in order to Lemmas drop the percentage 36% significantly. This is precisely what has motivated our study. 40000 35000 30000 25000 20000 15000 10000 5000 0 Actual Estimated 0 15 30 45 60 75 90 105120135150165180195210225240255270285 300315330345360375390405420435450465480 495 Corpus size (unit:100,000 tokens) Lemmas (1) frequency 10 28000 24000 20000 16000 12000 8000 4000 0 Actual Estimated 0 15 30 45 60 75 90 105 120 135150 165 180 195 210 225240 255 270 285 300 315330 345 360 375 390 405420 435 450 465 480495 Corpus size (unit:100,000 tokens) Lemmas (2) frequency 30 24000 21000 18000 15000 12000 9000 6000 3000 0 Actual Estimated 0 1 5 3 0 4 5 6 0 7 5 9 0 1 0 5 1 2 01 3 5 1 5 0 1 6 51 8 0 1 9 52 1 0 2 2 52 4 0 2 5 5 2 7 02 8 5 3 0 0 3 1 5 3 3 03 4 5 3 6 03 7 5 3 9 04 0 5 4 2 0 4 3 54 5 0 4 6 54 8 0 4 9 5 Corpus size (unit:100,000 tokens) (3) frequency 50 Figure 5. The number of different nouns having more than the specified frequencies 4. Results of the Experiment 4.1. Prediction of Corpus Sizes The desired frequency might vary depending on the aim of the investigation and/or the task one wants to carry out. Figure 5 shows the number of newly occurring lemmas that appear more than or equal to 10, 30, and 50 times in the TOTAL Corpus. The number of different lemmas within the frequency span of 10 is about 37,000 in the corpus; the one within the frequency 16 span of 30 is about 26,500, and the one within the frequency span of 50 is about 22,000. Unfortunately, there is no research on desirable frequencies for the automatic lexical acquisition that could help us in this sense. In this study, merely to minimize the error due to our relatively small-scale experimental data, we stick to a minimal frequency of 10 for predicting the corpus size. Lemmas 240,000 200,000 160,000 120,000 y=100.22257×x0.333427 80,000 40,000 0 0.0E+00 1.0E+09 2.0E+09 3.0E+09 4.0E+09 5.0E+09 6.0E+09 Tokens 7.0E+09 8.0E+09 9.0E+09 1.0E+10 1.1E+10 1.0E+10 1.1E+10 1.0E+10 1.1E+10 1.0E+10 1.1E+10 (1) Nouns 21,000 17,500 Lemmas 14,000 10,500 7,000 3,500 0 0.0E+00 y=305.96220×x0.176116 1.0E+09 2.0E+09 3.0E+09 4.0E+09 5.0E+09 6.0E+09 Tokens 7.0E+09 8.0E+09 9.0E+09 (2) Verbs 10,800 Lemmas 9,000 7,200 y=55.06041×x0.221324 5,400 3,600 1,800 0 0.0E+00 1.0E+09 2.0E+09 3.0E+09 4.0E+09 5.0E+09 6.0E+09 Tokens 7.0E+09 8.0E+09 9.0E+09 (3) Adjectives 4,200 Lemmas 3,500 2,800 y=153.32264×x0.138917 2,100 1,400 700 0 0.0E+00 1.0E+09 2.0E+09 3.0E+09 4.0E+09 5.0E+09 6.0E+09 Tokens 7.0E+09 8.0E+09 9.0E+09 (4) Adverbs Figure 6. Predicting curves by POS (frequency 10) Using the piecewise curve-fitting algorithm suggested by Dan-Hee Yang (2000) and considering a word frequency of 10, we can get the predicting functions of 17 Table 4 by POS and hence the predicting curves of Figure 6. Figure 6 clearly shows that as the corpus size grows, the increasing rate in the number of different lemmas gets significantly smaller. However, we cannot ignore the possibility that the increasing rate of newly occurring lemmas might fall abruptly somewhere in an interval outside the graphs of Figure 5. It is the intrinsic limitation of inductive reasoning. Table 4. Predicting Functions and Transformed Functions by POS (y is the required number of items, x is the predicted corpus size, frequency 10)) Predicting Functions Transformed Functions Nouns y=100.22257×x 0.333427 x = 10(log y - log 100.22257) / 0.333427 Verbs y=305.96220×x0.176116 x = 10(log y - log 305.96220) / 0.176116 Adjectives y=55.06041×x0.221324 x = 10(log y - log 55.06041) / 0.221324 Adverbs y=153.32264×x0.138917 x = 10(log y - log 153.32264) / 0.138917 In Table 4, to get each transformed function, taking logarithms of both sides of the predicting function, for example, for nouns yields y=100.22257 × x0.333427 log y = log 100.22257 + 0.333427 log x where the predicted corpus size x results from 0.333427 log x = (log y - log 100.22257) log x = (log y - log 100.22257) / 0.333427 giving the following transformed function x = 10(log y - log 100.22257) / 0.333427 Other transformed functions of Table 4 are induced in the same way. We can use these functions to get the corresponding corpus size x regarding a required number of nouns y. To obtain 100,000 different nouns, its transformed function x = 10(log y - log 100.22257) / 0.333427 reveals that a corpus of 987 million tokens would be required (by solving x = 10(log 100,000 - log 100.22257) / 0.333427 ), and a 7.9 billion ones to get 200,000 different nouns. To get 20,000 different 18 verbs, x = 10(log y - log 305.96220) / 0.176116 reveals that roughly 20.3 billion tokens would be needed, and about 1.04 trillion tokens for 40,000 different verbs. The predicted corpus sizes by POS in Table 5 were calculated by taking the predicting functions of Table 4 and by considering the number of each dictionary entry by POS in Table 2. All in all, most of the predicted results clearly exceed the size of most presently available corpora and current corpus compilation expectations. Table 5. The Predicted corpus sizes by POS according to the transformed functions in Table 4 (frequency 10) Nouns Verbs Adjectives Adverbs Corpus size Different Different Different Different Increase Increase Increase Increase words words words words 10 million 21,625 - 5,230 - 1,950 - 1,439 - 20 million 30 million 27,247 31,192 5,623 3,944 5,909 6,346 679 437 2,274 2,487 323 213 1,584 1,676 145 92 100 million 200 million 46,600 58,716 15,408 12,116 7,845 8,864 1,499 1,019 3,247 3,785 759 538 1,981 2,182 305 200 300 million 67,215 1 billion 100,417 8,500 33,202 9,520 11,768 656 2,249 4,140 5,404 355 1,264 2,308 2,728 126 420 2 billion 126,526 3 billion 144,842 26,109 18,316 13,296 14,280 1,528 984 6,301 6,892 896 592 3,004 3,178 276 174 10 billion 216,389 20 billion 272,651 71,547 56,262 17,653 19,945 3,373 2,292 8,997 10,488 2,104 1,492 3,756 4,136 579 380 30 billion 100 billion 21,422 26,482 1,476 5,060 11,473 14,976 985 3,503 4,376 5,172 240 797 200 billion 300 billion 29,920 32,135 3,438 2,215 5,695 6,025 523 330 1 trillion 2 trillion 39,725 44,882 7,590 5,158 7,122 7,842 1,097 720 3 trillion 10 trillion 48,205 3,322 8,296 9,807 454 1,510 20 trillion 30 trillion 10,798 11,424 991 626 100 trillion 200 trillion 13,503 14,868 2,080 1,365 4.2. Explanation of the Extreme Size The astronomical sizes in Table 5 need some brief considerations: (1) it is likely that many dictionary entries are not used in our everyday language, and (2) there are significant 19 inconsistencies and discrepancies in terms of the criteria for selecting dictionary entries and the rules on which morphological analysis is carried out. Most comprehensive paper/compact disk (CD) dictionaries normally include entries without considering frequency, except for some recently corpus-based dictionaries (see CollinsCOBUILD, CUMBRE Gran Diccionario de la Lengua Española). For example, regarding nouns, many reference dictionaries include a great amount of proper nouns, compound nouns, derivative nouns, technical terms, and so on. However, many of these terms are little used in real life. Similarly, many adverbs and adjectives are most likely to be used only in literary works. It follows intuitively that we cannot find such terms occurring more often than with the given frequency unless we have at our disposal a large corpus which is big enough to include all virtually possible text types. Regarding verbs, passivization and causativization can be realized morphologically in Korean. Korean dictionaries normally include these variants or derivations as entries whenever they are frequently used in everyday conversation. However, most morphological analyzers, including the NM-KTS, return the base form and suffix. For example, the NM-KTS analyzes the term 공부하다 kongpu-hata ‘make a study’ as 공부 kongpu ‘study’ (noun), and -하다 hata ‘make’ (verbal suffix), whereas our reference dictionary includes the term as a full verb (lemma). Accordingly, the analyzer underestimates the number of verbs and overestimates the number of nouns. We seriously doubt whether Korean has actually around 55,677 verbs. As previously mentioned in Section 2.4, the NM-KTS analyzer is highly reliable regarding adverbs. Nevertheless, since many adverbs are not likely to be used in everyday language, we cannot take adverb predictions totally for granted (see Table 5). 4.3. Increasing Effects Figure 7 displays the effects of increasing corpus size in relation to Table 5. The first increasing step to 10 million tokens produces the maximal effect. Obviously, the next increasing step of 10 million tokens is less effective: 26.0% (5,623 / 21,625) for nouns, 13.0% (679 / 5,230) for verbs, 20 16.6% (323 / 1,950) for adjectives and 10.1% (145 / 1,439) for adverbs. Successive 10 million token increases are even less effective. These graphs can give us some clues about the economical efficiency of trying to build an enormous corpus (e.g., 3 trillion tokens); for instance, a 2 trillion token corpus contains 7,842 adverbs occurring more than or equal to 10 times, 24,000 22,000 20,000 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 Lemmas Lemmas adding a further trillion tokens to the corpus, just gives 454 new adverbs, according to Table 5. 5,500 5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 0 0.0E+ 3.0E+ 6.0E+ 9.0E+ 1.2E+ 1.5E+ 1.8E+ 2.1E+ 2.4E+ 2.7E+ 3.0E+ 3.3E+ 00 07 07 07 08 (1) Nouns 08 08 08 08 08 0.0E+ 3.0E+ 6.0E+ 9.0E+ 1.2E+ 1.5E+ 1.8E+ 2.1E+ 2.4E+ 2.7E+ 3.0E+ 3.3E+ 08 00 2,200 2,000 1,800 1,600 1,400 1,200 1,000 800 600 400 200 0 0.0E+ 3.0E+ 6.0E+ 9.0E+ 1.2E+ 1.5E+ 1.8E+ 2.1E+ 2.4E+ 2.7E+ 3.0E+ 3.3E+ 00 07 07 (3) Adjectives 07 08 08 08 07 07 07 08 08 08 08 08 08 08 08 08 08 08 08 Tokens (2) Verbs Lemmas Lemmas 08 Tokens 1650 1500 1350 1200 1050 900 750 600 450 300 150 0 0.0E+ 3.0E+ 6.0E+ 9.0E+ 1.2E+ 1.5E+ 1.8E+ 2.1E+ 2.4E+ 2.7E+ 3.0E+ 3.3E+ 08 00 07 Tokens (4) Adverbs 07 07 08 08 08 08 08 08 08 08 Tokens Figure 7. Effect of increasing the corpus size (frequency 10) 5. Conclusion In this study, an attempt has been made to predict the corpus sizes needed for compiling a comprehensive computational lexicon by the automatic lexical acquisition. In addition, any prediction might also help to estimate elaboration and compilation costs. The primacy of quantity need not necessarily be maintained at all costs, regardless of issues and problems such as: (1) the relevance of the new data offered, (2) economy, (3) complexity and (4) the financial support required. Predictive tools like the ones offered here might be very useful in this 21 direction as they help us to estimate the total compiling cost and to determine the most efficient compiling strategy in accordance with the economy principle. To achieve this twofold goal, our research has tried to overcome several problems that previous studies have failed to notice. First, we have insisted on finding a specific predicting function rather than a universal one. Second, there is no special variation among various corpora for prediction in the sense that they are all monotone, increasing, and convex up. Third, base forms of words, rather than just tokens (word forms), should be considered in order to compile dictionaries for NLP. Fourth, it is worthwhile distinguishing between open class items. Finally, it makes sense to disregard unknown words in the results of morphological analysis. Traditionally, most corpora have been compiled for lexicography and/or corpus linguistics, where quality (representativeness or balancing), rather than quantity, has been regarded as being very worth pursuing. In a sense, Bernoulli’s theorem has been overwhelmed by the sampling theory in statistics. This obviously resulted in underestimating very important issues such as the data sparseness. In order to overcome the data sparseness in NLP-oriented corpora for developing practical NLP software, however, we should prefer quantity to quality so that probabilities obtained from a corpus can be validated according to Bernoulli’s theorem. An apparent serious drawback of our study is the huge predicted sizes, which seem ‘out of this world’, unlikely to be compiled, at least by means of the presently existing compiling strategies (oral transcriptions, manual typing, etc.). Instead, we prefer to determine accurately the experimental reliable corpus size needed for each research. That is, the number of lemmas that we need will determine the required size of the corpus. In addition, we also need to consider the total cost of compiling such a corpus. Therefore, what constitutes a reasonable corpus size is also closely related to whether we can afford to invest. Often, appending new huge data just results in few new different lemmas. In the worst case, if we inevitably need a 200 trillion token corpus, we might consider two possible compiling methods or strategies; (1) collect randomly as many different text types as possible, of course, electronic texts from Internet, publishing companies, electronic 22 newspapers, electronic libraries, BBS, and so on. (2) consider Internet itself as a virtual corpus. Consequently, it is necessary to collect on-line texts without any other condition, whenever they are correct in word spacing and spelling. We strongly believe that as long as we do not deliberately try to compile an unbalanced corpus, the larger the corpus is, the more balanced and representative it gets. However, there is always the possibility that some linguistic phenomena, structure, or words are not found even in a very large corpus. In the future, we shall try to resolve some of the limitations in this study: (1) the small size of the experimental corpus (the TOTAL Corpus). Of course, the larger the corpus is, the more accuracy we get; (2) some inconsistencies found between the morphological analyzer (NM-KTS) and the reference dictionary (Ulimal Keun Dictionary); and (3) the inaccuracy of the morphological analyzer. Regrettably, this creates a deadlock in the sense that an enormous corpus is required to develop a more accurate morphological analyzer in statistical NLP. All in all, we are confident that this study will shed new light on issues such as corpus predictability, corpus-compiling policy, and reliability of corpus based NLP. Acknowledgments This research was funded by the Ministry of Information and Communication of Korea under contract 98-86. Also, this is being partly supported by grant No. 2001-2-52200-001-2 from the Basic Research Program of the Korea Science & Engineering Foundation. Thanks go to Dr. David Walton at the University of Murcia for reading over the final draft of this paper and for offering some useful suggestions. References Church, Kenneth W. and Robert L. Mercer (1994). Introduction to the Special Issue on Computational Linguistics Using Large Corpora. Using Large Corpora, 1-24, edited by Susan Armstrong. The MIT Press. De Haan, Pieter (1992). The Optimum Corpus Sample Size? In Leitner, Gerhard (eds.): New Directions in English Language Corpora, Methodology Results, Software Development, 319. Mouton de Gruyte, New York. 23 Hays, William (1994). Statistics, 42-47, 94-111. Harcourt Brace College Publishers, Florida. Heaps, H. S. (1978). Information Retrieval: Computational and Theoretical Aspects, 206-208. Academic Press, New York. Jeong, Chan-Sup, Sang-Sup Lee, Ki-Sim Nam, et al. (1990). Selection Criteria of Sampling for Frequency Survey in Korean Words. Lexicographic Study, Vol. 3, 7-69. Tap Press, Seoul. Jeong, Young-Mi (1995). Statistical Characteristics of Korean Vocabulary and Its Application. Lexicographic Study, Vol. 5, 6, 134-163. Tap Press, Seoul. Kwen, Hyeok-Cheol (1997). Performance Improvement of Korean Information Processing System by Using a Corpus. Corpus and the Korean Language Information. The 9th Annual Meeting of Korean Lexicographic Center. Korean Lexicographic Center at Yonsei University. Seoul. Lauer, Mark (1995a). Conserving Fuel in Statistical Language Learning: Predicting Data Requirements. In the 8th Australian Joint Conference on Artificial Intelligence. Canberra. Lauer, Mark (1995b). How much is enough?: Data requirements for statistical NLP. In 2th Conference of the Pacific Association for Computational Linguistics. Brisbane, Australia. Lee, Hwi-Young (1974). An Introduction to French Linguistics, 47. Jeong-Eum-Sa, Seoul. Resnik, Philip (1993). Selection and Information: A Class-Based Approach to Lexical Relationships, 6-33. Ph.D. Dissertation of Department of Computer and Information Science. Pennsylvania University. Sánchez, Aquilino and Pascual Cantos (1997). Predictability of Word Forms (Types) and Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish. In International Journal of Corpus Linguistics 2(2), 259-280. Sánchez, Aquilino and Pascual Cantos (1998). El ritmo incremental de palabras nuevas en los repertorios de textos. Estudio experimental y comparativo basado en dos corpus lingüísticos equivalentes de cuatro millones de palabras, de las lenguas inglesa y española y en cinco autores de ambas lenguas, Atlantis (Revista de la Asociación Española de Estudios Anglo-Norteamericanos), Vol. XIX (1), 205-223, Spain. Katz, Slava (1996). Distribution of Content Words and Phrases in Text and Language Modelling. In Journal of Natural Language Engineering 2(1), 15-59. Stewart, Ian and David Tall (1977). The Foundations of Mathematics, 41-61. Oxford University Press. The Society of Hangul (1992). 우리말 큰사전 ‘Ulimal Keun Dictionary. Eomun-gak Press, Seoul. Weischedel, Ralph et al. (1994). Coping with Ambiguity and Unknown Words through 24 Probabilistic Models. Using Large Corpora, edited by Susan Armstrong, 323-326. The MIT Press. Yang, Dan-Hee, Mansuk Song (1998). How Much Training Data Is Required to Remove Data Sparseness in Statistical language Learning? In Proceedings of the First Workshop on Text, Speech, Dialogue (TSD’98), 141-146, Bruno. Yang, Dan-Hee Yang, Su-Jong Lim, Mansuk Song (1999a). The Estimate of the Corpus Size for Solving Data Sparseness. In Journal of KISS, Vol. 26, No. 4, 568-583. Yang, Dan-Hee, Mansuk Song (1999b). Representation and Acquisition of the Word Meaning for Picking out Thematic Roles, In International Journal of Computer Processing of Oriental Languages (CPOL), Vol. 12, No. 2, 161-177, the Oriental Languages Computer Society. Yang, Dan-Hee, Pascual Cantos Gómez, Mansuk Song (2000). "An Algorithm for Predicting the Relationship between Lemmas and Corpus Size", In ETRI Journal, Vol. 22, No. 2, Electronics and Telecommunications Research Institute. Zernik , Uri (1991). Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1-26. Lawrence Erlbaum Associates. 25