František Čermák The Institute of the Czech National Corpus, Charles University Prague, frantisek.cermak@ff.cuni.cz KGA 03 (2011), 33–44 Some of Current Problems of Corpus and Computational Linguistics or Fifteen Commandments and General Truths A lot is going on in the field and research here seems to be almost without limits. Much of this is due to vast possibilities offered by computers, corpora, Web etc., much is due, however, to a disregard, evident with many, for what real linguistics has to say and offer, whether this might be a good sign of bridging the gap or not. Recent corpus-based grammars and dictionaries may be viewed as a good sign, there is no doubt about it. Likewise, there is no doubt about that a number of insights, techniques and some useful results have been achieved in the field over years that are of real interest, inspiration and insight for linguists, such as machine translation, to name just one field (which in itself took decades to get to its present and limited state while linguists have always been behind this). Due to its different orientation, more linguistic in character, it is also evident that corpus linguistics is no longer a mere branch of computational linguistics, although there is a lot of overlapping and mutual inspiration. The praise along these lines is not difficult to formulate but are we really after praise and feeling of selfsatisfaction only? However, despite a considerable progress recorded here, some of real problems that seem to be of importance persist and new ones have emerged. In what follows, some of these are summarized and presented in a special and, perhaps, idiosyncratic list of problems which I choose to call commandments because of their urgent nature. In an attempt to sum up in a terse and perhaps somewhat aphoristic form, some of the most salient aspects, characteristic of the present state of affairs, which is based on sobering and often disconcerting experience of recent years, are offered here in a nutshell, followed by comments. Technically, this contribution happens to be a build-up and modification of a similar but much smaller list and survey published five years ago (Čermák 2002). (1) Garbage in, garbage out. A Comment: This is an old, general and, hopefully, broadly accepted experience and re-interpreted knowledge, having, in the case of corpora, a number of implications related to data and their treatment. One cannot expect reasonable and useful results from bad data fed into computer either because of their skewed nature and because they are not representative of the phenomenon to be described or simply because they are not sufficient. A special case of the latter is a wanton choice of data, artificially cut off from other although these are interconnected. This leads to another point. To make things worse, once a clumsy query or a badly formulated algorithm is used the results one may get may be ever more useless. (2) The more data the better. But there is never enough data to help solve everything. A Comment: This points to a need for more and also alternative resources, such as spoken data. However, one has to discern between data proper and their necessary contexts. Some language phenomena, even words, are quite rare and require a lot of recorded occurrences, should a decent description be achieved, while on the other hand some data are to be found in special types of texts only, including real spoken data. However, once having, perhaps, enough data, you are faced with a rather different challenge. You will never get at the meaning hidden behind form directly, its elusiveness being often a source of vexation. To get a help, you will have to resort to larger contexts, often extralinguistic in nature, use your power of introspection and knowledge of the rules of ellipsis, etc. The worst type of this is found in spontaneous spoken data not provided, as a rule, by external contexts. Hence, the call for maximum data is also an expression of hope to find help there. A Corollary: A popular blunder committed over and over again is to believe that once we are dealing with written texts (which are so easy to come by) we are dealing with the whole language. For a number of reasons, historical and informational, the truest form of language is the spoken (spontaneous) language. In this, true linguistics, including corpus linguisticts, has just barely started and the lack of this kind of data is appaling. (3) The best and really true information comes from direct data. A Comment: There is always the alternative of pure unadulterated texts, forms devoid of any annotation, which many people are extremely afraid of, pretending these are not sufficient for their goals, while, in fact, they do not know how to best handle them. The other alternative, annotation, now offered as the primary solution, is, to a varying extent, always biased and adulterates both the data input and results obtained. Hence, annotated texts should always be viewed as an alternative only, although a fast and popular one. Fast may often mean superficial, simplified, or, distorted, however. A Corollary: Don´t sneer at plain text corpora, since many kinds of information is to be got only from them. Linguistically, tagged texts are always a problem and a great misleader. Much of textual information is lost by tagging and lemmatisation; moreover, certain semantic collocational and other aspects, being found with word forms only, do not obtain with their lemmas. Computational linguists seem to be obssessed by ever improving their tagging and lemmatization scores but forget to measure the degree of information loss because of this. Perhaps, they do not know how to do it, provided they recognize this as a problem. Any markup is an additional information only and there might be as many markups of the same corpus as there are people available and willing to do it. The result, should this be undertaken, will be an array of rather different views while the language behind remains the same. Another Corollary: Data are sacrosanct and should not be tampered with, even with the best of intentions. Yet, some computer people find odd text mistakes, seeming misspellings etc. so tempting that they are inclined to „improve“ the bad and naughty data to fit their nice little programmes better. There is always a reason for this quality of data, most often because we are dealing with natural language which is imperfect by definition. In fact, we will never know for certain how much tampering the plain data, once they left the author, have gone through, because of too dilligent, idiosyncratic editors, censors, programme failures etc. One of Murphy´s laws puts this rather succinctly: In nature, nothing is ever right. Therefore, if everything is going right ... something is wrong. Naive and at the same time dangerous people endorsing this view may look upon the first part of the law as a justification for what they are about to do, while it is the second part that a real linguist should take seriously, both as a principle and a warning. To widen the gap between plain and annotated data even more, one should remind oneself, as many corpus lexicographers do now for some time, that much of sense and meaning of words is associated with specific word forms or their combinations only, and cannot be generalized to lemmas. (4) Representativeness of a corpus is possible and, in fact, advisable, being always related to a goal. A Comment: This has largely been misinterpreted to mean that a (general) representative corpus is a nonsense. Well, it is not. There are in fact as many kinds of representativeness as there are goals we choose to pursue. What one does not usually emphasize is that representation vis-a-vis a general multi-purpose monolingual dictionary is a legimate goal, too. Otherwise corpus lexicographers would be out of business. The whole business of lexicography is oriented towards users offering them a fair coverage of what is to be found in texts and, accordingly (hopefully) in dictionaries. Coverage of a language in its totality requires a lot of thinking and research about what kind of texts for a corpus-based lexicography must be used and in what proportions. But there are other kinds of representativeness, of course, perhaps better-known for their more clear-cut goal. In a sense, it is thanks to modern large corpora that one may, though still only to a degree, envisage and perhaps even try to achieve some sort of exhaustive coverage of language features being studied. Superficially and to a lay linguist, that would seem to be only natural. Yet with many approaches, being content with partial, limited and therefore problematic results, the very opposite seems to be their aim, unfortunately. (5) There is no all-embracing algorithm that is universal in its field and transferable to all other languages. To believe this, one has to be particularly narrow-minded, a fact which is, understandably, never admitted. Yet, some do act sometimes as if this were true. A Corollary: Language is both regular and irregular and there is no comfortable and handy set of algorithms to describe it as a whole. In fact, since language is not directly formalizable as a whole, techniques must be developped to handle the rest that isn´t. The linguist´s task is to delimit the necessary line between and decide on appropriate approaches. Blind and patently partial trials on a small scale to prove one´s cherished little programmes and algorithms are useless and waste of time, mostly. Although statistical coverage is often a great help it generally is too rough, too, and statistical results as such cannot stand alone, without further filtering and interpretation. Another Corollary: Formalisation must not be mistaken for statistics. Next to differences between cognate languages, there are major and more serious typological differences. For instance, no one has so far been able to handle complex agglutinative text constructions of little-known polysynthetic languages where boundaries between morphemes, words and sentences collapse (overlapping being a too mild label to use here). Another Corollary: There is no ontology or, rather, thesaurus design possible that would be freely interchangeable between all sorts of languages. This rather popular approach of ontologising as much as possible, often drawing on developments of WordNet, a move both to be applauded and principally criticised, is both antilinguistic in that it openly ignores the several centuries-old tradition of thesauri, and ill-based in that it often is a medley of impressions and some knowledge adopted from existing dictionaries or, even, from English. Of what these ontologies are really representative, is a very much open question. Admittedly, some might be useful, despite drawbacks. (6) Tagging and lemmatisation of corpora, however useful but with obvious drawbacks, will always lag behind needs and size of corpora. A Comment: Steady growth of corpora and requirements brought forth by constant flow of new forms and words in new texts will always lead to imperfect performance of lemmatizers and taggers. An obvious problem that will not go away is to be seen in the ever new variants coming in as alternatives as well as foreign imported words and proper names. A fact well-known, but hardly ever mentioned let alone subjected to a serious analysis aimimg at a solution, are hapaxes which do not seem to go away. Since their number is so high (up to 50 per cent of word forms, even in corpora with hundreds of millions of words) it follows that in order to make for the information loss there, corpora should be much larger. It also means that a, say, 100-million corpus represents only a fragment of reality and, partly, of language potentiality that is used as a basis for search and research. In a way, hapaxes, following Zipf´s law, are inevitable and their number, usually not published, does say a lot about how language is treated automatically. Hence, A Corrolary: It seems that the smaller the number of hapaxes is the more suspicious the treatment is. Unlike English, Chinese, Vietnamese etc., most languages having some kind of inflection and offering a lot of homonymous endings, seem to be in constant change. Perhaps the largest and, in fact, enormous field, which has not been covered in any language in full, are multiword units and constructions, including idioms (phrasemes) and terms. There is no language that has covered this to any degree of satisfaction, so far. It is obvious that these are systemic units (lexemes) and that it is dangerous to wave them away as mere collocations of specific word forms. The truth is we often do not know where to draw the line between normal open-class collocations and those that are anomalous. Moreover, some words do not exist in isolation and one may reasonably view them only as part of larger units, mostly idioms. This leads to a categoric demand that more types of language forms are recognized in a linguistic corpus analysis than just one or two only. Taking all of this into account, this appears to be a formidable task to grapple with. Hence, saying that a corpus is tagged and lemmatized is rather immodest. (7) Don´t get fascinated by your own system of coding, it may not be the only one, neither the best one. A Comment: There are so many authors having such brilliant theories (clashing however with each other) that one should be wary of. Moreover, the theories used in applications do not always come from the best theoreticians available. Rather, they may be concoctions, often recooked with new ingredients, of a number of partial and ad hoc technical solutions standing in some distance from the language reality. A Corollary: The more interpretation in various approaches a phenomenon gets the more unstable and suspicious its solution may be. This particularly holds for syntax. But it holds more generallly, too. Another Corollary: Black-and-white and yes-or-no solutions are never foolproof. Instead, more options offered along a scale would often correspond more closely to language facts. It seems that, due to history, linguists have originally been forced by the machines to this strange black-and-white binary approach almost everywhere. But isn´t it high time to reverse the state of things and force machines to view facts as (at least some) linguists do? (8) Lemmatizers have invented imaginary new worlds, often creating non-existent monsters or suggesting false ones. A Comment: It is over-generation, mostly, lurking behind this problem that presents a serious danger criticised by linguists. But this must not be mistaken for language potentials and potentiality reflecting, among other things, future development of the language in question. Languages never follow recasts, let alone rule-based forecasts. Obviously, the over-generation must be avoided as it creates fictitious worlds being hardly ever part of what is possible and potential. Over-generation, depending on the language in question, is perhaps best known from languages with rich inflection in that the algorithms suggest, for instance, a fictitious lemma for a form or label a form as part of an inflectional paradigm that does not exist in reality, however. A Corollary: On the other hand, language potentiality is not identical with a wild boundless overgeneration. The prevailing binary yes-no approach basically prevents any change in this. A headache, if perceived by computational linguists at all, is a frequent and natural variability of forms, closely related to this. Much of this holds for morphological tagging, too. (9) It is not all research that glitters statistically. A Comment: Statistics can be a great help as well as a great misleader, e.g. in the business of stochastic tagging. At best, any numeric results are only a kind of indirect and mediated picture of the state of language affairs, often based on too small and haphazard training data much of which linguists still do not know how to translate into linguistic terms. The tacit assumption that a tagger, trained on a million forms, will really grapple with a corpus of a hundred million words is false. A Corollary: There is no such thing as a converse of reductionism. That in itself may often be a help, though, perhaps, elsewhere in language. This, in turn, is related to sample and its type. Linguists should not get intimidated by formalisms by which a sample should be arrived at and insisted on, if pure statistics has its say. Statistical formulas are just too general and insensitive to many factors, including type of text, frequency and dependence of the phenomenon pursued, etc. This also holds for corpora based entirely on samples: advocating this is often asking for trouble as some texts make sense as a whole only. Another Corollary: There is no universal formula to easily calculate your sample. Before accepting it, use your common sense. (10) Language is both regular and irregular, not everything may be captured by algorithms automatically. A Comment: Apart from irregularities in grammar, this points, yet again, to the very much neglected field of idioms, mostly, and grey zones of metaphoricity. The need for stepwise, alternative approaches along a scale is obvious. Due to its theory-dependence and lack of general methodology, no one knows where the border-line between regular and irregular, rule and anomaly, etc. really is. Obviously, blind formalistic onsloughts on everything in language are gone, but do all people realize where to stop and weigh the situation? Here, statistics is often too tempting for some to be avoided, as it will always show some results, no matter how superficial and useless. Despite familiar disputes of grammarians there is, fortunately, a kind of regularity hardly to be disputed, namely phenomena with high frequency, whether grammatical or lexical, that are central to the language usage and are often called typical. Only here a new basis for regularity must be sought. Since corpus results are not black and white, or yes-no in their nature, a new kind of methodology must be developped to grapple with this. It will have to account for clines of many kinds but also language potentiality. (11) The main goal of language is to code and decode meaning. Since meaning is not limited to words only, it is wrong to concentrate on words only. A Comment: This point, often raised by many and by John Sinclair in particular, refers, among other things, to multi-word lexemes and problems of compositionality of meaning. As yet, no reliable and general techniques for handling this are available. At the one end of a scale here, one has to have a sufficient context, repeating itself many times over, to be able to make any generalization. To put this in an old-fashioned way, perhaps, there has to be sufficient basis for analogy. Without that, one cannot do anything. This age-old truth is easily ignored. To be true to this requirement should mean, then, that to be able to make a generalization and judgment on one percent of data may seem to be a problem. Yet, despite the fact that the number of occurrences of such word as accordance in BNC being under 1 percent, the rest being locked in the multi-word preposition in accordance with, the word is declared to be a noun in dictionaries. In other words, it is given a status of non-existent entity. Similarly, dictionaries do declare the word amok to be an adverb, while it is not used outside the idiom run amok, its sole form of existence, etc. How does one know it is a noun (the first example) or an adverb (the second example), as there is no or almost no analogy? Is this all right or is there something basically wrong? At the other end of the scale of complexity, one may deduce and understand meaning only from very large chunks of texts and a particularly lengthy context. All of this points to a need for searching larger units of meaning in text, whatever these might be, and, subsequently, try to find ways and techniques of their analysis. (12) Trust collocations as much as possible as they amount to immediate context and decipher the use for you. A Comment: Sufficient context is the alpha and omega of any meaningful research, an item devoid of context being worthless. Naturally, size and type of context depends on the phenomenon observed and studied. At the lower end, minimum contexts often contain collocations, these being immediate indicators of an item´s behaviour. A Corollary: Using collocations, always keep asking which measures to use for your goal and what criteria to apply. There is a number of statistical formula sometimes helping you decide what to use, though not always. Another Corollary: Saying this, do not forget, that you are always dealing with collocations of forms, not lemmas. There is no direct way of getting at the meaning while avoiding form; however, most information can be got from combinations of forms only. (13) It is so easy to prefer data from the Internet (Web) for your corpus to other types of sources. A Comment: However, since Internet is constantly changing you can never refer back to it, neither does it offer all sorts of data you may need. Its amoebic, slippery character should make one wary. Hence A Corollary: Shy away from Internet data if you care about their quality, balance and their being anchored in time, building your reliable corpus on replicable and quotable texts. All real science is based on replicable data. The Internet data may be useful as a secondary source, however. A recently popular notion of web mining may give you a lof of insight but you never know to what extent and how much is actually left behind. Since, by definition, the Web is not tagged and lemmatized, attempts to find meaning units in clear proportions that may be used as a referential basis and quoted are foolish. The same holds for semantic disambiguation etc. One cannot disambiguate what one has got from something so elusive as the Web. Any notion of this kind of disambiguation or, for that matter, grammatical disambiguation makes sense only if a lot of prior work along these lines is done and used in conjunction. This holds for many other ventures here, such as automatic finding or discovery of structures, etc. Unless you have a working theory, a list etc. and can use it as a filter on the results obtained from the Web you safely risk getting drowned in the bulk of results obtained from the Web. Another Corollary: There is no easy way to comfortably get results you are after from the Web unless you do a lot of uncomfortable prior work. (14) There are no aligners that will do the job for you automatically. Much of this has to be done manually anyway. A Comment: This points to difficult manual problems of prealignment which is hardly ever mentioned while over-optimistic plans for larger parallel corpora are made since much yield is expected coming from them. At the same time, while technologies and techniques how to explore this type of data are developped, some of the standard approaches and notions from a monolingual corpus, without giving them much thought, are carried over here and thought useful, such as alignment on the word level and POS tagging. Since languages, even closely related ones, are always different, this seems a particularly useless and misleading thing to do. A Corollary: There is no one-to-one correlation between words in any two languages observed. Once word-toword alignment is perhaps achieved, it tells one precious little, except for showing a seeming and not really surprising absence of a word in the other langauge or a combination of forms that may or may not be an equivalent of what is sought. A particularly difficult and hard to imagine is an aligned parallel corpus of the spoken language, hardly ever attempted so far. Its differences may be too difficult for one´s imagination. (15) It is high time to ask computational linguists what their theories and programmes cannot really do, how much of the field goes by the board and is never mentioned. Their alleged comprehensive coverage may be deceptive. A Comment: The experience behind this hardly needs any comment for anyone who has followed closely development of recent years in computational linguistics. People working here have invented their own criteria of success (rate) that may not be shared and appreciated by real linguists often deluding themselves that a real achievement has been reached. A Summary Obviously, much of what I had to say here is biased and one-sided, for a purpose. Should at least some of these critical notes find some positive response, then they have not been entirely useless. My conviction is that one has to keep learning from mistakes, especially before embarking blindly on a new venture. Having a computer, corpus and some mathematical training may not be enough. This is, basically, a reprint of the paper published in The Third Baltic Conference on Human Language Technologies. Proceedings, October 4-5, 2007, Kaunas Lithuania, Eds. F. Čermák, R. Marcienkevičienè, E. Rimkutè, J. Zabarskaitè, Vytauto Didžijo Universitas, Lietuviu Kalbos Institutas, Kaunas, 61-9 References Aston Guy, Lou Burnard, 1998, The BNC Handbook. Edinburgh Edinburgh U.P. Atkins, S., J. Clear, and N. Ostler, 1992. Corpus Design Criteria. Literary and Linguistic Computing 7 (1): 1-16. Biber, D. 1993, Representativeness in Corpus Design. Literary and Linguistic Computing 8(4): 243-257. Biber D., S. Conrad, R. Reppen, 1998. Corpus Linguistics. Investigating Language Structure and Use. Cambridge U.P.: Cambridge. Čermák, F., 1995. Jazykový korpus: Prostředek a zdroj poznání. Slovo a slovesnost 56: 119-140. (Language Corpus: Means and Source of Knowledge). Čermák, F. 1997. Czech National Corpus: A Case in Many Contexts. International Journal of Corpus Linguistics, 2(2): 181-197. Čermák, F., 1998, Czech National Corpus: Its Character, Goal and Background, In Text, Speech,Dialogue, Proceedings of the First Workshop on Text, Speech, Dialogue-TSD'98, Brno, Czech Republic, September, eds. P. Sojka, V. Matoušek, K. Pala, I. Kopeček, Masaryk University: Brno, 9-14. Čermák, F., 2000, Linguistics, Corpora and Information, in PALC'99: Practical Applications in Language Corpora, Łódź Studies in Language, eds. B. Lewandowska-Tomaszcyk, P. J. Melia, P. Lang, Frankfurt am M., Berlin, pp.193-201. Čermák, F., 2001, Jazyk a jazykověda. Přehled a slovníky. Karolinum Praha (Language and Linguistics. A Survey and Lists). Čermák, F., 2002, Research Methods in Linguistics, Karolinum Praha. Čermák, F., 2003, Ontologies in Today's Computational Linguistics, In: PALC 2001: Practical Applications in Language Corpora, ed. B. Lewandowska-Tomaszcyk, P. Lang, Frankfurt am M., Berlin, 43-45. Čermák, F., 2003, Today's Corpus Linguistics: Some Open Questions. International Journal of Corpus Linguistics 7,2, 2003, 265-282. Čermák, F., V. Petkevič, Linguistically motivated tagging as the base for a corpus-based grammar, In Corpus Linguistics 2005, Vol 1, No 1 (Birmingham, July 14-17), eds. P. Danielsson, M. Wagenmakers, Proceedings from The Corpus Linguistics Conference Series, http://www.corpus.bham.ac.uk/PCLC. Church, K. W., P. Hanks. 1992, Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 16:22-29. Kilgarriff, A., C. Yallop, 2003, What´s in a thesaurus? In LREC Proceedings, Athens,Vol III (Second International Conference on Language Resources and Evaluation), ELRA Athens 2000, 1371-8 Kruyt, J. G., 1993, Design Criteria for Corpora Construction in the Framework of a European Corpora Network. Final Report. Institute for Dutch Lexicology INL: Leiden. McEnery, T. & A. Wilson, 1996. Corpus Linguistics. Edinburgh University Press: Edinburgh. Norling-Christensen, O. 1992. Preparing a Text Corpus. Computational Tools and Methods for Standardizing, Tagging and Structuring Text Data. Papers in Computational Lexicography COMPLEX '92, ed. by R. Kiefer et al.: 251-259. Research Institute for Linguistics, Hungarian Academy of Sciences: Budapest. Rubio, A. et al., 1998, First International Conference on Language Resources and Evaluation, Vol. I, II, ELRA: Granada. Sinclair, J., 1991. Corpus, Concordance, Collocation. Oxford University Press: Oxford. Sinclair, J., 1996. The Empty Lexicon. International Journal of Corpus Linguistics 1(): 99-119. Sinclair, J., 2004, Trust the Text: language, corpus and discourse, Routledge London. Sinclair, J., A. Mauranen 2006, Linear Unit Grammar. Integrating Speech and Writing. J. Benjamins Amsterdam/Philadelphia. Svartvik, Jan, ed., 1992, Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Mouton: Berlin.