Some of Current Problems of Corpus and Computational Linguistics

advertisement
František Čermák
The Institute of the Czech National Corpus, Charles University Prague, frantisek.cermak@ff.cuni.cz
KGA 03 (2011), 33–44
Some of Current Problems of Corpus and Computational Linguistics or Fifteen Commandments and
General Truths
A lot is going on in the field and research here seems to be almost without limits. Much of this is due to vast
possibilities offered by computers, corpora, Web etc., much is due, however, to a disregard, evident with many, for
what real linguistics has to say and offer, whether this might be a good sign of bridging the gap or not. Recent
corpus-based grammars and dictionaries may be viewed as a good sign, there is no doubt about it.
Likewise, there is no doubt about that a number of insights, techniques and some useful results have been achieved
in the field over years that are of real interest, inspiration and insight for linguists, such as machine translation, to
name just one field (which in itself took decades to get to its present and limited state while linguists have always
been behind this). Due to its different orientation, more linguistic in character, it is also evident that corpus
linguistics is no longer a mere branch of computational linguistics, although there is a lot of overlapping and mutual
inspiration. The praise along these lines is not difficult to formulate but are we really after praise and feeling of selfsatisfaction only?
However, despite a considerable progress recorded here, some of real problems that seem to be of importance persist
and new ones have emerged. In what follows, some of these are summarized and presented in a special and, perhaps,
idiosyncratic list of problems which I choose to call commandments because of their urgent nature. In an attempt to
sum up in a terse and perhaps somewhat aphoristic form, some of the most salient aspects, characteristic of the
present state of affairs, which is based on sobering and often disconcerting experience of recent years, are offered
here in a nutshell, followed by comments. Technically, this contribution happens to be a build-up and modification
of a similar but much smaller list and survey published five years ago (Čermák 2002).
(1) Garbage in, garbage out.
A Comment: This is an old, general and, hopefully, broadly accepted experience and re-interpreted knowledge,
having, in the case of corpora, a number of implications related to data and their treatment. One cannot expect
reasonable and useful results from bad data fed into computer either because of their skewed nature and because
they are not representative of the phenomenon to be described or simply because they are not sufficient. A special
case of the latter is a wanton choice of data, artificially cut off from other although these are interconnected. This
leads to another point. To make things worse, once a clumsy query or a badly formulated algorithm is used the
results one may get may be ever more useless.
(2) The more data the better. But there is never enough data to help solve everything.
A Comment: This points to a need for more and also alternative resources, such as spoken data. However, one
has to discern between data proper and their necessary contexts. Some language phenomena, even words, are quite
rare and require a lot of recorded occurrences, should a decent description be achieved, while on the other hand
some data are to be found in special types of texts only, including real spoken data. However, once having, perhaps,
enough data, you are faced with a rather different challenge. You will never get at the meaning hidden behind form
directly, its elusiveness being often a source of vexation. To get a help, you will have to resort to larger contexts,
often extralinguistic in nature, use your power of introspection and knowledge of the rules of ellipsis, etc. The worst
type of this is found in spontaneous spoken data not provided, as a rule, by external contexts. Hence, the call for
maximum data is also an expression of hope to find help there.
A Corollary: A popular blunder committed over and over again is to believe that once we are dealing with written
texts (which are so easy to come by) we are dealing with the whole language.
For a number of reasons, historical and informational, the truest form of language is the spoken (spontaneous)
language. In this, true linguistics, including corpus linguisticts, has just barely started and the lack of this kind of
data is appaling.
(3) The best and really true information comes from direct data.
A Comment: There is always the alternative of pure unadulterated texts, forms devoid of any annotation, which
many people are extremely afraid of, pretending these are not sufficient for their goals, while, in fact, they do not
know how to best handle them. The other alternative, annotation, now offered as the primary solution, is, to a
varying extent, always biased and adulterates both the data input and results obtained. Hence, annotated texts should
always be viewed as an alternative only, although a fast and popular one. Fast may often mean superficial,
simplified, or, distorted, however.
A Corollary: Don´t sneer at plain text corpora, since many kinds of information is to be got only from them.
Linguistically, tagged texts are always a problem and a great misleader. Much of textual information is lost by
tagging and lemmatisation; moreover, certain semantic collocational and other aspects, being found with word
forms only, do not obtain with their lemmas. Computational linguists seem to be obssessed by ever improving
their tagging and lemmatization scores but forget to measure the degree of information loss because of this.
Perhaps, they do not know how to do it, provided they recognize this as a problem. Any markup is an additional
information only and there might be as many markups of the same corpus as there are people available and
willing to do it. The result, should this be undertaken, will be an array of rather different views while the
language behind remains the same.
Another Corollary: Data are sacrosanct and should not be tampered with, even with the best of intentions.
Yet, some computer people find odd text mistakes, seeming misspellings etc. so tempting that they are inclined
to „improve“ the bad and naughty data to fit their nice little programmes better. There is always a reason for this
quality of data, most often because we are dealing with natural language which is imperfect by definition. In fact,
we will never know for certain how much tampering the plain data, once they left the author, have gone through,
because of too dilligent, idiosyncratic editors, censors, programme failures etc.
One of Murphy´s laws puts this rather succinctly: In nature, nothing is ever right. Therefore, if everything is
going right ... something is wrong. Naive and at the same time dangerous people endorsing this view may look
upon the first part of the law as a justification for what they are about to do, while it is the second part that a real
linguist should take seriously, both as a principle and a warning.
To widen the gap between plain and annotated data even more, one should remind oneself, as many corpus
lexicographers do now for some time, that much of sense and meaning of words is associated with specific word
forms or their combinations only, and cannot be generalized to lemmas.
(4) Representativeness of a corpus is possible and, in fact, advisable, being always related to a goal.
A Comment: This has largely been misinterpreted to mean that a (general) representative corpus is a
nonsense. Well, it is not. There are in fact as many kinds of representativeness as there are goals we choose to
pursue. What one does not usually emphasize is that representation vis-a-vis a general multi-purpose
monolingual dictionary is a legimate goal, too. Otherwise corpus lexicographers would be out of business. The
whole business of lexicography is oriented towards users offering them a fair coverage of what is to be found in
texts and, accordingly (hopefully) in dictionaries. Coverage of a language in its totality requires a lot of thinking
and research about what kind of texts for a corpus-based lexicography must be used and in what proportions. But
there are other kinds of representativeness, of course, perhaps better-known for their more clear-cut goal.
In a sense, it is thanks to modern large corpora that one may, though still only to a degree, envisage and
perhaps even try to achieve some sort of exhaustive coverage of language features being studied. Superficially
and to a lay linguist, that would seem to be only natural. Yet with many approaches, being content with partial,
limited and therefore problematic results, the very opposite seems to be their aim, unfortunately.
(5) There is no all-embracing algorithm that is universal in its field and transferable to all other languages.
To believe this, one has to be particularly narrow-minded, a fact which is, understandably, never admitted. Yet,
some do act sometimes as if this were true.
A Corollary: Language is both regular and irregular and there is no comfortable and handy set of algorithms
to describe it as a whole.
In fact, since language is not directly formalizable as a whole, techniques must be developped to handle the rest
that isn´t. The linguist´s task is to delimit the necessary line between and decide on appropriate approaches. Blind
and patently partial trials on a small scale to prove one´s cherished little programmes and algorithms are useless
and waste of time, mostly. Although statistical coverage is often a great help it generally is too rough, too, and
statistical results as such cannot stand alone, without further filtering and interpretation.
Another Corollary: Formalisation must not be mistaken for statistics.
Next to differences between cognate languages, there are major and more serious typological differences. For
instance, no one has so far been able to handle complex agglutinative text constructions of little-known
polysynthetic languages where boundaries between morphemes, words and sentences collapse (overlapping being
a too mild label to use here).
Another Corollary: There is no ontology or, rather, thesaurus design possible that would be freely
interchangeable between all sorts of languages.
This rather popular approach of ontologising as much as possible, often drawing on developments of WordNet, a move both to be applauded and principally criticised, is both antilinguistic in that it openly ignores the
several centuries-old tradition of thesauri, and ill-based in that it often is a medley of impressions and some
knowledge adopted from existing dictionaries or, even, from English. Of what these ontologies are really
representative, is a very much open question. Admittedly, some might be useful, despite drawbacks.
(6) Tagging and lemmatisation of corpora, however useful but with obvious drawbacks, will always lag
behind needs and size of corpora.
A Comment: Steady growth of corpora and requirements brought forth by constant flow of new forms and
words in new texts will always lead to imperfect performance of lemmatizers and taggers. An obvious problem
that will not go away is to be seen in the ever new variants coming in as alternatives as well as foreign imported
words and proper names.
A fact well-known, but hardly ever mentioned let alone subjected to a serious analysis aimimg at a solution,
are hapaxes which do not seem to go away. Since their number is so high (up to 50 per cent of word forms, even
in corpora with hundreds of millions of words) it follows that in order to make for the information loss there,
corpora should be much larger. It also means that a, say, 100-million corpus represents only a fragment of reality
and, partly, of language potentiality that is used as a basis for search and research. In a way, hapaxes, following
Zipf´s law, are inevitable and their number, usually not published, does say a lot about how language is treated
automatically. Hence,
A Corrolary: It seems that the smaller the number of hapaxes is the more suspicious the treatment is.
Unlike English, Chinese, Vietnamese etc., most languages having some kind of inflection and offering a lot of
homonymous endings, seem to be in constant change.
Perhaps the largest and, in fact, enormous field, which has not been covered in any language in full, are multiword units and constructions, including idioms (phrasemes) and terms. There is no language that has covered
this to any degree of satisfaction, so far. It is obvious that these are systemic units (lexemes) and that it is
dangerous to wave them away as mere collocations of specific word forms. The truth is we often do not know
where to draw the line between normal open-class collocations and those that are anomalous. Moreover, some
words do not exist in isolation and one may reasonably view them only as part of larger units, mostly idioms.
This leads to a categoric demand that more types of language forms are recognized in a linguistic corpus
analysis than just one or two only.
Taking all of this into account, this appears to be a formidable task to grapple with. Hence, saying that a
corpus is tagged and lemmatized is rather immodest.
(7) Don´t get fascinated by your own system of coding, it may not be the only one, neither the best one.
A Comment: There are so many authors having such brilliant theories (clashing however with each other) that
one should be wary of. Moreover, the theories used in applications do not always come from the best theoreticians
available. Rather, they may be concoctions, often recooked with new ingredients, of a number of partial and ad hoc
technical solutions standing in some distance from the language reality.
A Corollary: The more interpretation in various approaches a phenomenon gets the more unstable and
suspicious its solution may be. This particularly holds for syntax. But it holds more generallly, too.
Another Corollary: Black-and-white and yes-or-no solutions are never foolproof.
Instead, more options offered along a scale would often correspond more closely to language facts. It seems that,
due to history, linguists have originally been forced by the machines to this strange black-and-white binary
approach almost everywhere. But isn´t it high time to reverse the state of things and force machines to view facts
as (at least some) linguists do?
(8) Lemmatizers have invented imaginary new worlds, often creating non-existent monsters or suggesting false
ones.
A Comment: It is over-generation, mostly, lurking behind this problem that presents a serious danger criticised by
linguists. But this must not be mistaken for language potentials and potentiality reflecting, among other things,
future development of the language in question. Languages never follow recasts, let alone rule-based forecasts.
Obviously, the over-generation must be avoided as it creates fictitious worlds being hardly ever part of what is
possible and potential. Over-generation, depending on the language in question, is perhaps best known from
languages with rich inflection in that the algorithms suggest, for instance, a fictitious lemma for a form or label a
form as part of an inflectional paradigm that does not exist in reality, however.
A Corollary: On the other hand, language potentiality is not identical with a wild boundless overgeneration.
The prevailing binary yes-no approach basically prevents any change in this. A headache, if perceived by
computational linguists at all, is a frequent and natural variability of forms, closely related to this. Much of this
holds for morphological tagging, too.
(9) It is not all research that glitters statistically.
A Comment: Statistics can be a great help as well as a great misleader, e.g. in the business of stochastic tagging.
At best, any numeric results are only a kind of indirect and mediated picture of the state of language affairs, often
based on too small and haphazard training data much of which linguists still do not know how to translate into
linguistic terms. The tacit assumption that a tagger, trained on a million forms, will really grapple with a corpus of a
hundred million words is false.
A Corollary: There is no such thing as a converse of reductionism.
That in itself may often be a help, though, perhaps, elsewhere in language. This, in turn, is related to sample and its
type. Linguists should not get intimidated by formalisms by which a sample should be arrived at and insisted on, if
pure statistics has its say. Statistical formulas are just too general and insensitive to many factors, including type of
text, frequency and dependence of the phenomenon pursued, etc. This also holds for corpora based entirely on
samples: advocating this is often asking for trouble as some texts make sense as a whole only.
Another Corollary: There is no universal formula to easily calculate your sample. Before accepting it, use your
common sense.
(10) Language is both regular and irregular, not everything may be captured by algorithms automatically.
A Comment: Apart from irregularities in grammar, this points, yet again, to the very much neglected field of
idioms, mostly, and grey zones of metaphoricity. The need for stepwise, alternative approaches along a scale is
obvious. Due to its theory-dependence and lack of general methodology, no one knows where the border-line
between regular and irregular, rule and anomaly, etc. really is. Obviously, blind formalistic onsloughts on everything
in language are gone, but do all people realize where to stop and weigh the situation? Here, statistics is often too
tempting for some to be avoided, as it will always show some results, no matter how superficial and useless.
Despite familiar disputes of grammarians there is, fortunately, a kind of regularity hardly to be disputed, namely
phenomena with high frequency, whether grammatical or lexical, that are central to the language usage and are often
called typical. Only here a new basis for regularity must be sought. Since corpus results are not black and white, or
yes-no in their nature, a new kind of methodology must be developped to grapple with this. It will have to account
for clines of many kinds but also language potentiality.
(11) The main goal of language is to code and decode meaning. Since meaning is not limited to words only, it is
wrong to concentrate on words only.
A Comment: This point, often raised by many and by John Sinclair in particular, refers, among other things, to
multi-word lexemes and problems of compositionality of meaning. As yet, no reliable and general techniques for
handling this are available. At the one end of a scale here, one has to have a sufficient context, repeating itself many
times over, to be able to make any generalization. To put this in an old-fashioned way, perhaps, there has to be
sufficient basis for analogy. Without that, one cannot do anything. This age-old truth is easily ignored. To be true to
this requirement should mean, then, that to be able to make a generalization and judgment on one percent of data
may seem to be a problem. Yet, despite the fact that the number of occurrences of such word as accordance in
BNC being under 1 percent, the rest being locked in the multi-word preposition in accordance with, the word is
declared to be a noun in dictionaries. In other words, it is given a status of non-existent entity. Similarly, dictionaries
do declare the word amok to be an adverb, while it is not used outside the idiom run amok, its sole form of
existence, etc. How does one know it is a noun (the first example) or an adverb (the second example), as there is no
or almost no analogy? Is this all right or is there something basically wrong?
At the other end of the scale of complexity, one may deduce and understand meaning only from very large
chunks of texts and a particularly lengthy context. All of this points to a need for searching larger units of
meaning in text, whatever these might be, and, subsequently, try to find ways and techniques of their analysis.
(12) Trust collocations as much as possible as they amount to immediate context and decipher the use for you.
A Comment: Sufficient context is the alpha and omega of any meaningful research, an item devoid of
context being worthless. Naturally, size and type of context depends on the phenomenon observed and studied.
At the lower end, minimum contexts often contain collocations, these being immediate indicators of an item´s
behaviour.
A Corollary: Using collocations, always keep asking which measures to use for your goal and what criteria to
apply. There is a number of statistical formula sometimes helping you decide what to use, though not always.
Another Corollary: Saying this, do not forget, that you are always dealing with collocations of forms, not
lemmas.
There is no direct way of getting at the meaning while avoiding form; however, most information can be got
from combinations of forms only.
(13) It is so easy to prefer data from the Internet (Web) for your corpus to other types of sources.
A Comment: However, since Internet is constantly changing you can never refer back to it, neither does it
offer all sorts of data you may need. Its amoebic, slippery character should make one wary. Hence
A Corollary: Shy away from Internet data if you care about their quality, balance and their being anchored in
time, building your reliable corpus on replicable and quotable texts.
All real science is based on replicable data. The Internet data may be useful as a secondary source, however.
A recently popular notion of web mining may give you a lof of insight but you never know to what extent and
how much is actually left behind. Since, by definition, the Web is not tagged and lemmatized, attempts to find
meaning units in clear proportions that may be used as a referential basis and quoted are foolish. The same holds for
semantic disambiguation etc. One cannot disambiguate what one has got from something so elusive as the Web.
Any notion of this kind of disambiguation or, for that matter, grammatical disambiguation makes sense only if a lot
of prior work along these lines is done and used in conjunction. This holds for many other ventures here, such as
automatic finding or discovery of structures, etc. Unless you have a working theory, a list etc. and can use it as a
filter on the results obtained from the Web you safely risk getting drowned in the bulk of results obtained from the
Web.
Another Corollary: There is no easy way to comfortably get results you are after from the Web unless you do a
lot of uncomfortable prior work.
(14) There are no aligners that will do the job for you automatically. Much of this has to be done manually
anyway.
A Comment: This points to difficult manual problems of prealignment which is hardly ever mentioned while
over-optimistic plans for larger parallel corpora are made since much yield is expected coming from them. At the
same time, while technologies and techniques how to explore this type of data are developped, some of the standard
approaches and notions from a monolingual corpus, without giving them much thought, are carried over here and
thought useful, such as alignment on the word level and POS tagging. Since languages, even closely related ones,
are always different, this seems a particularly useless and misleading thing to do.
A Corollary: There is no one-to-one correlation between words in any two languages observed. Once word-toword alignment is perhaps achieved, it tells one precious little, except for showing a seeming and not really
surprising absence of a word in the other langauge or a combination of forms that may or may not be an equivalent
of what is sought.
A particularly difficult and hard to imagine is an aligned parallel corpus of the spoken language, hardly ever
attempted so far. Its differences may be too difficult for one´s imagination.
(15) It is high time to ask computational linguists what their theories and programmes cannot really do, how
much of the field goes by the board and is never mentioned. Their alleged comprehensive coverage may be
deceptive.
A Comment: The experience behind this hardly needs any comment for anyone who has followed closely
development of recent years in computational linguistics. People working here have invented their own criteria of
success (rate) that may not be shared and appreciated by real linguists often deluding themselves that a real
achievement has been reached.
A Summary
Obviously, much of what I had to say here is biased and one-sided, for a purpose. Should at least some of these
critical notes find some positive response, then they have not been entirely useless. My conviction is that one has to
keep learning from mistakes, especially before embarking blindly on a new venture. Having a computer, corpus and
some mathematical training may not be enough.
This is, basically, a reprint of the paper published in The Third Baltic Conference on Human Language
Technologies. Proceedings, October 4-5, 2007, Kaunas Lithuania, Eds. F. Čermák, R. Marcienkevičienè, E.
Rimkutè, J. Zabarskaitè, Vytauto Didžijo Universitas, Lietuviu Kalbos Institutas, Kaunas, 61-9
References
Aston Guy, Lou Burnard, 1998, The BNC Handbook. Edinburgh Edinburgh U.P.
Atkins, S., J. Clear, and N. Ostler, 1992. Corpus Design Criteria. Literary and Linguistic Computing 7 (1): 1-16.
Biber, D. 1993, Representativeness in Corpus Design. Literary and Linguistic Computing 8(4): 243-257.
Biber D., S. Conrad, R. Reppen, 1998. Corpus Linguistics. Investigating Language Structure and Use. Cambridge
U.P.: Cambridge.
Čermák, F., 1995. Jazykový korpus: Prostředek a zdroj poznání. Slovo a slovesnost 56: 119-140. (Language
Corpus: Means and Source of Knowledge).
Čermák, F. 1997. Czech National Corpus: A Case in Many Contexts. International Journal of Corpus Linguistics,
2(2): 181-197.
Čermák, F., 1998, Czech National Corpus: Its Character, Goal and Background, In Text, Speech,Dialogue,
Proceedings of the First Workshop on Text, Speech, Dialogue-TSD'98, Brno, Czech Republic, September, eds.
P. Sojka, V. Matoušek, K. Pala, I. Kopeček, Masaryk University: Brno, 9-14.
Čermák, F., 2000, Linguistics, Corpora and Information, in PALC'99: Practical Applications in Language
Corpora, Łódź Studies in Language, eds. B. Lewandowska-Tomaszcyk, P. J. Melia, P. Lang, Frankfurt am M.,
Berlin, pp.193-201.
Čermák, F., 2001, Jazyk a jazykověda. Přehled a slovníky. Karolinum Praha (Language and Linguistics. A Survey
and Lists).
Čermák, F., 2002, Research Methods in Linguistics, Karolinum Praha.
Čermák, F., 2003, Ontologies in Today's Computational Linguistics, In: PALC 2001: Practical Applications in
Language Corpora, ed. B. Lewandowska-Tomaszcyk, P. Lang, Frankfurt am M., Berlin, 43-45.
Čermák, F., 2003, Today's Corpus Linguistics: Some Open Questions. International Journal of Corpus Linguistics
7,2, 2003, 265-282.
Čermák, F., V. Petkevič, Linguistically motivated tagging as the base for a corpus-based grammar, In Corpus
Linguistics 2005, Vol 1, No 1 (Birmingham, July 14-17), eds. P. Danielsson, M. Wagenmakers, Proceedings from
The Corpus Linguistics Conference Series, http://www.corpus.bham.ac.uk/PCLC.
Church, K. W., P. Hanks. 1992, Word Association Norms, Mutual Information, and Lexicography. Computational
Linguistics 16:22-29.
Kilgarriff, A., C. Yallop, 2003, What´s in a thesaurus? In LREC Proceedings, Athens,Vol III (Second International
Conference on Language Resources and Evaluation), ELRA Athens 2000, 1371-8
Kruyt, J. G., 1993, Design Criteria for Corpora Construction in the Framework of a European Corpora Network.
Final Report. Institute for Dutch Lexicology INL: Leiden.
McEnery, T. & A. Wilson, 1996. Corpus Linguistics. Edinburgh University Press: Edinburgh.
Norling-Christensen, O. 1992. Preparing a Text Corpus. Computational Tools and Methods for Standardizing,
Tagging and Structuring Text Data. Papers in Computational Lexicography COMPLEX '92, ed. by R. Kiefer et
al.: 251-259. Research Institute for Linguistics, Hungarian Academy of Sciences: Budapest.
Rubio, A. et al., 1998, First International Conference on Language Resources and Evaluation, Vol. I, II, ELRA:
Granada.
Sinclair, J., 1991. Corpus, Concordance, Collocation. Oxford University Press: Oxford.
Sinclair, J., 1996. The Empty Lexicon. International Journal of Corpus Linguistics 1(): 99-119.
Sinclair, J., 2004, Trust the Text: language, corpus and discourse, Routledge London.
Sinclair, J., A. Mauranen 2006, Linear Unit Grammar. Integrating Speech and Writing. J. Benjamins
Amsterdam/Philadelphia.
Svartvik, Jan, ed., 1992, Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Mouton: Berlin.
Download