Pozn

FRANTIŠEK ČERMÁK Institute of the Czech National Corpus, Charles University Prague InterCorp: A Contribution to Interlinguistics Prace filologiczne LXIII, Warszawa 2012, 67-83 1. Corpus: Monolingual or Multilingual. The opposition in the title of this paper may seem real, but in fact it depends on the point of view taken. However, from both points of view that are historical and relevant for our topic it is, in fact, largely useless, since one is a prerequisite for the other. Generally, there are at least three main aspects, relevant to our topic and forming a framework, that should be remembered and seen as interrelated. Should one not fall into circular thinking about features of one´s own language, one must gain a distance from it and, sooner or later, arrive at some objectivity and perhaps generalizations about it that are possible only thanks to comparison with other languages. A human community has basically never lived in isolation being surrounded by other that are usually talking a different language and the primitive idea to be still found for example in Polish and Czech, hidden etymologically behind the names for our neighbours, namely Niemec and Němec, seeing the German neighbours then as dumb (mute), is no longer upheld. The early recognition and awareness that every community has foreign-language neighbours has necessarily led to (1) comparison, finding differences and sometimes similarities, too. In fact, the very first dictionaries used to be bilingual first, by far preceding the monolingual ones. Eventually, this kind of comparison has gone on giving insights to all sorts of practical and theoretical fields including those of typology, universals and general linguistics. Yet, all of the comparison has been, until now, very limited, both in data and number of languages, and has been largely dependent on a few linguists speaking more languages being able to suggest general observations about languages, however problematic these might have been. Nevertheless, the bulk of knowledge was still limited to a bilingual comparison based on selected and hence unsystematic observations. With the advent of (2) corpora, however, the lack of data has suddenly been overcome with such modern resources as the Czech National Corpus or The Polish National Corpus, offering text collections of hundreds or even billions of words in context. It should be stressed unequivocally, that any information and subsequent generalisation is to be found in the text first, from where it is deduced and generalized only thanks to contexts provided. Corpora do resemble real language world we live in up to a large extent and are thus a record of our language life. Compared to the present situation, the old pre-corpus linguistics has never had enough data and, more importantly, enough contexts it has now. Speaking linguistically, contexts, made of combinations, take us from the old item-and-slot, or member-and-class i.e., rather, a paradigmatic approach, to a badly needed syntagmatic one, based on combinations. Realizing that one can have more than one corpus and establishing then two, three or more corpora in juxtaposition has led us the full circle, i.e. to the point from where we have started, though (3) parallel corpora, combining both (1 and 2) are, or may be far more than a sum-total of them. The enhanced and practical political emphasis on as many as possible, in fact generalized relations, has been given the name of globalisation which certainly applies in practical linguistics first but, hopefully, will not end there. Today, (bilingual) parallel corpora exist for many language pairs and the technology needed to build, process and exploit them is widely explored though it is far from being ideal. A considerable progress is witnessed since the days when “parallel” meant “bilingual”, when the only substantial sources were restricted to a special kind of language of English and French as in the Hansard Corpus, i. e. transcriptions of parliament debates, or available for a type of language somewhat distant from contemporary or common use – as in the case of the languages of the Bible, or of some classical authors. Obviously, before any build-up of a parallel corpus begins it is necessary to take stock of one´s needs today and, possibly, tomorrow, since also this kind of corpus is built for keeps and it must generally be evident why it should be built. The kind of language contrastive (comparative) research and various applications based on such corpora is obvious, since the previous lack of data has basically prevented projects of multi-language comparison. Today, parallel corpora are quite common existing, at least, for many language pairs and their technology is widely explored (cf. for example, Proceedings 2003, Proceedings 2005). However, except for a limited type of parallel corpora (due to the one-sided kind of data), such as Canadian Hansard and Europarl ones, most of the attention paid to this idea has, so far, been limited and restricted, mostly to two things. On the one hand, computer scientists seem to compete fiercely in the field of tools (machine translation, cross-lingual information retrieval etc.) including the search for an optimal alignment methods of texts. When they become convinced that there is no more to be technically achieved here, they drop the subject and interest in it as well (see a survey in Čermák-Rosen, in print). On the other hand, parallel corpora hardly ever means anything more than bilingual parallel corpora. Thus, the whole field seems rather one-sided, lacking in interest in real and general use and language exploitation. This exploitation and research should be linguistic, preferably, having a broader goal of comparing and researching more languages, an obvious goal in today´s multilingual Europe. It might give substance to and justify not only the old dictum that language is an instrument of transmission of meaning from thought to form but, perhaps, an additional one, namely that language comparison is also a bridge enabling transfer of meaning between them. Of course, parallel corpora must have access to data to be built on. Yet the supply of data is usually limited by the scarcity of available data, both in their size and type. This limitation, just one of many issues in standard monolingual corpus linguistics, becomes ever more central here. For those languages that are unable to benefit from a pool of literary translations (from or into the language), or even a role in an international context (being, for example, one of the official languages of the EU), this bottleneck may become prohibitive preventing any further growth of the corpus as there are simply no more data available. Some problems related to the business of a parallel corpus are obvious: sentence segmentation, tokenization, alignment and concordancing, but linguistic annotation, useful as such, requires language-specific tools or tagged data and that makes the job more difficult. Yet, there are also other, more specific and demanding needs to be covered. In a sense, the emerging field of comparative corpus linguistics may be given a substantial boost if multilingual corpora are built more extensively, with some concern about representativeness, and are researched systematically with the multilingual perspective in mind. The obvious desideratum behind this is to be sure of one’s tertium comparationis, against which comparisons are safely made, and a broader framework, preferably a typological one. So far, none of these is a common practice. 2. Language Contacts and Translations. Czech Language Situation. The fact that both parallel bilingual and multilingual corpora are based on available translations between languages and these grow only gradually has some reasons as well as consequences. From a cultural and historical point of view, the sum of available translations from one language into another represents the sum of various strands of interest, whether historically conditioned (such as fashionable novels) or real and useful, that a community has had, sometimes over a well-defined period of time, in another community and its specific texts. This is specifically clear when comparing the sum of what has been translated between two small languages where, often, anything useful and interesting came eventually into translators´ focus. Generalizing a bit along these lines, it seems to be true that for a multitude of languages, the intersection of available texts decreases with the growing quantity of languages included, hence the number of texts shared by many languages goes down. In this way, cultural, political and other influences can easily be seen if the number, type and spread of translations is examined in its totality not only for one language and the ethnical group behind it but also for a larger and multilingual community, such as Europe. Though there exist many types of translation from (and to) a large language (source language), in most cases the recipients of translations are small languages, i.e. those that the texts are translated into (target languages). Given the present geo-political situation, most attention is paid, with only few exceptions, to bilingual parallel corpora that are oriented on pairs made up of two large languages (such as English and French in the Hansard Corpus), or on language pairs where at least one is a large language, such as English. Due to the widespread knowledge of English and some other languages it is, in a way, a pair of two small languages that must be viewed as wanting in this respect. However, the existing needs point elsewhere, to a large-scale comparison and more qualified study of (all kinds of) languages and all types of texts. Hence it is necessary that the data must reasonably come from as many languages as possible. This is also true of the Czech language, a Slavic language spoken by some 10 million people, that is such a small language. As it is typologically inflectional, it has features, that are hardly to be found in English, French or German, such as rich inflection (7-case system), verb aspect, free wordorder, rich verb prefixation, rich noun derivation, a lot of particles, etc., though most of these features are familiar to and shared by other Slavic languages as well. Historically, since it is used by people living in the middle of Europe, Czech has always been a crossroads language due to the influence of many languages, such as the neighbouring German or Polish for centuries or non-neighbouring Russian for decades, etc. The Czech language has traditionally had two kinds of close linguistic contact with its neighbours, one Slavic (Slovak and Polish), one German (Austrian and German German), both of them representing a different type of research challenge. Here, specifically, the blurring of differences between two closely related languages (especially with Slovak) might be worth investigating in a parallel corpus, a really fine-grained one in this case. On the other hand, the long-standing contact with German, having a rich history, might be made more interesting if one goes more deeply, beyond mere loan-words, namely into semantics, calques or influences on the grammar system. All of these factors have had influence that has been projected into a language that might be worth researching, in general as well as from the typological point of view, though specifically not only from the point of view of the Czech native users but also those from elsewhere, from the outside. Hence the idea of a large multilingual corpus having Czech at its hub and, accordingly, the idea of InterCorp. 3. InterCorp Project. Unlike most other projects, the InterCorp project (see also more in Čermák-Rosen ?) is an open one striving for perpetual growth wherever possible, i.e. in so far as texts and financing are available, the major philosophy behind this being the same as with a large monolingual corpus: the more data the better. Obviously, since the Czech texts have been available before, it is, basically, only nonCzech texts that have to be found or scanned and given a suitable form that have to be handled mostly, though, nowadays, many “bilingual” texts are added that have to be scanned first for InterCorp. The list and number of languages included so far are open to further inclusions, the constraints being only pragmatic, namely the availability of texts; many texts are still waiting for inclusion. Obviously, each of the language pairs is different, both in size and contents, and the original assumption that there might be texts common to most if not all languages has not turned true, so far, as there are not so many texts shared by the bulk of languages or because these have not been acquired yet. Having this kind of broad goal the policy standing behind the InterCorp project efforts is rather straightforward and modest: (1) Only contemporary texts, i.e. for those dating no further back than 1945 are used (including older texts if reprinted after this date). This time line is set deliberately: except for classical literature, most of the actual readership and hence the language use starts about here. In other words, this is not only a practical solution but also a way how to ensure the diachrony-synchrony distinction. Unfortunately, for some corpus-architects this basic linguistic division line does not seem to be of importance. A open problem, difficult to solve in general, depending on a particular situation in some languages, is to be seen in those cases where the source language text may be older while its translation has originated after 1945. Here, a pragmatic solution respecting the text quality and usefulness may be best. (2) Although an obvious desideratum, it is almost impossible to achieve any kind of balance between the number of titles translated into Czech and from Czech, hence the idea has not been made a criterion (so far), though later having more texts on both sides it will become important. (3) The lack of texts shared by more languages being obvious and natural, it has been decided that also some texts whose original language is not Czech or the other language in the pair are admitted, namely those that are more widely translated. Thus, for example, 6 out of 15 titles in the Czech-Serbian subcorpus currently available on line are translations from a third language, mostly English, but also Italian, Polish, Portuguese and Russian. The general policy is to have titles with a wide array of translations into other languages and to be able to ensure as broad a link between more than two languages as possible. Thus titles translated into more languages are preferred. A list of titles with a high degree of translation rate serves as a suggestion for participants in charge of the individual languages. This fact, i.e. having non-original titles on both sides, has to be taken into account in some kinds of analysis while for other purposes it may not be really important. Techniques evaluating relevance of this kind of indirect translations from a third language, in comparison to direct equivalents, have yet to be found. (4) InterCorp strives to be linguistically general so that it might be used for many different purposes: linguistic, non-linguistic, academic, teaching, etc. Hence, it is desirable to capture types of language and vocabulary that are as diverse as possible. On the other hand, a balanced parallel corpus is much harder to build than a monolingual one. The reasons for this are at least four. (a) Some text types and most speech types are hardly ever translated, including spoken language or some types of newspaper language which is so important in monolingual corpora. This is why a pragmatic solution has been adopted centering on what is really available: hence InterCorp consists entirely of written texts, mostly fiction. (b) As for non-fiction and its most prevalent genre, the language of the press, we have already tapped one fairly reliable multilingual resource (Project Syndicate, an international association of newspapers publishing commentaries and analyses by foremost opinion leaders of today), and there is another promising candidate, Presseurope (http://www.presseurop.eu), a portal monitoring European dailies, presently translated into 10 languages. (c) Efforts are made to include more types of text, especially more specific language of EU parliamentary debates (Europarl http://www.europarl.europa.eu/), legal documents (EUR-Lex http://eur-lex.europa.eu, JRC-ACQUIS Multilingual Parallel Corpus http://wt.jrc.it/lt/Acquis/), or various open-source technical and software manuals (as in OPUS, Open Source Parallel Corpus http://urd.let.rug.nl/tiedeman/OPUS/), etc. The choice of these texts is largely pragmatic, depending on their (A) existence, (B) availability and (C) legal issues regulating their access. In any case, corpus users will always be free to select a set of texts to be searched and exploited according to their needs and preferences and if necessary, a kind of restricted access will have to be introduced. Due to this pragmatic feature of the corpus build-up it is difficult to plan the final shape of the corpus to any high degree as it is constantly changing. (d) Anyway, the predominance of the general, non-specific language is seen as a priority aiming primarily at the coverage of the basic vocabulary, i.e. one considered to be more important than specific types of language, if available. Such theoretical and practical reasons are to be found behind the idea of a large multilingual corpus with Czech at the centre. InterCorp (http://korpus.cz/intercorp) is currently a part of the Czech National Corpus project (CNC – http://korpus.cz). The idea at the heart of InterCorp is linguistically trivial, yet not very often voiced; having one’s own language amply covered by monolingual corpora may not be enough: the language must also be studied from the outside. The project seems to be unique in its scope, the choice of texts (so far, it consists mostly of fiction) but also a substantial share of manual work (with a higher quality of alignment, sentence boundary recognition and fewer typos as a result). The project participants, invited in 2005 to join the team headed by the Institute of The Czech National Corpus (Ústav Českého národního korpusu), come from most linguistic departments of the Faculty of Arts at Charles University in Prague and a few other academic institutions with a number of student helpers. The current number of “active” languages is 25 (plus Czech), with Czech being always one of the two languages in a pair. For the time being, 21 languages are available for on-line searches using a parallel concordancer at http://korpus.cz/Park (free to use after registration as a CNC user at http://korpus.cz/english/dohody.php). The table below gives figures for the languages available in the present release of the corpus (the Czech figures being high because of repetition in various language pairs). The “title” here means mostly a novel, as fiction is the predominant genre of InterCorp. However, some languages have the advantage of a more balanced choice of texts; the letter S in the column showing the number of titles in the following table indicates that the subcorpus for the given language includes a selection of political commentaries published by the Project Syndicate (http://www.project-syndicate.org/). The currently available Czech, English, French, German, Russian and Spanish issues, dated 2000-2008, will be followed by more recent texts in future releases, also in Arabic and Chinese. Project Syndicate data are included in the counts; their size is approximately 1.5– 2 million words for a given language. Figures for the part of InterCorp available on-line as of March 2010 (figures are in thousands): Language (L2) Bulgarian Croatian Danish Dutch English Finnish French German Hungarian Italian Latvian Lithuanian Polish Portuguese Romanian Russian Serbian Slovak Slovene Spanish Swedish Total Czech word tokens x 1,000 1,057 4,363 80 2,448 4,041 497 2,415 6,466 1,030 2,254 1,121 318 2,450 1,261 461 2,873 1,129 352 813 7,210 1,439 44,077 L2 word tokens x 1,000 1,049 4,599 102 2,046 4,705 423 3,120 7,480 985 2,591 1,067 272 2,422 1,436 564 2,902 1,209 351 901 8,427 1,643 49,293 Number of titles 14 69 4 45 S + 34 11 S + 21 S + 70 15 26 23 7 40 18 4 S + 23 19 7 15 S + 82 25 572 Obviously, each of the langauge pairs is different, both in size and content, and the original assumption that there might be more of a non-trivial common core of titles shared by most if not all languages has not turned out to be true so far. Currently the title available on line in the highest number of languages is Milan Kundera’s novel The Unbearable Lightness of Being (Nesnesitelná lehkost bytí, in 9 languages including Czech), and there are 7 more translations of this novel waiting in the pipeline. Another Kundera´s novel, The Joke (Žert), totals 18 items, followed by J. K. Rowling’s Harry Potter and the Philosopher’s Stone with 14 items, and J. R. R. Tolkien’s The Lord of the Rings I with 12 items. Again, not all of them are on line at the moment. Since InterCorp is being constantly developped and enlarged, these figures from spring 2010 can be, at the moment, updated by the autumn 2010 figures. Briefly, the total number of words per each languages (figures in full numbers) is now: bg 995 577, cs 31 748 363, da 189 642, de 7 775 391, en 5 058 517, es 9 268 754, fi 869 433, fr 3 140 914, hr 5 465 110, hu 703 826, it 2 590 835, lt 200 953, lv 1 069 232, nl 3 418 173, no 1 561 786, pl 3635 273, pt 1 259 741, ro 564 467, ru 2 841 360, sk 388 773, sl 901 062, sr 1 387 242, sv 1 786 575, sy 132 029, the sum total being 86 953 028 words altogether (sy standing for Serbian texts in Cyrillic). 4. Technical Aspects of InterCorp and Search Possibilities Available and Illustrated. The kind of coordination of a project involving such a number of participants is a challenge that is gradually being solved pragmatically. Since the majority of partners and users are not computer linguists this has to be taken into consideration. Therefore, it was necessary to use the participants’ expertise but also different technologies. A real start of the project has come when a predominantly single-user parallel corpus builder and concordancer ParaConc http://www.athel.com/para.html (Barlow 2002) has been brought in. At the same time, common procedures, tools and text formats have shown to be needed for the results to be integrated into a single corpus (for more technical detail see Vavřín & Rosen 2008 and Čermák-Rosen in print). On a practical level, each language has a coordinator in charge of text acquisition, conversion into a standard electronic format (in case it is needed), text cleanup and proofreading, most of the work being done by students against a modest fee. Then the text is uploaded into the project database for formatting checks and automatic detection of sentence boundaries. For Czech, a rule-based splitter is used, for other languages a stochastic method is used (Punkt from http://nltk.org/). In the next step, the aligned texts are cleaned (ParaConc may insert tags in somewhat erratic ways) and transformed into an XML format, including bibliographical data extracted from a database of titles available within the project. This database is also used for tracking the passage of the title through the pre-processing stages. The texts are aligned using a web-based tool Intertext, integrating Hunalign, an automatic aligner (Varga et al), with an alignment editor for proofreading of the results. This tool also handles the problem of maintaining a single copy of a (Czech) text, potentially modified in multiple alignment pairs. Finally, the texts can be morphologically tagged and lemmatized. This option depends on the availability and performance of suitable language-specific tools. At the moment, there are 11 languages that are currently tagged. To allow for a multiple-language search and use, Park, a server-based parallel concordancer, is being built, using the corpus manager Manatee (by Pavel Rychlý). Starting with the present release of the corpus, the texts are fed into the manager with a stand-off alignment annotation. The search interface is accessible by a web browser. However, only a restricted set of search and display functions is available so far, due to be extended in the future. The currently available options include: • Restrictions on the search scope by language and title • Queries into one or more languages by word form, by a string of word forms (a phrase), by a CQL expression (including regular expressions), for some languages by lemma (base form) and/or morphosyntactic tag; with a virtual keyboard to type in foreign characters; with an option to recall a previous query • Displaying parallel concordances side by side or in rows; displaying more context; displaying/suppressing structural tags (paragraphs, sentences, segments), bibliographical data and concordance id, lemma and/or morphosyntactic tag for keyword or all displayed words (for some languages); export of concordances as a spreadsheet file. Below is a screenshot of the corpus search tool after specifying Czech and Russian as the languages to queried, which may illustrate the existing state of affairs. The list of available titles shrinks, depending on the choice of languages. Next to Czech and Russian, two additional languages were selected, English and Polish in this case, while the list of common texts shrinks even further to only two novels by Milan Kundera: The Unbearable Lightness of Being and The Joke (not shown). Specifying languages and titles A query can be specified for any language or any combination of languages. Next Figure shows a CQP query into the Czech part, showing negated forms of the verb věřit ‘believe’: The first three hits are shown in the following figure. The number of tokens in the column headings refers to the total number of word tokens present in the titles selected for the query. Query results in vertical view The search results can be displayed with different types of metatextual information and/or in the horizontal view. So far, the option to restrict morphological annotation to the keyword is not available for the other languages because no keyword can be identified when it is not specified in the query. This will also change with the planned introduction of word-to-word alignment, however problematic this may be, or the option to select most likely keyword equivalents from a list of most frequent content words in the parallel concordances. Query result in horizontal view with tags It is important to mention that all InterCorp texts available in Park can be queried also as a set of monolingual corpora through a web-based version of Bonito, the interface used for the monolingual parts of the Czech National Corpus, offering a more extensive choice of features (filters, sorting, collocations, frequency distribution, random sampling etc., see http://korpus.cz/corpora/intercorp/). The present form and content of the corpus data, together with the (pre-)processing and search infrastructure, are not yet in their last stage of progress. The interface of the corpus, or rather its outer appearance page, that of the parallel concordancer Park, is gradually acquiring a richer set of features, to match those available in its monolingual counterparts and to offer features relevant for parallel data. So far, search is possible on 4 levels, that of lemma (if the text is lemmatized), phrase, word form or any of the three or a combination of them using CQL requiring a more elaborate knowledge. A very simple search starting from Czech, choosing a simple lemma query stůl (table), which is basically monosemous, and looking for its counterparts in the English and Italian texts gets, perhaps surprisingly, an array of possible equivalents used in the two texts scanned (namely M. Kundera´s Nesmrtelnost and J.K. Rowling´s Harry Potter and the Philosopher’s Stone i.e. one text being originally in Czech while the other in English). Full 100 occurrences of the Czech lemma stůl give 89 (89%) English counterparts of table, 5 of desk, 1 of desktop and 5 cases of no translation equivalent (i.e. 5%). The Italian results are more varied, using here 83 times tavola (83%), three time banco, twice cattedra, but also once banchetto, once scrivania and once scrittoio coming up, too, with no equivalent in 9 cases (9 %). At first look, this may seem straightforward enough, but a rather high number of no equivalents, higher in Italian than English, makes one wonder why this is so. To get an answer for this, two examples might start to suggest different possibilities. The first Czech-English case is based on an implication (operations are carrried out on a table, hence this does not have to be spelled out), the second Czech/English-Italian is a deliberate omission of the English table though an immediate context does not vouch for it. CZ při nějaké nevinné operaci zemřela na operačním stole mladá pacientka kvůli nedbale provedenému uspání ENG a young woman who in the course of a completely minor operation died because of carelessly administered anaesthetic CZ Hagrid se k němu naklonil přes stůl. ENG Hagrid leaned across the table. IT Hagrid si chinò verso di lui. However, it is probably more important to take into account the diversity of all equivalent options offered here, namely 3 positive equivalent possibilities in English and 6 in Italian which is a fertile soil for possible improvement of dictionaries, etc. This possibility is evident from the fact that most of the Italian equivalents with a low frequency are rarely found in dictionaries. 5. Research Needs and Possibilities. It seems that the InterCorp project will be a useful resource, as this is to be observed in its initial results, some of which have found its way into two volumes of contributions published as a result of a conference in 2009 in Prague (Čermák-KlégrCorness and Čermák-Kocek). Though more possibilities are offered, a research of a multilingual corpus might, primarily, be twofold (A) applied and (B) theoretical one. The applied research (A) will depend on actual demand and might be related, traditionally, to translation studies and lexicography (Teubert 2001, 2007), mostly. Problems of interpretation of the same text in a number of different translations are an interesting possibility. Since every single translation captures only part of the meaning, one may ask, yet again, what is actually, usually or always lost in translation, etc. Though multilingual lexicography does not seem popular at the moment (apart from terminology, such as Eurodicautom, renamed as IATE , i.e. InterActive Terminology for Europe), but this might change. Let us just mention that it could be useful having, for example, a dictionary of closely related languages such Czech, Polish and Slovak, Scandinavian or Romance ones, etc., used often for checking only or avoidance of false-friends, etc. Definitely a practical use of multilingual corpora can be seen in the area of machine translation, automatic text-mining, word-sense disambiguation, too. The latter, (b) theoretical line of research in advanced multilingual comparison may, too, open some new vistas, hitherto unexplored because of lack of data. But it is the comparative corpus linguistics where a multilingual corpus might come useful offering better data to general linguistics, typology, pragmatics and discourse studies at least. A number of general issues may be raised in this new framework. Thus, one of the old and rather general statements about relationships of languages in smaller and larger groups would certainly call for a more precise formulation. On the other hand, research into the seemingly endless diversity of nonrelated languages, covered so far by typology and universals only, would be an open-ended venture where inspiration can be drawn from the data and typology of the differences. While the strong point of any monolingual corpus research has always been in its study of authentic texts and real contexts, bilingual and multilingual corpora are different in that translations are not original, authentic texts (and, for that matter, neither the contexts that are translated, too). Obviously, a methodology will have to be found here evaluating translated counterparts. It is evident that moving on upwards, from lexical items, through collocations to sentences and their combinations, the value of each step must inevitably become more problematic and prone to various interpretations. Yet, sticking to meaning, which must be taken as the starting point, it seems that more interesting results must be sought in higher levels rather than in lower ones, such as words. Having a parallel corpus or corpora offering profuse contexts and a variety of equivalents of an item on a scale that can be statistically evaluated means much more than the old-time manual contrastive study based on odd and unsystematic examples only. 6. A Conclusion. There being so many open-ended issues and desiderata that could be brought up, they must be paid due attention in specialized contributions. Instead, it might be perhaps of some interest to recall the Final Panel Discussion of the 2009 InterCorp conference in Prague (Čermák, Klégr, Corness). Five broad topics brought before the participants, eliciting a lively discussion, seem to indicate answers to some of the more pressing problems. 1 The Role of a Third Language in Bilingual Corpora: Extent and Methodology Views Expressed: A Third Language is indispensable if the number of translation texts is not very large. Its extent does not seem to be relevant. However, it is necessary to distinguish (as to its relevance) between the original text and the translation; the final results may be checked against a large balanced monolingual corpus. 2 Balancing of Two Languages in a Parallel Corpus Views Expressed: This is desirable, though pragmatic factors limiting the number of accessible texts may distort the balance. 3 A Joint Text Core for More Languages Views Expressed: As a possibility it is certainly desirable, though one does not really know in advance how many users might use this feature enabling comparison of more than two languages; however, other factors in favour of this can also be found. 4 Legal Problems Relating to Copyright and Ownership of Texts Views Expressed: Let’s not worry. No corpus linguist has ever been sent to court for breaching copyright by including a text in a corpus. One should not stop collecting parallel texts because of legal formalities; there is always the means of enabling a limited access, if necessary using a password; the practice of text sampling or disarranging the sequence of text parts does not seem useful, however. 5 Critical Number of Words or Size of a Parallel Corpus for Practical Purposes, Specifically in Lexicography Views Expressed: Size depends on the goal, in (bilingual) lexicography aiming at c. 20 thousand lemmas; millions of words are necessary. 6 Miscellaneous Suggestions: A parallel corpus should include more text types (i.e. in addition to fiction and professional texts), if possible. A parallel corpus is useful in language teaching. References Barlow, M. 1992: Using Concordance Software in Language Teaching and Research, in Shinjo, W. et al. „Proceedings of the Second International Conference on Foreign Language Education and Technology“, Kasugai, Japan: LLAJ & IALL. Barlow M. 2000: Parallel texts in linguistic analysis, in M. Barlow and S. Kemmer (eds.) Usage-based models of language, in Botley, S. P., T. McEnery, A. Wilson (eds.), Multilingual Corpora in Teaching and Research, Amsterdam, pp. 106-115. Barlow M. 2002: ParaConc: Concordance software for multilingual parallel corpora. in „Language Resources for Translation Work and Research, LREC 2002“, pp. 20–24. Botley S., A. McEnery & A. Wilson (eds.) 2000: Multilingual Corpora: Teaching and Research. Amsterdam. Čermák F., A. Klégr, & P. Corness, P. (eds.) 2010: InterCorp: Exploring a Multilingual Corpus. Praha. Čermák F. Kocek J. (eds.) 2010: Mnohojazyčný korpus InterCorp: Možnosti studia. Praha. Čermák F., A. Rosén A. (in print), The Case of InterCorp, a multilingual parallel corpus (International Journal of Corpus Linguistics) Gage W. W. 1961: Contrastive Studies in Linguistics: A Bibliographical Checklist. Washington, DC. Hammer J. H., F. A. Rice 1965: A bibliography of contrastive linguistics. Washington, DC. Johansson S. 2007: Seeing through Multilingual Corpora; On the use of corpora in contrastive studies, in: “Studies in Corpus Linguistics”, Amsterdam. Melamed Dan I. 2001: Empirical Methods for Exploiting Parallel Texts. Cambridge MIT Press. Proceedings of the 2003 Workshop on Building and Using Parallel Texts, www.llas.ac.uk/resources/goodpractice.aspx?resourceid= 1444&PHPSESSID=d9b58ba3f2a87 0be08f2e417e57d8326 www.cse.unt.edu/~rada/wpt/ Proceedings of the 2005 Workshop on Building and Using Parallel Texts Available at: www.aclweb.org/anthology- new/W/W05/W050800.pdf. Resnik Ph. 1999: Mining the web for bilingual text, in: Proc. 37th ACL, 527-534, University of Maryland Press. Teubert Wolfgang 2001: Corpus Linguistics and Lexicography, in: „International Journal of Corpus Linguistics“, 6, Special Issue, pp. 125-153. Teubert Wolfgang ed. 2007: Text Corpora and Multilingual Lexicography. Birmingham. Varga D., L. Németh, P. Halácsy, A. Kornai, V. Trón and V. Nagy 2005. Parallel corpora for medium density languages, in: „Proceedings of the RANLP 2005“, pp. 590-596. Vavřín M., A. Rosen 2008: Intercorp: A Multilingual parallel Corpus“, in: „Труды Международной конференции Корпусная лингвистика 2008“ Санкт Петервург, pp. 156-162.

Pozn

Related documents

Products

Support

Pozn

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib