Pozn

advertisement
FRANTIŠEK ČERMÁK
Institute of the Czech National Corpus, Charles
University Prague
InterCorp: A Contribution to
Interlinguistics
Prace filologiczne LXIII, Warszawa 2012, 67-83
1. Corpus: Monolingual or Multilingual.
The opposition in the title of this paper may
seem real, but in fact it depends on the point of view
taken. However, from both points of view that are
historical and relevant for our topic it is, in fact,
largely useless, since one is a prerequisite for the
other. Generally, there are at least three main
aspects, relevant to our topic and forming a
framework, that should be remembered and seen as
interrelated.
Should one not fall into circular thinking about
features of one´s own language, one must gain a
distance from it
and, sooner or later, arrive at some objectivity and
perhaps generalizations about it that are possible
only thanks to comparison with other languages. A
human community has basically never lived in
isolation being surrounded by other that are usually
talking a different language and the primitive idea to
be still found for example in Polish and Czech,
hidden etymologically behind the names for our
neighbours, namely Niemec and Němec, seeing the
German neighbours then as dumb (mute), is no
longer upheld. The early recognition and awareness
that every community has foreign-language
neighbours has necessarily led to (1) comparison,
finding differences and sometimes similarities, too.
In fact, the very first dictionaries used to be bilingual
first, by far preceding the monolingual ones.
Eventually, this kind of comparison has gone on
giving insights to all sorts of practical and
theoretical fields including those of typology,
universals and general linguistics. Yet, all of the
comparison has been, until now, very limited, both
in data and number of languages, and has been
largely dependent on a few linguists speaking more
languages being able to suggest general observations
about languages, however problematic these might
have been. Nevertheless, the bulk of knowledge was
still limited to a bilingual comparison based on
selected and hence unsystematic observations.
With the advent of (2) corpora, however, the lack
of data has suddenly been overcome with such
modern resources as the Czech National Corpus or
The Polish National Corpus, offering text collections
of hundreds or even billions of words in context. It
should be stressed unequivocally, that any
information and subsequent generalisation is to be
found in the text first, from where it is deduced and
generalized only thanks to contexts provided.
Corpora do resemble real language world we live in
up to a large extent and are thus a record of our
language life. Compared to the present situation, the
old pre-corpus linguistics has never had enough data
and, more importantly, enough contexts it has now.
Speaking linguistically, contexts, made of
combinations, take us from the old item-and-slot, or
member-and-class i.e., rather, a paradigmatic
approach, to a badly needed syntagmatic one, based
on combinations.
Realizing that one can have more than one corpus
and establishing then two, three or more corpora in
juxtaposition has led us the full circle, i.e. to the
point from where we have started, though (3)
parallel corpora, combining both (1 and 2) are, or
may be far more than a sum-total of them. The
enhanced and practical political emphasis on as
many as possible, in fact generalized relations, has
been given the name of globalisation which
certainly applies in practical linguistics first but,
hopefully, will not end there.
Today, (bilingual) parallel corpora exist for many
language pairs and the technology needed to build,
process and exploit them is widely explored though
it is far from being ideal. A considerable progress is
witnessed since the days when “parallel” meant
“bilingual”, when the only substantial sources were
restricted to a special kind of language of English
and French as in the Hansard Corpus, i. e.
transcriptions of parliament debates, or available for
a type of language somewhat distant from
contemporary or common use – as in the case of the
languages of the Bible, or of some classical authors.
Obviously, before any build-up of a parallel corpus
begins it is necessary to take stock of one´s needs
today and, possibly, tomorrow, since also this kind
of corpus is built for keeps and it must generally be
evident why it should be built.
The kind of language contrastive (comparative)
research and various applications based on such
corpora is obvious, since the previous lack of data
has basically prevented projects of multi-language
comparison. Today, parallel corpora are quite
common existing, at least, for many language pairs
and their technology is widely explored (cf. for
example, Proceedings 2003, Proceedings 2005).
However, except for a limited type of parallel
corpora (due to the one-sided kind of data), such as
Canadian Hansard and Europarl ones, most of the
attention paid to this idea has, so far, been limited
and restricted, mostly to two things.
On the one hand, computer scientists seem to
compete fiercely in the field of tools (machine
translation, cross-lingual information retrieval etc.)
including the search for an optimal alignment
methods of texts. When they become convinced that
there is no more to be technically achieved here,
they drop the subject and interest in it as well (see a
survey in Čermák-Rosen, in print). On the other
hand, parallel corpora hardly ever means anything
more than bilingual parallel corpora. Thus, the
whole field seems rather one-sided, lacking in
interest in real and general use and language
exploitation. This exploitation and research should
be linguistic, preferably, having a broader goal of
comparing and researching more languages, an
obvious goal in today´s multilingual Europe. It
might give substance to and justify not only the old
dictum that language is an instrument of
transmission of meaning from thought to form but,
perhaps, an additional one, namely that language
comparison is also a bridge enabling transfer of
meaning between them.
Of course, parallel corpora must have access to data
to be built on. Yet the supply of data is usually
limited by the scarcity of available data, both in their
size and type. This limitation, just one of many
issues in standard monolingual corpus linguistics,
becomes ever more central here. For those languages
that are unable to benefit from a pool of literary
translations (from or into the language), or even a
role in an international context (being, for example,
one of the official languages of the EU), this
bottleneck may become prohibitive preventing any
further growth of the corpus as there are simply no
more data available.
Some problems related to the business of a parallel
corpus are obvious: sentence segmentation,
tokenization, alignment and concordancing, but
linguistic annotation, useful as such, requires
language-specific tools or tagged data and that
makes the job more difficult. Yet, there are also
other, more specific and demanding needs to be
covered.
In a sense, the emerging field of comparative corpus
linguistics may be given a substantial boost if
multilingual corpora are built more extensively, with
some concern about representativeness, and are
researched systematically with the multilingual
perspective in mind. The obvious desideratum
behind this is to be sure of one’s tertium
comparationis, against which comparisons are safely
made, and a broader framework, preferably a
typological one. So far, none of these is a common
practice.
2. Language Contacts and Translations. Czech
Language Situation.
The fact that both parallel bilingual and
multilingual corpora are based on available
translations between languages and these grow only
gradually has some reasons as well as consequences.
From a cultural and historical point of view, the sum
of available translations from one language into
another represents the sum of various strands of
interest, whether historically conditioned (such as
fashionable novels) or real and useful, that a
community has had, sometimes over a well-defined
period of time, in another community and its specific
texts. This is specifically clear when comparing the
sum of what has been translated between two small
languages where, often, anything useful and
interesting came eventually into translators´ focus.
Generalizing a bit along these lines, it seems to be
true that for a multitude of languages, the
intersection of available texts decreases with the
growing quantity of languages included, hence the
number of texts shared by many languages goes
down. In this way, cultural, political and other
influences can easily be seen if the number, type and
spread of translations is examined in its totality not
only for one language and the ethnical group behind
it but also for a larger and multilingual community,
such as Europe. Though there exist many types of
translation from (and to) a large language (source
language), in most cases the recipients of
translations are small languages, i.e. those that the
texts are translated into (target languages).
Given the present geo-political situation, most
attention is paid, with only few exceptions, to
bilingual parallel corpora that are oriented on pairs
made up of two large languages (such as English and
French in the Hansard Corpus), or on language pairs
where at least one is a large language, such as
English. Due to the widespread knowledge of
English and some other languages it is, in a way, a
pair of two small languages that must be viewed as
wanting in this respect. However, the existing needs
point elsewhere, to a large-scale comparison and
more qualified study of (all kinds of) languages and
all types of texts. Hence it is necessary that the data
must reasonably come from as many languages as
possible.
This is also true of the Czech language, a Slavic
language spoken by some 10 million people, that is
such a small language. As it is typologically
inflectional, it has features, that are hardly to be
found in English, French or German, such as rich
inflection (7-case system), verb aspect, free wordorder, rich verb prefixation, rich noun derivation, a
lot of particles, etc., though most of these features
are familiar to and shared by other Slavic languages
as well. Historically, since it is used by people living
in the middle of Europe, Czech has always been a
crossroads language due to the influence of many
languages, such as the neighbouring German or
Polish for centuries or non-neighbouring Russian for
decades, etc.
The Czech language has traditionally had two kinds
of close linguistic contact with its neighbours, one
Slavic (Slovak and Polish), one German (Austrian
and German German), both of them representing a
different type of research challenge. Here,
specifically, the blurring of differences between two
closely related languages (especially with Slovak)
might be worth investigating in a parallel corpus, a
really fine-grained one in this case. On the other
hand, the long-standing contact with German,
having a rich history, might be made more
interesting if one goes more deeply, beyond mere
loan-words, namely into semantics, calques or
influences on the grammar system.
All of these factors have had influence that has been
projected into a language that might be worth
researching, in general as well as from the
typological point of view, though specifically not
only from the point of view of the Czech native
users but also those from elsewhere, from the
outside. Hence the idea of a large multilingual
corpus having Czech at its hub and, accordingly, the
idea of InterCorp.
3. InterCorp Project.
Unlike most other projects, the InterCorp
project (see also more in Čermák-Rosen ?) is an
open one striving for perpetual growth wherever
possible, i.e. in so far as texts and financing are
available, the major philosophy behind this being the
same as with a large monolingual corpus: the more
data the better. Obviously, since the Czech texts
have been available before, it is, basically, only nonCzech texts that have to be found or scanned and
given a suitable form that have to be handled mostly,
though, nowadays, many “bilingual” texts are added
that have to be scanned first for InterCorp.
The list and number of languages included so far are
open to further inclusions, the constraints being only
pragmatic, namely the availability of texts; many
texts are still waiting for inclusion. Obviously, each
of the language pairs is different, both in size and
contents, and the original assumption that there
might be texts common to most if not all languages
has not turned true, so far, as there are not so many
texts shared by the bulk of languages or because
these have not been acquired yet.
Having this kind of broad goal the policy standing
behind the InterCorp project efforts is rather
straightforward and modest:
(1) Only contemporary texts, i.e. for those dating no
further back than 1945 are used (including older
texts if reprinted after this date). This time line is set
deliberately: except for classical literature, most of
the actual readership and hence the language use
starts about here. In other words, this is not only a
practical solution but also a way how to ensure the
diachrony-synchrony distinction. Unfortunately, for
some corpus-architects this basic linguistic division
line does not seem to be of importance. A open
problem, difficult to solve in general, depending on
a particular situation in some languages, is to be
seen in those cases where the source language text
may be older while its translation has originated
after 1945. Here, a pragmatic solution respecting the
text quality and usefulness may be best.
(2) Although an obvious desideratum, it is almost
impossible to achieve any kind of balance between
the number of titles translated into Czech and from
Czech, hence the idea has not been made a criterion
(so far), though later having more texts on both sides
it will become important.
(3) The lack of texts shared by more languages
being obvious and natural, it has been decided that
also some texts whose original language is not
Czech or the other language in the pair are admitted,
namely those that are more widely translated. Thus,
for example, 6 out of 15 titles in the Czech-Serbian
subcorpus currently available on line are translations
from a third language, mostly English, but also
Italian, Polish, Portuguese and Russian. The general
policy is to have titles with a wide array of
translations into other languages and to be able to
ensure as broad a link between more than two
languages as possible. Thus titles translated into
more languages are preferred. A list of titles with a
high degree of translation rate serves as a suggestion
for participants in charge of the individual
languages. This fact, i.e. having non-original titles
on both sides, has to be taken into account in some
kinds of analysis while for other purposes it may not
be really important. Techniques evaluating relevance
of this kind of indirect translations from a third
language, in comparison to direct equivalents, have
yet to be found.
(4) InterCorp strives to be linguistically general so
that it might be used for many different purposes:
linguistic, non-linguistic, academic, teaching, etc.
Hence, it is desirable to capture types of language
and vocabulary that are as diverse as possible. On
the other hand, a balanced parallel corpus is much
harder to build than a monolingual one. The reasons
for this are at least four.
(a) Some text types and most speech types are
hardly ever translated, including spoken language or
some types of newspaper language which is so
important in monolingual corpora. This is why a
pragmatic solution has been adopted centering on
what is really available: hence InterCorp consists
entirely of written texts, mostly fiction.
(b) As for non-fiction and its most prevalent genre,
the language of the press, we have already tapped
one fairly reliable multilingual resource (Project
Syndicate, an international association of
newspapers publishing commentaries and analyses
by foremost opinion leaders of today), and there is
another promising candidate, Presseurope
(http://www.presseurop.eu), a portal monitoring
European dailies, presently translated into 10
languages.
(c) Efforts are made to include more types of text,
especially more specific language of EU
parliamentary debates (Europarl
http://www.europarl.europa.eu/), legal documents
(EUR-Lex http://eur-lex.europa.eu, JRC-ACQUIS
Multilingual Parallel Corpus
http://wt.jrc.it/lt/Acquis/), or various open-source
technical and software manuals (as in OPUS, Open
Source Parallel Corpus
http://urd.let.rug.nl/tiedeman/OPUS/), etc. The
choice of these texts is largely pragmatic, depending
on their (A) existence, (B) availability and (C) legal
issues regulating their access. In any case, corpus
users will always be free to select a set of texts to be
searched and exploited according to their needs and
preferences and if necessary, a kind of restricted
access will have to be introduced. Due to this
pragmatic feature of the corpus build-up it is
difficult to plan the final shape of the corpus to any
high degree as it is constantly changing.
(d) Anyway, the predominance of the general,
non-specific language is seen as a priority aiming
primarily at the coverage of the basic vocabulary,
i.e. one considered to be more important than
specific types of language, if available.
Such theoretical and practical reasons are to be
found behind the idea of a large multilingual corpus
with Czech at the centre. InterCorp
(http://korpus.cz/intercorp) is currently a part of the
Czech National Corpus project (CNC –
http://korpus.cz). The idea at the heart of InterCorp
is linguistically trivial, yet not very often voiced;
having one’s own language amply covered by
monolingual corpora may not be enough: the
language must also be studied from the outside.
The project seems to be unique in its scope, the
choice of texts (so far, it consists mostly of fiction)
but also a substantial share of manual work (with a
higher quality of alignment, sentence boundary
recognition and fewer typos as a result). The project
participants, invited in 2005 to join the team headed
by the Institute of The Czech National Corpus
(Ústav Českého národního korpusu), come from
most linguistic departments of the Faculty of Arts at
Charles University in Prague and a few other
academic institutions with a number of student
helpers. The current number of “active” languages is
25 (plus Czech), with Czech being always one of the
two languages in a pair. For the time being, 21
languages are available for on-line searches using a
parallel concordancer at http://korpus.cz/Park (free
to use after registration as a CNC user at
http://korpus.cz/english/dohody.php).
The table below gives figures for the languages
available in the present release of the corpus (the
Czech figures being high because of repetition in
various language pairs). The “title” here means
mostly a novel, as fiction is the predominant genre
of InterCorp. However, some languages have the
advantage of a more balanced choice of texts; the
letter S in the column showing the number of titles
in the following table indicates that the subcorpus
for the given language includes a selection of
political commentaries published by the Project
Syndicate (http://www.project-syndicate.org/). The
currently available Czech, English, French, German,
Russian and Spanish issues, dated 2000-2008, will
be followed by more recent texts in future releases,
also in Arabic and Chinese. Project Syndicate data
are included in the counts; their size is
approximately 1.5– 2 million words for a given
language. Figures for the part of InterCorp available
on-line as of March 2010 (figures are in thousands):
Language (L2)
Bulgarian
Croatian
Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Latvian
Lithuanian
Polish
Portuguese
Romanian
Russian
Serbian
Slovak
Slovene
Spanish
Swedish
Total
Czech word
tokens
x 1,000
1,057
4,363
80
2,448
4,041
497
2,415
6,466
1,030
2,254
1,121
318
2,450
1,261
461
2,873
1,129
352
813
7,210
1,439
44,077
L2 word tokens
x 1,000
1,049
4,599
102
2,046
4,705
423
3,120
7,480
985
2,591
1,067
272
2,422
1,436
564
2,902
1,209
351
901
8,427
1,643
49,293
Number of
titles
14
69
4
45
S + 34
11
S + 21
S + 70
15
26
23
7
40
18
4
S + 23
19
7
15
S + 82
25
572
Obviously, each of the langauge pairs is different,
both in size and content, and the original assumption
that there might be more of a non-trivial common
core of titles shared by most if not all languages has
not turned out to be true so far. Currently the title
available on line in the highest number of languages
is Milan Kundera’s novel The Unbearable Lightness
of Being (Nesnesitelná lehkost bytí, in 9 languages
including Czech), and there are 7 more translations
of this novel waiting in the pipeline. Another
Kundera´s novel, The Joke (Žert), totals 18 items,
followed by J. K. Rowling’s Harry Potter and the
Philosopher’s Stone with 14 items, and J. R. R.
Tolkien’s The Lord of the Rings I with 12 items.
Again, not all of them are on line at the moment.
Since InterCorp is being constantly developped and
enlarged, these figures from spring 2010 can be, at
the moment, updated by the autumn 2010 figures.
Briefly, the total number of words per each
languages (figures in full numbers) is now:
bg 995 577, cs 31 748 363, da 189 642, de 7 775
391, en 5 058 517, es 9 268 754, fi 869 433, fr 3 140
914, hr 5 465 110, hu 703 826, it 2 590 835, lt 200
953, lv 1 069 232, nl 3 418 173, no 1 561 786, pl
3635 273, pt 1 259 741, ro 564 467, ru 2 841 360, sk
388 773, sl 901 062, sr 1 387 242, sv 1 786 575, sy
132 029,
the sum total being 86 953 028 words altogether (sy
standing for Serbian texts in Cyrillic).
4. Technical Aspects of InterCorp and Search
Possibilities Available and Illustrated.
The kind of coordination of a project
involving such a number of participants is a
challenge that is gradually being solved
pragmatically. Since the majority of partners and
users are not computer linguists this has to be taken
into consideration. Therefore, it was necessary to use
the participants’ expertise but also different
technologies. A real start of the project has come
when a predominantly single-user parallel corpus
builder and concordancer ParaConc
http://www.athel.com/para.html (Barlow 2002) has
been brought in. At the same time, common
procedures, tools and text formats have shown to be
needed for the results to be integrated into a single
corpus (for more technical detail see Vavřín &
Rosen 2008 and Čermák-Rosen in print).
On a practical level, each language has a coordinator
in charge of text acquisition, conversion into a
standard electronic format (in case it is needed), text
cleanup and proofreading, most of the work being
done by students against a modest fee. Then the text
is uploaded into the project database for formatting
checks and automatic detection of sentence
boundaries. For Czech, a rule-based splitter is used,
for other languages a stochastic method is used
(Punkt from http://nltk.org/).
In the next step, the aligned texts are cleaned
(ParaConc may insert tags in somewhat erratic
ways) and transformed into an XML format,
including bibliographical data extracted from a
database of titles available within the project. This
database is also used for tracking the passage of the
title through the pre-processing stages.
The texts are aligned using a web-based tool
Intertext, integrating Hunalign, an automatic aligner
(Varga et al), with an alignment editor for
proofreading of the results. This tool also handles
the problem of maintaining a single copy of a
(Czech) text, potentially modified in multiple
alignment pairs. Finally, the texts can be
morphologically tagged and lemmatized. This option
depends on the availability and performance of
suitable language-specific tools. At the moment,
there are 11 languages that are currently tagged.
To allow for a multiple-language search and use,
Park, a server-based parallel concordancer, is being
built, using the corpus manager Manatee (by Pavel
Rychlý). Starting with the present release of the
corpus, the texts are fed into the manager with a
stand-off alignment annotation. The search interface
is accessible by a web browser. However, only a
restricted set of search and display functions is
available so far, due to be extended in the future.
The currently available options include:
• Restrictions on the search scope by language and
title
• Queries into one or more languages by word form,
by a string of word forms (a phrase), by a CQL
expression (including regular expressions), for some
languages by lemma (base form) and/or
morphosyntactic tag; with a virtual keyboard to type
in foreign characters; with an option to recall a
previous query
• Displaying parallel concordances side by side or in
rows; displaying more context;
displaying/suppressing
structural tags (paragraphs, sentences, segments),
bibliographical data and concordance id, lemma
and/or
morphosyntactic tag for keyword or all displayed
words (for some languages); export of concordances
as a spreadsheet file.
Below is a screenshot of the corpus search tool after
specifying Czech and Russian as the languages to
queried, which may illustrate the existing state of
affairs. The list of available titles shrinks, depending
on the choice of languages. Next to Czech and
Russian, two additional languages were selected,
English and Polish in this case, while the list of
common texts shrinks even further to only two
novels by Milan Kundera: The Unbearable
Lightness of Being and The Joke (not shown).
Specifying languages and titles
A query can be specified for any language or any
combination of languages. Next Figure shows a
CQP query into the Czech part, showing negated
forms of the verb věřit ‘believe’:
The first three hits are shown in the following figure.
The number of tokens in the column headings refers
to the total number of word tokens present in the
titles selected for the query.
Query results in vertical view
The search results can be displayed with different
types of metatextual information and/or in the
horizontal view. So far, the option to restrict
morphological annotation to the keyword is not
available for the other languages because no
keyword can be identified when it is not specified in
the query. This will also change with the planned
introduction of word-to-word alignment, however
problematic this may be, or the option to select most
likely keyword equivalents from a list of most
frequent content words in the parallel concordances.
Query result in horizontal view with tags
It is important to mention that all InterCorp texts
available in Park can be queried also as a set of
monolingual corpora through a web-based version of
Bonito, the interface used for the monolingual parts
of the Czech National Corpus, offering a more
extensive choice of features (filters, sorting,
collocations, frequency distribution, random
sampling etc., see
http://korpus.cz/corpora/intercorp/).
The present form and content of the corpus data,
together with the (pre-)processing and search
infrastructure, are not yet in their last stage of
progress. The interface of the corpus, or rather its
outer appearance page, that of the parallel
concordancer Park, is gradually acquiring a richer
set of features, to match those available in its
monolingual counterparts and to offer features
relevant for parallel data.
So far, search is possible on 4 levels, that of lemma
(if the text is lemmatized), phrase, word form or any
of the three or a combination of them using CQL
requiring a more elaborate knowledge. A very
simple search starting from Czech, choosing a
simple lemma query stůl (table), which is basically
monosemous, and looking for its counterparts in the
English and Italian texts gets, perhaps surprisingly,
an array of possible equivalents used in the two texts
scanned (namely M. Kundera´s Nesmrtelnost and
J.K. Rowling´s Harry Potter and the Philosopher’s
Stone i.e. one text being originally in Czech while
the other in English).
Full 100 occurrences of the Czech lemma stůl give
89 (89%) English counterparts of table, 5 of desk, 1
of desktop and 5 cases of no translation equivalent
(i.e. 5%). The Italian results are more varied, using
here 83 times tavola (83%), three time banco, twice
cattedra, but also once banchetto, once scrivania
and once scrittoio coming up, too, with no
equivalent in 9 cases (9 %). At first look, this may
seem straightforward enough, but a rather high
number of no equivalents, higher in Italian than
English, makes one wonder why this is so. To get an
answer for this, two examples might start to suggest
different possibilities. The first Czech-English case
is based on an implication (operations are carrried
out on a table, hence this does not have to be spelled
out), the second Czech/English-Italian is a deliberate
omission of the English table though an immediate
context does not vouch for it.
CZ při nějaké nevinné operaci zemřela na
operačním stole mladá pacientka kvůli nedbale
provedenému uspání
ENG a young woman who in the course of a
completely minor operation died because of
carelessly administered anaesthetic
CZ Hagrid se k němu naklonil přes stůl.
ENG Hagrid leaned across the table.
IT Hagrid si chinò verso di lui.
However, it is probably more important to take into
account the diversity of all equivalent options
offered here, namely 3 positive equivalent
possibilities in English and 6 in Italian which is a
fertile soil for possible improvement of dictionaries,
etc. This possibility is evident from the fact that
most of the Italian equivalents with a low frequency
are rarely found in dictionaries.
5. Research Needs and Possibilities.
It seems that the InterCorp project will be a
useful resource, as this is to be observed in its initial
results, some of which have found its way into two
volumes of contributions published as a result of a
conference in 2009 in Prague (Čermák-KlégrCorness and Čermák-Kocek).
Though more possibilities are offered, a research of
a multilingual corpus might, primarily, be twofold
(A) applied and (B) theoretical one.
The applied research (A) will depend on actual
demand and might be related, traditionally, to
translation studies and lexicography (Teubert 2001,
2007), mostly.
Problems of interpretation of the same text in a
number of different translations are an interesting
possibility. Since every single translation captures
only part of the meaning, one may ask, yet again,
what is actually, usually or always lost in translation,
etc.
Though multilingual lexicography does not seem
popular at the moment (apart from terminology,
such as Eurodicautom, renamed as IATE , i.e. InterActive Terminology for Europe), but this might
change. Let us just mention that it could be useful
having, for example, a dictionary of closely related
languages such Czech, Polish and Slovak,
Scandinavian or Romance ones, etc., used often for
checking only or avoidance of false-friends, etc.
Definitely a practical use of multilingual corpora can
be seen in the area of machine translation, automatic
text-mining, word-sense disambiguation, too.
The latter, (b) theoretical line of research in
advanced multilingual comparison may, too, open
some new vistas, hitherto unexplored because of
lack of data.
But it is the comparative corpus linguistics where a
multilingual corpus might come useful offering
better data to general linguistics, typology,
pragmatics and discourse studies at least.
A number of general issues may be raised in this
new framework. Thus, one of the old and rather
general statements about relationships of languages
in smaller and larger groups would certainly call for
a more precise formulation. On the other hand,
research into the seemingly endless diversity of nonrelated languages, covered so far by typology and
universals only, would be an open-ended venture
where inspiration can be drawn from the data and
typology of the differences.
While the strong point of any monolingual corpus
research has always been in its study of authentic
texts and real contexts, bilingual and multilingual
corpora are different in that translations are not
original, authentic texts (and, for that matter, neither
the contexts that are translated, too). Obviously, a
methodology will have to be found here evaluating
translated counterparts.
It is evident that moving on upwards, from lexical
items, through collocations to sentences and their
combinations, the value of each step must inevitably
become more problematic and prone to various
interpretations. Yet, sticking to meaning, which must
be taken as the starting point, it seems that more
interesting results must be sought in higher levels
rather than in lower ones, such as words. Having a
parallel corpus or corpora offering profuse contexts
and a variety of equivalents of an item on a scale
that can be statistically evaluated means much
more than the old-time manual contrastive study
based on odd and unsystematic examples only.
6. A Conclusion.
There being so many open-ended issues and
desiderata that could be brought up, they must be
paid due attention in specialized contributions.
Instead, it might be perhaps of some interest to recall
the Final Panel Discussion of the 2009 InterCorp
conference in Prague (Čermák, Klégr, Corness).
Five broad topics brought before the participants,
eliciting a lively discussion, seem to indicate
answers to some of the more pressing problems.
1 The Role of a Third Language in Bilingual
Corpora: Extent and Methodology
Views Expressed: A Third Language is
indispensable if the number of translation texts is not
very large. Its extent does not seem to be relevant.
However, it is necessary to distinguish (as to its
relevance) between the original text and the
translation; the final results may be checked against
a large balanced monolingual corpus.
2 Balancing of Two Languages in a Parallel
Corpus
Views Expressed: This is desirable, though
pragmatic factors limiting the number of accessible
texts may distort the balance.
3 A Joint Text Core for More Languages
Views Expressed: As a possibility it is certainly
desirable, though one does not really know in
advance how many users might use this feature
enabling comparison of more than two languages;
however, other factors in favour of this can also be
found.
4 Legal Problems Relating to Copyright and
Ownership of Texts
Views Expressed: Let’s not worry. No corpus
linguist has ever been sent to court for breaching
copyright by including a text in a corpus. One
should not stop collecting parallel texts because of
legal formalities; there is always the means of
enabling a limited access, if necessary using a
password; the practice of text sampling or
disarranging the sequence of text parts does not
seem useful, however.
5 Critical Number of Words or Size of a Parallel
Corpus for Practical Purposes, Specifically in
Lexicography
Views Expressed: Size depends on the goal, in
(bilingual) lexicography aiming at c. 20 thousand
lemmas; millions of words are necessary.
6 Miscellaneous
Suggestions: A parallel corpus should include more
text types (i.e. in addition to fiction and professional
texts), if possible. A parallel corpus is useful in
language teaching.
References
Barlow, M. 1992: Using Concordance Software in
Language Teaching and Research, in Shinjo, W. et al.
„Proceedings of the Second International Conference on
Foreign Language Education and Technology“, Kasugai,
Japan: LLAJ & IALL.
Barlow M. 2000: Parallel texts in linguistic analysis, in
M. Barlow and S. Kemmer (eds.) Usage-based models of
language, in Botley, S. P., T. McEnery, A. Wilson (eds.),
Multilingual Corpora in Teaching and Research,
Amsterdam, pp. 106-115.
Barlow M. 2002: ParaConc: Concordance software for
multilingual parallel corpora. in „Language Resources
for Translation Work and Research, LREC 2002“, pp.
20–24.
Botley S., A. McEnery & A. Wilson (eds.) 2000:
Multilingual Corpora: Teaching and Research.
Amsterdam.
Čermák F., A. Klégr, & P. Corness, P. (eds.) 2010:
InterCorp: Exploring a Multilingual Corpus. Praha.
Čermák F. Kocek J. (eds.) 2010: Mnohojazyčný korpus
InterCorp: Možnosti studia. Praha.
Čermák F., A. Rosén A. (in print), The Case of
InterCorp, a multilingual parallel corpus (International
Journal of Corpus Linguistics)
Gage W. W. 1961: Contrastive Studies in Linguistics: A
Bibliographical Checklist. Washington, DC.
Hammer J. H., F. A. Rice 1965: A bibliography of
contrastive linguistics. Washington, DC.
Johansson S. 2007: Seeing through Multilingual Corpora;
On the use of corpora in contrastive studies, in: “Studies
in Corpus Linguistics”, Amsterdam.
Melamed Dan I. 2001: Empirical Methods for Exploiting
Parallel Texts. Cambridge MIT Press.
Proceedings of the 2003 Workshop on Building and
Using Parallel Texts,
www.llas.ac.uk/resources/goodpractice.aspx?resourceid=
1444&PHPSESSID=d9b58ba3f2a87
0be08f2e417e57d8326 www.cse.unt.edu/~rada/wpt/
Proceedings of the 2005 Workshop on Building and
Using Parallel Texts Available at:
www.aclweb.org/anthology- new/W/W05/W050800.pdf.
Resnik Ph. 1999: Mining the web for bilingual text, in:
Proc. 37th ACL, 527-534, University of Maryland Press.
Teubert Wolfgang 2001: Corpus Linguistics and
Lexicography, in: „International Journal of Corpus
Linguistics“, 6, Special Issue, pp. 125-153.
Teubert Wolfgang ed. 2007: Text Corpora and
Multilingual Lexicography. Birmingham.
Varga D., L. Németh, P. Halácsy, A. Kornai, V. Trón and
V. Nagy 2005. Parallel corpora for medium density
languages, in: „Proceedings of the RANLP 2005“, pp.
590-596.
Vavřín M., A. Rosen 2008: Intercorp: A Multilingual
parallel Corpus“, in: „Труды Международной
конференции Корпусная лингвистика 2008“ Санкт
Петервург, pp. 156-162.
Download