Parallel corpora and contrastive studies

advertisement
Parallel corpora and
contrastive studies
Hilde Hasselgård
University of Oslo
From monolingual to
multilingual corpus linguistics
Corpus linguistics – the study of language by means of large(ish),
structured databases of text compiled and prepared for use in
linguistic research.
Largely developed within English linguistics, with the Brown
corpus as the first (1960s), followed by the Lancaster-Oslo/Bergen
(LOB) corpus.
Greatly facilitated the access to material.
Opened up new possibilities for quantitative studies & variation
studies.
Parallel corpora: a more recent development (1990s), requiring
new technology and new research methods.
2>
Structure of talk
•Multilingual corpus linguistics
– Multilingual corpora
– The English-Norwegian Parallel Corpus
– Contrastive analysis
•The use of parallel corpora in contrastive studies
– The contribution of parallel corpora
– Methodology
•The Oslo Multilingual Corpus and the work of ”Språk i Kontrast”
(Languages in Contrast) in Oslo
•Case study: two future-referring expressions
•Summing up
3>
What is a parallel corpus?
original texts with translations into one or more other
languages  A translation corpus
comparable original texts in different languages  A
comparable corpus
bi-directional translation corpus  Parallel corpus
4>
Translation corpus
A corpus that contains the ‘same’ texts in more than one language, in
other words a corpus with both original and translated texts.
Original text(s)
Translation,
language 1
5>
(Translation,
language 2)
(Translation,
language 3)
Comparable corpus
a corpus that contains original texts in more than one language and where the
texts in each language have been selected according to the same criteria
(genre, content, publication date etc.)
6>
Language 1
Language 2
Language 3
criterion A
criterion A
criterion A
criterion B
criterion B
criterion B
criterion C
criterion C
criterion C
criterion D
criterion D
criterion D
Parallel corpus (ENPC model)
Combination of translation and comparable corpus
The original texts are comparable (genre, number of words)
The translations go in both directions – a bidirectional translation corpus
7>
The English-Norwegian Parallel Corpus
(ENPC) – Some facts
Started as a research project at the Department of British and American Studies
in 1994 and completed in 1997. Prof. Stig Johansson initiated and directed the
project.
Original texts with translations (English-Norwegian and Norwegian-English)
Fiction and non-fiction
Compiled for use in applied and theoretical linguistic research
Development of software for alignment of the texts (Knut Hofland, UiB) and for
searching the corpus (Jarle Ebeling, UiO)
Sister projects: The English-Swedish Parallel Corpus (Lund/Göteborg), EnglishFinnish Parallel Corpus (Jyväskylä/Savonlinna/Tampere) – same principle of
compilation; to some extent also shared texts.
Other corpora built on the ENPC model in Germany (Chemnitz), France/Belgium
(Poitiers/Louvain-la-Neuve: the PLECI corpus), Spain (University of Léon)].
8>
Contrastive analysis
Contrastive analysis is the systematic comparison of two or more
languages, with the aim of describing their similarities and
differences. (Johansson 2007: 1)
CA [contrastive analysis] is a linguistic enterprise aimed at
producing inverted (i.e. contrastive, not comparative) two-valued
typologies (a CA is always concerned with a pair of languages),
and founded on the assumption that languages can be compared.
(James 1980: 3)
Executing a CA involves two steps: description and comparison;
and the steps are taken in that order. (James 1980: 63)
9>
Contrastive analysis
A CA presupposes a tertium comparationis, i.e. a measure by
which we can be fairly certain we are comparing like with like.
The items to be compared across languages are selected on the
basis of perceived similarity (Chesterman 1998), such as
translation equivalence, semantic/etymological similarity,
grammatical or functional categories.
A frequently suggested tertium comparationis is translation
equivalence (e.g. James 1980, Chesterman 1998); which implies
that the items in the two languages convey (more or less) the
same meaning.
10 >
What can multilingual corpora
contribute?
They give insights into the languages compared – insights that are
likely to be unnoticed in studies of monolingual corpora.
They can be used for a range of comparative purposes and increase
our understanding of language-specific, typological and cultural
differences, as well as of universal features.
They illuminate differences between source texts and translations,
and between native and non-native texts.
They can be used for a number of practical applications, e.g. in
lexicography, language teaching, and translation.
(Aijmer & Altenberg 1996: 12)
11 >
Other benefits of a parallel
corpus such as the ENPC
Ready access to (relatively) large quantities of bilingual data
Sentence alignment
Comparable original and translated texts in both languages
Control for translation bias
In-built tertium comparationis through translation equivalence and
text comparability
“the paired texts reveal the interlingual identifications made by
translators” (Johansson 1999: 117)
12 >
Methodology: Classifying
correspondences
congruent
expressed
divergent
Correspondence
Same realisation
type
Different
realisation type
zero
Example: English correspondences of imidlertid (‘however’) in ENPC
Alle "innrømmelsene" hadde imidlertid en pris. (GL1)  However, all these
"concessions" had a price.
Det endte imidlertid godt: (…) (UD1)  But it ended well (…)
Reguleringstiltakene har imidlertid gitt resultater (…). (ABJH1)  The regulations
have shown results (…).
13 >
Paradigms of correspondences
Swedish translations of however
English translations of emellertid
emellertid (51 = 47%)
however (83 = 81%)
men (‘but’) (36 = 33%)
but (3)
dock (14 = 13%)
yet (3)
ändå (2)
anyway (1)
däremot (1)
Ø (13)
i alla fall (1)
Ø (4)
(Altenberg 1999)
14 >
Mutual correspondence (MC)
(Altenberg 1999)
The frequency with which different (grammatical, semantic and
lexical) expressions are translated into each other.
Calculated and expressed as a percentage by means of the
formula
(At + Bt) x 100
As + Bs
The MC of however and emmelertid in the ESPC is thus
(51 + 83) x 100 / (109 + 103) = 63.2
15 >
Lexicogrammar
Paradigms of correspondence highlight the fuzzy borderlines
between lexis and grammar and grammar and discourse.
Example: A modal verb will have a wide range of
correspondences
Norwegian kan (‘can’)
Modal aux: can, could, may, might, ‘ll, will, would,
should
Other verbs: know, enable, have, have to, had
better
Adjectives: possible, able, capable.
Adverbs: maybe, perhaps
(Løken 2007)
Suffix: -able
Valget av tidspunkt kan også inneholde et stenk av egoisme.
(KH1)
Maybe his choice of timing also contained a touch of egotism.
16 >
From ENPC to OMC under the
SPRIK umbrella (SPRåk I Kontrast)
New languages have been added, first (mainly) German, then French
Focus on English – Norwegian – German in the first phase of the SPRIKproject: original texts in each language with translations into the other
two.
Same principles for text selection, text sampling and preparation as for
the ENPC (exception: even more biased towards fiction because of the
lack of translated non-fiction)
Same (or later versions of same) software for alignment, searching etc.
Expanded search facilities and research possibilities:
– Three-way comparison of translations and originals
– Possibilities of investigating two different translations of the same text
(translation strategies, translationese)
17 >
Current stock of multilingual
corpora at Oslo
OMC:
Parallel corpora: English-Norwegian, French-Norwegian, GermanNorwegian; three-way English-German Norwegian.
Translation corpora: Norwegian – English – French – German, Norwegian
– French – German, English-Dutch, English-Portuguese.
Multiple translations corpus (English-Norwegian)
Outside OMC:
Russian – English – Norwegian (RuN)
Multilingual corpora of historical texts (two projects)
18 >
Trilingual parallel corpus model
19 >
Searching in No-En-Fr-Ge
Jeg kommer til å si det til ham likevel.” (KF1)
Ich werde es ihm sowieso sagen.” (KF1TD)
I 'll tell him about it anyway.” (KF1TE)
De toute façon, je le lui dirai.” (KF1TF)
"You're going to have a book reissued … (BHH1TE)
Du skal få en bok trykt opp igjen ... (BHH1)
"Ein Buch von dir wird neu aufgelegt, ... (BHH1TD)
Un de tes livres va être réédité ... (BHH1TF)
22 >
Using the ENPC/OMC for
research
Particularly well suited for studies of lexis / lexico-grammar (or
phenomena that can take lexis as their starting point)
A broad range of phenomena have been (are being) investigated, e.g.
the use of individual verbs (bli, få, take, give, see), modality, particular
syntactic constructions, connectives, sentence openings and other
discourse phenomena.
The methodology is not tied to any particular theoretical approach
A range of theoretical approaches, e.g. SFL, cognitive linguistics,
pattern grammar, lexis-based approach à la Sinclair + traditional
grammar / basic linguistic theory.
23 >
Limitations
(As with corpus linguistics in general:) you can only search for
something that is explicit in the text
Restricted to texts / text types that have been translated
The size of the corpus restricts studies of less frequent lexical/
grammatical constructions
Faulty and less successful translations
The corpus has been word-class tagged, but not parsed (syntactically
annotated), i.e. it is not possible to search for grammatical
constructions, patterns of word order etc.
Tagging errors
24 >
Ways around the limitations?
Identify typical (and searchable!) expressions of a grammatical construction, e.g.
presentatives, clefting, phrasal verbs, inversion.
Use a combination of word class tagging, filters and wildcards. Example: tense /
aspect, participle clauses. (e.g. BE +Ving)
In any case – a lot of work involved in tidying up the search results (precision).
Possibility of searching with regular expressions
Errors in the tagging: Never possible to make sure that you have found all the
relevant instances (recall).
Errors/idiosyncracies in the translation: Weed out? Ignore translations that occur
only once, or in only one text?
Manual searches in running text, e.g. for Theme, subjects.
Supplement results of parallel corpus study with (larger) monolingual corpora.
Supplement corpus study with e.g. experimental data.
25 >
Examples of studies based on
ENPC/ OMC / ESPC
Bengt Altenberg: Work on adverbial connectors, sentence openings, subject
selection etc. in English and Swedish.
Karin Aijmer: Work on modality and discourse markers in English and Swedish.
Åke Viberg: Work on verbs of motion and cognition in English and Swedish.
Helge Dyvik: Translations as semantic mirrors; ENPC as basis for bilingual
wordnet.
Jarle Ebeling (2000): Presentative constructions in English and Norwegian : a
corpus-based contrastive study (PhD, University of Oslo)
Mats Johansson (2002) Clefts in English and Swedish: A contrastive study of ITclefts and WH-clefts in original texts and translations. (PhD, Lund University)
Signe Oksefjell Ebeling (2003): The Norwegian verbs bli and få and their
correspondences in English: a corpus-based contrastive study (PhD, University
of Oslo)
26 >
Berit Løken: Beyond modals: A corpus-based study of English and
Norwegian expressions of possibility (PhD, Oslo, 2007)
Lene Nordrum : English lexical nominalizations in a Norwegian-Swedish
contrastive perspective. (PhD, Göteborg, 2007)
Wiebke Ramm: Sentence boundary adjustments in translation (German /
Norwegian): Consequences on information distribution and discourse
structure (PhD, Oslo, ongoing)
Astrid Nome: Ongoing PhD work on connectors in Norwegian and
French. (Oslo)
Cathrine Fabricius Hansen et al: Big Events, Small Clauses. The
Grammar of Elaboration. (Forthcoming book with multiple authors and
multiple languages)
Master theses (English, German, French) studying individual verbs,
syntactic constructions, connectors, metaphor …
27 >
My own contrastive work
2009. A textual perspective on the pragmatic markers in fact and faktisk. In S.
Slembrouck,, M. Taverniers, M. Van Herreweghe (eds.) From will to well: Studies
in Linguistics offered to Anne-Marie Simon-Vandenbergen. Ghent: Academia
Press.
2007. Using the ENPC and the ESPC as a parallel translation corpus: adverbs of
frequency and usuality. Nordic Journal of English Studies 6:1,
http://ojs.ub.gu.se/ojs/index.php/njes/issue/view/6
2006. “Not now” – on non-correspondence between the cognate adverbs now and nå.
In K. Aijmer & A.-M. Simon Vandenbergen (eds.) Pragmatic Markers in Contrast.
Elsevier, 93-114.
2005. Theme in Norwegian. In K.L. Berge, & E. Maagerø (eds.). Semiotics from the
North: Nordic Approaches to Systemic Functional Linguistics. Oslo: Novus, 35-48.
2004 . Spatial linking in English and Norwegian. In K. Aijmer & H. Hasselgård (eds.).
Translation and Corpora. Göteborg: Acta Universitatis Gothoburgensis, 163-188.
2004. Thematic choice in English and Norwegian. Functions of Language 11:2. 187212.
2000. English multiple Themes in translation. In A. Klinge (ed.) Contrastive Studies in
Syntax. Special issue of Copenhagen Studies in Language, Vol 25. Copenhagen:
Samfundslitteratur, 11-38.
28 >
Case study: be going to and
komme til å (‘come to’)
Future-referring expressions based on motion verb +
infinitive
Both described in grammars as common expressions,
though less common than expressions with English will,
Norwegian skal
29 >
Meanings
be going to
– ‘future fulfilment of the present’; present intention or present cause (Quirk et
al 1985)
– associated with present intention or arrangement; was going to quite often
has ‘an implicature of non-actualisation’. (Huddleston & Pullum 2002)
– Two meanings: ‘futurish’, linked to a present situation, and ‘future tense’,
simply expressing future time reference. (Declerck 2006)
komme til å
– the speaker predicts what will happen based on his knowledge at the moment
of speaking (Faarlund et al 1997)
– Past tense kom til å V– also ‘accidentally V’ or ‘was led to V’/ ‘grew to V’
(Vannebo 1979 and Engelsk Stor Ordbok)
30 >
Examples
1. I know what he’s going to say even before he says it. (FW1)
2. Jeg vet hva han kommer til å si selv før han sier det. (FW1T)
3. "I was going to wait until another time we met, but I may as well tell
you now. (AH1)
4. Meningen var å vente til en annen gang, men jeg kan like godt si det
nå. (AH1T)
5. Ingen av dem visste hva som kom til å skje. (TTH1)
6. Neither of them knew what was going to happen. (TTH1T)
7. Kanskje hun kom til å svelge dem ved et uhell? (LSC1)
8. Maybe she happened to swallow them by accident? (LSC1T)
9. Og siden ble det jeg som kom til å se mest til henne. (EHA1)
10. And then I became the one who ended up seeing her most often.
(EHA1T)
31 >
be going to and komme til å in
ENPC fiction (raw frequencies)
250
200
150
original
100
translation
50
0
going to
32 >
komme til å
Preliminary observations
Be going to is more common than komme til å in original texts
Be going to is more common in original texts than in translations
Komme til å is less common in original texts than in translations
– i.e. translations in both directions can be assumed to be coloured by the
source texts.
The frequency differences between originals and translations
(particularly with komme til å) indicate that the two expressions
can often be used in the same contexts, but may tend not to be.
33 >
Correspondences of be going to
(percentages)
35
30
25
20
15
10
5
0
k om
me
s ka
o
s
v il
v
h
s
I N F il l e I N a t e n im pl e t he r
l I N k u ll e
kt
IN F
F
F
te n
t il å
se
N translation
34 >
N original
Correspondences of komme til å
(percentages)
35
30
25
20
15
10
5
0
g oi
ng
to
wi l
l IN
F
wo
u ld
m ig
s
b
h t I im pl e e t o
I NF
NF
te n
se
E translation
35 >
h ap
E original
p en
oth
to
er
Correspondences
The mutual correspondence between be going to and komme til å is
surprisingly low: 12.6%
The correspondence is asymmetrical:
– 15% of be going to are translated as komme til å
– 7% of komme til å are translated as be going to
Komme til å has meanings not covered by be going to (‘accidentally’,
‘grow to’, ‘be led to’).
The ‘present cause/intention’ meaning works differently for the two
expressions; apparently also speaker certainty/non-actualisation.
1. What are we going to do, says Ruth, … (BV2T)
2. Hva skal vi gjøre, sier Rut …(BV2)
3. Hun kommer bare til å bli redd." (THA1)
4. She 'll only be frightened." (THA1T)
Uncertain outcome, no
intentionality
Confident prediction –
speaker knowledge
5. "Are you going to run a hotel?" enquired Frederick reasonably, … (DL1)
6. "Har dere tenkt å drive hotell?" spurte Frederick fornuftig, … (DL1T)
Intention, but
uncertain outcome
36 >
Thus, in spite of shared meanings, English be going to and and
Norwegian komme til å, differ as to
– The frequency with which the item is chosen
– The extent to which they compete with other future-referring expressions
– The extent to which they convey confident predictions, ‘present intention’ and
‘actualised future in past’.
Some other explanations may be
– Translators in both directions tend to normalize be going to / komme til å into a more
common future-referring expression (will/would INF and skal/skulle INF); Will/would and
skal/skulle are also the most common sources of komme til å / be going to
– Sometimes more lexically explicit forms have been used to translate be going
to/komme til å: ha tenkt å / intend to (subject’s intention); was to (‘was led/destined to’)
– Be going to may be needed for syntactic reasons, as English modals lack non-finite
forms and do not show tense clearly.
– Norwegian modal auxiliaries are more flexible, having non-finite and tensed forms 
skal /skulle + INF fits into more syntactic environments than will/would + INF
37 >
The verb forms
going to OT
present
going to TT
past
modalised
komme til OT
other
komme til TT
0%
38 >
20 %
40 %
60 %
80 %
100 %
•
The present tense be going to occurs to a great extent in direct
speech.
•
The meanings of ‘accidentally do’ and ‘grow to’/ ‘be led to’ of
komme til å occur mainly with the past tense, the former also
with modalisation.
1. Hun kjenner at hun er søvnig, at hun kan komme til å sovne mot fars jakke,
hun vil ikke det. (BV2)
2. She feels that she is sleepy, that she might fall asleep against father's
jacket, but she doesn't want to do that. (BV2T)
3. … og at den kvinnen jeg leter efter egentlig var et barn den gangen hun
kom til å bety noe for meg.“ (FC1)
4. … and that the woman I'm searching for was really a child when she came
to mean something to me. (FC1T)
39 >
Some reflections on findings
and further work
The picture of correspondence is a complex one, in spite of the
rather similar descriptions in grammars of be going to and komme
til å.
Syntactic differences between will/skal-future expressions may go
some way towards explaining the difference in distribution.
Correspondence types will have to be correlated with tense forms.
Subtle differences of meaning regarding speaker certainty and
present cause/intention come to the surface when studying
correspondences.
be going to is closer to a neutral future meaning than komme til å;
further grammaticalized as a future tense.
40 >
Summing up
Parallel corpora enhance contrastive studies in a number of ways
– by ensuring that observations are based on authentic language use
– by yielding paradigms and patterns of correspondences
– thus often revealing meanings and nuances we might not have thought of
– and showing how the same meaning may be expressed by means of different
linguistic categories
– by providing quantitative data
– … thus also giving insights into ‘preferred ways of putting things’
– (if the corpus is bidirectional) by providing control for translation bias
– (if the corpus is representative) by controlling for the idiosyncrasies of
individual authors/translators
41 >
Why undertake corpus-based
contrastive investigations?
The importance of multilingual corpora
extends beyond contrastive studies. It is
up to the user to define fruitful research
questions and use the corpora creatively.
In this process we learn not only about
individual languages and their
relationships, about translation and
foreign-language acquisition, but also
about language in general – provided that
the study becomes truly multilingual.
Seeing through corpora we can see
through language.
Stig Johansson (2007: 316)
42 >
Information on the OMC /
ENPC
About the corpora:
OMC: www.hf.uio.no/ilos/english/originalfiler/services/omc/
ENPC: www.hf.uio.no/ilos/english/originalfiler/services/omc/enpc/
www.helsinki.fi/varieng/CoRD/corpora/ENPC/
About publications based on the OMC (up to 2006):
www.hf.uio.no/ilos/forskning/prosjekter/sprik/english/publications/
43 >
References
Aijmer, K. & B. Altenberg. 1996. Introduction. In K. Aijmer, B. Altenberg, M. Johansson
(eds.) Languages in Contrast. Lund University Press, 11-16.
Altenberg, B. 1999. Adverbial connectors in English and Swedish: Semantic and lexical
correspondences. In Hasselgård & Oksefjell (eds.) Out of Corpora. Amsterdam: Rodopi,
249-268.
Berglund, Y. 2005. Expressions of Future in Present-day English. A Corpus-based
Approach. Uppsala University.
Chesterman, A. 1998 Contrastive Functional Analysis. Amsterdam/Philadelphia: John
Benjamins Publishing Company.
Declerck, R. 2006. The Grammar of the English Verb Phrase, Vol. 1. Berlin: Mouton de
Gruyter.
Faarlund, J. T., S. Lie, K. I. Vannebo. 1997. Norsk Referansegrammatikk. Oslo:
Universitetsforlaget.
Huddleston, R. and G. K. Pullum. 2002. The Cambridge Grammar of the English Language.
Cambridge: Cambridge University Press.
James, C.. 1980. Contrastive Analysis. London: Longman.
Johansson, S. 1999. Corpora and contrastive studies. In P. Pietilä & O-P. Salo (eds.)
Multiple Languages – Multiple Perspectives. AFinLA Yearbook 1999 / No. 57, 116-125.
Johansson, S. 2007. Seeing through multilingual corpora. Amsterdam: Benjamins.
Quirk, R., S. Greenbaum, G. Leech, J. Svartvik. 1985. A Comprehensive Grammar of the
English Language. London: Longman.
Vannebo, K. I. 1979. Tempus og tidsreferanse. Oslo: Novus
44 >
Download