Parallel corpora and contrastive studies Hilde Hasselgård University of Oslo From monolingual to multilingual corpus linguistics Corpus linguistics – the study of language by means of large(ish), structured databases of text compiled and prepared for use in linguistic research. Largely developed within English linguistics, with the Brown corpus as the first (1960s), followed by the Lancaster-Oslo/Bergen (LOB) corpus. Greatly facilitated the access to material. Opened up new possibilities for quantitative studies & variation studies. Parallel corpora: a more recent development (1990s), requiring new technology and new research methods. 2> Structure of talk •Multilingual corpus linguistics – Multilingual corpora – The English-Norwegian Parallel Corpus – Contrastive analysis •The use of parallel corpora in contrastive studies – The contribution of parallel corpora – Methodology •The Oslo Multilingual Corpus and the work of ”Språk i Kontrast” (Languages in Contrast) in Oslo •Case study: two future-referring expressions •Summing up 3> What is a parallel corpus? original texts with translations into one or more other languages A translation corpus comparable original texts in different languages A comparable corpus bi-directional translation corpus Parallel corpus 4> Translation corpus A corpus that contains the ‘same’ texts in more than one language, in other words a corpus with both original and translated texts. Original text(s) Translation, language 1 5> (Translation, language 2) (Translation, language 3) Comparable corpus a corpus that contains original texts in more than one language and where the texts in each language have been selected according to the same criteria (genre, content, publication date etc.) 6> Language 1 Language 2 Language 3 criterion A criterion A criterion A criterion B criterion B criterion B criterion C criterion C criterion C criterion D criterion D criterion D Parallel corpus (ENPC model) Combination of translation and comparable corpus The original texts are comparable (genre, number of words) The translations go in both directions – a bidirectional translation corpus 7> The English-Norwegian Parallel Corpus (ENPC) – Some facts Started as a research project at the Department of British and American Studies in 1994 and completed in 1997. Prof. Stig Johansson initiated and directed the project. Original texts with translations (English-Norwegian and Norwegian-English) Fiction and non-fiction Compiled for use in applied and theoretical linguistic research Development of software for alignment of the texts (Knut Hofland, UiB) and for searching the corpus (Jarle Ebeling, UiO) Sister projects: The English-Swedish Parallel Corpus (Lund/Göteborg), EnglishFinnish Parallel Corpus (Jyväskylä/Savonlinna/Tampere) – same principle of compilation; to some extent also shared texts. Other corpora built on the ENPC model in Germany (Chemnitz), France/Belgium (Poitiers/Louvain-la-Neuve: the PLECI corpus), Spain (University of Léon)]. 8> Contrastive analysis Contrastive analysis is the systematic comparison of two or more languages, with the aim of describing their similarities and differences. (Johansson 2007: 1) CA [contrastive analysis] is a linguistic enterprise aimed at producing inverted (i.e. contrastive, not comparative) two-valued typologies (a CA is always concerned with a pair of languages), and founded on the assumption that languages can be compared. (James 1980: 3) Executing a CA involves two steps: description and comparison; and the steps are taken in that order. (James 1980: 63) 9> Contrastive analysis A CA presupposes a tertium comparationis, i.e. a measure by which we can be fairly certain we are comparing like with like. The items to be compared across languages are selected on the basis of perceived similarity (Chesterman 1998), such as translation equivalence, semantic/etymological similarity, grammatical or functional categories. A frequently suggested tertium comparationis is translation equivalence (e.g. James 1980, Chesterman 1998); which implies that the items in the two languages convey (more or less) the same meaning. 10 > What can multilingual corpora contribute? They give insights into the languages compared – insights that are likely to be unnoticed in studies of monolingual corpora. They can be used for a range of comparative purposes and increase our understanding of language-specific, typological and cultural differences, as well as of universal features. They illuminate differences between source texts and translations, and between native and non-native texts. They can be used for a number of practical applications, e.g. in lexicography, language teaching, and translation. (Aijmer & Altenberg 1996: 12) 11 > Other benefits of a parallel corpus such as the ENPC Ready access to (relatively) large quantities of bilingual data Sentence alignment Comparable original and translated texts in both languages Control for translation bias In-built tertium comparationis through translation equivalence and text comparability “the paired texts reveal the interlingual identifications made by translators” (Johansson 1999: 117) 12 > Methodology: Classifying correspondences congruent expressed divergent Correspondence Same realisation type Different realisation type zero Example: English correspondences of imidlertid (‘however’) in ENPC Alle "innrømmelsene" hadde imidlertid en pris. (GL1) However, all these "concessions" had a price. Det endte imidlertid godt: (…) (UD1) But it ended well (…) Reguleringstiltakene har imidlertid gitt resultater (…). (ABJH1) The regulations have shown results (…). 13 > Paradigms of correspondences Swedish translations of however English translations of emellertid emellertid (51 = 47%) however (83 = 81%) men (‘but’) (36 = 33%) but (3) dock (14 = 13%) yet (3) ändå (2) anyway (1) däremot (1) Ø (13) i alla fall (1) Ø (4) (Altenberg 1999) 14 > Mutual correspondence (MC) (Altenberg 1999) The frequency with which different (grammatical, semantic and lexical) expressions are translated into each other. Calculated and expressed as a percentage by means of the formula (At + Bt) x 100 As + Bs The MC of however and emmelertid in the ESPC is thus (51 + 83) x 100 / (109 + 103) = 63.2 15 > Lexicogrammar Paradigms of correspondence highlight the fuzzy borderlines between lexis and grammar and grammar and discourse. Example: A modal verb will have a wide range of correspondences Norwegian kan (‘can’) Modal aux: can, could, may, might, ‘ll, will, would, should Other verbs: know, enable, have, have to, had better Adjectives: possible, able, capable. Adverbs: maybe, perhaps (Løken 2007) Suffix: -able Valget av tidspunkt kan også inneholde et stenk av egoisme. (KH1) Maybe his choice of timing also contained a touch of egotism. 16 > From ENPC to OMC under the SPRIK umbrella (SPRåk I Kontrast) New languages have been added, first (mainly) German, then French Focus on English – Norwegian – German in the first phase of the SPRIKproject: original texts in each language with translations into the other two. Same principles for text selection, text sampling and preparation as for the ENPC (exception: even more biased towards fiction because of the lack of translated non-fiction) Same (or later versions of same) software for alignment, searching etc. Expanded search facilities and research possibilities: – Three-way comparison of translations and originals – Possibilities of investigating two different translations of the same text (translation strategies, translationese) 17 > Current stock of multilingual corpora at Oslo OMC: Parallel corpora: English-Norwegian, French-Norwegian, GermanNorwegian; three-way English-German Norwegian. Translation corpora: Norwegian – English – French – German, Norwegian – French – German, English-Dutch, English-Portuguese. Multiple translations corpus (English-Norwegian) Outside OMC: Russian – English – Norwegian (RuN) Multilingual corpora of historical texts (two projects) 18 > Trilingual parallel corpus model 19 > Searching in No-En-Fr-Ge Jeg kommer til å si det til ham likevel.” (KF1) Ich werde es ihm sowieso sagen.” (KF1TD) I 'll tell him about it anyway.” (KF1TE) De toute façon, je le lui dirai.” (KF1TF) "You're going to have a book reissued … (BHH1TE) Du skal få en bok trykt opp igjen ... (BHH1) "Ein Buch von dir wird neu aufgelegt, ... (BHH1TD) Un de tes livres va être réédité ... (BHH1TF) 22 > Using the ENPC/OMC for research Particularly well suited for studies of lexis / lexico-grammar (or phenomena that can take lexis as their starting point) A broad range of phenomena have been (are being) investigated, e.g. the use of individual verbs (bli, få, take, give, see), modality, particular syntactic constructions, connectives, sentence openings and other discourse phenomena. The methodology is not tied to any particular theoretical approach A range of theoretical approaches, e.g. SFL, cognitive linguistics, pattern grammar, lexis-based approach à la Sinclair + traditional grammar / basic linguistic theory. 23 > Limitations (As with corpus linguistics in general:) you can only search for something that is explicit in the text Restricted to texts / text types that have been translated The size of the corpus restricts studies of less frequent lexical/ grammatical constructions Faulty and less successful translations The corpus has been word-class tagged, but not parsed (syntactically annotated), i.e. it is not possible to search for grammatical constructions, patterns of word order etc. Tagging errors 24 > Ways around the limitations? Identify typical (and searchable!) expressions of a grammatical construction, e.g. presentatives, clefting, phrasal verbs, inversion. Use a combination of word class tagging, filters and wildcards. Example: tense / aspect, participle clauses. (e.g. BE +Ving) In any case – a lot of work involved in tidying up the search results (precision). Possibility of searching with regular expressions Errors in the tagging: Never possible to make sure that you have found all the relevant instances (recall). Errors/idiosyncracies in the translation: Weed out? Ignore translations that occur only once, or in only one text? Manual searches in running text, e.g. for Theme, subjects. Supplement results of parallel corpus study with (larger) monolingual corpora. Supplement corpus study with e.g. experimental data. 25 > Examples of studies based on ENPC/ OMC / ESPC Bengt Altenberg: Work on adverbial connectors, sentence openings, subject selection etc. in English and Swedish. Karin Aijmer: Work on modality and discourse markers in English and Swedish. Åke Viberg: Work on verbs of motion and cognition in English and Swedish. Helge Dyvik: Translations as semantic mirrors; ENPC as basis for bilingual wordnet. Jarle Ebeling (2000): Presentative constructions in English and Norwegian : a corpus-based contrastive study (PhD, University of Oslo) Mats Johansson (2002) Clefts in English and Swedish: A contrastive study of ITclefts and WH-clefts in original texts and translations. (PhD, Lund University) Signe Oksefjell Ebeling (2003): The Norwegian verbs bli and få and their correspondences in English: a corpus-based contrastive study (PhD, University of Oslo) 26 > Berit Løken: Beyond modals: A corpus-based study of English and Norwegian expressions of possibility (PhD, Oslo, 2007) Lene Nordrum : English lexical nominalizations in a Norwegian-Swedish contrastive perspective. (PhD, Göteborg, 2007) Wiebke Ramm: Sentence boundary adjustments in translation (German / Norwegian): Consequences on information distribution and discourse structure (PhD, Oslo, ongoing) Astrid Nome: Ongoing PhD work on connectors in Norwegian and French. (Oslo) Cathrine Fabricius Hansen et al: Big Events, Small Clauses. The Grammar of Elaboration. (Forthcoming book with multiple authors and multiple languages) Master theses (English, German, French) studying individual verbs, syntactic constructions, connectors, metaphor … 27 > My own contrastive work 2009. A textual perspective on the pragmatic markers in fact and faktisk. In S. Slembrouck,, M. Taverniers, M. Van Herreweghe (eds.) From will to well: Studies in Linguistics offered to Anne-Marie Simon-Vandenbergen. Ghent: Academia Press. 2007. Using the ENPC and the ESPC as a parallel translation corpus: adverbs of frequency and usuality. Nordic Journal of English Studies 6:1, http://ojs.ub.gu.se/ojs/index.php/njes/issue/view/6 2006. “Not now” – on non-correspondence between the cognate adverbs now and nå. In K. Aijmer & A.-M. Simon Vandenbergen (eds.) Pragmatic Markers in Contrast. Elsevier, 93-114. 2005. Theme in Norwegian. In K.L. Berge, & E. Maagerø (eds.). Semiotics from the North: Nordic Approaches to Systemic Functional Linguistics. Oslo: Novus, 35-48. 2004 . Spatial linking in English and Norwegian. In K. Aijmer & H. Hasselgård (eds.). Translation and Corpora. Göteborg: Acta Universitatis Gothoburgensis, 163-188. 2004. Thematic choice in English and Norwegian. Functions of Language 11:2. 187212. 2000. English multiple Themes in translation. In A. Klinge (ed.) Contrastive Studies in Syntax. Special issue of Copenhagen Studies in Language, Vol 25. Copenhagen: Samfundslitteratur, 11-38. 28 > Case study: be going to and komme til å (‘come to’) Future-referring expressions based on motion verb + infinitive Both described in grammars as common expressions, though less common than expressions with English will, Norwegian skal 29 > Meanings be going to – ‘future fulfilment of the present’; present intention or present cause (Quirk et al 1985) – associated with present intention or arrangement; was going to quite often has ‘an implicature of non-actualisation’. (Huddleston & Pullum 2002) – Two meanings: ‘futurish’, linked to a present situation, and ‘future tense’, simply expressing future time reference. (Declerck 2006) komme til å – the speaker predicts what will happen based on his knowledge at the moment of speaking (Faarlund et al 1997) – Past tense kom til å V– also ‘accidentally V’ or ‘was led to V’/ ‘grew to V’ (Vannebo 1979 and Engelsk Stor Ordbok) 30 > Examples 1. I know what he’s going to say even before he says it. (FW1) 2. Jeg vet hva han kommer til å si selv før han sier det. (FW1T) 3. "I was going to wait until another time we met, but I may as well tell you now. (AH1) 4. Meningen var å vente til en annen gang, men jeg kan like godt si det nå. (AH1T) 5. Ingen av dem visste hva som kom til å skje. (TTH1) 6. Neither of them knew what was going to happen. (TTH1T) 7. Kanskje hun kom til å svelge dem ved et uhell? (LSC1) 8. Maybe she happened to swallow them by accident? (LSC1T) 9. Og siden ble det jeg som kom til å se mest til henne. (EHA1) 10. And then I became the one who ended up seeing her most often. (EHA1T) 31 > be going to and komme til å in ENPC fiction (raw frequencies) 250 200 150 original 100 translation 50 0 going to 32 > komme til å Preliminary observations Be going to is more common than komme til å in original texts Be going to is more common in original texts than in translations Komme til å is less common in original texts than in translations – i.e. translations in both directions can be assumed to be coloured by the source texts. The frequency differences between originals and translations (particularly with komme til å) indicate that the two expressions can often be used in the same contexts, but may tend not to be. 33 > Correspondences of be going to (percentages) 35 30 25 20 15 10 5 0 k om me s ka o s v il v h s I N F il l e I N a t e n im pl e t he r l I N k u ll e kt IN F F F te n t il å se N translation 34 > N original Correspondences of komme til å (percentages) 35 30 25 20 15 10 5 0 g oi ng to wi l l IN F wo u ld m ig s b h t I im pl e e t o I NF NF te n se E translation 35 > h ap E original p en oth to er Correspondences The mutual correspondence between be going to and komme til å is surprisingly low: 12.6% The correspondence is asymmetrical: – 15% of be going to are translated as komme til å – 7% of komme til å are translated as be going to Komme til å has meanings not covered by be going to (‘accidentally’, ‘grow to’, ‘be led to’). The ‘present cause/intention’ meaning works differently for the two expressions; apparently also speaker certainty/non-actualisation. 1. What are we going to do, says Ruth, … (BV2T) 2. Hva skal vi gjøre, sier Rut …(BV2) 3. Hun kommer bare til å bli redd." (THA1) 4. She 'll only be frightened." (THA1T) Uncertain outcome, no intentionality Confident prediction – speaker knowledge 5. "Are you going to run a hotel?" enquired Frederick reasonably, … (DL1) 6. "Har dere tenkt å drive hotell?" spurte Frederick fornuftig, … (DL1T) Intention, but uncertain outcome 36 > Thus, in spite of shared meanings, English be going to and and Norwegian komme til å, differ as to – The frequency with which the item is chosen – The extent to which they compete with other future-referring expressions – The extent to which they convey confident predictions, ‘present intention’ and ‘actualised future in past’. Some other explanations may be – Translators in both directions tend to normalize be going to / komme til å into a more common future-referring expression (will/would INF and skal/skulle INF); Will/would and skal/skulle are also the most common sources of komme til å / be going to – Sometimes more lexically explicit forms have been used to translate be going to/komme til å: ha tenkt å / intend to (subject’s intention); was to (‘was led/destined to’) – Be going to may be needed for syntactic reasons, as English modals lack non-finite forms and do not show tense clearly. – Norwegian modal auxiliaries are more flexible, having non-finite and tensed forms skal /skulle + INF fits into more syntactic environments than will/would + INF 37 > The verb forms going to OT present going to TT past modalised komme til OT other komme til TT 0% 38 > 20 % 40 % 60 % 80 % 100 % • The present tense be going to occurs to a great extent in direct speech. • The meanings of ‘accidentally do’ and ‘grow to’/ ‘be led to’ of komme til å occur mainly with the past tense, the former also with modalisation. 1. Hun kjenner at hun er søvnig, at hun kan komme til å sovne mot fars jakke, hun vil ikke det. (BV2) 2. She feels that she is sleepy, that she might fall asleep against father's jacket, but she doesn't want to do that. (BV2T) 3. … og at den kvinnen jeg leter efter egentlig var et barn den gangen hun kom til å bety noe for meg.“ (FC1) 4. … and that the woman I'm searching for was really a child when she came to mean something to me. (FC1T) 39 > Some reflections on findings and further work The picture of correspondence is a complex one, in spite of the rather similar descriptions in grammars of be going to and komme til å. Syntactic differences between will/skal-future expressions may go some way towards explaining the difference in distribution. Correspondence types will have to be correlated with tense forms. Subtle differences of meaning regarding speaker certainty and present cause/intention come to the surface when studying correspondences. be going to is closer to a neutral future meaning than komme til å; further grammaticalized as a future tense. 40 > Summing up Parallel corpora enhance contrastive studies in a number of ways – by ensuring that observations are based on authentic language use – by yielding paradigms and patterns of correspondences – thus often revealing meanings and nuances we might not have thought of – and showing how the same meaning may be expressed by means of different linguistic categories – by providing quantitative data – … thus also giving insights into ‘preferred ways of putting things’ – (if the corpus is bidirectional) by providing control for translation bias – (if the corpus is representative) by controlling for the idiosyncrasies of individual authors/translators 41 > Why undertake corpus-based contrastive investigations? The importance of multilingual corpora extends beyond contrastive studies. It is up to the user to define fruitful research questions and use the corpora creatively. In this process we learn not only about individual languages and their relationships, about translation and foreign-language acquisition, but also about language in general – provided that the study becomes truly multilingual. Seeing through corpora we can see through language. Stig Johansson (2007: 316) 42 > Information on the OMC / ENPC About the corpora: OMC: www.hf.uio.no/ilos/english/originalfiler/services/omc/ ENPC: www.hf.uio.no/ilos/english/originalfiler/services/omc/enpc/ www.helsinki.fi/varieng/CoRD/corpora/ENPC/ About publications based on the OMC (up to 2006): www.hf.uio.no/ilos/forskning/prosjekter/sprik/english/publications/ 43 > References Aijmer, K. & B. Altenberg. 1996. Introduction. In K. Aijmer, B. Altenberg, M. Johansson (eds.) Languages in Contrast. Lund University Press, 11-16. Altenberg, B. 1999. Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In Hasselgård & Oksefjell (eds.) Out of Corpora. Amsterdam: Rodopi, 249-268. Berglund, Y. 2005. Expressions of Future in Present-day English. A Corpus-based Approach. Uppsala University. Chesterman, A. 1998 Contrastive Functional Analysis. Amsterdam/Philadelphia: John Benjamins Publishing Company. Declerck, R. 2006. The Grammar of the English Verb Phrase, Vol. 1. Berlin: Mouton de Gruyter. Faarlund, J. T., S. Lie, K. I. Vannebo. 1997. Norsk Referansegrammatikk. Oslo: Universitetsforlaget. Huddleston, R. and G. K. Pullum. 2002. The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press. James, C.. 1980. Contrastive Analysis. London: Longman. Johansson, S. 1999. Corpora and contrastive studies. In P. Pietilä & O-P. Salo (eds.) Multiple Languages – Multiple Perspectives. AFinLA Yearbook 1999 / No. 57, 116-125. Johansson, S. 2007. Seeing through multilingual corpora. Amsterdam: Benjamins. Quirk, R., S. Greenbaum, G. Leech, J. Svartvik. 1985. A Comprehensive Grammar of the English Language. London: Longman. Vannebo, K. I. 1979. Tempus og tidsreferanse. Oslo: Novus 44 >