Using a parallel corpus in translation practice and research Ana Frankenberg-Garcia ana.frankenberg@sapo.pt Machine Translation Using machines to analyse Human Translation The study of human translation Traditionally not a hard science Difficult to be systematic But with the technology of corpus linguistics, things can change … What is a corpus? text-retrieval software large specific criteria machine-readable Advantages of using corpora to study human translation An enormous amount of translated texts Systematic analyses Quantifiable results A bi-directional parallel corpus of Portuguese and English COMPARA Project leaders Ana Frankenberg-Garcia & Diana Santos Research assistants Rosário Silva & Susana Inácio Initial support (1999-2000) FCT (Portugal) ISLA (Lisboa) Oxford University (Language Centre) Present funding (2001-2006) Linguateca: FCT/ POSI (POSI/PLP/43931/2001) COMPARA EN translations PT source texts structure PT translations EN source texts COMPARA English Portuguese Original Source Translations Translated Original Translated Portuguese Texts Portuguese English English COMPARA 8.0 varieties UK Portugal US Mozambique Brazil Unbalanced distribution! South Africa Angola PORTUGUESE ENGLISH COMPARA 8.0 Publication dates 2002 1997 1988 1914 1880 1837 COMPARA 8.0 genre Published other genres fiction EXTENSIBLE COMPARA 8.0 authors Portuguese writers Camilo Castelo Branco Eça de Queirós José Cardoso Pires José Saramago Jorge de Sena Lídia Jorge Mário de Carvalho Sá Carneiro COMPARA 8.0 Brazilian writers Aluísio Azevedo Autran Dourado Chico Buarque Jô Soares José de Alencar Machado de Assis Manuel Antônio de Almeida Marcos Rey Patrícia Melo Paulo Coelho Rubem Fonseca authors COMPARA 8.0 authors Angolan writers José Eduardo Agualusa Mozambiquean writers Mia Couto COMPARA 8.0 authors British writers David Lodge Ian McEwan Julian Barnes Joseph Conrad Joanna Trollope Kazuo Ishiguro Lewis Carrol Mary Shelley Oscar Wilde COMPARA 8.0 authors American writers Henry James Edgar Allan Poe Richard Zimler South African writers Nadine Gordimer Can any text be included in the corpus? Only published source texts and translations Only English translated directly from Portuguese Portuguese translated directly from English Only human translations! COMPARA 8.0 texts 74 translations 71 source texts (extracts) COMPARA 8.0 size 1,536,269 1,423,937 words in English words in Portuguese Largest edited parallel corpus containing Portuguese COMPARA users and uses Language learners - bilingual dictionary with examples Language teachers - exercises and tests Translators - language equivalents Translation lecturers - exercises & problems Translation theorists - test translation hypotheses Lexicographers - bilingual dictionaries Computational linguists - machine translation Latest statistics: + 6000 queries per month COMPARA availability Free, online For research and education COMPARA access www.linguateca.pt/COMPARA/ “nodded” Studies using COMPARA 1. Observing source texts and translations 2. Constrasting Portuguese and English 3. Comparing translated and untranslated language 4. Examining the characteristics of translated texts 1. Observing source texts & translations Improving bilingual dictionaries and machine-translation programs Frankenberg-Garcia (2002) nod Ribeiro & Dias (2005) grande Specia et al. (2005) word-sense disambiguation 2. Contrasting English and Portuguese Contrasting original fiction in English and Portuguese Frankenberg-Garcia (2005) PT EN Loan words Loan words PT EN Loan languages Loan languages 3. Comparing translated and untranslated language translations source texts * diferente(s) 30,7 2x 15,4 simplesmente 15,6 3x 5,1 end.* up 13,5 4x 2,8 5,6 2x 12,4 lemma “rezar” * frequency/100 K words in COMPARA 7.0.4 4. Examining the characteristics of translated texts Are translations longer than source texts? Frankenberg-Garcia (2004) Explicitation Hypothesis Source texts Translations Pt Pt 1500Pt 1500Pt words 1500Pt words 1500Pt words 1500Pt words 1500Pt words 1500 En words 1500 En words 1500 En words 1500 En words 1500 En words 1500 En words 1500 En words 1500 En words 1500 words 1500 8 PT authors words 8 EN authors words 8 PT translators 8 EN translators ? Source texts Matched t-test: 95% probability TT longer than ST Translations TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT ST + 5% To conclude.... Studies such as these were unthinkable before corpora Many other studies are possible! COMPARA is free and available online Contact us: ana.frankenberg@sapo.pt diana.santos@sintef.no