A Construção (e alguns usos) do corpus Compara

advertisement
Using a parallel corpus
in translation
practice and research
Ana Frankenberg-Garcia
ana.frankenberg@sapo.pt
Machine Translation
Using machines to analyse
Human Translation
The study of human
translation
Traditionally not a hard science
Difficult to be systematic
But with the technology of
corpus linguistics, things
can change …
What is a corpus?
text-retrieval software
large
specific criteria
machine-readable
Advantages of using
corpora to study human
translation
An enormous amount of translated
texts
Systematic analyses
Quantifiable results
A bi-directional parallel corpus
of Portuguese and English
COMPARA
Project leaders
Ana Frankenberg-Garcia & Diana Santos
Research assistants
Rosário Silva & Susana Inácio
Initial support (1999-2000)
FCT (Portugal)
ISLA (Lisboa)
Oxford University (Language Centre)
Present funding (2001-2006)
Linguateca: FCT/ POSI (POSI/PLP/43931/2001)
COMPARA
EN translations
PT source texts
structure
PT translations
EN source texts
COMPARA
English
Portuguese
Original
Source
Translations
Translated
Original
Translated
Portuguese
Texts
Portuguese
English
English
COMPARA 8.0 varieties
UK
Portugal
US
Mozambique
Brazil
Unbalanced
distribution!
South Africa
Angola
PORTUGUESE ENGLISH
COMPARA 8.0
Publication dates
2002
1997
1988
1914
1880
1837
COMPARA 8.0 genre
Published
other
genres
fiction
EXTENSIBLE
COMPARA 8.0 authors
Portuguese writers
Camilo Castelo Branco
Eça de Queirós
José Cardoso Pires
José Saramago
Jorge de Sena
Lídia Jorge
Mário de Carvalho
Sá Carneiro
COMPARA 8.0
Brazilian writers
Aluísio Azevedo
Autran Dourado
Chico Buarque
Jô Soares
José de Alencar
Machado de Assis
Manuel Antônio de Almeida
Marcos Rey
Patrícia Melo
Paulo Coelho
Rubem Fonseca
authors
COMPARA 8.0 authors
Angolan writers
José Eduardo Agualusa
Mozambiquean writers
Mia Couto
COMPARA 8.0 authors
British writers
David Lodge
Ian McEwan
Julian Barnes
Joseph Conrad
Joanna Trollope
Kazuo Ishiguro
Lewis Carrol
Mary Shelley
Oscar Wilde
COMPARA 8.0 authors
American writers
Henry James
Edgar Allan Poe
Richard Zimler
South African writers
Nadine Gordimer
Can any text be included in
the corpus?
Only
published source texts and translations
Only
English translated directly from Portuguese
Portuguese translated directly from English
Only
human translations!
COMPARA 8.0
texts
74 translations
71 source texts (extracts)
COMPARA 8.0 size
1,536,269 1,423,937
words
in
English
words
in
Portuguese
Largest edited parallel corpus containing Portuguese
COMPARA users and uses
Language learners - bilingual dictionary with examples
Language teachers - exercises and tests
Translators - language equivalents
Translation lecturers - exercises & problems
Translation theorists - test translation hypotheses
Lexicographers - bilingual dictionaries
Computational linguists - machine translation
Latest statistics:
+ 6000 queries per month
COMPARA availability
Free, online
For research and
education
COMPARA access
www.linguateca.pt/COMPARA/
“nodded”
Studies using COMPARA
1. Observing source texts and translations
2. Constrasting Portuguese and English
3. Comparing translated and untranslated
language
4. Examining the characteristics of
translated texts
1. Observing source texts & translations
Improving bilingual dictionaries and machine-translation
programs
Frankenberg-Garcia (2002)
nod
Ribeiro & Dias (2005)
grande
Specia et al. (2005)
word-sense disambiguation
2. Contrasting English and Portuguese
Contrasting original fiction in English and Portuguese
Frankenberg-Garcia (2005)
PT
EN
Loan words
Loan words
PT
EN
Loan languages
Loan languages
3. Comparing translated and untranslated language
translations
source texts *
diferente(s)
30,7
2x
15,4
simplesmente
15,6
3x
5,1
end.* up
13,5
4x
2,8
5,6
2x
12,4
lemma “rezar”
* frequency/100 K words in COMPARA 7.0.4
4. Examining the characteristics of translated texts
Are translations longer than source texts?
Frankenberg-Garcia (2004)
Explicitation Hypothesis
Source texts
Translations
Pt
Pt
1500Pt
1500Pt
words
1500Pt
words
1500Pt
words
1500Pt
words
1500Pt
words
1500
En
words
1500
En
words
1500
En
words
1500
En
words
1500
En
words
1500
En
words
1500
En
words
1500
En
words
1500
words
1500
8 PT authors
words
8 EN authors
words
8 PT translators
8 EN translators
?
Source texts
Matched t-test:
95% probability
TT longer than ST
Translations
TT
TT
TT
TT
TT
TT
TT
TT
TT
TT
TT
TT
TT
TT
TT
TT
ST
+ 5%
To conclude....
Studies such as these were unthinkable before
corpora
Many other studies are possible!
COMPARA is free and available online
Contact us: ana.frankenberg@sapo.pt
diana.santos@sintef.no
Download